The Eiffel Compiler / Interpreter (tecomp)

Context free grammars are usually known to describe languages over an alphabet. A language is understood as a set of strings over the alphabet.

But context free grammars can also be used to describe set of trees over an alphabet.

We consider trees where only leaf nodes carry information. E.g. the set of trees

* * * * .... | / \ / \ / \ 0 s * s * s * | / \ / \ 0 s * s * | / \ 0 s * | 0 -- * marks a node with no information -- The alphabet is {s,0}

can be characterized by the context free grammar

nat: 0 nat: s nat -- Terminals: {0,s} -- Nonterminals: {nat} -- Start symbol: nat

We can use parentheses (as metasymbols) to describe the tree structure in a linear form. The above trees can be describe in parenthesis notation as

(0) (s (0)) (s (s (0))) (s (s (s (0))))

Each such a tree can be interpreted as a representation for natural numbers.

A different grammar

btree: btree: btree a btree -- Terminals: {a} -- Nonterminals: {btree} -- Start symbol: btree

describes the trees

* * * /|\ /|\ * a * * a * /|\ * a * /|\ * a * -- or in notation with parentheses () (() a ()) ((() a (() a ())) a ())

The second example shows the effect of empty productions in the grammar. In order to describe the trees in our programming notation we need trees which carry information only in the leaves. Furthermore the information in the leaves has to be optional in order to represent the case of zero productions.

In order to describe the trees we use the following inductive type

-- module: "leaf_tree" case class LEAF_TREE[G] create leaf(leaf: G) tree(children: LIST[LEAF_TREE[G])) end

If a type T represents a superset of the alphabet our needed trees are of the type

LEAF_TREE[T]

Since LEAF_TREE is an induction data type the compiler generates the induction principle

feature -- generated induction principle all(e: BOOLEAN) require all(a:G) e[t:=leaf(a)] all(cs:LIST[LEAF_TREE[G]]) (all(c) c in cs => e[t:=c]) => e[t:=tree(cs)] ensure all(t) e end -- default type is LEAF_TREE[G] -- -- e[t:=leaf(e)]: all free occurences of "t" within the expression -- "e" substituted by "leaf(a) -- -- Implication "=>" is right associative, i.e. "a=>b=>c" and -- "a=>(b=>c)" are equivalent which is equivalent to the uncurried -- version "a and b => c". -- -- The range of the quantifiers "all" and "some" spans as far to the -- right as possible. end

In words: If an assertion is true for all trees which consist only of a leaf and if all children of a tree carry over the truth of the assertion to its parent, then the assertion is true for all trees.

In order to handle context free grammars we define a module to describe them.

A context free grammar has an alphabet of terminal symbols, an alphabet of nonterminal symbols and start symbol and a set of rules.

-- module: context_free_grammar case class CONTEXT_FREE_GRAMMAR[T,NT] create cfg(start:NT, rules:SET[RULE]) end

A grammar symbol is either a terminal or a nonterminal

case class SYMBOL create terminal(t:T) nonterminal(nt:NT) end -- Note: The class SYMBOL is contained within the module -- "context_free_grammar". This module has as its current type -- CONTEXT_FREE_GRAMMAR[T,NT] with the formal generic types T and -- NT. These formal generics are available to all types in the same -- module (i.e. they are considered as local types of the module)

A rule has a nonterminal symbols on its left hand side and a possibly empty sequence of grammar symbols on the right hand side.

case class RULE create rule(left:SYMBOL, right:LIST[SYMBOL]) require left as nonterminal(_) end end -- Note: The precondition of the constructor "rule" guarantees that no -- terminal symbol can be entered on the left hand side of a rule.

A parse tree is a tree where each node carries a grammar symbol and which is consistent with the grammar. A symbol tree is consistent with the grammar if and only if each tree node and its children represent one of the grammar rules.

-- grammar leaf tree parse tree nat: 0 * nat nat: s nat / \ / \ s * s nat / \ / \ s * s nat / \ / \ s * s nat | | 0 0

Each nonterminal node in the parse tree together with its children has one of the two following forms

nat nat / \ | s nat 0 nat: s nat nat: 0

Each of the two forms represent one of the rules of the grammar.

The structure of symbol trees is defined in the module "tree"

-- module: tree case class TREE[G] create tree(info:G, children: LIST[CURRENT]) end

feature -- Generated induction principle all(e:BOOLEAN) require all(a:G, cs:LIST[CURRENT]) (all(c:CURRENT) c in cs => e[t:=c]) => e[t:=tree(a,cs)] ensure all(t) e end end

Sometimes it is important to get the set of symbols

feature infos(t:TREE[G]): SET[G] -- The set of all information elements of the tree `t'. ensure Result = {t.info} + t.children.mapped(c->c.infos).to_set.union end end

A symbol tree is valid for a set of rules, if each tree node together with its children is consistent with one of the rules.

It is easy to design a predicate which tells weather a symbol tree is a parse tree of a set of rules.

feature is_root_valid(t:TREE[SYMBOL], rs:SET[RULE]): BOOLEAN -- Is the root node of the tree `t' together with its children -- consistent with a rule in the set of rules `rs'? ensure Result = inspect t.info case terminal(_) then t.chilren=nil case nonterminal(_) then rule(t.info,t.children.mapped(x->x.info)) in rs end end is_valid(t:TREE[SYMBOL], rs:SET[RULE]): BOOLEAN -- Is `t' a validparse tree for the rules in `rs'? ensure Result = t.is_root_valid and t.children.forall({c: c.is_valid(rs)}) end end

A tree is a parse tree of a grammar if it is a parse tree of the ruleset and if its root node is the start symbol.

feature is_valid(t:TREE[SYMBOL], g:CURRENT): BOOLEAN -- Is `t' a valid parse tree for the grammar `g'? ensure Result = t.is_valid(g.rules) and t.info~g.start end parse_trees(g:CURRENT): TREE[SYMBOL]? -- The set of valid parse trees of the grammar `g'. ensure Result = {t: t.is_valid(g)} end end

The transformation of a symbol tree into a leaf tree is straightforward by removing all information about nonterminals in the symbol tree.

feature leaf_tree(t:TREE[SYMBOL]): LEAF_TREE[T] -- The leaf tree which corresponds to the symbol tree `t'. ensure Result = inspect t.info case terminal(term) then leaf(term) case nonterminal(_) then tree(t.children.mapped(c->c.leaf_tree)) end end end

- Note: All subtrees below a terminal node are ignored. This is no restriction, since all valid parse trees have no children below a terminal node.

A set of rules generate a leaf tree if there is a parse tree for the rules which transforms into the leaf tree. A grammar generates a leaf tree if there is a parse tree for the grammar which transforms into the leaf tree.

feature is_valid(t:LEAF_TREE[T], rs:SET[RULES]): ghost BOOLEAN -- Is `t' a valid leaf tree with respect to the rules in `rs'? ensure Result = some(u:TREE[SYMBOL]) u.is_valid(rs) and t = u.leaf_tree end is_valid(t:LEAF_TREE[T], g:CURRENT): ghost BOOLEAN -- Is `t' a valid leaf tree with respect to the grammar `g'? ensure Result = some(u:TREE[SYMBOL]) u.is_valid(g) and t = u.leaf_tree end trees(g:CURRENT): ghost LEAF_TREE[T]? -- The set of valid leaf trees for the grammar `g'. ensure Result = {t: t.is_valid(g)} end end

Both validity predicates for leaf trees are ghost predicates because of the use of the existential quantifier (which is not computable) in their definition. In the following we want to find an implementation to convert these two predicates into computable predicates.

Furthermore we want to investigate whether a tree grammar is contained in another tree grammar (i.e. all leaf trees produced by one grammar can be produced by the other as well) and whether two tree grammars are equivalent (i.e. produce the same set of leaf trees).

If all parse trees generated by a grammar g1 are also valid parse trees for a grammar g2, then all leaf trees generated by the grammar g1 are also leaf trees generated by the grammar g2. The same applies if the set of trees are equivalent.

feature all(g1,g2:CURRENT) ensure -- proofs: def "trees" g1.parse_tress<=g2.parse_trees => g1.trees<=g2.trees g1.parse_trees=g2.parse_trees => g1.trees=g2.trees end end

If the rules of a grammar is a subset of the rules of another grammar having the same start symbol, then the generated parse trees (and leaf trees) are subsets.

feature all(g1,g2:CURRENT) ensure g1.start~g2.start => g1.rules<=g2.rules => g1.parse_trees<=g2.parse_trees end end

Each context free grammar might contain nonterminals which produce nothing or which are not reachable from the start symbol. A trimmed grammar does not contain such superfluous nonterminals (i.e. no such rules which have a superfluous nonterminal at its left hand side).

In order to trim a grammar we need the set of terminals, nonterminals and the set of all grammar symbols which are contained in the rules.

feature terminals(rs:SET[RULE]): SET[SYMBOL] -- The set of all terminals used in the set of rules `rs'. ensure Result = rs.mapped(r->r.right .filtered({s:s as terminal(_)}) .to_set).union end all(rs:SET[RULE]) ensure rs.terminals.predicate = {s: s as terminal(_) and some(r:RULE) r in rs and s in r.right} end nonterminals(rs:SET[RULE]): SET[SYMBOL] -- The set of all nonterminals used in the set of rules `rs'. ensure Result = rs.mapped(r->r.left) end all(rs:SET[RULE]) ensure rs.nonterminals.predicate = {s: some(r:RULE) r in rs and s=r.left} end symbols(rs:SET[RULE]): SET[SYMBOL] -- The set of all symbols used in the set of rules `rs'. ensure Result = rs.mapped(r->{r.left}+r.right.to_set) end all(rs:SET[RULE]) ensure rs.symbols.predicate = {s: some(r:RULE) r in rs and (s=r.left or s in r.right)} end all(rs:SET[RULE]) ensure rs.terminals <= rs.symbols rs.nonterminals <= rs.symbols rs.terminals * rs.nonterminals = 0 rs.terminals + rs.nonterminals = rs.symbols -- "0": empty set -- "*": set intersection -- "+": set union end end

The producers of a set of rules are all the terminals (which produce themselves) and all nonterminals which produce something over the terminals.

step_producers(rs:SET[RULE], ss:SET[SYMBOL]): SET[SYMBOL] -- The set of grammar symbols which produce strings over -- the symbols of the set `ss' using one rule of the rule -- set `rs'. ensure Result = rs.filtered(r->r.right.forall(s -> s in ss)) end all(rs:SET[RULE], ss:SET[SYMBOL]) ensure bounded: rs.step_producers(ss) <= rs.nonterminals rs.step_producers(ss).predicate = {s: some(r:RULE) r in rs and all(s:SYMBOL) s in r.right => s in ss} end producers(rs:SET[RULE]): SET[SYMBOL] -- The set of all grammar symbols which produce "something" by -- using any sequence of rules from the set `rs'. ensure Result = rs.terminals.closed(ss->rs.step_producers(ss)) end

It is important to note that only producers can occur in valid parse trees.

all(rs:SET[RULE], t:TREE[SYMBOL]) ensure t.is_valid(rs) => t.infos<=rs.producers end

This claim can be proved by induction on the trees

all(t:TREE[SYMBOL]) check c1: all(s:SYMBOL,cs:LIST[TREE[SYMBOL]],rs:SET[RULE]) require r1: all(c:TREE[SYMBOL]) c in cs => c.is_valid(rs) => c.infos<=rs.producers r2: tree(s,cs).is_valid(rs) check c2: all(c:TREE[SYMBOL]) c in cs => c.is_valid(rs) -- r2, def "is_valid" c3: all(c:TREE[SYMBOL]) c in cs => c.infos<=rs.producers -- c2, r1 c4: cs.mapped(c->c.infos).to_set.union <= rs.producers -- c3, def "mapped", def "union" c5: s as terminal(_) => cs=nil -- r2, def "is_valid" c6: s as nonterminal(_) => rule(s,t.children.mapped(c->c.info) in rs -- r2, def "is_valid" c7: cs.mapped(c->c.info).to_set <= rs.producers -- c4, def "infos" c8: s in rs.producers -- c6, c7, def "producers" -- all children are producers (c7), there is a -- corresponding rule (c6) in case of a nonterminal -- therefore s is a producer as well (closure property -- of producers, terminals are producers anyhow) c9: {s}+cs.mapped(c->c.infos).to_set.union <= rs.producers -- c8, c4 ensure tree(s,cs).infos<=rs.producers -- c9 end ensure all(rs:SET[RULE]) t.is_valid(rs) => t.infos<=rs.producers -- c1, induction principle end

The reachables of a grammar are all the symbols which can be reached from the start symbol.

feature -- Reachables step_reachables(rs:SET[RULE], ss:SET[SYMBOL]): SET[SYMBOL] -- Set of all symbols reachable directly from the symbol -- set `ss' via one of the rules of `rs'. ensure Result = rs.filtered(r->r.left in ss) .mapped(r -> r.right.to_set) .union end all(rs:SET[RULE], ss:SET[SYMBOL]) ensure rs.step_reachables(ss) <= rs.symbols end reachables(rs:SET[RULE], ss:SET[SYMBOL]): SET[SYMBOL] -- Set of all symbols reachable from the symbol set -- `ss' via any number of rules in `rs'. ensure Result = ss.closed(ss -> rules.step_reachables(ss)) end reachables(g:CURRENT): SET[SYMBOL] -- Set of all grammar symbols which are reachable from the -- start symbol of the grammar `g' by using the rules of -- this grammar ensure Result = g.rules.reachables({g.start}) end end

A valid parse tree can carry only reachable symbols.

feature all(t:TREE[SYMBOL], g:CURRENT) ensure t.is_valid(g) => t.infos<=g.reachables end end

This claim can be proved with the help of a lemma which states that all symbols within a valid parse tree are reachable from the symbol at the root node via the rules.

feature all(t:TREE[SYMBOL], rs:SET[RULE]) check all(s:SYMBOL, cs:LIST[TREE[SYMBOL]]) require r1: all(c:TREE[SYMBOL]) c in cs => c.is_valid(rs) => c.infos<=rs.reachables(c.info) r2: tree(s,cs).is_valid(rs) check c1: rules(s,cs.mapped(c->c.info)) in rs -- r2, def "is_valid" c2: all(c:TREE[SYMBOL]) require r3: c in cs check c3: c.is_valid(rs) -- r3, r2, def "is_valid" c4: c.infos<=rs.reachables(c.info) -- r1,r3,c3 c5: c.info in rs.reachables({s}) -- r3, c1, def "reachable" ensure c.infos<= rs.reachables({s}) -- c4,c5, closure property of "reachables" end c6: {s}+cs.mapped(c->c.infos).to_set.union <= rs.reachables({s}) -- c2, def "union", "s in rs.reachables({s})" ensure tree(s,cs).infos<=rs.reachables({s}) -- c6, def "infos" end ensure t.is_valid(rs) => t.infos<=rs.reachables(t.info) -- induction principle end end

A grammar is trimmed if all symbols are reachables and producers.

feature is_trimmed(g:CURRENT): BOOLEAN -- Are there only reachables and producers in the grammar? ensure Result = (g.symbols = g.reachables*g.rules.producers) end end

We can generate a trimmed grammar from an arbitrary grammar.

feature trimmed(g:CURRENT): CURRENT -- The grammar `g' where with all rules removed where the -- left symbol is neither reachable nor a producer. ensure Result = cfg(g.start, g.rules .filtered(r->r.left in (g.reachables*g.rules.producers))) end end

A trimmed grammar produces the same set of parse trees as an untrimmed grammar.

feature all(g:CURRENT, t:TREE[SYMBOL]) ensure t.is_valid(g) = t.is_valid(g.trimmed) end end

A parse tree for a leaf tree is not unique in general. There might be a set of possible parse trees. If the set is empty, then there is no parse tree for a leaf tree. The following algorithm computes the set of possible parse trees of a leaf tree.

feature combined[G](l:LIST[SET[G]]): SET[LIST[G]] ensure Result = inspect l case nil then {nil} case f::t then f.mapped(e->t.combined.mapped(l->e::l)).union end end parse_trees(t:LEAF_TREE[T], s:SYMBOL, rs:SET[RULE]): SET[TREE[SYMBOL]] -- The set of parse trees for the leaf tree `t' -- starting with the symbol `s' which are valid for the -- rules in `rs' ensure Result = inspect t case leaf(opt) then inspect opt case none then rs.filtered(r->r.left=s and r.right=nil) .mapped(r->tree(s,nil)) case value(term) then rs.filtered(r->r.left=s and s=terminal(term)) .mapped(r->tree(s,[]) end case tree(cs) then rs.filtered(r->r.left=s and r.right.count=cs.count) .mapped(r->cs.zipped(r.right) .mapped([c,s]->c.parse_trees(s,rs))) .combined .mapped(cs->tree(s,cs)) end end end

If the algorithm is correct, the following claim must be valid.

all(t:LEAF_TREE[T], g:CURRENT) ensure all(u:TREE[SYMBOL]) u in t.parse_trees(g.start, g.rules) => (u.is_valid(g) and u.leaf_tree=t) (some(u:TREE[SYMBOL]) u.is_valid(g) and u.leaf_tree=t) = (t.parse_trees(g.start,g.rules) /= 0) end

It requires more structuring of the algorithm to verify that the algorithm is correct.

Local Variables: mode: outline coding: iso-latin-1 outline-regexp: "=\\(=\\)*" End: