SourceForge.net LogoThe Eiffel Compiler / Interpreter (tecomp)

doc/papers/proof/tree grammar

Tree grammars (draft)

Introduction

Context free grammars are usually known to describe languages over an alphabet. A language is understood as a set of strings over the alphabet.

But context free grammars can also be used to describe set of trees over an alphabet.

Examples of tree grammars

We consider trees where only leaf nodes carry information. E.g. the set of trees

        *         *           *                  *          ....
        |        / \         / \                / \     
        0       s   *       s   *              s   *    
                    |          / \                / \   
                    0         s   *              s   *  
                                  |                 / \ 
                                  0                s   *
                                                       |
                                                       0
    -- * marks a node with no information
    -- The alphabet is {s,0}
 

can be characterized by the context free grammar

  nat:   0
  nat:   s nat
 
  -- Terminals:    {0,s}
  -- Nonterminals: {nat}
  -- Start symbol: nat
 

We can use parentheses (as metasymbols) to describe the tree structure in a linear form. The above trees can be describe in parenthesis notation as

  (0)      (s (0))        (s (s (0)))     (s (s (s (0))))
 

Each such a tree can be interpreted as a representation for natural numbers.

A different grammar

  btree:
  btree: btree a btree
 
  -- Terminals:    {a}
  -- Nonterminals: {btree}
  -- Start symbol: btree
 

describes the trees

  *           *              *
             /|\            /|\
            * a *          * a *
                          /|\
                         * a *
                            /|\
                           * a *
 
  -- or in notation with parentheses
 
  ()       (() a ())       ((() a (() a ())) a ())
 

The second example shows the effect of empty productions in the grammar. In order to describe the trees in our programming notation we need trees which carry information only in the leaves. Furthermore the information in the leaves has to be optional in order to represent the case of zero productions.

Leaf trees

In order to describe the trees we use the following inductive type

  -- module: "leaf_tree"
  case class
      LEAF_TREE[G]
  create
      leaf(leaf: G)
      tree(children: LIST[LEAF_TREE[G]))
  end
 

If a type T represents a superset of the alphabet our needed trees are of the type

  LEAF_TREE[T]
 

Since LEAF_TREE is an induction data type the compiler generates the induction principle

  feature
      -- generated induction principle
      all(e: BOOLEAN)
          require
              all(a:G) e[t:=leaf(a)]
              all(cs:LIST[LEAF_TREE[G]])
                   (all(c) c in cs => e[t:=c]) => e[t:=tree(cs)]
          ensure
              all(t) e
          end
 
     -- default type is LEAF_TREE[G]
     --
     -- e[t:=leaf(e)]:   all free occurences of "t" within the expression
     --                  "e" substituted by "leaf(a)
     --
     -- Implication "=>" is right associative, i.e. "a=>b=>c" and
     -- "a=>(b=>c)" are equivalent which is equivalent to the uncurried
     -- version "a and b => c".
     --
     -- The range of the quantifiers "all" and "some" spans as far to the
     -- right as possible.
  end
 

In words: If an assertion is true for all trees which consist only of a leaf and if all children of a tree carry over the truth of the assertion to its parent, then the assertion is true for all trees.

Context free grammar

In order to handle context free grammars we define a module to describe them.

A context free grammar has an alphabet of terminal symbols, an alphabet of nonterminal symbols and start symbol and a set of rules.

  -- module: context_free_grammar
  case class
      CONTEXT_FREE_GRAMMAR[T,NT]
  create
      cfg(start:NT, rules:SET[RULE])
  end
 

A grammar symbol is either a terminal or a nonterminal

  case class
      SYMBOL
  create
      terminal(t:T)
      nonterminal(nt:NT)
  end
 
  -- Note: The class SYMBOL is contained within the module
  --       "context_free_grammar". This module has as its current type
  --       CONTEXT_FREE_GRAMMAR[T,NT] with the formal generic types T and
  --       NT. These formal generics are available to all types in the same
  --       module (i.e. they are considered as local types of the module)
 

A rule has a nonterminal symbols on its left hand side and a possibly empty sequence of grammar symbols on the right hand side.

  case class
      RULE
  create
      rule(left:SYMBOL, right:LIST[SYMBOL])
          require
              left as nonterminal(_)
          end
  end
 
  -- Note: The precondition of the constructor "rule" guarantees that no
  --       terminal symbol can be entered on the left hand side of a rule.
 

Parse trees

A parse tree is a tree where each node carries a grammar symbol and which is consistent with the grammar. A symbol tree is consistent with the grammar if and only if each tree node and its children represent one of the grammar rules.

Example

  -- grammar              leaf tree              parse tree
 
     nat: 0                    *                     nat
     nat: s nat               / \                   /   \
                             s   *                 s    nat
                                / \                    /   \
                               s   *                  s    nat
                                  / \                     /   \
                                 s   *                   s    nat
                                     |                         |
                                     0                         0
 

Each nonterminal node in the parse tree together with its children has one of the two following forms

 
 
            nat                       nat
           /   \                       |
          s    nat                     0
 
 
         nat: s nat                  nat: 0
 

Each of the two forms represent one of the rules of the grammar.

Symbol trees

The structure of symbol trees is defined in the module "tree"

  -- module: tree
  case class
      TREE[G]
  create
      tree(info:G, children: LIST[CURRENT])
  end
 
  feature   -- Generated induction principle
      all(e:BOOLEAN)
          require
              all(a:G, cs:LIST[CURRENT])
                   (all(c:CURRENT) c in cs => e[t:=c]) 
                   => e[t:=tree(a,cs)]
          ensure
              all(t) e
          end
  end
 

Sometimes it is important to get the set of symbols

  feature
      infos(t:TREE[G]): SET[G]
              -- The set of all information elements of the tree `t'.
          ensure
              Result 
              = {t.info} + t.children.mapped(c->c.infos).to_set.union
          end
  end
 

Validity of symbol trees

A symbol tree is valid for a set of rules, if each tree node together with its children is consistent with one of the rules.

It is easy to design a predicate which tells weather a symbol tree is a parse tree of a set of rules.

  feature
      is_root_valid(t:TREE[SYMBOL], rs:SET[RULE]): BOOLEAN
              -- Is the root node of the tree `t' together with its children
              -- consistent with a rule in the set of rules `rs'?
          ensure
              Result = inspect
                           t.info
                       case terminal(_) then 
                           t.chilren=nil
                       case nonterminal(_) then
                           rule(t.info,t.children.mapped(x->x.info)) in rs
                       end
          end
 
      is_valid(t:TREE[SYMBOL], rs:SET[RULE]): BOOLEAN
              -- Is `t' a validparse tree for the rules in `rs'?
          ensure
              Result = t.is_root_valid and
                       t.children.forall({c: c.is_valid(rs)})
          end
  end
 

A tree is a parse tree of a grammar if it is a parse tree of the ruleset and if its root node is the start symbol.

  feature
      is_valid(t:TREE[SYMBOL], g:CURRENT): BOOLEAN
              -- Is `t' a valid parse tree for the grammar `g'?
          ensure
              Result = t.is_valid(g.rules) and t.info~g.start
          end
 
      parse_trees(g:CURRENT): TREE[SYMBOL]?
              -- The set of valid parse trees of the grammar `g'.
          ensure
              Result = {t: t.is_valid(g)}
          end
  end
 

Symbol trees and leaf trees

The transformation of a symbol tree into a leaf tree is straightforward by removing all information about nonterminals in the symbol tree.

  feature
      leaf_tree(t:TREE[SYMBOL]): LEAF_TREE[T]
              -- The leaf tree which corresponds to the symbol tree `t'.
          ensure
              Result =
                  inspect
                      t.info
                  case terminal(term) then
                      leaf(term)
                  case nonterminal(_) then
                      tree(t.children.mapped(c->c.leaf_tree))
                  end
          end
  end
 

Generated leaf trees

A set of rules generate a leaf tree if there is a parse tree for the rules which transforms into the leaf tree. A grammar generates a leaf tree if there is a parse tree for the grammar which transforms into the leaf tree.

  feature
      is_valid(t:LEAF_TREE[T], rs:SET[RULES]): ghost BOOLEAN
              -- Is `t' a valid leaf tree with respect to the rules in `rs'?
          ensure
              Result = some(u:TREE[SYMBOL])
                           u.is_valid(rs) and
                           t = u.leaf_tree
          end
 
      is_valid(t:LEAF_TREE[T], g:CURRENT): ghost BOOLEAN
              -- Is `t' a valid leaf tree with respect to the grammar `g'?
          ensure
              Result = some(u:TREE[SYMBOL])
                           u.is_valid(g) and
                           t = u.leaf_tree
          end
 
      trees(g:CURRENT): ghost LEAF_TREE[T]?
              -- The set of valid leaf trees for the grammar `g'.
          ensure
              Result = {t: t.is_valid(g)}
          end
  end
 

Both validity predicates for leaf trees are ghost predicates because of the use of the existential quantifier (which is not computable) in their definition. In the following we want to find an implementation to convert these two predicates into computable predicates.

Furthermore we want to investigate whether a tree grammar is contained in another tree grammar (i.e. all leaf trees produced by one grammar can be produced by the other as well) and whether two tree grammars are equivalent (i.e. produce the same set of leaf trees).

Some evident facts

If all parse trees generated by a grammar g1 are also valid parse trees for a grammar g2, then all leaf trees generated by the grammar g1 are also leaf trees generated by the grammar g2. The same applies if the set of trees are equivalent.

  feature
      all(g1,g2:CURRENT)
          ensure
              -- proofs: def "trees"
              g1.parse_tress<=g2.parse_trees => g1.trees<=g2.trees
 
              g1.parse_trees=g2.parse_trees => g1.trees=g2.trees
          end
  end
 

If the rules of a grammar is a subset of the rules of another grammar having the same start symbol, then the generated parse trees (and leaf trees) are subsets.

  feature
      all(g1,g2:CURRENT)
          ensure
              g1.start~g2.start
              => g1.rules<=g2.rules     
              => g1.parse_trees<=g2.parse_trees
          end
  end
 

Trimmed grammars

Each context free grammar might contain nonterminals which produce nothing or which are not reachable from the start symbol. A trimmed grammar does not contain such superfluous nonterminals (i.e. no such rules which have a superfluous nonterminal at its left hand side).

Symbol sets

In order to trim a grammar we need the set of terminals, nonterminals and the set of all grammar symbols which are contained in the rules.

  feature
      terminals(rs:SET[RULE]): SET[SYMBOL]
              -- The set of all terminals used in the set of rules `rs'.
          ensure
              Result = rs.mapped(r->r.right
                                    .filtered({s:s as terminal(_)})
                                    .to_set).union
          end
 
      all(rs:SET[RULE])
          ensure
              rs.terminals.predicate 
                  = {s: s as terminal(_) 
                        and some(r:RULE) r in rs and s in r.right}
          end
 
 
      nonterminals(rs:SET[RULE]): SET[SYMBOL]
              -- The set of all nonterminals used in the set of rules `rs'.
          ensure
              Result = rs.mapped(r->r.left)
          end
 
      all(rs:SET[RULE])
          ensure
              rs.nonterminals.predicate
              = {s: some(r:RULE) r in rs and s=r.left}
          end
 
 
      symbols(rs:SET[RULE]): SET[SYMBOL]
              -- The set of all symbols used in the set of rules `rs'.
          ensure
              Result = rs.mapped(r->{r.left}+r.right.to_set)
          end
 
      all(rs:SET[RULE])
          ensure
              rs.symbols.predicate
              = {s: some(r:RULE) r in rs 
                                 and (s=r.left or s in r.right)}
          end
 
 
      all(rs:SET[RULE])
          ensure
              rs.terminals    <= rs.symbols
              rs.nonterminals <= rs.symbols
              rs.terminals * rs.nonterminals = 0
              rs.terminals + rs.nonterminals = rs.symbols
                  -- "0": empty set
                  -- "*": set intersection
                  -- "+": set union
          end
  end
 

Producers

The producers of a set of rules are all the terminals (which produce themselves) and all nonterminals which produce something over the terminals.

     step_producers(rs:SET[RULE], ss:SET[SYMBOL]): SET[SYMBOL]
             -- The set of grammar symbols which produce strings over
             -- the symbols of the set `ss' using one rule of the rule
             -- set `rs'.
         ensure
             Result = rs.filtered(r->r.right.forall(s -> s in ss))
         end
 
     all(rs:SET[RULE], ss:SET[SYMBOL])
         ensure
             bounded: rs.step_producers(ss) <= rs.nonterminals
 
             rs.step_producers(ss).predicate
             = {s: some(r:RULE) 
                      r in rs and
                      all(s:SYMBOL) s in r.right => s in ss}
         end
 
     producers(rs:SET[RULE]): SET[SYMBOL]
             -- The set of all grammar symbols which produce "something" by
             -- using any sequence of rules from the set `rs'.
         ensure
             Result = rs.terminals.closed(ss->rs.step_producers(ss))
         end
 

It is important to note that only producers can occur in valid parse trees.

  all(rs:SET[RULE], t:TREE[SYMBOL])
      ensure
          t.is_valid(rs) => t.infos<=rs.producers
      end
 

This claim can be proved by induction on the trees

  all(t:TREE[SYMBOL])
      check
          c1: all(s:SYMBOL,cs:LIST[TREE[SYMBOL]],rs:SET[RULE])
              require
                  r1: all(c:TREE[SYMBOL]) c in cs 
                                           => c.is_valid(rs)
                                           => c.infos<=rs.producers 
                  r2: tree(s,cs).is_valid(rs)
              check
                  c2: all(c:TREE[SYMBOL]) c in cs => c.is_valid(rs)
                            -- r2, def "is_valid"
 
                  c3: all(c:TREE[SYMBOL]) c in cs => c.infos<=rs.producers
                            -- c2, r1
 
                  c4: cs.mapped(c->c.infos).to_set.union <= rs.producers
                            -- c3, def "mapped", def "union"
 
                  c5: s as terminal(_) => cs=nil  -- r2, def "is_valid"
 
                  c6: s as nonterminal(_)
                  => rule(s,t.children.mapped(c->c.info) in rs
                                                   -- r2, def "is_valid"
 
                  c7: cs.mapped(c->c.info).to_set <= rs.producers
                            -- c4, def "infos"
 
                  c8: s in rs.producers  -- c6, c7, def "producers"
                      -- all children are producers (c7), there is a
                      -- corresponding rule (c6) in case of a nonterminal
                      -- therefore s is a producer as well (closure property
                      -- of producers, terminals are producers anyhow)
 
                  c9: {s}+cs.mapped(c->c.infos).to_set.union <= rs.producers
                              -- c8, c4
              ensure
                  tree(s,cs).infos<=rs.producers   -- c9
              end
      ensure
          all(rs:SET[RULE]) t.is_valid(rs) => t.infos<=rs.producers
              -- c1, induction principle
      end
 

Reachables

The reachables of a grammar are all the symbols which can be reached from the start symbol.

  feature    -- Reachables
      step_reachables(rs:SET[RULE], ss:SET[SYMBOL]): SET[SYMBOL]
              -- Set of all symbols reachable directly from the symbol
              -- set `ss' via one of the rules of `rs'.
          ensure
              Result = rs.filtered(r->r.left in ss)     
                         .mapped(r -> r.right.to_set)
                         .union
          end
 
      all(rs:SET[RULE], ss:SET[SYMBOL])
          ensure
              rs.step_reachables(ss) <= rs.symbols
          end
 
      reachables(rs:SET[RULE], ss:SET[SYMBOL]): SET[SYMBOL]
              -- Set of all symbols reachable from the symbol set
              -- `ss' via any number of rules in `rs'.
          ensure
              Result = ss.closed(ss -> rules.step_reachables(ss))
          end
 
      reachables(g:CURRENT): SET[SYMBOL]
              -- Set of all grammar symbols which are reachable from the
              -- start symbol of the grammar `g' by using the rules of
              -- this grammar
          ensure
              Result = g.rules.reachables({g.start})
          end
  end
 

A valid parse tree can carry only reachable symbols.

  feature
      all(t:TREE[SYMBOL], g:CURRENT)
          ensure
              t.is_valid(g) => t.infos<=g.reachables
          end
  end
 

This claim can be proved with the help of a lemma which states that all symbols within a valid parse tree are reachable from the symbol at the root node via the rules.

  feature
      all(t:TREE[SYMBOL], rs:SET[RULE])
          check
              all(s:SYMBOL, cs:LIST[TREE[SYMBOL]])
                  require
                      r1: all(c:TREE[SYMBOL]) c in cs
                                              => c.is_valid(rs)
                                              => c.infos<=rs.reachables(c.info)
                      r2: tree(s,cs).is_valid(rs)
                  check
                      c1: rules(s,cs.mapped(c->c.info)) in rs
                                -- r2, def "is_valid"
                      c2: all(c:TREE[SYMBOL])
                          require
                              r3: c in cs
                          check
                              c3: c.is_valid(rs)  -- r3, r2, def "is_valid"
                              c4: c.infos<=rs.reachables(c.info)
                                                  -- r1,r3,c3
                              c5: c.info in rs.reachables({s})
                                                  -- r3, c1, def "reachable"
                          ensure
                              c.infos<= rs.reachables({s})
                                    -- c4,c5, closure property of "reachables"
                          end
 
                      c6: {s}+cs.mapped(c->c.infos).to_set.union
                          <= rs.reachables({s})
                             -- c2, def "union", "s in rs.reachables({s})"
                  ensure
                      tree(s,cs).infos<=rs.reachables({s})  -- c6, def "infos"
                  end
          ensure
              t.is_valid(rs) => t.infos<=rs.reachables(t.info)
                  -- induction principle
          end
  end
 

Equivalence of trimmed grammars

A grammar is trimmed if all symbols are reachables and producers.

   feature
       is_trimmed(g:CURRENT): BOOLEAN
               -- Are there only reachables and producers in the grammar?
           ensure
               Result = (g.symbols = g.reachables*g.rules.producers)
           end
   end
 

We can generate a trimmed grammar from an arbitrary grammar.

  feature
      trimmed(g:CURRENT): CURRENT
              -- The grammar `g' where with all rules removed where the
              -- left symbol is neither reachable nor a producer.
          ensure
              Result 
                 = cfg(g.start,
                       g.rules
                        .filtered(r->r.left in
                                     (g.reachables*g.rules.producers)))
          end
  end
 

A trimmed grammar produces the same set of parse trees as an untrimmed grammar.

  feature
      all(g:CURRENT, t:TREE[SYMBOL])
          ensure
              t.is_valid(g) = t.is_valid(g.trimmed)
          end
  end
 

Parse tree construction

A parse tree for a leaf tree is not unique in general. There might be a set of possible parse trees. If the set is empty, then there is no parse tree for a leaf tree. The following algorithm computes the set of possible parse trees of a leaf tree.

  feature
      combined[G](l:LIST[SET[G]]): SET[LIST[G]]
          ensure
              Result = inspect
                           l
                       case nil then
                           {nil}
                       case f::t then
                           f.mapped(e->t.combined.mapped(l->e::l)).union
                       end
          end
 
      parse_trees(t:LEAF_TREE[T],
                  s:SYMBOL,
                  rs:SET[RULE]):   SET[TREE[SYMBOL]]
              -- The set of parse trees for the leaf tree `t'
              -- starting with the symbol `s' which are valid for the
              -- rules in `rs'
          ensure
              Result = inspect
                           t
                       case leaf(opt) then
                           inspect
                               opt
                           case none then
                               rs.filtered(r->r.left=s and r.right=nil)
                                 .mapped(r->tree(s,nil))
                           case value(term) then
                               rs.filtered(r->r.left=s
                                           and s=terminal(term)) 
                                 .mapped(r->tree(s,[])
                           end
                       case tree(cs) then
                           rs.filtered(r->r.left=s 
                                       and r.right.count=cs.count)
                             .mapped(r->cs.zipped(r.right)
                                          .mapped([c,s]->c.parse_trees(s,rs)))
                             .combined
                             .mapped(cs->tree(s,cs))
                       end
          end
  end
 

If the algorithm is correct, the following claim must be valid.

  all(t:LEAF_TREE[T], g:CURRENT)
      ensure
          all(u:TREE[SYMBOL])
              u in t.parse_trees(g.start, g.rules)
              => 
              (u.is_valid(g) and u.leaf_tree=t)
 
          (some(u:TREE[SYMBOL]) u.is_valid(g) and u.leaf_tree=t)
          =
          (t.parse_trees(g.start,g.rules) /= 0)
      end
 

It requires more structuring of the algorithm to verify that the algorithm is correct.

Emacs suffix

 Local Variables:
 mode: outline
 coding: iso-latin-1
 outline-regexp: "=\\(=\\)*"
 End:
Table of contents

- Introduction

- Examples of tree grammars

- Leaf trees

- Context free grammar

- Parse trees

- Example

- Symbol trees

- Validity of symbol trees

- Symbol trees and leaf trees

- Generated leaf trees

- Some evident facts

- Trimmed grammars

- Symbol sets

- Producers

- Reachables

- Equivalence of trimmed grammars

- Parse tree construction

- Emacs suffix


ip-location