A context-free grammar CFG is a 4-tuple {N, S, S, P}, where:
This section has been superseded. Go to the section Structural relations in parse trees.
Derivations with respect to context-free grammars fall into equivalence classes, whose members differ only in the order in which the productions are applied. Each equivalence class can be represented by a parse tree (phrase marker) in which the arcs connecting nonterminal symbols represent the application of a production. The parse tree of a terminal string x generated by a context-free grammar G is called a structural description of x with respect to G. The occurrences of nonterminal and terminal symbols that appear in a parse tree are called nodes. The node corresponding to the start of the derivation is called the root, and the terminal nodes are called leaves. All other nodes are called interior nodes.
Note: A regular grammar that generates the rational numerals in binary notation is given in the Regular languages handout.
Let GB =
1. A ® 0 D |
5. A ® 1 C D |
9. B ® 1 C |
12. C ® 1 |
2. A ® 0 |
6. A ® 1 D |
10. C ® 1 C |
13. C ® 0 |
3. A ® - B D |
7. A ® 1 C |
11. C ® 0 C |
14. D ® / B |
4. A ® - B |
8. A ® 1 |
Production |
Line |
Production |
Line |
start |
A |
9 |
- 1 0 / 1 C |
3 |
- B D |
10 |
- 1 0 / 1 1 C |
9 |
- 1 C D |
11 |
- 1 0 / 1 1 0 C |
13 |
- 1 0 D |
12 |
- 1 0 / 1 1 0 1 |
14 |
- 1 0 / B |
|
|
+-- - | | +-- 1 | | A-- B | | | +-- C -- 0 | | +-- / | | +-- D +-- 1 | | +-- B +-- 1 | | +-- C +-- 0 | | +-- C | +-- C -- 1
In this parse tree, the node "A" is the root; the nodes "-", "/". the two "0"s and four "1"s are leaves; and the nodes "D", the two "B"s, and four "C"s are interior nodes. Note that the same productions can be used to generate the numerator and the denominator of fractions. Compare this with the regular grammar that generates the same language, in which different productions must be used.
Let Gc =
Production |
Line |
Production |
Line |
0 |
S |
8 |
the cats think that S |
1 |
NP VP |
1 |
the cats think that NP VP |
2 |
D N VP |
2 |
the cats think that D N VP |
6 |
the N VP |
6 |
the cats think that the N VP |
7 |
the cats VP |
7 |
the cats think that the dogs VP |
3 |
the cats Vc S' |
4 |
the cats think that the dogs Vi |
9 |
the cats think S' |
10 |
the cats think that the dogs stink |
5 |
the cats think C S |
|
|
+-- D -- the | +-- NP | | | +-- N -- cats S | +-- Vc -- think | | +-- VP +-- C -- that | | | | +-- D -- the | | | +-- S' +-- NP | | | +-- S +-- N -- dogs | +-- VP -- Vi -- stink
Identify the root, leaves and interior nodes of this parse tree.
Let Gf be:
11. Ns ® cats | dogs |
||
7. VPs ® Vi |
12. C ® that |
|
3. NPf ® D Nf S' |
8. S' ® C S |
13. Vc ® think |
9. D ® the |
14. Vf ® amazes | bothers |
|
5. VPf ® Vf NPs |
10. Nf ® fact |
15. Vi ® stink |
Draw the parse tree for the sentence the fact that the cats think that the fact that the dogs stink bothers the dogs amazes the cats with respect to Gf.
Suppose a production of the form A ® f B y, where A, B Î N and f and y are strings in V*, applies in the course of a derivation with respect to a context-free grammar G. Then, using a familial metaphor, we say that that occurrence of A is the mother (parent) of that occurrence of B in that derivation, and also in the parse tree that includes that derivation. We also say that that B is a daughter (child) of that A in that derivation and parse tree.
In addition to being the mother of B, A is the mother of every terminal and nonterminal symbol appearing in f and y, and every such symbol is a child of A. For concreteness, suppose the production in question is A ® B C, where A, B, C Î N. Then A is the mother of both B and C, and B and C are the children of A. Continuing the familial metaphor, we say that B is the sister (sibling) of C, more specifically its left sister. In addition, C is the sister of B, more specifically its right sister. We also say that A immediately dominates B C. The domain of this relation (the B C) can be interpreted in two ways: as a sequence, in which case it matters that B precedes C, or as a bag, in which case it doesn't matter. (A bag is like a set, but repetition is allowed.) I adopt the sequence interpretation. The immediate dominance relation is to be distinguished from the relation expressed by the arrow in a production. The former holds in a parse tree, the latter in a grammar, and the latter (which may be called immediate generation) licenses the former.
Just as we can recursively define familial relations among people, so we can recursively define "familial" relations among nodes in a parse tree. Here is a recursive definition of descendant of node n in parse tree t according to which it is a reflexive relation. (Every node is a descendant of itself.)
If n is the root of t, then every node in t is a descendant of n. However if n is a leaf, then n is its only descendant in t.
Just as 'daughter' is used in the recursive definition of descendant, so ' immediate dominance' is used in the recursive definition of (exhaustive) dominance.
If n is the root of t, then it (exhaustively) dominates the sequence of leaves in t, but not any partial subsequence of leaves in t. If those leaves consist entirely of terminal symbols, then that sequence can be interpreted as a string s, and s is generated by the grammar G. Expressing the (exhaustive) dominance relation in t by Þt, we write S Þt s.
Finally, using the 'immediate generation' relation, we can recursively define generation with respect to a context-free grammar G as follows. (Note that generation, unlike (exhaustive) dominance, is not reflexive.)
Expressing the generation relation with respect to G by "ÞG", then S ÞG s for every s in the language L generated by G. That is, the language L generated by G is {s | s is a string of terminal symbols in G and S ÞG s}.
Suppose that G generates s but associates two or more distinct parse trees t, u, ... with s; i.e. S ÞG s, S Þt s, S Þu s, .... Then s is structurally ambiguous with respect to G. We can use the ability to represent structural ambiguity as a test for adequacy of a grammar. For example, suppose we use the productions in the grammar GP1 to generate the language LP consisting of a noun phrase (NP) followed by zero or more preposition phrases (PP), schematically NP PP*, such as the ball, the ball above the box, the ball above the box behind the camera, etc. (NP is the start symbol of GP1.) It is easy to verify that every phrase generated by this grammar is structurally unambiguous.
However, careful consideration of phrases with two or more PPs reveals that they are structurally ambiguous. The second PP can be associated with either the first noun or the second. We can obtain a grammar GP2 that represents such phrases as structurally ambiguous by replacing the first production of GP1 by the following production.
Click here for the two parse trees that GP2 associates with the phase the ball above the box behind the camera.
However, the number of parse trees associated with the phrases in LP by GP2 grows exponentially with the number of preposition phrases, as shown in the following table. (In the limit the number of parse trees associated with a phrase containing n prepositional phrases approaches 4n.) This result may give us pause.
Phrase |
# parses |
the ball above the box |
1 |
the ball above the box behind the camera |
2 |
the ball above the box behind the camera beside the doll |
5 |
the ball above the box behind the camera beside the doll below the keys |
14 |
the ball above the box behind the camera beside the doll below the keys next to the wallet |
42 |
A similar result is obtained if we use the productions in grammar GC1 to generate simple and coordinate proper noun phrases like Alice, Alice and Betty, Alice and Betty and Carla. (PN is the start symbol.) This result may give us even more pause, because these phrases hardly seem structurally ambiguous, much less massively so.
Phrase |
# parses |
Alice and Betty |
1 |
Alice and Betty and Carla |
2 |
Alice and Betty and Carla and Diane |
5 |
Alice and Betty and Carla and Diane and Ellen |
14 |
Alice and Betty and Carla and Diane and Ellen and Flora |
42 |
The productions in GC1don't account for the possibility of "flat" coordinate structure with 3 or more conjuncts. One possibility is to provide an infinite schema of productions as in GC2.
However, this change results in an increase in the rate of growth of ambiguity to 6n in the limit.
Phrase |
# parses |
Alice and Betty |
1 |
Alice and Betty and Carla |
3 |
Alice and Betty and Carla and Diane |
11 |
Alice and Betty and Carla and Diane and Ellen |
45 |
Alice and Betty and Carla and Diane and Ellen and Flora |
197 |
This explosive growth in degree of structural ambiguity seems excessive. Is it right? If not, what sort of grammatical description do we need?
If A ÞG A f, where f Î V+, then G is left-embedding (or left-recursive). Moreover, A is a left-embedding (or left-recursive) category in G if either A = S, or there is some category B (called its introducing category) and strings f, y such that B ÞG f A y and it is not the case that there are strings f¢ , y¢ such that A ÞG f¢ B y¢ .
Let Gg contain the following productions, where S is the start symbol.
Gg is left-embedding, since NP ÞGg NP f, where f is a non-empty string. Moreover, NP is a left-embedding category, since S is its introducing category. Det is not a left-embedding category, since although Det ÞGg Det f, where f is a non-empty string, it is neither the start symbol, nor does it have an introducing category.
If A ÞG f A with respect to G, where f Î V+, then G is right-embedding (or right-recursive). The definition of a right-embedding (right-recursive) category is similar to that of a left-embedding category.
The grammar Gc above is right-embedding. Its right-embedding category is S.
If A ÞG f A y with respect to G, where f, y Î V+, then G is center-embedding (or self-embedding or center-recursive). The definition of a center-embedding (self-embedding or center-recursive) category is similar to that of a left- or right-embedding category.
The grammar Gf above is center-embedding. Its center-embedding category is S.
A context-free language is strictly context free if and only if all of its grammars are center-embedding. If any context-free language is finite or has a left- or right-embedding grammar, it is also a regular language.