Context free grammars

LING 501, Fall 2004: Context-free grammars

Revised September 11 and 12, 2004, allowing the empty string in various normal-form productions, reformatting some charts and diagrams and moving two to separate files, and adding a new section on the analysis of parse trees. The material in the section "Derivation from a nonterminal" has been incorporated into this new section.

Revised September 8 and 10, 2004, eliminating the boundary symbol # from the definition, correcting the tree diagram, and other editorial changes such as replacing A₀ (the designated start symbol) by S.

Definition

A context-free grammar CFG is a 4-tuple {N, S, S, P}, where:

N is a finite nonempty set of nonterminal symbols (categories) {A₁, ..., A_n};

S is a finite nonempty set of terminal symbols (words) {a₁, ... a_m}, where N Ç S = Æ and N È S = V;

S Î N is the start symbol (axiom);

P is a finite set of productions (rules) of the form A ® w, where A Î N, and w Î V*.

Derivation from a nonterminal

This section has been superseded. Go to the section Structural relations in parse trees.

Comparison of regular and context-free productions

Below, S' = S È {e}.

Regular productions

Ordinary right-linear productions

A ® x B or A ® x, where A, B Î N, and x Î S*

Normal-form right-linear productions

A ® a B or A ® a, where A, B Î N, and a Î S'

Ordinary left-linear productions

A ® B x or A ® x, where A, B Î N, and x Î S*

Normal-form left-linear productions

A ® B a or A ® a, where A, B Î N, and a Î S'

Context-free productions

Ordinary context-free productions

A ® w, where A Î N, and w Î V*

Chomsky-normal-form (binary-branching form) productions

A ® B C or A ® a, where A, B, C Î N, and a Î S'

Greibach-normal-form productions

A ® a B C or A ® a B or A ® a, where A, B, C Î N, and a Î S'

Linear productions

A ® x B y or A ® x, where A, B Î N, and x, y Î S*

Two-sided linear productions

A ® x B or A ® B x or A ® x, where A, B Î N, and x Î S*

Derivations, parse trees and structural descriptions

Derivations with respect to context-free grammars fall into equivalence classes, whose members differ only in the order in which the productions are applied. Each equivalence class can be represented by a parse tree (phrase marker) in which the arcs connecting nonterminal symbols represent the application of a production. The parse tree of a terminal string x generated by a context-free grammar G is called a structural description of x with respect to G. The occurrences of nonterminal and terminal symbols that appear in a parse tree are called nodes. The node corresponding to the start of the derivation is called the root, and the terminal nodes are called leaves. All other nodes are called interior nodes.

Example 1: Greibach-normal-form grammar G_B that generates the binary rational numerals

Note: A regular grammar that generates the rational numerals in binary notation is given in the Regular languages handout.

Let G_B =

N = {A, B, C, D}

S = {0, 1, -, /}
S = A

1. A ® 0 D	5. A ® 1 C D	9. B ® 1 C	12. C ® 1
2. A ® 0	6. A ® 1 D	10. C ® 1 C	13. C ® 0
3. A ® - B D	7. A ® 1 C	11. C ® 0 C	14. D ® / B
4. A ® - B	8. A ® 1

Derivation of the string -10/1101 using G_B

Production	Line	Production	Line
start	A	9	- 1 0 / 1 C
3	- B D	10	- 1 0 / 1 1 C
9	- 1 C D	11	- 1 0 / 1 1 0 C
13	- 1 0 D	12	- 1 0 / 1 1 0 1
14	- 1 0 / B

Parse tree for the string -10/1101 with respect to G_B

+-- -
| 
|   +-- 1 
|   | 
A-- B 
|   | 
|   +-- C -- 0 
|
|   +-- / 
|   | 
+-- D   +-- 1 
    |   | 
    +-- B   +-- 1 
        |   | 
        +-- C   +-- 0 
            |   | 
            +-- C 
                | 
                +-- C -- 1

In this parse tree, the node "A" is the root; the nodes "-", "/". the two "0"s and four "1"s are leaves; and the nodes "D", the two "B"s, and four "C"s are interior nodes. Note that the same productions can be used to generate the numerator and the denominator of fractions. Compare this with the regular grammar that generates the same language, in which different productions must be used.

Example 2: Chomsky-normal form grammar `Gc` that generates a regular language

Let Gc =

N = {C, D, N, NP, S, S', Vc, Vi, VP}

S = {cats, dogs, stink, that, the, think}
S = S

1. S ® NP VP	6. D ® the
2. NP ® D N	7. N ® cats \| dogs
3. VP ® Vc S'	8. C ® that
4. VP ® Vi	9. Vc ® think
5. S' ® C S	10. Vi ® stink

Derivation of the cats think that the dogs stink using `Gc`

Production	Line	Production	Line
0	S	8	the cats think that S
1	NP VP	1	the cats think that NP VP
2	D N VP	2	the cats think that D N VP
6	the N VP	6	the cats think that the N VP
7	the cats VP	7	the cats think that the dogs VP
3	the cats Vc S'	4	the cats think that the dogs Vi
9	the cats think S'	10	the cats think that the dogs stink
5	the cats think C S

Parse tree of the cats think that the dogs stink using `Gc`

    +-- D -- the
    |
+-- NP
|   |
|   +-- N -- cats
S
|   +-- Vc -- think
|   |
+-- VP  +-- C -- that
    |   |
    |   |       +-- D -- the
    |   |       |
    +-- S'  +-- NP
        |   |   |
        +-- S   +-- N -- dogs
            |
            +-- VP -- Vi -- stink

Exercise

Identify the root, leaves and interior nodes of this parse tree.

Example 3: Chomsky-normal form grammar `Gf` that generates a strictly context-free language language

Let Gf be:

N ={C, D, Nf, Ns, NPf, NPs, S, S', Vc, Vf, Vi, VPf, VPs}

S = {amazes, bothers, cats, dogs, fact, stink, that, the, think}
S = S

1. S ® NPf VPf	6. VPs ® Vc S'	11. Ns ® cats \| dogs
2. S ® NPs VPs	7. VPs ® Vi	12. C ® that
3. NPf ® D Nf S'	8. S' ® C S	13. Vc ® think
4. NPs ® D Ns	9. D ® the	14. Vf ® amazes \| bothers
5. VPf ® Vf NPs	10. Nf ® fact	15. Vi ® stink

Click here for derivation of the fact that the cats think that the fact that the dogs stink bothers the dogs amazes the cats using `Gf`

Exercise

Draw the parse tree for the sentence the fact that the cats think that the fact that the dogs stink bothers the dogs amazes the cats with respect to Gf.

Structural relations in parse trees

Return to top

Immediate "family" and domination relations

Suppose a production of the form A ® f B y, where A, B Î N and f and y are strings in V*, applies in the course of a derivation with respect to a context-free grammar G. Then, using a familial metaphor, we say that that occurrence of A is the mother (parent) of that occurrence of B in that derivation, and also in the parse tree that includes that derivation. We also say that that B is a daughter (child) of that A in that derivation and parse tree.

In addition to being the mother of B, A is the mother of every terminal and nonterminal symbol appearing in f and y, and every such symbol is a child of A. For concreteness, suppose the production in question is A ® B C, where A, B, C Î N. Then A is the mother of both B and C, and B and C are the children of A. Continuing the familial metaphor, we say that B is the sister (sibling) of C, more specifically its left sister. In addition, C is the sister of B, more specifically its right sister. We also say that A immediately dominates B C. The domain of this relation (the B C) can be interpreted in two ways: as a sequence, in which case it matters that B precedes C, or as a bag, in which case it doesn't matter. (A bag is like a set, but repetition is allowed.) I adopt the sequence interpretation. The immediate dominance relation is to be distinguished from the relation expressed by the arrow in a production. The former holds in a parse tree, the latter in a grammar, and the latter (which may be called immediate generation) licenses the former.

Recursively defined family and dominance relations

Just as we can recursively define familial relations among people, so we can recursively define "familial" relations among nodes in a parse tree. Here is a recursive definition of descendant of node n in parse tree t according to which it is a reflexive relation. (Every node is a descendant of itself.)

Base case: For every node n in a parse tree t, n is a descendant of n in t.
Recursive step: If m is a descendant of n in t, then every daughter of m is a descendant of n in t.
Closure: Nothing else is a descendant of n in t.

If n is the root of t, then every node in t is a descendant of n. However if n is a leaf, then n is its only descendant in t.

Just as 'daughter' is used in the recursive definition of descendant, so ' immediate dominance' is used in the recursive definition of (exhaustive) dominance.

Base case: For every node n in a parse tree t, n (exhaustively) dominates n in t.
Recursive step: If n (exhaustively) dominates the sequence of nodes f m y in t, and m immediately dominates c in t, then n (exhaustively) dominates f c y in t.
Closure: Nothing else is (exhaustively) dominated by n in t.

If n is the root of t, then it (exhaustively) dominates the sequence of leaves in t, but not any partial subsequence of leaves in t. If those leaves consist entirely of terminal symbols, then that sequence can be interpreted as a string s, and s is generated by the grammar G. Expressing the (exhaustive) dominance relation in t by Þ^t, we write S Þ^t s.

Finally, using the 'immediate generation' relation, we can recursively define generation with respect to a context-free grammar G as follows. (Note that generation, unlike (exhaustive) dominance, is not reflexive.)

Base case: For every nonterminal symbol A in a context-free grammar G such that A ® w, A generates w with respect to G.
Recursive step: If A generates f B y with respect to G for some nonterminal symbol B in G such that B ® c, then A generates f c y.
Closure: Nothing else is generated by G.

Expressing the generation relation with respect to G by "Þ^G", then S Þ^G s for every s in the language L generated by G. That is, the language L generated by G is {s | s is a string of terminal symbols in G and S Þ^G s}.

Structural ambiguity

Suppose that G generates s but associates two or more distinct parse trees t, u, ... with s; i.e. S Þ^G s, S Þ^t s, S Þ^u s, .... Then s is structurally ambiguous with respect to G. We can use the ability to represent structural ambiguity as a test for adequacy of a grammar. For example, suppose we use the productions in the grammar G_P1 to generate the language L_P consisting of a noun phrase (NP) followed by zero or more preposition phrases (PP), schematically NP PP*, such as the ball, the ball above the box, the ball above the box behind the camera, etc. (NP is the start symbol of G_P1.) It is easy to verify that every phrase generated by this grammar is structurally unambiguous.

Productions of G_P1

NP ® D N PP
NP ® D N
PP ® P NP
N ® ball | box | camera | doll | keys | wallet | ...
P ® above | behind | beside | below | next to | ...

However, careful consideration of phrases with two or more PPs reveals that they are structurally ambiguous. The second PP can be associated with either the first noun or the second. We can obtain a grammar G_P2 that represents such phrases as structurally ambiguous by replacing the first production of G_P1 by the following production.

NP ® NP PP

Click here for the two parse trees that G_P2 associates with the phase the ball above the box behind the camera.

Exponential growth in degree of structural ambiguity in certain context-free grammars

However, the number of parse trees associated with the phrases in L_P by G_P2 grows exponentially with the number of preposition phrases, as shown in the following table. (In the limit the number of parse trees associated with a phrase containing n prepositional phrases approaches 4ⁿ.) This result may give us pause.

Phrase	# parses
the ball above the box	1
the ball above the box behind the camera	2
the ball above the box behind the camera beside the doll	5
the ball above the box behind the camera beside the doll below the keys	14
the ball above the box behind the camera beside the doll below the keys next to the wallet	42

Binary coordinate structure

A similar result is obtained if we use the productions in grammar G_C1 to generate simple and coordinate proper noun phrases like Alice, Alice and Betty, Alice and Betty and Carla. (PN is the start symbol.) This result may give us even more pause, because these phrases hardly seem structurally ambiguous, much less massively so.

Productions of G_C1

PN ® PN Conj PN
PN ® Alice | Betty | Carla | Diane | Ellen | Flora | ...
Conj ® and | or

Phrase	# parses
Alice and Betty	1
Alice and Betty and Carla	2
Alice and Betty and Carla and Diane	5
Alice and Betty and Carla and Diane and Ellen	14
Alice and Betty and Carla and Diane and Ellen and Flora	42

N-ary coordinate structure

The productions in G_C1don't account for the possibility of "flat" coordinate structure with 3 or more conjuncts. One possibility is to provide an infinite schema of productions as in G_C2.

Productions of G_C2

PN ® PN (and PN)+
PN ® PN (or PN)+

However, this change results in an increase in the rate of growth of ambiguity to 6ⁿ in the limit.

Phrase	# parses
Alice and Betty	1
Alice and Betty and Carla	3
Alice and Betty and Carla and Diane	11
Alice and Betty and Carla and Diane and Ellen	45
Alice and Betty and Carla and Diane and Ellen and Flora	197

Concluding remark on structural ambiguity

This explosive growth in degree of structural ambiguity seems excessive. Is it right? If not, what sort of grammatical description do we need?

Embedding

Left-embedding

If A Þ^G A f, where f Î V+, then G is left-embedding (or left-recursive). Moreover, A is a left-embedding (or left-recursive) category in G if either A = S, or there is some category B (called its introducing category) and strings f, y such that B Þ^G f A y and it is not the case that there are strings f¢ , y¢ such that A Þ^G f¢ B y¢ .

Example

Let Gg contain the following productions, where S is the start symbol.

S ® NP VP
VP ® V NP
NP ® Det N
Det ® NP Poss
Det ® my | your | our | her | his | their
Poss ® 's
N ® sister | cousin | friend | neighbor | teacher | babysitter
V ® loves | hates | admires | detests | knows | remembers

Gg is left-embedding, since NP Þ^Gg NP f, where f is a non-empty string. Moreover, NP is a left-embedding category, since S is its introducing category. Det is not a left-embedding category, since although Det Þ^Gg Det f, where f is a non-empty string, it is neither the start symbol, nor does it have an introducing category.

Right-embedding

If A Þ^G f A with respect to G, where f Î V+, then G is right-embedding (or right-recursive). The definition of a right-embedding (right-recursive) category is similar to that of a left-embedding category.

Example

The grammar Gc above is right-embedding. Its right-embedding category is S.

Center-embedding

If A Þ^G f A y with respect to G, where f, y Î V+, then G is center-embedding (or self-embedding or center-recursive). The definition of a center-embedding (self-embedding or center-recursive) category is similar to that of a left- or right-embedding category.

Example

The grammar Gf above is center-embedding. Its center-embedding category is S.

Embedding and strict context-free languages

A context-free language is strictly context free if and only if all of its grammars are center-embedding. If any context-free language is finite or has a left- or right-embedding grammar, it is also a regular language.

LING 501, Fall 2004: Context-free grammars

Revised September 8 and 10, 2004, eliminating the boundary symbol # from the definition, correcting the tree diagram, and other editorial changes such as replacing A0 (the designated start symbol) by S.

Definition

Derivation from a nonterminal

Comparison of regular and context-free productions

Below, S' = S È {e}.

Regular productions

Ordinary right-linear productions

Normal-form right-linear productions

Ordinary left-linear productions

Normal-form left-linear productions

Context-free productions

Ordinary context-free productions

Chomsky-normal-form (binary-branching form) productions

Greibach-normal-form productions

Linear productions

Two-sided linear productions

Derivations, parse trees and structural descriptions

Example 1: Greibach-normal-form grammar GB that generates the binary rational numerals

Derivation of the string -10/1101 using GB

Parse tree for the string -10/1101 with respect to GB

Example 2: Chomsky-normal form grammar Gc that generates a regular language

Derivation of the cats think that the dogs stink using Gc

Parse tree of the cats think that the dogs stink using Gc

Exercise

Example 3: Chomsky-normal form grammar Gf that generates a strictly context-free language language

Click here for derivation of the fact that the cats think that the fact that the dogs stink bothers the dogs amazes the cats using Gf

Exercise

Structural relations in parse trees

Return to top

Immediate "family" and domination relations

Recursively defined family and dominance relations

Structural ambiguity

Productions of GP1

Exponential growth in degree of structural ambiguity in certain context-free grammars

Binary coordinate structure

Productions of GC1

N-ary coordinate structure

Productions of GC2

Concluding remark on structural ambiguity

Embedding

Left-embedding

Example

Right-embedding

Example

Center-embedding

Example

Embedding and strict context-free languages

Revised September 8 and 10, 2004, eliminating the boundary symbol # from the definition, correcting the tree diagram, and other editorial changes such as replacing A₀ (the designated start symbol) by S.

Example 1: Greibach-normal-form grammar G_B that generates the binary rational numerals

Derivation of the string -10/1101 using G_B

Parse tree for the string -10/1101 with respect to G_B

Example 2: Chomsky-normal form grammar `Gc` that generates a regular language

Derivation of the cats think that the dogs stink using `Gc`

Parse tree of the cats think that the dogs stink using `Gc`

Example 3: Chomsky-normal form grammar `Gf` that generates a strictly context-free language language

Click here for derivation of the fact that the cats think that the fact that the dogs stink bothers the dogs amazes the cats using `Gf`

Productions of G_P1

Productions of G_C1

Productions of G_C2