LING 501, Fall 2004: Context-free grammars

Revised September 11 and 12, 2004, allowing the empty string in various normal-form productions, reformatting some charts and diagrams and moving two to separate files, and adding a new section on the analysis of parse trees. The material in the section "Derivation from a nonterminal" has been incorporated into this new section.
Revised September 8 and 10, 2004, eliminating the boundary symbol # from the definition, correcting the tree diagram, and other editorial changes such as replacing A0 (the designated start symbol) by S.

Definition

A context-free grammar CFG is a 4-tuple {N, S, S, P}, where:

Derivation from a nonterminal

This section has been superseded. Go to the section Structural relations in parse trees.

Comparison of regular and context-free productions

Below, S' = S È {e}.

Regular productions

Ordinary right-linear productions
Normal-form right-linear productions
Ordinary left-linear productions
Normal-form left-linear productions

Context-free productions

Ordinary context-free productions
Chomsky-normal-form (binary-branching form) productions
Greibach-normal-form productions
Linear productions
Two-sided linear productions

Derivations, parse trees and structural descriptions

Derivations with respect to context-free grammars fall into equivalence classes, whose members differ only in the order in which the productions are applied. Each equivalence class can be represented by a parse tree (phrase marker) in which the arcs connecting nonterminal symbols represent the application of a production. The parse tree of a terminal string x generated by a context-free grammar G is called a structural description of x with respect to G. The occurrences of nonterminal and terminal symbols that appear in a parse tree are called nodes. The node corresponding to the start of the derivation is called the root, and the terminal nodes are called leaves. All other nodes are called interior nodes.

Example 1: Greibach-normal-form grammar GB that generates the binary rational numerals

Note: A regular grammar that generates the rational numerals in binary notation is given in the Regular languages handout.

Let GB =

1. A ® 0 D

5. A ® 1 C D

9. B ® 1 C

12. C ® 1

2. A ® 0

6. A ® 1 D

10. C ® 1 C

13. C ® 0

3. A ® - B D

7. A ® 1 C

11. C ® 0 C

14. D ® / B

4. A ® - B

8. A ® 1

   
Derivation of the string -10/1101 using GB

Production

Line

Production

Line

start

A

9

- 1 0 / 1 C

3

- B D

10

- 1 0 / 1 1 C

9

- 1 C D

11

- 1 0 / 1 1 0 C

13

- 1 0 D

12

- 1 0 / 1 1 0 1

14

- 1 0 / B

 

 

Parse tree for the string -10/1101 with respect to GB
+-- -
| 
|   +-- 1 
|   | 
A-- B 
|   | 
|   +-- C -- 0 
|
|   +-- / 
|   | 
+-- D   +-- 1 
    |   | 
    +-- B   +-- 1 
        |   | 
        +-- C   +-- 0 
            |   | 
            +-- C 
                | 
                +-- C -- 1

In this parse tree, the node "A" is the root; the nodes "-", "/". the two "0"s and four "1"s are leaves; and the nodes "D", the two "B"s, and four "C"s are interior nodes. Note that the same productions can be used to generate the numerator and the denominator of fractions. Compare this with the regular grammar that generates the same language, in which different productions must be used.

Example 2: Chomsky-normal form grammar Gc that generates a regular language

Let Gc =

1. S ® NP VP

6. D ® the

2. NP ® D N

7. N ® cats | dogs

3. VP ® Vc S'

8. C ® that

4. VP ® Vi

9. Vc ® think

5. S' ® C S

10. Vi ® stink

Derivation of the cats think that the dogs stink using Gc

Production

Line

Production

Line

0

S

8

the cats think that S

1

NP VP

1

the cats think that NP VP

2

D N VP

2

the cats think that D N VP

6

the N VP

6

the cats think that the N VP

7

the cats VP

7

the cats think that the dogs VP

3

the cats Vc S'

4

the cats think that the dogs Vi

9

the cats think S'

10

the cats think that the dogs stink

5

the cats think C S

 

 

Parse tree of the cats think that the dogs stink using Gc
    +-- D -- the
    |
+-- NP
|   |
|   +-- N -- cats
S
|   +-- Vc -- think
|   |
+-- VP  +-- C -- that
    |   |
    |   |       +-- D -- the
    |   |       |
    +-- S'  +-- NP
        |   |   |
        +-- S   +-- N -- dogs
            |
            +-- VP -- Vi -- stink
Exercise

Identify the root, leaves and interior nodes of this parse tree.

Example 3: Chomsky-normal form grammar Gf that generates a strictly context-free language language

Let Gf be:

1. S ® NPf VPf

6. VPs ® Vc S'

11. Ns ® cats | dogs

2. S ® NPs VPs

7. VPs ® Vi

12. C ® that

3. NPf ® D Nf S'

8. S' ® C S

13. Vc ® think

4. NPs ® D Ns

9. D ® the

14. Vf ® amazes | bothers

5. VPf ® Vf NPs

10. Nf ® fact

15. Vi ® stink

Click here for derivation of the fact that the cats think that the fact that the dogs stink bothers the dogs amazes the cats using Gf

Exercise

Draw the parse tree for the sentence the fact that the cats think that the fact that the dogs stink bothers the dogs amazes the cats with respect to Gf.

Structural relations in parse trees

Return to top

Immediate "family" and domination relations

Suppose a production of the form A ® f B y, where A, B Î N and f and y are strings in V*, applies in the course of a derivation with respect to a context-free grammar G. Then, using a familial metaphor, we say that that occurrence of A is the mother (parent) of that occurrence of B in that derivation, and also in the parse tree that includes that derivation. We also say that that B is a daughter (child) of that A in that derivation and parse tree.

In addition to being the mother of B, A is the mother of every terminal and nonterminal symbol appearing in f and y, and every such symbol is a child of A. For concreteness, suppose the production in question is A ® B C, where A, B, C Î N. Then A is the mother of both B and C, and B and C are the children of A. Continuing the familial metaphor, we say that B is the sister (sibling) of C, more specifically its left sister. In addition, C is the sister of B, more specifically its right sister. We also say that A immediately dominates B C. The domain of this relation (the B C) can be interpreted in two ways: as a sequence, in which case it matters that B precedes C, or as a bag, in which case it doesn't matter. (A bag is like a set, but repetition is allowed.) I adopt the sequence interpretation. The immediate dominance relation is to be distinguished from the relation expressed by the arrow in a production. The former holds in a parse tree, the latter in a grammar, and the latter (which may be called immediate generation) licenses the former.

Recursively defined family and dominance relations

Just as we can recursively define familial relations among people, so we can recursively define "familial" relations among nodes in a parse tree. Here is a recursive definition of descendant of node n in parse tree t according to which it is a reflexive relation. (Every node is a descendant of itself.)

If n is the root of t, then every node in t is a descendant of n. However if n is a leaf, then n is its only descendant in t.

Just as 'daughter' is used in the recursive definition of descendant, so ' immediate dominance' is used in the recursive definition of (exhaustive) dominance.

If n is the root of t, then it (exhaustively) dominates the sequence of leaves in t, but not any partial subsequence of leaves in t. If those leaves consist entirely of terminal symbols, then that sequence can be interpreted as a string s, and s is generated by the grammar G. Expressing the (exhaustive) dominance relation in t by Þt, we write S Þt s.

Finally, using the 'immediate generation' relation, we can recursively define generation with respect to a context-free grammar G as follows. (Note that generation, unlike (exhaustive) dominance, is not reflexive.)

Expressing the generation relation with respect to G by "ÞG", then S ÞG s for every s in the language L generated by G. That is, the language L generated by G is {s | s is a string of terminal symbols in G and S ÞG s}.

Structural ambiguity

Suppose that G generates s but associates two or more distinct parse trees t, u, ... with s; i.e. S ÞG s, S Þt s, S Þu s, .... Then s is structurally ambiguous with respect to G. We can use the ability to represent structural ambiguity as a test for adequacy of a grammar. For example, suppose we use the productions in the grammar GP1 to generate the language LP consisting of a noun phrase (NP) followed by zero or more preposition phrases (PP), schematically NP PP*, such as the ball, the ball above the box, the ball above the box behind the camera, etc. (NP is the start symbol of GP1.) It is easy to verify that every phrase generated by this grammar is structurally unambiguous.

Productions of GP1

However, careful consideration of phrases with two or more PPs reveals that they are structurally ambiguous. The second PP can be associated with either the first noun or the second. We can obtain a grammar GP2 that represents such phrases as structurally ambiguous by replacing the first production of GP1 by the following production.

Click here for the two parse trees that GP2 associates with the phase the ball above the box behind the camera.

Exponential growth in degree of structural ambiguity in certain context-free grammars

However, the number of parse trees associated with the phrases in LP by GP2 grows exponentially with the number of preposition phrases, as shown in the following table. (In the limit the number of parse trees associated with a phrase containing n prepositional phrases approaches 4n.) This result may give us pause.

Phrase

# parses

the ball above the box

1

the ball above the box behind the camera

2

the ball above the box behind the camera beside the doll

5

the ball above the box behind the camera beside the doll below the keys

14

the ball above the box behind the camera beside the doll below the keys next to the wallet

42

Binary coordinate structure

A similar result is obtained if we use the productions in grammar GC1 to generate simple and coordinate proper noun phrases like Alice, Alice and Betty, Alice and Betty and Carla. (PN is the start symbol.) This result may give us even more pause, because these phrases hardly seem structurally ambiguous, much less massively so.

Productions of GC1

Phrase

# parses

Alice and Betty

1

Alice and Betty and Carla

2

Alice and Betty and Carla and Diane

5

Alice and Betty and Carla and Diane and Ellen

14

Alice and Betty and Carla and Diane and Ellen and Flora

42

N-ary coordinate structure

The productions in GC1don't account for the possibility of "flat" coordinate structure with 3 or more conjuncts. One possibility is to provide an infinite schema of productions as in GC2.

Productions of GC2

However, this change results in an increase in the rate of growth of ambiguity to 6n in the limit.

Phrase

# parses

Alice and Betty

1

Alice and Betty and Carla

3

Alice and Betty and Carla and Diane

11

Alice and Betty and Carla and Diane and Ellen

45

Alice and Betty and Carla and Diane and Ellen and Flora

197

Concluding remark on structural ambiguity

This explosive growth in degree of structural ambiguity seems excessive. Is it right? If not, what sort of grammatical description do we need?

Embedding

Left-embedding

If A ÞG A f, where f Î V+, then G is left-embedding (or left-recursive). Moreover, A is a left-embedding (or left-recursive) category in G if either A = S, or there is some category B (called its introducing category) and strings f, y such that B ÞG f A y and it is not the case that there are strings , such that A ÞG B .

Example

Let Gg contain the following productions, where S is the start symbol.

Gg is left-embedding, since NP ÞGg NP f, where f is a non-empty string. Moreover, NP is a left-embedding category, since S is its introducing category. Det is not a left-embedding category, since although Det ÞGg Det f, where f is a non-empty string, it is neither the start symbol, nor does it have an introducing category.

Right-embedding

If A ÞG f A with respect to G, where f Î V+, then G is right-embedding (or right-recursive). The definition of a right-embedding (right-recursive) category is similar to that of a left-embedding category.

Example

The grammar Gc above is right-embedding. Its right-embedding category is S.

Center-embedding

If A ÞG f A y with respect to G, where f, y Î V+, then G is center-embedding (or self-embedding or center-recursive). The definition of a center-embedding (self-embedding or center-recursive) category is similar to that of a left- or right-embedding category.

Example

The grammar Gf above is center-embedding. Its center-embedding category is S.

Embedding and strict context-free languages

A context-free language is strictly context free if and only if all of its grammars are center-embedding. If any context-free language is finite or has a left- or right-embedding grammar, it is also a regular language.