LING 501, Fall 2004: The Chomsky hierarchy

Introduction

Much of the material in this handout is technical, and I don't expect you to know how to prove any of the results mentioned here. We will be focusing on the linguistic significance of the material, and it is that that I want you to try to understand.

Classes of languages

Recall that a (formal) language L is any subset of S*, where S is a finite vocabulary of symbols. The powerset à of S* is the set of all subsets of S*, i.e. set of all languages over S. A class CL of languages is any subset of Ã(S*), or member of Ã(Ã(S*)). We consider here the Chomsky hierarchy of classes (or families) of languages, which were originally defined by the form of the rules needed to generate the languages in those classes, but which can also be characterized at least in part by "dependencies" between elements that appear in the strings that make up the languages of those classes.

Regular languages

The smallest infinite class of languages in the Chomsky hierarchy is the class RL of regular languages. These are the languages that can be represented by regular expressions.

Serial dependencies in regular languages

Let R1 be the regular language represented by the regular expression (ab*cb*)+. In any string r of R1 there is at least one dependency between an a and a c, which we can describe in this way: Following any occurrence of a in r, a c must occur, and between that a and that c only bs can occur. An example is a1b4c1ba2b9c2b15a3c3b6, where I have indexed (using subscripts) the three occurrences of a and c. In this example, there is a dependency between a1 and c1, between a2 and c2, and between a3 and c3, and no others, as shown in Figure 1. Moreover, these dependencies do not intersect or overlap; they are serial; each one ends before the next one begins. If there is no upper limit on the number of occurrences of a particular type of dependency in a language, we call it unbounded. The serial dependency just described is unbounded.

a1b4c1ba2b9c2b15a3c3b6
│   │ │   │   │  │
└───┘ └───┘   └──┘

Figure 1. Serial dependencies in a1b4c1ba2b9c2b15a3c3b6

Notice that the distance between any two dependent elements in R1 can be arbitrarily large or small (e.g., between a2 and c2, there are 9 bs, but between a3 and c3, there are no bs). That doesn't matter as far as unbounded serial dependencies are concerned. What does matter is that they don't intersect or overlap, and that there can be any number of them. Every unbounded dependency in a regular language is serial. On the other hand, a finite language has no unbounded dependencies of any sort.

Unbounded serial dependencies in natural language

Let R2 be the regular language (he thinks | they think)* I stink). The dependencies in R2 between he and thinks and between they and think are serial and unbounded, so if R2 is part of English, then some set of English sentences exhibits unbounded serial dependencies.

Pumping lemma for regular languages (PLRL)

PLRL: Let R Î RL. Then there is a constant n such that for any z in R of length n, z = uvw, where v is of length 1, uv is of length n, and every string of the form uv*w is in R.

The primary use for PLRL is to prove that certain languages are not regular. Here I show how to use it to prove that the palindrome language P ={ocô : o Î (a | b)* and ô = x backwards} = {c, aca, bcb, a2ca2, abcba, bacab, b2cb2, ...} Ï RL. Proof: Suppose P Î RL. Then pick z = uvw of the PLRL. Clearly c cannot be part of v, because then it would not occur at all in uw, and would occur twice in uv2w, contrary to the definition of P. But c cannot be part of u or w either. Suppose it is part of u. Then u = xcy, and yw would have to be the mirror image of x. But then yvw would not be the mirror image of x. Therefore it cannot be part of u. A similar argument shows it cannot be part of w. Therefore c cannot be part of u, v, or w. But this is a contradiction since c is part of z = uvw. Therefore P Ï RL.

Context-free languages

The next larger class of languages in the Chomsky hierarchy is the class CFL of context-free languages. Every regular language is also a context-free language, but the converse is not true. The language P above is a context-free language, but is not a regular language. We call the class SCFL of context-free languages that are not regular languages strictly context-free languages (i.e. SCFL = CFL - RL). Unlike regular languages which can be expressed by regular expressions, there is no class of "expressions" for context-free languages. They are defined only by the class of grammars that generate them. These grammars are presented in the context-free languages handout.

Nesting dependencies in strictly context-free languages

In the string a1a2b1a3b2cb2*a3*b1*a2*a1*, I exhibit a member of the SCF palindrome language P, indexing the occurrences of as and bs. Clearly there are dependencies between each ai and its counterpart ai*; similarly for each bj and bj*; if one occurs, then so must the other. Figure 2 (with subscripted *s omitted) shows that these dependencies nest; that is, they start in a particular order and end in the opposite order ("first-start, last-end"). If there is no bound on the number of dependencies of a given type that nest, we say that that type is unbounded. Clearly, in P, there is an unbounded nesting dependency. Every SCFL has at least one unbounded nesting dependency, and may have unbounded serial dependencies as well.

a1a2b1a3b2 c b2a3b1a2a1
│ │ │ │ │  │ │ │ │ │
│ │ │ │ └──┘ │ │ │ │
│ │ │ └──────┘ │ │ │
│ │ └──────────┘ │ │
│ └──────────────┘ │
└──────────────────┘

Figure 2. Nesting dependencies in a1a2b1a3b2cb2*a3*b1*a2*a1*

Nesting dependencies in natural languages

Clearly there are nesting dependencies in natural languages, as in the English sentence the cats that the dog chases howl, in which the dependency between the dog and chases nests within the dependency between the cats and howl. (The strings *the cats that the dog chase howl, *the cats that the dog chases howls, and *the cats that the dog chase howls are all ungrammatical in English.) The question then arises whether natural languages have unbounded nesting dependencies. The answer depends on judgments like whether the string the cats that the dog that the horse that the zebras admire frightens chases howl is grammatical, whereas the strings in which at least one of the following replacements is made are ungrammatical: admires for admire, frighten for frightens, chase for chases, and howls for howl.

Pumping lemma for context-free languages (PLCFL)

PLCFL: Let F Î CFL. Then there is a constant n such that for any z in F of length n or greater, z = uvwxy, where vx is of length 1 or greater, vwx is of length n or less, and every string of the form uvnwxny (n 0) is in F. Proof of PLCFL is in several textbooks on formal language theory, such as Hopcroft & Ullman (1979).

Again the major utility of the PLCFL is to show that certain languages are not context free. An example is the language I = aibici = {e, abc, a2b2c2, a3b3c3, ...}. Suppose I Î CFL. Let z =uvwxy = anbncn, where 3n is the constant of PLCFL. Then vwx cannot contain both as and cs, because the string uv2wx2y would contain more as and cs than bs. Moreover, v and x cannot contain as only, since then uwy would contain fewer as than bs and cs, contrary to the assumption that uwy Î I. All other cases are similarly disposed of. Therefore I Ï CFL.

Intersection with regular languages

Context-free languages are closed under intersection with regular languages. Proof of this theorem is also in Hopcroft & Ullman (1979). That is, if F Î CFL and R Î RL, the language F' = F Ç R Î CFL. This result can also be used together with PLCFL to show that certain languages are not context free, for example the language C = {xx : x Î (a | b)+} = {aa, bb, a2a2, abab, baba, b2b2, a3a3, a2ba2b, ...}. Suppose C Î CFL. Let R = a*b*a*b*. Clearly R Î RL. Then C Ç R = J = {aibjaibj : i or j > 0}, which is in CFL if and only if C is. However, from PLCFL, J Ï CFL. Therefore C Ï CFL.

Intersection of context-free languages

Context-free languages are not closed under intersection. For example, let I1 = aibicj and I2 = ajbici. Both I1 and I2 Î CFL. However I1 Ç I2 = I, and I Ï CFL.

Recursive, or context-sensitive languages

The next larger class of languages in the Chomsky hierarchy is the class CSL of recursive languages, often called context-sensitive languages. These are the languages for which there is a decision procedure for determining whether an arbitrary string does or does not belong to the language. CFL Ì CSL. For example, the languages C and I described above are in CSL but not in CFL. CSL is closed under intersection with RL, just as CFL is.

Indexed languages

Linguists are not interested in the class CSL as a whole, but only with a proper subclass IXL that includes C, I and J together with the CFLs, called indexed (or mildly context-sensitive) languages. We may talk more about this class of languages (which was discovered after the Chomsky hierarchy was established) and grammars for generating them later on in this course. The class SIXL of strictly indexed languages is IXL - CFL.

Crossing dependencies in strictly indexed languages

In a1a2b1a3b2a1*a2*b1*a3*b2*, I exhibit a member of the strictly indexed language C, indexing the occurrences of as and bs. Clearly there are dependencies between each ai and its counterpart ai*; similarly for each bj and bj*; if one occurs, then so must the other. Figure 3 (with subscripted *s omitted) shows that these dependencies cross. Each one begins before any of them end, and they end in the same order that they begin ("first-start, first-end"). If there is no bound on the number of dependencies of a given type that cross, we say that that type is unbounded. Clearly, in C, there is an unbounded crossing dependency. Every SIXL has at least one unbounded crossing dependency, and may also have unbounded nesting and serial dependencies.

a1a2b1a3b2 a1a2b1a3b2
│ │ │ │ │ │ │ │ │ │
└─┼─┼─┼─┼─┘ │ │ │ │
  └─┼─┼─┼───┘ │ │ │
    └─┼─┼─────┘ │ │
      └─┼───────┘ │
        └─────────┘

Figure 3. Crossing dependencies in a1a2b1a3b2a1*a2*b1*a3*b2*

Crossing dependencies in natural languages

If unbounded reduplication is grammatical in some language, then the reduplicated strings would show unbounded crossing dependencies. In almost all cases, reduplication is bounded, so that the crossing dependencies they give rise to are bounded. However, Culy (1985) has argued that Bambara (a Northwestern Mande language spoken in Mali) exhibits a kind of unbounded reduplication, and the A-not-A construction of Chinese may be another instance.

Recursively enumerable languages

The largest computable class of languages is the class REL of recursively enumerable languages. These are the languages for which there is a decision procedure for determining for each string in the language that it is in the language, but not necessarily for determining for each string not in the language that it is not. CSL Ì REL. The class SREL of strictly recursively enumerable languages REL - CSL is the class of undecidable languages. The study of undecidable languages is an interesting branch of mathematics, but not directly relevant to linguistics.