Ling 501, Fall 2004: Regular languages

Revised September 8, 2004, notational changes in definition of regular grammar to make notation consistent with context-free grammar handout
Revised September 4, 2004, correcting the recursive definition of regular language.
Revised September 3, 2004, eliminating reference to the boundary symbol #.
Revised September 2, 2004 to include a link to a page illustrating the concatenation and union steps.

Recursive definition of the class of regular languages

The empty set Æ = {} is a regular language.

The singleton set consisting of the empty string {e} is a regular language.

Recall that Æ and {e} are distinct sets, hence distinct languages.

The singleton set {w} such that w Î S is a regular language.

If L and M are regular languages, then so is L ^ M = {st: s is in L and t is in M}

The concatenation of two regular languages is a regular language.

If L and M are regular languages, then so is L È M = {s: s is in L or s is in M}

The union (disjunction) of two regular languages is a regular language.

If L is a regular language, then so is L+ {sin: n > 0 and si is in L}.

The unbounded repetition of the members of a regular language is a regular language. Note that a different choice of si can be made on each repetition. If n is large enough, then the same si will recur.

Nothing else is a regular language. The class of regular languages over S is closed under concatenation, union and unbounded repetition.

Additional closure properties of regular languages

If L and M are regular languages, then so is L Ç M = {s: s is in L and s is in M}

The intersection (conjunction) of two regular languages is a regular language.

If L is a regular language, then so is Ø L = {s : s is in S* and s is not in L}.

The complement with respect to S* (negation) of a regular language is a regular language.

For an illustration of how concatenation and union work to build regular languages, click here.

Regular expressions

A regular expression is a compact notation for representing a particular regular language, using concatenation, union and unbounded repetition.

Notational conventions for regular expressions

The notations that we use for regular expressions include the following repetition operators.

The Kleene star operator is different from the "wildcard star", as in the DOS command delete *.*. Wildcard star means 'any string', including e, and so is equivalent to S*. Also the optional operator is different from the wildcard question mark, as in delete ??.???. Wildcard question mark means 'any single character', and so is equivalent to S.

Here is the notation for union.

We can take the Kleene plus and pipe operators as the basic notations, and define the Kleene star and optionality operators in terms of them as follows. However, we allow the use of all four operators in homework answers.

Scope of string operators

The scope of any string operator can be indicated explicitly by parentheses. Without parentheses, the scope of n, +, * and ? is the string to their immediate left, and the scope of | is the strings to its immediate left and right. Here are some examples.

A regular expression problem

Let R the language consisting of the set of rational numbers in binary notation.

Write a regular expression for R. (Hint: First formulate a set of rules for the "use" of the members of S. Then "translate" these into regular-expression notation.)

Some rules regarding the composition of members of R.

The minus sign:

The division sign:

Zero:

One:

Forming the regular expression

Given that that the pipe as a whole is parenthesized, it is not necessary to parenthesize its right-hand side, but no harm is done if you do. That is, the above expression is equivalent to: (0 | (-? 1 (0 | 1)*))

Question: what does the unparenthesized pipe expression 0 | -? 1 (0 | 1)* represent?

Regular grammars

A regular grammar is a 4-tuple {N, S, S, P}, where:

Derivations and generated languages

A derivation of the string s in a regular grammar RG is a sequence of lines starting with A0 and ending with s, where s Î V*. Each line in the derivation except the first is formed from the previous one by replacing a nonterminal symbol A by the right hand side of a rule whose left hand side is A. If s Î S*, the derivation is said to be terminated. The language generated by RG is the set L consisting of strings that have terminated derivations.

Example

Here is a regular grammar G for the regular language consisting of the rational numbers in binary notation. (The productions are numbered for ease of reference.)

G = {N, S, S, P}, where:

1

A ® 0

9

D ® 1 D

2

A ® 1

10

D ® / E

3

A ® 0 B

11

E ® 1 F

4

A ® 1 D

12

F ® 0

5

A ® - C

13

F ® 0 F

6

B ® / E

14

F ® 1

7

C ® 1 D

15

F ® 1 F

8

D ® 0 D

 

 

Derivation of the string -10/11:

start

A

5

- C

7

- 1 D

8

- 1 0 D

10

- 1 0 / E

11

- 1 0 / 1 F

14

- 1 0 / 1 1