Regular languages

Ling 501, Fall 2004: Regular languages

Revised September 8, 2004, notational changes in definition of regular grammar to make notation consistent with context-free grammar handout

Revised September 4, 2004, correcting the recursive definition of regular language.

Revised September 3, 2004, eliminating reference to the boundary symbol #.

Revised September 2, 2004 to include a link to a page illustrating the concatenation and union steps.

Recursive definition of the class of regular languages

Base cases:

The empty set Æ = {} is a regular language.

The singleton set consisting of the empty string {e} is a regular language.

Recall that Æ and {e} are distinct sets, hence distinct languages.

The singleton set {w} such that w Î S is a regular language.

Recursive steps:

If L and M are regular languages, then so is L ^ M = {st: s is in L and t is in M}

The concatenation of two regular languages is a regular language.

If L and M are regular languages, then so is L È M = {s: s is in L or s is in M}

The union (disjunction) of two regular languages is a regular language.

If L is a regular language, then so is L⁺ {s_iⁿ: n > 0 and s_i is in L}.

The unbounded repetition of the members of a regular language is a regular language. Note that a different choice of s_i can be made on each repetition. If n is large enough, then the same s_i will recur.

Closure:

Nothing else is a regular language. The class of regular languages over S is closed under concatenation, union and unbounded repetition.

Additional closure properties of regular languages

If L and M are regular languages, then so is L Ç M = {s: s is in L and s is in M}

The intersection (conjunction) of two regular languages is a regular language.

If L is a regular language, then so is Ø L = {s : s is in S* and s is not in L}.

The complement with respect to S* (negation) of a regular language is a regular language.

For an illustration of how concatenation and union work to build regular languages, click here.

Regular expressions

A regular expression is a compact notation for representing a particular regular language, using concatenation, union and unbounded repetition.

Notational conventions for regular expressions

The notations that we use for regular expressions include the following repetition operators.

"Kleene star": s* = {sⁿ : n ³ 0} = s+ È {e} = {e, s, s², ...}
"Kleene plus": s+ = {sⁿ : n > 0} = {s, s², ...}
"Optional": s? = {sⁿ : 0 £ n £ 1} = {e, s}

The Kleene star operator is different from the "wildcard star", as in the DOS command delete *.*. Wildcard star means 'any string', including e, and so is equivalent to S*. Also the optional operator is different from the wildcard question mark, as in delete ??.???. Wildcard question mark means 'any single character', and so is equivalent to S.

Here is the notation for union.

"Pipe": s | t = {x : x = s or x = t} = {s, t}

We can take the Kleene plus and pipe operators as the basic notations, and define the Kleene star and optionality operators in terms of them as follows. However, we allow the use of all four operators in homework answers.

s* = s+ | e.

s? = s | e.

Scope of string operators

The scope of any string operator can be indicated explicitly by parentheses. Without parentheses, the scope of ⁿ, +, * and ? is the string to their immediate left, and the scope of | is the strings to its immediate left and right. Here are some examples.

st² = stt

(st)² = stst

st+ = {st, st², ...}
(st)+ = {st, (st)², ...}

st* = {s, st, st², ...}
(st)* = {e, st, (st)², ...}

st? = {s, st}
(st)? = {e, st}

s | t = (s | t) = {s, t}

s* | t* = {e, s, t, s², t², ...}
(s | t)* = {e, s, t, s², st, ts, t², ...}

A regular expression problem

Let R the language consisting of the set of rational numbers in binary notation.

The vocabulary S of R is {-, /, 0, 1}.
Some well-formed expressions: {0, 1, 10, 11, 100, -1, -10, 1/10, 101/10, 0/1, -1/100, -11/10}.
Some ill-formed expressions: {e, -, -0, 1-, 1-1, /, /1, 1/, 0/0, 100/0, 1/-1, 01, 00001, -01, 01/1, 1/01, 1/1/1}.

Write a regular expression for R. (Hint: First formulate a set of rules for the "use" of the members of S. Then "translate" these into regular-expression notation.)

Some rules regarding the composition of members of R.

The minus sign:

Can occur at most once.
Can only occur at the beginning of the string.

The division sign:

Can occur at most once.
Can only occur between two numerals (call the first one the 'numerator', the second one the 'denominator').

Zero:

Can occur as the only symbol in a string, or as the only symbol in the numerator. Otherwise it cannot occur at the beginning of a numeral (no 'leading zeroes').
Cannot occur by itself as the denominator.

One:

Must occur as the first symbol in a numeral other than '0'.
Following an initial '1', zeroes and ones can occur in any combination.

Forming the regular expression

Start with: (0 | -? 1 (0 | 1)*) [for whole numbers or numerators of fractions]

Given that that the pipe as a whole is parenthesized, it is not necessary to parenthesize its right-hand side, but no harm is done if you do. That is, the above expression is equivalent to: (0 | (-? 1 (0 | 1)*))

Question: what does the unparenthesized pipe expression 0 | -? 1 (0 | 1)* represent?

Continue with: (/ 1 (0 | 1)*)? [for denominators]
The entire regular expression is: (0 | -? 1 (0 | 1)*) (/ 1 (0 | 1)*)?

Regular grammars

A regular grammar is a 4-tuple {N, S, S, P}, where:

N is a finite nonempty set of nonterminal symbols (categories) {A₁, ..., A_n};

S is a finite nonempty set of terminal symbols (words) {a₁, ... a_m}, where N Ç S = Æ and N È S = V;
S Î N is the start (axiom) symbol;

P is a finite set of productions (rules) of the form A ® x B or A ® x (right-linear), or of the form A ® B x or A ® x (left-linear) where A, B Î N and x Î S*.

Derivations and generated languages

A derivation of the string s in a regular grammar RG is a sequence of lines starting with A₀ and ending with s, where s Î V*. Each line in the derivation except the first is formed from the previous one by replacing a nonterminal symbol A by the right hand side of a rule whose left hand side is A. If s Î S*, the derivation is said to be terminated. The language generated by RG is the set L consisting of strings that have terminated derivations.

Example

Here is a regular grammar G for the regular language consisting of the rational numbers in binary notation. (The productions are numbered for ease of reference.)

G = {N, S, S, P}, where:

N = {A, B, C, D, E, F}

S = {0, 1, -, /}
S= A

1	A ® 0	9	D ® 1 D
2	A ® 1	10	D ® / E
3	A ® 0 B	11	E ® 1 F
4	A ® 1 D	12	F ® 0
5	A ® - C	13	F ® 0 F
6	B ® / E	14	F ® 1
7	C ® 1 D	15	F ® 1 F
8	D ® 0 D

Derivation of the string -10/11:

start	A
5	- C
7	- 1 D
8	- 1 0 D
10	- 1 0 / E
11	- 1 0 / 1 F
14	- 1 0 / 1 1