Sets and formal languages

LING 501, Fall 2004: Sets and formal languages

Sets and set membership

Sets are defined by their members. So if the set S has the members a, b, and c, and only these, then we write S = {a, b, c}, where the curly brackets and the comma can be considered "punctuation". Set members must be distinct objects. While it is not incorrect to represent S as {a, b, b, c}, for example, it is redundant, and equivalent to {a, b, c}.

To say that a is a member of the set S, we may write a Î S, where the relation Î can be read "is in", "is a member of", or "belongs to". The sign "Î" looks like the sign "e" for the empty string (especially in my handwriting), but they are distinct, just as the letters "f" and "t" are distinct, though they look similar. The expression d Ï S means that d is not a member of S.

If a set has exactly one member, it is called a singleton set, for example the set T = {a}. Finally, there is one, and only one set, with no members. It is called the empty set, and is symbolized Æ . Another way to write the empty set is {}. The empty set is distinct from the singleton set E whose one member is the empty string e.

Cardinality

The number of members of a set is its size, or cardinality. The cardinality of the empty set is 0, which we can express as C(Æ ) = 0. The cardinality of the set S above is 3. The cardinality of the set S* of every string over a finite vocabulary S is infinity, more precisely countable infinity, symbolized À₀ (read "aleph null" or "aleph zero"). The members of any countably infinite set can be put into one-to-one correspondence with the natural numbers N = {1, 2, ...}. We will talk more about this notion later in this course (or now, if you ask me nicely). One of the curious facts about countable infinity is that the cardinality of S* does not vary with the cardinality of S (as long as the latter is finite).

Subsets

If every member of the set R is also a member of the set S, then R is a subset of S, written R Í S. For example, let S be as above, and R = {a, b}. Then R Í S. If as in this example, some member of S is not a member of R, then R is a proper subset of S, written R Ì S. The empty set is a subset of every set, including itself, though of course it is not a proper subset of itself.

Singleton subsets of a given set are distinct from the members of that set. For example, a is a member of the set S = {a, b, c}, whereas the singleton set {a} is a subset of that set; that is a Î S and {a} Ì S.

Now consider the set U = {a, {a}}. Then {a} Î U and {a} Ì U. This is not a paradox, but merely a fact. Note also that {{a}} Ì U. {{a}} is a singleton set, whose member is a singleton set whose member is a.

Powersets

The set of all subsets of a given set is called its powerset. For example, the powerset of the set S = {a, b, c}, symbolized Ã(S), is {Æ, {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, {a, b, c}}. C(Ã(S)) = 8, and in general if C(A) = n, then the cardinality of its powerset C(Ã(A)) = 2ⁿ.

This relation also holds for countably infinite sets, for example the set S* of strings over a finite vocabulary, whose cardinality is À₀. The cardinality of Ã(S*) = 2^À0, which the 19th century mathematician Georg Cantor showed is greater than countable infinity. He symbolized it as À₁ (read "aleph one"), and conjectured that it is the same as the cardinality of the real numbers (a conjecture known as the continuum hypothesis).

Formal languages

A formal language is any subset of S*. Therefore the set of all languages is Ã(S*), and the cardinality of that set is À₁. We can consider the set Ã(S*) to be the "universe of discourse" for formal language theory, since all and only all the members of Ã(S*) are languages.

Classes of formal languages

Since Ã(S*) is a set, we can form its powerset Ã(Ã(S*)), the set of all sets of languages. To help avoid confusion with languages as sets of strings, I will refer to sets of languages as classes of languages. (The term "class" is generally used in set theory to mean something different from "set", so my choice of terms can lead to other confusions, but I will persist.) The cardinality of the set Ã(Ã(S*)), the set of all classes of languages, is À₂, the cardinality of the powerset of the set of real numbers, if the continuum hypothesis is correct. One member of this extraordinarily huge set is, presumably, the class of natural languages. Another is the class of regular languages, as previously defined. Our problem as linguists is to discover the class of natural languages, out of the enormous range of possibilities that formal language theory presents.