Features

Lexicon

  1. Lexicon:

    Predicates that must be supplied in a lexicon.

  2. Lexical Features:

    Obligatory and optional lexical features.

  3. Parse PF:

    The ParsePF system and predicates for taking apart and combining words.


Contents

Obligatory:
lexicon/3
probeLexicon/1
term
contraction/{3,4}
blockContraction
superClass
Optional:
lexicon/4
probeLexicon/2
define_contraction_defaults
contraction_default
define_lex_forms
ending
lexForms
lexTemplate
rootEnding
agr_feature(*)

(*) Not actually described here. Reference is to predicate agreeAGR/4 from constituent feature access.


Obligatory Components

lexicon(W,C,L) Word W, category label C, and a simple list of features L.

This is the sole interface to the lexicon for word lookup. The implementation of lexicon/3 is left up to the user. This allows for much flexibility in how the lexicon may be organized, for example, with respect to inflected forms.

However, note that there are no mode restrictions. This means that it should not be assumed that the word itself or any other parameter will necessarily be supplied upon lookup. In other words, the implemented predicate must be able to function as an enumerator (and, of course, terminate) when called with one or more uninstantiated parameters. For example, here are some of the many possible modes of usage:

Call Comment
lexicon(+W,-C,-L) (WordMatch)
lexicon(+W,+C,-L) (evalExpand for X=word)

Also, we assume in the case of ambiguity the predicate will supply the possible matches one at a time.

Sample Implementation:

Here is a simple implementation of lexicon/3 as used in a small English lexicon supplied with PAPPI:

lexicon(Word,C,Fs) :- lex(Word,C,Fs).	% directly available
lexicon(Form,v,Fs) :-			% non-base verb forms
	lex(Form,v,Base,F1),
	verbFeatures(Base,F2),
	append1(F1,F2,Fs).

verbFeatures(Base,F) :-
	lex(Base,v,F1),
	pick1(morph(Base,_),F1,F).
Here, words are either stored directly as lex/3 facts, or, in the case of inflected verbs, as pairs with the inflected forms in lex/4 and the base (or infinitival) form in lex/3. This pairing scheme is used to avoid unnecessary replication of information. For example, here is the entry for the verb eat:
%% lex/3
lex(eat,v,[morph(eat,[]),grid([agent],[[patient]])]).

%% lex/4
lex(eating,v,eat,[morph(eat,ing)]).
lex(eats,v,eat,[morph(eat,s)]).
lex(ate,v,eat,[morph(eat,ed(1))]).
lex(eaten,v,eat,[morph(eat,ed(2))]).

Referenced by: X-bar rules compiler (compile-time) / GLR machine builder (compile-time) / Expand Contractions (run-time)

References:


probeLexicon(+Atom) Holds if atom Atom is present in the lexicon.

This is used by the contraction mechanism for output items of the form Atom=word to check that Atom can be found in the lexicon. For efficiency, the predicate should be defined to be a deterministic lookup procedure.

Example:

In the English lexicon, words are encoded using lex/3 (base forms) or lex/4 (fully inflected forms). Hence the following definition:

% deterministic
probeLexicon(Word) :- lex(Word,_,_) -> true ; lex(Word,_,_,_), !.
That is, as far as probeLexicon is concerned, it doesn't matter whether Word occurs once or many times in the lexicon.

Reference: contraction


term(C) Holds if C is a terminal symbol (or category label) present in the lexicon.

Here is a sample definition:

term(n).     term(v).    term(a).    term(p). 
term(c). 
term(adv).   term(det).
term(neg).   term(mrkr). term('$').
$ is a dummy terminal symbol (used by the GLR machine) that must be present in every lexicon. It is purely a dummy since there will be no lexical entries of category $.

Note that C is not necessarily restricted to heads as well as the usual non-projecting terminal symbols. For example, in a theory of noun phrases based on DPs (Determiner Phrases), we might choose to exercise the option of lexical insertion at a phrasal level as well:

term(dp). term(np).
term(n).  term(d).
...
lex(one,np,[count(+),agr([3,sg,[]]),a(-),p(-)]).
...
%% Common nouns
lex(actor,n,[a(-),p(-),count(+),agr([3,sg,m]),vow]).

%% Proper nouns
lex(bill,dp,[a(-),p(-),agr([3,sg,m])]).

Referenced by: X-bar rules compiler (compile-time) / GLR machine builder (compile-time) / ParsePF (run-time)

References:


contraction(+K,+W,+L)
contraction(+K,+W1,+W2,+L)
Contractions expand an input word W, or two adjacent words W1 W2, into a list of output words L.
Contractions are grouped into classes, denoted by K. Classes may be used to define a context in which a contraction may fire. Any contraction belonging to the null class, [], may fire at any time.

Every lexicon must contain declarations for contraction/3 and contraction/4. If there are no entries, the lexicon should contain the following two lines:

no contraction(_,_,_).
no contraction(_,_,_,_).
For more details on how the contraction mechanism operates, see the description of ParsePF.

We now proceed to describe the pattern matching options for the word to be expanded, i.e. W for contraction/3, and W1 and W2 for contraction/4.

Word Match Patterns

The following patterns are available for both contraction/3 and contraction/4.

Given a word W in the input:

  1. -V V is a variable. Succeeds with V bound to W.

    For example, in the French lexicon, W = l'homme successfully matches the folowing rule with X set to l and Y to homme.

    contraction([],X,''''+Y,[X+[e,a]=word,Y=word]).
    
  2. +Atom Atom is a simple atom. Succeeds if Atom is identical to W.

    For example, W = 'd successfully matches 'd and triggers the rule:

    contraction([],'''d',[would]).
    
  3. -V+suffix V is a variable. suffix is an atom. Succeeds if suffix is a non-empty suffix of W. V is bound to the remainder of W.

    For example, in the Turkish lexicon, W = elmayý successfully matches X+yý with X bound to elma using the rule:

    contraction(case,X+yý,[X=pf([block(cop),block(case)]),acc]).  
    
  4. -V+double(-C,+L) V and C are variables. L is a list of atoms representing single characters. Succeeds if W ends with a doubled character C chosen from L.

    For example, in the Hungarian lexicon, we have a rule for a doubled final consonant:

    contraction(doubledFC,X+double(C,[m,l,t,r]),
    	    [X+C=pf([allowOnly(poss)]),lengthenedFC]).
    
  5. -V+single(-C,+L) V and C are variables. L is a list of atoms representing single characters. Succeeds if W ends in a character C chosen from L.

    Example:

    contraction(past,X+single(C,[t,d])+ett,[X+C=pf([allowOnly(caus)]),pst]).
    
  6. -V$+L V is a variable. L contains a list of lexical features. Succeeds if W contains all the features in L. Note W must be a word in the lexicon.

    For example, here is a rule in the Hungarian lexicon for forcing az to be analyzed as a demonstrative, as opposed to being a determiner, when it precedes a non-vowel-initial word.

    contraction([],az,Y$[not(vow(+))],[az$[dem],Y]).
    

We now move on to describe the possible output patterns for the contraction rules. In general, the output pattern is a (possibly empty) list. Each list item must conform to one of the following forms:

Output Patterns Items

  1. +Atom Atom is a simple atom.

    For example, here the output of the rule for can+'t is a pair of atoms, can followed by neg:

    contraction([],can,'''t',[can,neg]).
    
  2. +Atom$[F1,..,Fn] Atom must have features [F1,..,Fn]. Atom is a simple atom. Fi are features.

    Feature values can be used to select particular lexicon items or instantiate feature values in generic elements. Here is an example of the latter from the Japanese lexicon:

    contraction(u,X+tta,[X=morph(_,u),past$[suffix(tta,a4c3a4bf)]]).
    
    The second element in the output, namely past, is a generic past tense suffix. The morphological form of this suffix will vary according to the verb stem. In this case, it will take form -tta when proceeded by a verb of the -u-class. The morphological form is encoded using the feature suffix. Hence, this will force the suffix feature for [past] to unify with suffix(tta,a4c3a4bf).

  3. +Atom$$[F1,..,Fn] Feature Hopping: features [F1,..,Fn] will attach and apply to the item immediately following Atom. Atom is a simple atom. Fi are features.

    Note: the item immediately following Atom, call it Next, must arise from the same original input word. sequence. That is, Atom and Next must share the same derivation sequence of contractions. If Atom happens to be the last or rightmost output word, the rule containing the feature hop will fail.

    This output pattern is designed for use when Atom affects the appearance of or applies some constraint to the next morpheme.

    For instance, here are two verb stemming rules from the Japanese lexicon for verbs that end in -u and -ku, respectively:

    contraction(vNStem,X+wa,[X=word(base(u))$$[prefix(wa,a4ef)]]).
    contraction(vNStem,X+ka,[X=word(base(ku(_)))$$[prefix(ka,a4ab)]]).
    
    The first rule is used, for example, in conjunction with the base entry for the verb kau (buy):
    lexicon(ka,v,base(u),[index(_),morph(kau,base(u)),
    		      grid([agent],[theme]),k(c7e3),eng(buy)]).
    
    and:
    contraction(vEnd,X+nai,[X=pf([require(vNStem)]),negnpast]).
    
    to correctly decompose kawanai as the negative non-past form of kau. Basically, the vNStem rule says that the negative stem - for kau is kawa.

    The feature prefix(wa,a4ef) hops onto the following item, namely negnpast, derived earlier (via vEnd) from the negative non-past ending -nai. In other words, the prefix -wa effectively serves as a bridge or linking mora for combining -u verbs with negative endings.

    Note that feature hopping is local in the sense that prefix/2 hops onto an element that is local to ka+wa+nai. In particular, -wa cannot be used to bridge a word that follows this original input word.

  4. -X=word Restricts X to be a word in the lexicon. Note, X is typically a variable that (must) occur in the input pattern.

    (Note: the lexicon will be accessed twice for X. Once using the predicate probeLexicon during contraction processing. lexicon/3 will be used second time around to pick up the lexical features.)

    For example, here is an (over-productive) rule from the French lexicon for transforming vowel contractions like l'homme into le homme:

    contraction([],X,''''+Y,[X+[e,a]=word,Y=word]).
    
    Here, X and Y must refer to variables in the input pattern. Note also, unlike in the case of Y=word, the left side of the output pattern X+[e,a]=word is not a simple variable. In general, the left side of an equative output item can be one of the following:

    LHSExplanation
    -VV is a variable that must be instantiated to be an atom when the input pattern is matched.
    +AtomAtom is a simple atom.
    [X1,..,Xn] is a disjunctive list of items. Each Xi must either be a variable or a simple atom.
    X+Y X is concatenated with Y. X and Y can be any of the possible forms for the LHS provided they evaluate to simple atoms after input pattern matching.

  5. -X=word(+Form) Restricts X to be a word with Form in the lexicon. Note, X is typically a variable that (must) occur in the input pattern.

    This output form behaves identically to the standard output form X=word except with respect to word lookup. Lookup is performed via probeLexicon(Word,Form) and lexicon(Word,C,Form,Fs). These two predicates must be defined in the lexicon only if this output form is used.

    For example, this is used in the Japanese implementation to separate verb base forms from other lexical entries. For instance:

    contraction(vStem4,X+i,[X=word(base(ku(1)))$$[prefix(i,a4a4)]]).
    contraction(vStem4,X+t,[X=word(base(ku(2)))$$[prefix(t,a4c3)]]).
    
    Here, X will be looked up with Form as base(ku(1)) or base(ku(2)) depending on whether X is immediately followed by -i or -t. These rules are used in conjunction with:
    contraction(vEnd,X+ta,[X=pf([require(vStem4)]),past]).
    
    to decompose forms such as kaita (buy+past) and itta (go+past) into ka+i+ta and i+t+ta, respectively.

    The base forms for verbs kaku (write) and iku (go) should be defined using lexicon/4 as follows:

    lexicon(ka,v,base(ku(1)),[index(_),morph(kaku,base(ku(1))),
    			  grid([agent],[theme]),k(bdf1),eng(write)]).
    lexicon(i,v,base(ku(2)),[index(_),morph(iku,base(ku(2))),grid([agent],[]),
    			 allowExt(goal),k(b9d4),eng(go)]).
    
    We can then define probeLexicon/2 (deterministic) as:
    probeLexicon(Word,X) :- lexicon(Word,v,X,_), !.
    

    References: probeLexicon/2 / lexicon/4

  6. X=pf(R) X must be either a lexical item or contain further contractions. X, typically an input pattern variable, is one of the possible forms described above for the left side of the equative expression. pf(R) indicates that either:
    1. X can be found in the lexicon, i.e. X=pf(R) is equivalent to X=word, or:
    2. X is subject to further rounds of contraction processing. R is a list of class restrictions (possibly null) that applies to all further rounds.
    (Class restrictions will be explained shortly.)

    For example, the following rule in the Japanese lexicon strips off the past tense verb ending -ta and looks for either a verb stem, as in the case of mi+ta, or further contraction processing, as in the case of the passive form mi-rare-ta:

    contraction(vEnding,X+ta,[X=pf([block(vEnding)]),past]).
    

    Here, we have a restriction [block(vEnding)] which means that if X is subject to further contraction processing, rules of the class vEnding are blocked from applying. In particular, this encodes the fact that a single verb cannot be doubly marked with respect to tense.

    In general, R in pf(R) is a (possibly empty) list of class restrictions, each of which may be of the following form:

    Restriction Explanation
    block(C) Rules of class C are blocked from applying. In general, once a class is blocked, the block remains in operation for all further cycles of contraction processing for the word in question.
    blockOnly(C) Only rules of class C are blocked from further application. All other (possibly blocked) rules are activated.
    allow(C) Activates rules of class C. Use this to cancel a previous block.
    allowOnly(C) Only class C is allowed to apply next. That is, it deactivates all classes except for C. Note, this may not be used in conjunction with block.
    require(C)
    require(Cs)
    Forces a contraction rule of class C or one from the list of classes Cs to apply next. In particular, the (normally always present) option of not applying a contraction rule, i.e. of directly looking the left side up in the lexicon, is temporarily suppressed in this case. However, the null class ([]) is exempt from this rule.

  7. X=+F X must be a lexical item with feature F. X, typically an input pattern variable, is one of the possible forms described above for the left side of the equative expression.

    The "progressive" or habitual form of iru in Japanese:

    contraction([],X+iru,[X=morph(_,te),iru]).
    
    Here, X must be a verb with the te-form ending.

See also the section on Expand Contractions for details on debugging the contraction rule mechanism.

Finally, we summarize the class mechanism for contraction.

Classes and the Application of Contraction Rules

In defining rules of the form:

contraction(+K,+W,+L)
contraction(+K,+W1,+W2,+L)
the grammar writer is free to group contraction rules into classes by naming the class of each rule (K).

The class restriction mechanism, encoded by R in the output pattern item X=pf(R) described above, allows named classes to be blocked or permitted to apply as needed. For example, if number agreement follows case endings, the class mechanism can be used to properly sequence firing of the various suffix rules by having number agreement rules explicitly block case rules.

More generally, one can write default restriction rules for contractions. This is provided as a notational convenience. The contraction_default declaration may be used to specify default rules either for unrestricted classes or for some particular class. For example, the following two declarations prevent rules of any named class from firing again once that class has applied to the input word:

define_contraction_defaults.
contraction_default(X,[block(X)]).

Classes may also be grouped into superclasses. This is also used to restrict the scope of contraction rule application. For example, one might have a set of classes that apply only to nouns and another set that apply only to verbs. If a superclass has been declared for a given class, subsequent rounds of contraction processing for a given word will be restricted to rules that belong to the same superclass.

Finally, note:

  1. Unlike named contraction rules, contraction rules from the special null ([]) class are not subject to restriction by any of the mechanisms discussed above. In other words, null class contraction rules are always applicable - except for case (2).
  2. Contraction rule processing can be blocked altogether on a word-by-word basis. See blockContraction.

References: ParsePF / blockContraction / define_contraction_defaults / contraction_default / superClass / Expand Contractions


blockContraction(W) Declares that no contraction rule should be used to expand word W.

In general, contraction rules optionally apply. That is, PAPPI will pursue both lines of inference if there is an input item for which both a lexical entry exists and to which a contraction rule may apply. The blockContraction declaration allows the user to override this default behaviour on a word-by-word basis, for example, in the case of irregular expansions.

Example:

In the Turkish lexicon, we prevent any further morphological decomposition of the following genitive-marked personal pronouns:

blockContraction(benim).	% my
blockContraction(senin).	% your
blockContraction(onun).		% his/her/its
That is, there exist lexical entries for benim, senin and onum.

Every lexicon must contain declarations for blockContraction. If there are no entries, the lexicon should contain the following line:

no blockContraction(_).

Example:

References: contraction


superClass(SK,K) Declares contraction class SK to be the superclass of class K.

Every lexicon must contain declarations for superClass. If there are no entries, the lexicon should contain the following line:

no superClass(_,_).

Example:

In the Hungarian implementation, we have two superclasses n and v:

superClass(n,case).
superClass(n,num).
superClass(n,doubledFC).
superClass(n,poss).
superClass(n,poss1).
superClass(n,poss2).
superClass(v,infin).
superClass(v,infl).
superClass(v,past).
superClass(v,agr).
superClass(v,mood).
superClass(v,subj).
superClass(v,caus).

The basic idea behind the superclass declaration is to restrict contraction expansion sequences to particular groups of classes. For example, if a contraction that belong to superclass n has been used to expand a word, any subsequent application of a contraction rule to that word should come from a compatible class, i.e. some class with superclass n.

References: contraction


Optional Components

define_contraction_defaults Precedes all contraction_rule declarations.

If the lexicon uses the contraction_default meta-rule mechanism, a define_contraction_defaults header line must precede all such declarations.

Example:

%% Defaults for contraction class restrictions 

define_contraction_defaults.
contraction_default(X,[block(X),block(doubledFC)]).
...
References: contraction_default


contraction_default(C,R) Declares a contraction restriction list R for class C.

The possible contraction restrictions are described above in the section for X=pf(R). Note that there are no mode restrictions on C. In particular, C can be a variable - in which case it applies to all named classes. Example from the Hungarian lexicon:

contraction_default(X,[block(X),block(doubledFC)]).
This states that rules of all named classes X=pf(R) are blocked from firing again. Also, by default, all named classes block the class doubledFC. For instance, this imples the rule:
contraction(case,X+át,[X+a=pf([]),acc]).
is actually equivalent to writing:
contraction(case,X+át,[X+a=pf([block(case),block(doubledFC)]),acc]).

Finally, note that a define_contraction_defaults declaration must precede the contraction_default rules.

References: contraction / define_contraction_defaults


define_lex_forms
(PAPPI 3.x only)
Precedes all ending, rootEnding, lexTemplate and lexForms declarations.

If the lexicon uses the lexForms off-line compilation mechanism, a define_lex_forms header must precede all such entries.

Example:

define_lex_forms.
ending(ing,ing).
ending(s,s).
ending(ed(_),ed).
...
lexTemplate(Root,C,Form,Infl,lex(Infl,C,Root,[morph(Root,Form)])).

lexForms(appear,v,[ing,s,ed(_)]).
lexForms(appreciate,v,[ing,s,ed(_)]).
References: ending / rootEnding / lexTemplate / lexForms


ending(Ending,Form)
(PAPPI 3.x only)
Defines the morphological form Form associated with ending code Ending. Ending is used in lexForms macro declarations.

[Note: the define_lex_forms declaration must precede all ending entries.]

Example:

In the English implementation we have:

ending(ing,ing).
ending(s,s).
ending(ed(_),ed).
ending(ed(1),ed).
ending(ed(1),en).
The first parameter will be referenced in lexForms declarations. For instance:
lexForms(appear,v,[ing,s,ed(_)]).
The verb appear has ending forms ing, s and ed(_). These are associated with the strings -ing, -s and -ed, respectively. The stem appear will combine with the strings to produce the inflected forms appearing, appears and appeared, respectively.

See rootEnding for information on defining concatenation rules for endings.

References: define_lex_forms / rootEnding / lexTemplate / lexForms


rootEnding(Root,Ending,Form)
(PAPPI 3.x only)
Defines a rule for producing Form from Root combined with Ending. Ending is defined separately using ending declarations. Root is matched against word stems in lexForms macro declarations.

[Note: the define_lex_forms declaration must precede all rootEnding entries.]

Example:

In the English implementation, we have:

ending(ing,ing).
ending(s,s).
ending(ed(_),ed).
lexForms(appear,v,[ing,s,ed(_)]).
lexForms(appreciate,v,[ing,s,ed(_)]).
Under the default concatenation rule, the lexForms macro will generate the correct inflected forms for appear but not appreciate.

We can either list the lex/4 entries manually as defined in the section on lexicon/3, or define rootEnding rules to cope with the vowel stem ending as follows:

rootEnding(X+e,ing,X+ing).
rootEnding(X+e,ed,X+ed).
rootEnding(X+e,en,X+en).
These rules will override the default concatenation rule when the stem ends in the vowel e.

References: define_lex_forms / ending / lexicon/3 / lexForms


lexTemplate(Root,Category,Ending,Form,Clause)
(PAPPI 3.x only)
Defines a template for macro expansion of lexForms declarations. Will be used to generate a lexical entry Clause for a given word Form with category label Category derived from stem Root + ending code Ending.

Example:

In the English implementation, we have:

ending(ing,ing).
ending(s,s).
ending(ed(_),ed).

lexTemplate(Root,C,Form,Infl,lex(Infl,C,Root,[morph(Root,Form)])).

lexForms(appear,v,[ing,s,ed(_)]).
The lexForms declaration for appear in conjunction with the above template will generate the following lex/4 clauses:
lex(appearing,v,appear,[morph(appear,ing)]).
lex(appears,v,appear,[morph(appear,s)]).
lex(appeared,v,appear,[morph(appear,ed(_))]).
References: ending / lexForms


lexForms(Root,Category,Endings)
(PAPPI 3.x only)
Declares stem Root should be macro expanded using the list of ending codes Endings to form a series of lexical entries with category label Category.

The lexForms mechanism is a compilation scheme. That is, macro expansion of lexForms will be carried out (off-line) at lexicon compilation time. (By contrast, stemming done using the contraction mechanism will be performed at run-time.)

For each lexForms declaration, the following sequence of steps will be carried out:

  1. Each code given in Endings will be looked up via ending declarations to retrieve the corresponding ending.

  2. The ending will be concatenated with the stem Root to produce a final morphological form.

    Note: if a rootEnding rule applies to the combination, it will override the default simple concatenation rule.

  3. Each resulting form will produce a separate lexical entry or clause. The template for the lexical entry is defined by lexTemplate.

Finally, the whole series of definitions must be headed by a define_lex_forms declaration. Within the various declarations, there is the further restriction that clauses for lexForms must go at the end. Definitions for ending and rootEnding may appear in any order.

Examples:

In the English implementation we have:

define_lex_forms.
ending(ing,ing).
ending(s,s).
ending(ed(_),ed).
ending(ed(1),ed).
ending(ed(2),en).

rootEnding(X+e,ing,X+ing).
rootEnding(X+e,ed,X+ed).
rootEnding(X+e,en,X+en).

lexTemplate(Root,C,Form,Infl,lex(Infl,C,Root,[morph(Root,Form)])).
Given this, the following macro definition:
lexForms(appear,v,[ing,s,ed(_)]).
will produce the inflected entries for appear shown below:
lex(appearing,v,appear,[morph(appear,ing)]).
lex(appears,v,appear,[morph(appear,s)]).
lex(appeared,v,appear,[morph(appear,ed(_))]).
Similarly, we can define a macro for arrive as follows:
lexForms(arrive,v,[ing,s,ed(1),ed(2)]).
This produces the following block of entries:
lex(arriving,v,arrive,[morph(arrive,ing)]).
lex(arrives,v,arrive,[morph(arrive,s)]).
lex(arrived,v,arrive,[morph(arrive,ed(1))]).
lex(arriven,v,arrive,[morph(arrive,ed(2))]).
Note, for instance, that the entry for arriving has been derived via the concatenation rule:
rootEnding(X+e,ing,X+ing).
Inflected forms for verbs irregular to these rules can be simply spelled out as per the examples in lexicon/3. For example:
lex(eating,v,eat,[morph(eat,ing)]).
lex(eats,v,eat,[morph(eat,s)]).
lex(ate,v,eat,[morph(eat,ed(1))]).
lex(eaten,v,eat,[morph(eat,ed(2))]).
References: define_lex_forms / ending / rootEnding / lexicon/3 / lexTemplate

Features