Statistical Natural Language Processing

- Syllabus
- Survey
- Automata handout
- N-gram handout
- Exercise #1
- Information theory
- Smoothing
- Smoothing continued
- Toolkits
- Exercise #2
- HMMs
- HMM algorithms
- HMM algorithms continued
- PCFGs
- Handout on prospectus
- PCFG algorithms

- Chris Manning's Statistical NLP links
- Linguistic Data Consortium
- Project Gutenberg
- Oxford Text Archive

- Krenn & Samuelson's book on statistical NLP*
- Chris Brew's book on Data-intensive Linguistics, as postscript* or HTML
- Abney's paper on Statistical methods and linguistics*
- Chen & Goodman as pdf or as postscript*
- Clarkson & Rosenfeld* (This paper describes the CMU-Cambridge package below.)
- Stolcke as gzipped postscript or as pdf (This paper describes the SRILM package below.)
- Goldsmith on probability for language modeling

`a2ps`

or viewed/printed with
Ghostview (free for all
platforms)
- Brown corpus
- Susanne corpus (tagged) (This file has
to be extracted with
`tar`

and`uncompress`

in a unix/linux environment.)

- My book manuscript
- Ch.1: Introduction (2/10/03)
- Ch.2: Formal language theory (2/10/03)
- Ch.3: Probability theory (2/10/03)
- Ch.4: N-grams (2/10/03)
- Ch.5: Information theory (2/17/03)
- Ch.6: Sparse data (2/24/03)
- Ch.7: Hidden Markov Models (3/10/03)
- Ch.8: HMM Algorithms (3/24/03)
- References (2/1/03)

- programs from book (Note that these are not production quality and
require that you have Perl installed on your system.)
- Unigrams.pm
- Bigrams.pm
- uniapprox.pl*
- biapprox.pl*
- entropy.pl*
- entropy2.pl*
- hapax.pl*
- novelwords.pl*
- addonecross.pl*
- addx.pl*
- forngram.pl* (Corrected: 3/7/03)

`.txt`

extension. This should be stripped off before running the program. For
example, `uniapprox.txt`

should be renamed
`uniapprox.pl`

. The programs can
be run in several ways; the easiest is ```
perl program-name
(arguments)
```

.
Mike Hammond

Dept. of Linguistics