- Instructor: Ying Lin

- Email: yinglin@email.arizona.edu
- Class: TTh 2:00-3:15

- Room: Social Sciences 224 (ICL)

- Office location: Douglass 305
- Phone: 626-0678
- Office hours: W 1:15-3:15, or by appointment

This course is an introduction to the statistical modeling of spoken language and text data. Statistical methods have become popular in recent years in both theoretical and applied research, partly because of rapid increases in computing power and data set size. In this class, we will provide an introduction to the statistical concepts and algorithms used in NLP, and survey how they are applied in human language technology. This course is for linguistics students who are interested in the methodology of statistical modeling, as well as those who wish to acquire hands-on experience with speech and natural language processing tools. In terms of content, the materials can be roughly divided into the following sub-units:

- Data acquisition: working with text corpora, programming in the
Python language

- Basic probability and statistical inference: distributions, hypothesis testing, maximum likelihood estimation, Bayesian methods, basic information theory
- Stochastic grammars: n-gram models, hidden Markov models, probabilistic context-free grammars
- Unsupervised methods and supervised methods

- Applications of statistical NLP

Upon completion of this course, the student is expected to be able
to do the following:

- Perform various text processing tasks with the Python language and packages
- Perform calculation and simulations based on probability distributions
- Apply the Bayes formula to calculate conditional probability

- Use hypothesis testing to discover terms from collocations
- Implement simple statistical classifiers and apply them to classification problems in NLP
- Formulate and implement the EM algorithm for training various
statistical models

- Apply the methodology of clustering in scenarios of unsupervised learning
- Relate various training and smoothing techniques to principles of parameter estimation
- Calculate probabilities based on HMM, implement the Viterbi and EM algorithms for HMM
- Calculate probabilities based on PCFG, implement the CKY algorithm for CFG and PCFG

The prerequisite for this class is LING 438/538, or LING 388 with instructor's approval). Students who have not taken the prerequisite should speak to me for permission to enroll.

Class participation: | 10% |

Quizzes: | 20% |

Homework assignments: | 40% |

Project -- proposal: | 5% |

Project -- presentation |
5% |

Project -- final report |
20% |

There will be 6-8 homework problem sets and 3-4 short quizzes. The
homework sets and quizzes are identical for 439
and 539. Repair work may be allowed for the quizzes after they are
graded. 539 students will have the lowest assignment grade dropped.
Students enrolling in 439 will have the lowest assignment AND the
lowest quiz dropped from their grade.

Most teaching materials for this class, including slides, handouts, sample codes, homework assignments and an online discussion forum will be available on desire2learn for registered students (http://d2l.arizona.edu). A NetId is required for login. Questions regarding course materials, homework, etc. should be directed to the discussion forum in D2L. Some problems in the homework assignments will require the results to be submitted to the dropbox in D2L.

NOTE: If you need to send me an email, do not use the D2L email since I do not check it. If you are not enrolled in the class, but need to get access to the course website, please let me know.

Required:

- Manning and Schutze (1999). Foundations of Statistical Natural Language Processing.MIT Press. (electronic version can be accessed from an on-campus IP address)
- NLTK tutorial (available online and downloadable as a PDF)

Recommended:

- Jurafsky and Martin (2000). Speech and Language Processing.Prentice-Hall. (some sample chapters of the new edition are available online)
- Python tutorial (available online from http://www.python.org, also included in a Python distribution)

Some extra reading materials will also be made available during the
course. The book "Learning Python",
by Lutz and Ascher, may also help
you get started with Python programming.

Although prior programming experience will be helpful, it is not
strictly necessary since we will cover some basics in the
beginning.

Programming languages:

- Python is the recommended language for homework assignments and term project. It runs on most platforms. However, if you are already familiar with other languages, you may also do your work in Perl, Java, Prolog or C/C++ and submit your code.

Software packages:

- NLTK: a natural language toolkit written in Python.
- NumPy/SciPy: Python packages for scientific computing.
- iPython: a nice environment for running Python. Integrates many
shell-like features.

- Matplotlib: is a MATLAB-inspired 2D plotting package for Python.

- Other packages (mostly command-line tools) will be made available
on the Linux server later in the semester.

To install Python, go to: http://www.python.org/download

To install iPython: go to: http://ipython.scipy.org/

To install NLTK (lite), go to: http://nltk.sourceforge.net/

To install SciPy/NumPy, go to: http://www.scipy.org/

Tp install Matplotlib, go to: http://matplotlib.sourceforge.net/

SERVER:

The HLT server (hlt.sbs.arizona.edu)
will be available for you to access the files required for the course,
work on your homework
and term project. An account will be created for each student who is
enrolled in the class. Your user name will be the same as your NetID.
Passwords will be distributed in the lab. Please change your password
after your first login.

Points will be taken off from late homework assignments for each additional day. You may work with your fellow students to solve these exercises, but you must list all the names of the people that you have worked with, write up your solutions and hand them in individually.

The term project allows you to put together some ideas and tools that you've learned in this class and to apply them to a topic in which you are interested. Students may form pairs to work on the term project and turn in one proposal and write-up, but more will be expected from students working in pairs than those who work individually. Names of those working in pairs must be included in the project proposal. Uses of third-party libraries, packages or toolboxes are allowed, as far as the source is acknowledged.

Dates related to term project:

Proposal (team, description of the problem, references, plan) | 3/04 |

Presentation | 5/01, 5/06 |

Report due |
5/13 |

Further guidlines for the term project will be distributed later in the semester.

Week of |
Topic |

Jan. 14th |
Initial Meeting |

Jan. 22rd |
Working with corpora, introduction to Python programming (ch.1, ch.4) |

Jan. 29th |
Tokenization, regular
expressions, Zipf's law (ch.1, ch.4) |

Feb. 5th |
Basic probability theory, Bayes theorem, parameter estimation (ch. 2) |

Feb. 12th |
Naive Bayes classifier, spam
classification |

Feb. 19th |
Hypothesis testing, word collocations (ch. 5) |

Feb. 26th |
Word sense disambiguation (ch. 7) |

Mar. 4th |
Unsupervised learning, EM
algorithm (ch.8, ch.14) |

Mar. 11th |
N-grams, information theory,
language modeling (ch. 6) |

Mar. 18th |
Spring Break -- No class |

Mar. 25th |
Hidden Markov models (ch.9,
ch.10) |

April 1st |
Hidden Markov models, tagging (ch.9, ch.10) |

April 8th |
Probabilistic context-free
grammars and parsing (ch.11, ch.12) |

April 15th |
Probabilistic context-free
grammars and parsing (ch.11, ch.12) |

April 22nd |
Word alignment, statistical
machine translation (ch. 13) |

April 29th |
Guest lecture, student
presentations |

May 6th |
Student presentations |

*This schedule is subject to change.