LING 439/539: Statistical Natural Language Processing

BASIC INFO:

DESCRIPTION:

This course is an introduction to the statistical modeling of spoken language and text data. Statistical methods have become popular in recent years in both theoretical and applied research, partly because of rapid increases in computing power and data set size. In this class, we will provide an introduction to the statistical concepts and algorithms used in NLP, and survey how they are applied in human language technology. This course is for linguistics students who are interested in the methodology of statistical modeling, as well as those who wish to acquire hands-on experience with speech and natural language processing tools. In terms of content, the materials can be roughly divided into the following sub-units:

LEARNING OBJECTIVES:

Upon completion of this course, the student is expected to be able to do the following:

PREREQUISITES:

The prerequisite for this class is LING 438/538, or LING 388 with instructor's approval). Students who have not taken the prerequisite should speak to me for permission to enroll. 


COURSE REQUIREMENTS:


Class participation: 10%
Quizzes: 20%
Homework assignments: 40%
Project -- proposal: 5%
Project -- presentation
5%
Project -- final report
20%

There will be 6-8 homework problem sets and 3-4 short quizzes. The homework sets and quizzes are identical for 439 and 539. Repair work may be allowed for the quizzes after they are graded. 539 students will have the lowest assignment grade dropped. Students enrolling in 439 will have the lowest assignment AND the lowest quiz dropped from their grade.

COURSE WEBPAGE: 

Most teaching materials for this class, including slides, handouts, sample codes, homework assignments and an online discussion forum will be available on desire2learn for registered students (http://d2l.arizona.edu). A NetId is required for login. Questions regarding course materials, homework, etc. should be directed to the discussion forum in D2L. Some problems in the homework assignments will require the results to be submitted to the dropbox in D2L. 

NOTE: If you need to send me an email, do not use the D2L email since I do not check it. If you are not enrolled in the class, but need to get access to the course website, please let me know. 

READING

Required: 

Recommended:

Some extra reading materials will also be made available during the course. The book "Learning Python", by Lutz and Ascher, may also help you get started with Python programming.

SOFTWARE:

Although prior programming experience will be helpful, it is not strictly necessary since we will cover some basics in the beginning. 

Programming languages:

Software packages: 

To install Python, go to: http://www.python.org/download
To install iPython: go to: http://ipython.scipy.org/
To install NLTK (lite), go to: http://nltk.sourceforge.net/
To install SciPy/NumPy, go to: http://www.scipy.org/
Tp install Matplotlib, go to: http://matplotlib.sourceforge.net/


SERVER:

The HLT server (hlt.sbs.arizona.edu) will be available for you to access the files required for the course, work on your homework and term project. An account will be created for each student who is enrolled in the class. Your user name will be the same as your NetID. Passwords will be distributed in the lab. Please change your password after your first login.

HOMEWORK POLICIES:

Points will be taken off from late homework assignments for each additional day. You may work with your fellow students to solve these exercises, but you must list all the names of the people that you have worked with, write up your solutions and hand them in individually. 

TERM PROJECT:

The term project allows you to put together some ideas and tools that you've learned in this class and to apply them to a topic in which you are interested.  Students may form pairs to work on the term project and turn in one proposal and write-up, but more will be expected from students working in pairs than those who work individually. Names of those working in pairs must be included in the project proposal. Uses of third-party libraries, packages or toolboxes are allowed, as far as the source is acknowledged.

Dates related to term project:

Proposal (team, description of the problem, references, plan) 3/04
Presentation 5/01, 5/06
Report due
5/13

Further guidlines for the term project will be distributed later in the semester.

TENTATIVE CLASS SCHEDULE

Week of
Topic
Jan. 14th
Initial Meeting
Jan. 22rd
Working with corpora, introduction to Python programming (ch.1, ch.4)
Jan. 29th
Tokenization, regular expressions, Zipf's law (ch.1, ch.4)
Feb. 5th
Basic probability theory, Bayes theorem, parameter estimation (ch. 2)
Feb. 12th
Naive Bayes classifier, spam classification
Feb. 19th
Hypothesis testing, word collocations (ch. 5)
Feb. 26th
Word sense disambiguation (ch. 7)
Mar. 4th
Unsupervised learning, EM algorithm (ch.8, ch.14)
Mar. 11th
N-grams, information theory, language modeling (ch. 6)
Mar. 18th
Spring Break -- No class
Mar. 25th
Hidden Markov models (ch.9, ch.10)
April 1st
Hidden Markov models, tagging (ch.9, ch.10)
April 8th
Probabilistic context-free grammars and parsing (ch.11, ch.12)
April 15th
Probabilistic context-free grammars and parsing (ch.11, ch.12)
April 22nd
Word alignment, statistical machine translation (ch. 13)
April 29th
Guest lecture, student presentations
May 6th
Student presentations

*This schedule is subject to change.