This course is an introduction to the statistical modeling of spoken language and text data. Statistical methods have become popular in recent years in both theoretical and applied research, partly because of rapid increases in computing power and data set size. In this class, we will provide an introduction to the statistical concepts and algorithms used in NLP, and survey how they are applied in human language technology. This course is for linguistics students who are interested in the methodology of statistical modeling, as well as those who wish to acquire hands-on experience with speech and natural language processing tools. In terms of content, the materials can be roughly divided into the following sub-units:
Upon completion of this course, the student is expected to be able
to do the following:
The prerequisite for this class is LING 438/538, or LING 388 with instructor's approval). Students who have not taken the prerequisite should speak to me for permission to enroll.
|Project -- proposal:||5%|
|Project -- presentation
|Project -- final report
There will be 6-8 homework problem sets and 3-4 short quizzes. The
homework sets and quizzes are identical for 439
and 539. Repair work may be allowed for the quizzes after they are
graded. 539 students will have the lowest assignment grade dropped.
Students enrolling in 439 will have the lowest assignment AND the
lowest quiz dropped from their grade.
Most teaching materials for this class, including slides, handouts, sample codes, homework assignments and an online discussion forum will be available on desire2learn for registered students (http://d2l.arizona.edu). A NetId is required for login. Questions regarding course materials, homework, etc. should be directed to the discussion forum in D2L. Some problems in the homework assignments will require the results to be submitted to the dropbox in D2L.
NOTE: If you need to send me an email, do not use the D2L email since I do not check it. If you are not enrolled in the class, but need to get access to the course website, please let me know.
Some extra reading materials will also be made available during the
course. The book "Learning Python",
by Lutz and Ascher, may also help
you get started with Python programming.
Although prior programming experience will be helpful, it is not
strictly necessary since we will cover some basics in the
To install Python, go to: http://www.python.org/download
To install iPython: go to: http://ipython.scipy.org/
To install NLTK (lite), go to: http://nltk.sourceforge.net/
To install SciPy/NumPy, go to: http://www.scipy.org/
Tp install Matplotlib, go to: http://matplotlib.sourceforge.net/
The HLT server (hlt.sbs.arizona.edu)
will be available for you to access the files required for the course,
work on your homework
and term project. An account will be created for each student who is
enrolled in the class. Your user name will be the same as your NetID.
Passwords will be distributed in the lab. Please change your password
after your first login.
Points will be taken off from late homework assignments for each additional day. You may work with your fellow students to solve these exercises, but you must list all the names of the people that you have worked with, write up your solutions and hand them in individually.
The term project allows you to put together some ideas and tools that you've learned in this class and to apply them to a topic in which you are interested. Students may form pairs to work on the term project and turn in one proposal and write-up, but more will be expected from students working in pairs than those who work individually. Names of those working in pairs must be included in the project proposal. Uses of third-party libraries, packages or toolboxes are allowed, as far as the source is acknowledged.
Dates related to term project:
|Proposal (team, description of the problem, references, plan)||3/04
||Working with corpora, introduction to Python programming (ch.1, ch.4)|
expressions, Zipf's law (ch.1, ch.4)
||Basic probability theory, Bayes theorem, parameter estimation (ch. 2)|
||Naive Bayes classifier, spam
||Hypothesis testing, word collocations (ch. 5)|
||Word sense disambiguation (ch. 7)
||Unsupervised learning, EM
algorithm (ch.8, ch.14)
||N-grams, information theory,
language modeling (ch. 6)
||Spring Break -- No class
||Hidden Markov models (ch.9,
||Hidden Markov models, tagging (ch.9, ch.10)|
grammars and parsing (ch.11, ch.12)
grammars and parsing (ch.11, ch.12)
||Word alignment, statistical
machine translation (ch. 13)
||Guest lecture, student