1 Introduction LING 572 Fei Xia, Dan Jinguji Week 1: 1/08/08
Outline General course information Course contents Reading assignment #1: due 1/10 Take-home exam #1: due 1/10 2
3 General info Course url: –Syllabus (incl. slides, assignments, and papers): updated every week. –GoPost: –Collect it: Please check your s and GoPost at least once per day.
4 Slides The slides will be online before class. The final version will be uploaded a few hours after class. “Additional slides” are not required and not covered in class.
5 Prerequisites CS 326 (Data Structures) or equivalent: –Ex: hash table, array, tree, … Stat 391 (Prob. and Stats for CS) or equivalent: Basic concepts in probability and statistics –Ex: random variables, chain rule, Bayes’ rule Programming in Perl, C, C++, Java, or Python Basic unix/linux commands (e.g., ls, cd, ln, sort, head): tutorials on unix LING570: If you don’t meet all the prerequisites above, you need to get permission from Fei before taking LING572.
LING570 prerequisites If you did not take LING570 last quarter, you need to understand all the material covered in that class. Especially, –the material in Weeks #9-#10 –hw #10, Quiz #4 –the Mallet package – 6
7 Topics covered in Ling570 Unit #1: –Formal languages and formal grammars –FSA, FST –Morphological analysis Unit #2: LM, ngram, and smoothing Unit #3: HMM and POS tagging Unit #4: Classification and sequence labeling tasks.
8 Grades for LING572 No midterm or final exams. Graded: Assignments (9): 65-75% Take-home exams (3-5): 15-25% Not graded: Reading assignments (5-9): 5-10% Class participation: 5%
9 Office hour Fei: – address: Subject line should include “ling572” The 48-hour rule: it works both ways –Office hour: Time: Fr: 10:30-11:30am Location: Padelford A-210G
10 TA hour Dan Jinguji – –Time: T: ?? –Location: Art 337
Assignments / Exams Assignments: the same as in ling570 Take-home exams: to replace in-class quizzes Reading assignments: some papers should be read before class. When there are take-home exams or reading assignments for the same period, the amount of assignments will be reduced accordingly. The total amount of time spent on Ling572 will be about hours. 11
Assignments Nine assignments Programming languages: C, C++, Java, Perl, or Python. Please follow the instructions in the assignments, including –command line format –file format –the probability model –… 12
The Mallet package Several assignments will use the Mallet package. If you don’t know how to use Mallet, you should get familiar with the package ASAP. You can start with the hw10 from LING570. The Mallet slides are at The LING570 hw10 is at
14 Assignment submission Use “Collect it”: submit the tar file. –E.g., tar –cvf hw1.tar hw1_dir Due date: every Thurs at 1pm unless specified otherwise The submission area is closed 4 days after the due date. There is 1% penalty for every hour after the due date.
15 Homework Submission (cont) Each submission includes –a note file: hw1.(txt|doc|pdf) for hw1. If your code does not work, explain in the note file what you have implemented so far. –a set of shell scripts: e.g., kNN.sh –source code: e.g., kNN.C –binary code (for C/C++/Java): kNN.out –data files if any. –The TA will NOT compile or debug your code. Time spent on an assignment: hours/week I would appreciate it if you could tell me the time you spent on the homework.
Take-home exams Normally, it is handed out on Tuesday, and due on Thursday Bring the hardcopy of your solution to class. The dates on the tentative schedule are subject to change. Extension will be granted for the exams ONLY under extremely unusual circumstances. There are no makeup exams. If you know that you will miss a class in which an exam could be given, you need to inform me at least two hours before the class starts. 16
Take-home exams (cont) You should complete the exams on your own. No discussion among students is allowed. You can refer to anything that is available on the course url and on patas, but please don’t search the Web for answers. If you have any questions about the exams, please Fei. 17
Reading assignments You will answer some questions about the papers that will be discussed in next class. The questions are on teaching slides, and there are no separate documents for them. Your answers should be concise and no more than a few lines. Your answers are due before the next class. Bring the hardcopy of your answers to class. 18
Summary of assignments and exams Assignments (hw) Take-home exams Reading assignments Num DistributionDownload from the course url Distributed in class Download from the course url DiscussionAllowedDisallowedAllowed SubmissionCollect ItBring to class Not graded Due date1pm every ThursBefore next class Extension1% penalty per hour Normally disallowed Disallowed Estimate of hours10-20 hours1-5 hours2-5 hours Solution filesOn Patas Discussed in class 19
Patas If you need to have a patas account, you need to right away to get an account. The directory for LING572: ~/dropbox/07-08/572/ –hw1/, hw2/, ….: Assignments and solution –misc_slides/: Solution to exams and misc slides that are not on the course url. For jobs that run more than 5 minutes, use the cluster submission commands: see Hw1. 20
21 Course plan
Types of ML problems Classification problem Estimation problem Clustering Discovery … A learning method can be applied to one or more types of ML problems. We will focus on the classification problem. 22
Course objectives Covering basic statistical methods that produce state-of-the-art results Focusing on classification and sequence labeling problems Some ML algorithms are complex. We will focus on basic ideas, not theoretical proofs. 23
Main units Unit #1 (2 weeks): simple classification algorithms –kNN –Decision tree –Naïve Bayes Unit #2 (3 weeks): advanced classification algorithms –MaxEnt* –SVM** 24
Main units (cont) Unit #3 (2 weeks): sequence labeling algorithms –TBL* (if time permits) –CRF** Unit #4 (1 week): system combination Unit #5 (if time permits) –Introduction to semi-supervised learning ** –Introduction to EM ** 25
Other topics Information theory Feature selection Converting multi-class task to binary classification task 26
Three levels of discussion The default (i.e., unmarked): We will discuss the model, training, decoding, etc. –kNN, Decision tree, Naïve Bayes. *: We will discuss the model, but not the training and other implementation issues: –MaxEnt, TBL **: We will only go over the main intuition about the algorithms: –SVM, CRF, semi-supervised learning, EM 27
Questions for each ML method Modeling: –what is the model? –What kind of assumption is made by the model? –How many types of model parameters? –How many “internal” (or non-model) parameters? –…–… 28
Questions for each method (cont) Training: how to estimate parameters? Decoding: how to find the “best” solution? Weaknesses and strengths? –Is the algorithm robust? (e.g., handling outliners) scalable? prone to overfitting? efficient in training time? Test time? –How much data is needed? Labeled data Unlabeled data 29
Reading assignment #1 30
Reading assignment #1 Read M&S 2.2: Essential Information Theory Questions: For a random variable X, p(x) and q(x) are two distributions: Assuming p is the real distribution. –p(X=a)=p(X=b)=1/8, p(X=c)=1/4, p(X=d)=1/2 –q(X=a)=q(X=b)=q(X=c)=q(X=d)=1/4 (a) What is H(X)? (b) What is cross entropy H(X,q)? (c) What is KL divergence D(p||q)? (d) What is D(q||p)? 31
Next time Both Reading #1 and Exam #1 are due before class. Bring the hardcopy to class. Topics: –Information theory and hw #1 –Solution to Exam #1 (if time permits) –Recap on the classification problem (if time permits) 32