CS 4705 Natural Language Processing Julia Hirschberg COMS 4705 Fall 2010
What is Natural Language Processing? Software that can recognize, analyze and generate text and speech AKA computational linguistics At Columbia: –Michael Collins, CS, parsing, machine translation –Mona Diab, CCLS, semantics –Nizar Habash, CCLS, morphology, machine translation –Julia Hirschberg, CS, spoken language processing –Kathy McKeown, CS, summarization, generation –Becky Passonneau, CCLS, dialogue systems, reference resolution –Owen Rambow, CCLS, syntax, parsing
Why is NLP hard? Some Headlines… Something Went Wrong In Jet Crash, Expert Says Police Begin Campaign To Run Down Jaywalkers Drunk Gets Nine Months In Violin Case Farmer Bill Dies In House Iraqi Head Seeks Arms Enraged Cow Injures Farmer With Ax Stud Tires Out Eye Drops Off Shelf Teacher Strikes Idle Kids Squad Helps Dog Bite Victim
What will we learn about in this course? Morphology: the way words are formed Syntax: the way words are grouped together into larger constituents and phrases and the way these phrases can be ordered Semantics: the context-independent ‘meaning’ of utterances Pragmatics: the context-dependent ‘meaning’ of utterances Goal: What is a speaker/writer meaning to convey?
Morphology Stud tires out: Is `stud’ an adjective or a noun? `tires’: a noun or a verb? Internet search: `union activities in New York’ –What to look for? Union/unions; activities/activity Active? Action? Actor? Actual? Academic? New vs. New York, York vs. yorkie
Syntax Constituent Structure: –Teacher Strikes Idle Kids –Enraged Cow Injures Farmer With Ax Word Order and Position and Meaning –John hit Bill. –Bill was hit by John. –Bill, John hit. –Who John hit was Bill. –I said John hit Bill. –John hits Bill.
Semantics Word meaning – semantic roles –John picked up a bad cold. –John picked up a large rock. –John picked up Radio Netherlands on his radio. Is meaning compositional? –Squad helps dog bite victim –Enraged cow injures farmer with ax
Pragmatics Going Home, a play in one act (thanks to Bonnie Dorr) –Scene 1: Pennsylvania Station, NY Bonnie: Long Beach? Passerby: Downstairs, LIRR Station. –Scene 2: Ticket Counter, LIRR Station Bonnie: Long Beach? Clerk: $4.50.
–Scene 3: Information Booth, LIRR Station Bonnie: Long Beach? Clerk: 4:19, Track 17. –Scene 4: On the train, vicinity of Forest Hills Bonnie: Long Beach? Conductor: Change at Jamaica. –Scene 5: On the next train, vicinity of Lynbrook Bonnie: Long Beach? Conductor: Right after Island Park.
Algorithms Rule-based –Symbolic Parsers and morphological analyzers –Finite state automata Probabilistic/statistical –Learned from observation of (labeled) data –Predicting new data based on old –Machine learning
Current Real-World Applications Search: very large corpora, e.g. Google Question answering: e.g. IBM’s Jeopardy!, DARPA who/what/where…, Ask Jeeves Translating between one language and another: e.g. Google Translate, Babelfish Summarizing very large amounts of text or speech: e.g. your , the news, voic Sentiment analysis: restaurant or movie reviews Dialogue systems: e.g. Amtrak’s ‘Julie’
Instructor Julia Hirschberg –CEPSR 705, –Focus: Spoken Language Processing –Lab: The Speech Lab, CEPSR 7LW3-AThe Speech Lab –Research: Deceptive speech Charismatic speech: Emotional speech: anger, uncertainty Speech summarization: Broadcast News Spoken Dialogue Systems: Games CorpusGames Corpus `Translating Prosody’: English – Mandarin Text2Scene SynthesisText2Scene
Course Details Teaching Assistants: –Mohamed Altantawy Office Hours: CEPSR 7LW1 (Speech Lab), W 5-6, Th 5:30-6:30 Will manage CVN course –Wei Yun Ma Office Hours: CEPSR 725, Tu /syllabus10.htmhttp://www1.cs.columbia.edu/~julia/courses/CS47 05/syllabus10.htm
Text: Daniel Jurafsky and James H. Martin, Speech and Language Processing, second edition Speech and Language Processing –Note errata available on websiteerrata Check courseworks for additional information on class, homework assignments, posting questions Assignments: –3 homework assignments: Question-answering, text classification, delightful surprise –Midterm and final exams –Five ‘free’ late days for homeworks -- after that 10% off per late day– not usable on HW1 though –You will need a CS account
Recorded Lecture Availability For on-campus students –On CVN websitewebsite
Grading HW1: 10% Hw2: 20% Hw3: 20% Midterm: 15% Final: 25% Class participation: 10%
Academic Integrity Copying or paraphrasing someone's work (code included), or permitting your own work to be copied or paraphrased, even if only in part, is forbidden, and will result in an automatic grade of 0 for the entire assignment or exam in which the copying or paraphrasing was done. Your grade should reflect your own work. If you are going to have trouble completing an assignment, talk to the instructor or TA in advance of the due date please. Everyone: Read/write protect your homework files at all times.
For Next Class Look at syllabus – ask questions about anything you don’t understandsyllabus Read Chapters 1-2 of J&M