Introduction to CL Session 1: 7/08/2011
What is computational linguistics? Processing natural language text by computers for practical applications ... or linguistic research Among practical applications Sometimes the computer only needs to classify or transform the text ... but sometimes it needs to “understand” Ex: Watson: winner of ‘Jeopardy’ CL vs. NLP (natural language processing)
NLP applications Automatic speech recognition (ASR): speech text Machine translation (MT): L1 L2 Information retrieval (IR): Query + documents a subset of doc Information extraction (IE): document “database”
NLP applications (cont) Question answering (QA): Question + documents Answer Summarization: documents summary Natural language generation (NLG): representation text
Other Applications Call Center Spam filter Spell checker Sentiment analysis: product reviews Bio-NLP: processing clinical data ….
Basic NLP tasks: Shallow processing Tokenization: – He visited New York in Morphological analysis: – visited visit + -ed Part-of-speech tagging – He/Pron visited/V New/?? York/N in/Prep 2003/CD Name-entity tagging – He visited [LOCATION New York] in [YEAR 2003] Chunking – [NP He] [V visited] [NP New York] in [NP 2003]
Basic NLP tasks: Deep processing Parsing – (S (NP (PRON he)) (VP (V visited) ….) Semantic analysis – Semantic tagging: [AGENT He] visited [DEST New York] …. – Meaning: visit (he, New-York) Discourse – Co-reference: “He” refers to “John” – Discourse structure Dialogue Generation
Ambiguity Phonological ambiguity: (ASR) – “too”, “two”, “to” – “ice cream” vs. “I scream” – “ta” in Mandarin: he, she, or it Morphological ambiguity: (morphological analysis) – unlockable: [[un-lock]-able] vs. [un-[lock-able]] Syntactic ambiguity: (parsing) – John saw a man with a telescope. – Time flies like an arrow.
Ambiguity (cont) Lexical ambiguity: (WSD) – Ex: “bank”, “saw”, “run” Semantic ambiguity: (semantic representation) – Ex: every boy loves his mother – Ex: John and Mary bought a house Discourse ambiguity: – Susan called Mary. She was sick. (coreference resolution) – It is pretty hot here. (intention resolution) Machine translation: – “brother”, “cousin”, “uncle”, etc.
Ambiguity resolution Rule-based or knowledge-based: – Parsing: I saw a man with a hat I saw a man with a telescope (in my hand) – WSD: “bank” – MT: “brother”, “cousin”, “uncle” Statistical approach: – Require training data – Build a statistical model – Knowledge and rules can be incorporated into the model as features etc.
Major approaches to NLP Rule-based approach Statistical approach – Supervised learning – Semi-supervised learning – Unsupervised learning
Supervised learning algorithms Hidden Markov Model (HMM) Decision tree Decision list Naïve Bayes Transformation-based Learning (TBL) Maximum Entropy (MaxEnt) Support Vector Machine (SVM) Conditional Random Field (CRF) …
Data Raw text: – Monolingual: English/Chinese/Arabic Gigawords – Parallel data: UN data, EuroParl Treebank: – Syntactic treebanks: a set of parse trees – Proposition Bank: – Discourse Treebank Dictionaries WordNet FrameNet …
Applications Task1Task2Task_i ML1 ML_m ML2 … D1D2D_n … …
The role of linguistics knowledge in NLP An NLP system is language-independent. Good or bad? – Good: it can be ported to many languages without any changes. – Bad: it cannot take advantage of properties of certain languages. How to incorporate (linguistic) knowledge in statistical systems? – the design of models – as features – as filters –…–… Building a treebank is an effective way.