Robust Semantics, Information Extraction, and Information Retrieval

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Chapter 5: Introduction to Information Retrieval
1 Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper bypaper Mary Elaine Califf and Raymond.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Word sense disambiguation and information retrieval Chapter 17 Jurafsky, D. & Martin J. H. SPEECH and LANGUAGE PROCESSING Jarmo Ritola -
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Information Retrieval in Practice
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
CS 4705 Robust Semantics, Information Extraction, and Information Retrieval.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
CS 4705 Robust Semantics, Information Extraction, and Information Retrieval.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Vector Space Models.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
NTNU Speech Lab 1 Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley Lacatusu Language Computer Corporation Presented by Yi-Ting.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
From Frequency to Meaning: Vector Space Models of Semantics
Information Retrieval in Practice
Plan for Today’s Lecture(s)
CSCE 590 Web Scraping – Information Extraction II
Information Retrieval: Models and Methods
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
Semantic Processing with Context Analysis
Information Retrieval: Models and Methods
Natural Language Processing (NLP)
Machine Learning in Practice Lecture 12
Machine Learning in Natural Language Processing
Basic Information Retrieval
Representation of documents and queries
Text Categorization Assigning documents to a fixed set of categories
From frequency to meaning: vector space models of semantics
Chapter 5: Information Retrieval and Web Search
CS246: Information Retrieval
CSCI 5832 Natural Language Processing
Boolean and Vector Space Retrieval Models
Natural Language Processing (NLP)
Information Retrieval and Web Design
Introduction to Sentiment Analysis
Artificial Intelligence 2004 Speech & Natural Language Processing
Information Retrieval
Introduction to Search Engines
Natural Language Processing (NLP)
Presentation transcript:

Robust Semantics, Information Extraction, and Information Retrieval

Problems with Syntax-Driven Semantics Syntactic structures often don’t fit semantic structures very well Important semantic elements often distributed very differently in trees for sentences that mean ‘the same’ I like soup. Soup is what I like. Parse trees contain many structural elements not clearly important to making semantic distinctions Syntax driven semantic representations are sometimes pretty verbose V --> serves

Alternatives? Semantic Grammars Information Extraction Techniques Information Retrieval --> Information Extraction

Semantic Grammars Alternative to modifying syntactic grammars to deal with semantics too Define grammars specifically in terms of the semantic information we want to extract Domain specific: Rules correspond directly to entities and activities in the domain I want to go from Boston to Baltimore on Thursday, September 24th Greeting --> {Hello|Hi|Um…} TripRequest  Need-spec travel-verb from City to City on Date

Predicting User Input Semantic grammars rely upon knowledge of the task and (sometimes) constraints on what the user can do, when Allows them to handle very sophisticated phenomena I want to go to Boston on Thursday. I want to leave from there on Friday for Baltimore. TripRequest  Need-spec travel-verb from City on Date for City Dialogue postulate maps filler for ‘from-city’ to pre-specified from-city

Drawbacks of Semantic Grammars Lack of generality A new one for each application Large cost in development time Can be very large, depending on how much coverage you want If users go outside the grammar, things may break disastrously I want to leave from my house. I want to talk to someone human.

Information Extraction Another ‘robust’ alternative Idea is to ‘extract’ particular types of information from arbitrary text or transcribed speech Examples: Named entities: people, places, organizations, times, dates Telephone numbers <Organization> MIPS</Organization> Vice President <Person>John Hime</Person> Domains: Medical texts, broadcast news, voicemail,...

Appropriate where Semantic Grammars and Syntactic Parsers are Not Appropriate where information needs very specific Question answering systems, gisting of news or mail… Job ads, financial information, terrorist attacks Input too complex and far-ranging to build semantic grammars But full-blown syntactic parsers are impractical Too much ambiguity for arbitrary text 50 parses or none at all Too slow for real-time applications

Information Extraction Techniques Often use a set of simple templates or frames with slots to be filled in from input text Ignore everything else My number is 212-555-1212. The inventor of the wiggleswort was Capt. John T. Hart. The king died in March of 1932. Context (neighboring words, capitalization, punctuation) provides cues to help fill in the appropriate slots

The IE Process Given a corpus and a target set of items to be extracted: Clean up the corpus Tokenize it Do some hand labeling of target items Extract some simple features POS tags Phrase Chunks … Do some machine learning to associate features with target items or derive this associate by intuition Use e.g. FSTs, simple or cascaded to iteratively annotate the input, eventually identifying the slot fillers

Some examples Semantic grammars Information extraction

Information Retrieval How related to NLP? Operates on language (speech or text) Does it use linguistic information? Stemming Bag-of-words approach Does it make use of document formatting? Headlines, punctuation, captions Collection: a set of documents Term: a word or phrase Query: a set of terms

But…what is a term? Stop list Stemming Homonymy, polysemy, synonymy

Vector Space Model Simple versions represent documents and queries as feature vectors, one binary feature for each term in collection Is t in this document or query or not? D = (t1,t2,…,tn) Q = (t1,t2,…,tn) Similarity metric:how many terms does a query share with each candidate document? Weighted terms: term-by-document matrix D = (wt1,wt2,…,wtn) Q = (wt1,wt2,…,wtn)

How do we compare the vectors? Normalize each term weight by the number of terms in the document: how important is each t in D? Compute dot product between vectors to see how similar they are Cosine of angle: 1 = identity; 0 = no common terms How do we get the weights? Term frequency (tf): how often does t occur in D? Inverse document frequency (idf): # docs/ # docs term t occurs in tf . idf weighting: weight of term i for doc j is product of frequency of i in j with log of idf in collection

Evaluating IR Performance Precision: #rel docs returned/total #docs returned -- how often are you right when you say this document is relevant? Recall: #rel docs returned/#rel docs in collection -- how many of the relevant documents do you find? F-measure combines P and R

Improving Queries Relevance feedback: users rate retrieved docs Query expansion: many techniques e.g. add top N docs retrieved to query Term clustering: cluster rows of terms to produce synonyms and add to query

IR Tasks Ad hoc retrieval: ‘normal’ IR Routing/categorization: assign new doc to one of predefined set of categories Clustering: divide a collection into N clusters Segmentation: segment text into coherent chunks Summarization: compress a text by extracting summary items Question-answering: find a stretch of text containing the answer to a question

Summary Many approaches to ‘robust’ semantic analysis Semantic grammars targeting particular domains Utterance --> Yes/No Reply Yes/No Reply --> Yes-Reply | No-Reply Yes-Reply --> {yes,yeah, right, ok,”you bet”,…} Information extraction techniques targeting specific tasks Extracting information about terrorist events from news Information retrieval techniques --> more like NLP