Evidence from Content INST 734 Module 2 Doug Oard.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda  Ranked retrieval Similarity-based ranking Probability-based ranking.
Multilingual Text Retrieval Applications of Multilingual Text Retrieval W. Bruce Croft, John Broglio and Hideo Fujii Computer Science Department University.
Information Retrieval in Practice
1/7 INFO60021 Natural Language Processing Harold Somers Professor of Language Engineering.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Evidence from Content LBSC 796/INFM 718R Session 2 September 7, 2011.
Overview of Search Engines
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.
LING 388: Language and Computers Sandiway Fong Lecture 22: 11/10.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Copyright (c) 2003 David D. Lewis (Spam vs.) Forty Years of Machine Learning for Text Classification David D. Lewis, Ph.D. Independent Consultant Chicago,
PETRA – the Personal Embedded Translation and Reading Assistant Werner Winiwarter University of Vienna InSTIL/ICALL Symposium 2004 June 17-19, 2004.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong.
Data Structure. Two segments of data structure –Storage –Retrieval.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
GTRI.ppt-1 NLP Technology Applied to e-discovery Bill Underwood Principal Research Scientist “The Current Status and.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001.
Representing and Indexing Content INST 734 Lecture 2 February 5, 2014.
Natural Language Processing for Information Retrieval -KVMV Kiran ( )‏ -Neeraj Bisht ( )‏ -L.Srikanth ( )‏
Chapter 23: Probabilistic Language Models April 13, 2004.
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
PARSING 2 David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Languages at Inxight Ian Hersey Co-Founder and SVP, Corporate Development and Strategy.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Evidence from Content INST 734 Module 2 Doug Oard.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Evidence from Metadata INST 734 Doug Oard Module 8.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
HANGMAN OPTIMIZATION Kyle Anderson, Sean Barton and Brandyn Deffinbaugh.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Integrating linguistic knowledge in passage retrieval for question answering J¨org Tiedemann Alfa Informatica, University of Groningen HLT/EMNLP 2005.
Language Modeling Again So are we smooth now? Courtesy of Chris Jordan.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval Representation  Retrieval Thanks for David Doermann for most of these.
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
CSC 594 Topics in AI – Natural Language Processing
Clustering of Web pages
Natural Language Processing (NLP)
CS 430: Information Discovery
Content Augmentation for Mixed-Mode News Broadcasts Mike Dowman
Multilingual Information Retrieval
Natural Language Processing (NLP)
Structure of IR Systems
Natural Language Processing (NLP)
Presentation transcript:

Evidence from Content INST 734 Module 2 Doug Oard

Agenda Character sets  Terms as units of meaning Boolean retrieval Building an index

Segmentation Retrieval is (often) a search for concepts –But what we actually search are character strings What strings best represent concepts? –In English, words are often a good choice But well-chosen phrases might also be helpful –In German, compounds may need to be split Otherwise queries using constituent words would fail –In Chinese, word boundaries are not marked Thissegmentationproblemissimilartothatofspeech

Tokenization Words (from linguistics): –Morphemes are the units of meaning –Combined to make words Anti (disestablishmentarian) ism Tokens (from computer science) –Doug ’s running late !

Morphology Inflectional morphology –Preserves part of speech –Destructions = Destruction+PLURAL –Destroyed = Destroy+PAST Derivational morphology –Relates parts of speech –Destructor = AGENTIVE(destroy)

Stemming Conflates words, usually preserving meaning –Rule-based suffix-stripping helps for English {destroy, destroyed, destruction}: destr –Prefix-stripping is needed in some languages Arabic: {alselam}: selam [Root: SLM (peace)] Imperfect, but goal is to usually be helpful –Overstemming {centennial,century,center}: cent –Understemming: {acquire,acquiring,acquired}: acquir {acquisition}: acquis

Segmentation Option 1: Longest Substring Segmentation Greedy algorithm based on a lexicon Start with a list of every possible term For each unsegmented string –Remove the longest single substring in the list –Repeat until no substrings are found in the list

Longest Substring Example Possible German compound term: –washington List of German words: –ach, hin, hing, sei, ton, was, wasch Longest substring segmentation –was-hing-ton –Roughly translates as “What tone is attached?”

Segmentation Option 2: Most Likely Segmentation Given an input word c 1 c 2 c 3 … c n Try all possible partitions into w 1 w 2 w 3 … –c 1 c 2 c 3 … c n –c 1 c 2 c 3 c 3 … c n –c 1 c 2 c 3 … c n –… Choose the highest probability partition –Based on word combinations seen in segmented text (we will call this a “language model” next week)

Generic Segmentation: N-gram Indexing Consider a Chinese document c 1 c 2 c 3 … c n –Don’t commit to any single segmentation Instead, treat every character bigram as a term c 1 c 2, c 2 c 3, c 3 c 4, …, c n-1 c n Break up queries the same way Bad matches will happen, but rarely

Relating Words and Concepts Homonymy: bank (river) vs. bank (financial) –Different words are written the same way –We’d like to work with word senses rather than words Polysemy: fly (pilot) vs. fly (passenger) –A word can have different “shades of meaning” –Not bad for IR: often helps more than it hurts Synonymy: class vs. course –Causes search failures … well address this next week!

Phrases Phrases can yield more precise queries –“University of Maryland”, “solar eclipse” We would not usually index only phrases –Because people might search for parts of a phrase Positional indexes allow query-time phrase match –At the cost of a larger index and slower queries IR systems are good at evidence combination Better evidence combination  less help from phrases

“Named Entity” Tagging Automatically assign “types” to words or phrases –Person, organization, location, date, money, … More rapid and robust than parsing Best algorithms use supervised learning –Annotate a corpus identifying entities and types –Train a probabilistic model –Apply the model to new text

A “Term” is Whatever You Index Word sense Token Word Stem Character n-gram Phrase

Summary The key is to index the right kind of terms Start by finding fundamental features –So far all we have talked about are character codes –Same ideas apply to handwriting, OCR, and speech Combine them into easily recognized units –Words where possible, character n-grams otherwise Apply further processing to optimize the system –Stemming is the most commonly used technique –Some “good ideas” don’t pan out that way

Agenda Character sets Terms as units of meaning  Boolean Retrieval Building an index