Text Analysis Everything Data CompSci 290.01 Spring 2014.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Probabilistic Language Processing Chapter 23. Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram,
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
CS276 Information Retrieval and Web Mining
Hinrich Schütze and Christina Lioma
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
The Vector Space Model …and applications in Information Retrieval.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Information Retrieval
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Search A Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014.
Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Chapter 23: Probabilistic Language Models April 13, 2004.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture IX: 2014/05/05.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
IR 6 Scoring, term weighting and the vector space model.
Automated Information Retrieval
Tolerant Retrieval Review Questions
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Ch 6 Term Weighting and Vector Space Model
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Basic Information Retrieval
Text Categorization Assigning documents to a fixed set of categories
From frequency to meaning: vector space models of semantics
PageRank GROUP 4.
6. Implementation of Vector-Space Retrieval
Statistical n-gram David ling.
Language Model Approach to IR
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

Text Analysis Everything Data CompSci Spring 2014

Announcements (Thu. Jan. 16) Sample solutions for HW 1 / Lab 1 online. HW 2 will be posted latest by tomorrow (Fri) morning. Please work individually (not teams). Updated team assignments will be posted before Lab 2 – (to account for slight changes in enrollment). 2

Outline Basic Text Processing – Word Counts – Tokenization Pointwise Mutual Information – Normalization Stemming 3

Outline Basic Text Processing Finding Salient Tokens – TFIDF scoring 4

Outline Basic Text Processing Finding Salient Tokens Document Similarity & Keyword Search – Vector Space Model & Cosine Similarity – Inverted Indexes 5

Outline Basic Text Processing – Word Counts – Tokenization Pointwise Mutual Information – Normalization Stemming Finding Salient Tokens Document Similarity & Keyword Search 6

Basic Query: Word Counts 7 Google Flu Emotional Trends

Basic Query: Word Counts How many times does each word appear in the document? 8

Problem 1 What is a word? – I’m … 1 word or 2 words {I’m} or {I, am} – State-of-the-art … 1 word or 4 words – San Francisco … 1 or 2 words – Other Languages French: l’ensemble German: freundshaftsbezeigungen Chinese: 這是一個簡單的句子 (no spaces) 9 This, one, easy, sentence

Solution: Tokenization For English: – Splitting the text on non alphanumeric characters is a reasonable way to find individual words. – But will split words like “San Francisco” and “state-of-the-art” For other languages: – Need more sophisticated algorithms for identifying words. 10

Finding simple phrases Want to find pairs of tokens that always occur together (co-occurence) Intuition: Two tokens are co-occurent if they appear together more often than “random” – More often than monkeys typing English words 11

Formalizing “more often than random” Language Model – Assigns a probability P(x1, x2, …, xk) to any sequence of tokens. – More common sequences have a higher probability – Sequences of length 1: unigrams – Sequences of length 2: bigrams – Sequences of length 3: trigrams – Sequences of length n: n-grams 12

Formalizing “more often than random” Suppose we have a language model – P(“San Francisco”) is the probability that “San” and “Francisco” occur together (and in that order) in the language. 13

Formalizing “random”: Bag of words Suppose we only have access to the unigram language model – Think: all unigrams in the language thrown into a bag with counts proportional to their P(x) – Monkeys drawing words at random from the bag – P(“San”) x P(“Francisco”) is the probability that “San Francisco” occurs together in the (random) unigram model 14

Formalizing “more often than random” Pointwise Mutual Information: – Positive PMI suggests word co-occurrence – Negative PMI suggests words don’t appear together 15

What is P(x)? “Suppose we have a language model …” Idea: Use counts from a large corpus of text to compute the probabilities 16

What is P(x)? Unigram: P(x) = count(x) / N count(x) is the number of times token x appears N is the total # tokens. Bigram: P(x 1, x 2 ) = count(x 1, x 2 ) / N count(x 1,x 2 ) = # times sequence (x 1,x 2 ) appears Trigram: P(x 1, x 2, x 3 ) = count(x 1, x 2, x 3 ) / N count(x 1,x 2,x 3 ) = # times sequence (x 1,x 2,x 3 ) appears 17

What is P(x)? “Suppose we have a language model …” Idea: Use counts from a large corpus of text to compute the probabilities 18

Large text corpora Corpus of Contemporary American English – Google N-gram viewer –

Summary of Problem 1 Word tokenization can be hard Space/non-alphanumeric words may oversplit the text We can find co-ocurrent tokens: – Build a language model from a large corpus – Check whether the pair of tokens appear more often than random using pointwise mutual information. 20

Language models … … have many many applications – Tokenizing long strings – Word/query completion/suggestion – Spell checking – Machine translation – … 21

Problem 2 A word may be represented in many forms – car, cars, car’s, cars’  car – automation, automatic  automate Lemmatization: Problem of finding the correct dictionary headword form 22

Solution: Stemming Words are made up of – Stems: core word – Affixes: modifiers added (often with grammatical function) Stemming: reduce terms to their stems by crudely chopping off affixes – automation, automatic  automat 23

Porter’s algorithm for English Sequences of rules applied to words Example rule sets: 24 /(.*)sses$/  \1ss /(.*)ies$/  \1i /(.*)ss$/  \1s /(.*[^s])s$/  \1 /(.*[aeiou]+.*)ing$/  \1 /(.*[aeiou]+.*)ed$/  \1

Any other problems? Same words that mean different things – Florence the person vs Florence the city – Paris Hilton (person or hotel) Abbreviations – I.B.M vs International Business Machines Different words meaning same thing – Big Apple vs New York … 25

Any other problems? Same words that mean different things – Word Sense Disambiguation Abbreviations – Translations Different words meaning same thing – Named Entity Recognition & Entity Resolution … 26

Outline Basic Text Processing Finding Salient Tokens – TFIDF scoring Document Similarity & Keyword Search 27

Summarizing text 28

Finding salient tokens (words) Most frequent tokens? 29

Document Frequency Intuition: Uninformative words appear in many documents (not just the one we are concerned about) Salient word: – High count within the document – Low count across documents 30

TF  IDF score Term Frequency (TF): c(x) is # times x appears in the document Inverse Document Frequency (IDF): DF(x) is the number of documents x appears in. 31

Back to summarization Simple heuristic: – Pick sentences S = {x1, x2, …, xk} with the highest: 32

Outline Basic Text Processing Finding Salient Tokens Document Similarity & Keyword Search – Vector Space Model & Cosine Similarity – Inverted Indexes 33

Document Similarity 34

Vector Space Model Let V = {x 1, x 2, …, x n } be the set of all tokens (across all documents) A document is a n-dimensional vector D = [w 1, w 2, …, w n ] where w i = TFIDF(x i, D) 35

Distance between documents Euclidean distance D1 = [w 1, w 2, …, w n ] D2 = [y 1, y 2, …, y n ] Why? 36 D1 D2 distance

Cosine Similarity 37 θ D1 D2

Cosine Similarity D1 = [w 1, w 2, …, w n ] D2 = [y 1, y 2, …, y n ] 38 Dot Product L2 Norm

Keyword Search How to find documents that are similar to a keyword query? Intuition: Think of the query as another (very short) document 39

Keyword Search Simple Algorithm For every document D in the corpus Compute d(q, D) Return the top-k highest scoring documents 40

Does it scale? 41

Inverted Indexes Let V = {x1, x2, …, xn} be the set of all token For each token, store – How does this help? 42

Using Inverted Lists Documents containing “ceasar” – Use the inverted list to find documents containing “caesar” – What additional information should we keep to compute similarity between the query and documents? 43

Using Inverted Lists Documents containing “caesar” AND “cleopatra” – Return documents in the intersection of the two inverted lists. – Why is inverted list sorted on document id? OR? NOT? – Union and difference, resp. 44

Many other things in a modern search engine … Maintain positional information to answer phrase queries Scoring is not only based on token similarity – Importance of Webpages: PageRank (in later classes) User Feedback – Clicks and query reformulations 45

Summary Word counts are very useful when analyzing text – Need good algorithms for tokenization, stemming and other normalizations Algorithm for finding word co-occurrence – Language Models – Pointwise Mutual Information 46

Summary (contd.) Raw counts are not sufficient to find salient tokens in a document – Term Frequency x Inverse Document Frequency (TFIDF) scoring Keyword Search – Use Cosine Similarity over TFIDF scores to compute similarity – Use Inverted Indexes to speed up processing. 47