INFO 624 -- Week 5 Text Properties and Operations Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Recuperação de Informação B Cap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.1, 6.2, 6.3 November 01, 1999.
Information Retrieval in Practice
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Chapter 7: Text mining UIC - CS 594 Bing Liu 1 1.
ISP 433/533 Week 2 IR Models.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
WMES3103 : INFORMATION RETRIEVAL
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
INFO 624 Week 3 Retrieval System Evaluation
1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
CS 430 / INFO 430 Information Retrieval
Information Retrieval
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
INFO624 - Week 4 Query Languages and Query Operations Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Querying Structured Text in an XML Database By Xuemei Luo.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
LIS 7450, Searching Electronic Databases Basic: Database Structure & Database Construction Dialog: Database Construction for Dialog (FYI) Deborah A. Torres.
Chapter 23: Probabilistic Language Models April 13, 2004.
Web- and Multimedia-based Information Systems Lecture 2.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Information Retrieval
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
WIRED Week 5 Readings Overview - Text & Multimedia Languages & Properties - Text Operations - Multimedia IR Finalize Topic Discussions Schedule Projects.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
INFO Week 7 Indexing and Searching Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.
Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Plan for Today’s Lecture(s)
Search Engine Architecture
Text Based Information Retrieval
CS 430: Information Discovery
CS 430: Information Discovery
Representation of documents and queries
CS 430: Information Discovery
Data Mining Chapter 6 Search Engines
Chapter 5: Information Retrieval and Web Search
Content Analysis of Text
Information Retrieval and Web Design
Information Retrieval and Web Design
Presentation transcript:

INFO Week 5 Text Properties and Operations Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University

Objectives of Assignment 1 Practice basic Web skills Practice basic Web skills Get familiar with a few search engines Get familiar with a few search engines Learn to describe features of search engines Learn to describe features of search engines Learn to compare search engines Learn to compare search engines

Grading Sheet for Assignment 1 1. Memo 2. Selection of Search engines 1. Is it downloadable 2. Can it be under controlled of the small business? 3. Quality of reviews 4. Formats of review pages, including metadata. 5. Appropriate links in reviews and in the registered page

What’s missing? Who are the current users of the selected search engine? Who are the current users of the selected search engine? Hand-on experience with the selected search engines Hand-on experience with the selected search engines  Personal observation or experience  Some testing on demos or customer sites. Convincing statements on the differences between the search engines. Convincing statements on the differences between the search engines.

Properties of Text Classic theories Classic theories  Zipf’s Law  Information Theory  Benford's Law  Bradford's Law  Heaps’ Law English letter/word frequencies English letter/word frequencies

Zipf’s Law (1945) in a large, well-written English document, in a large, well-written English document, r * f = c where where r is the ranking number, f is the number of times the given word is used in the document; c is a constant. Difference collections may have different c. English text tends to have c = N/10 where N is the number of words in the collection. Difference collections may have different c. English text tends to have c = N/10 where N is the number of words in the collection.

Zipf’s Law is an observation of a fact in proximity. Zipf’s Law is an observation of a fact in proximity.  Examples:  Word frequencies in Alice in Wonderland Word frequencies in Alice in Wonderland Word frequencies in Alice in Wonderland  Time magazine collection Time magazine collection Time magazine collection Zipf’s Law has been verified for many many years on many different collections. Zipf’s Law has been verified for many many years on many different collections. There are also many revised version of Ziph’s Law. There are also many revised version of Ziph’s Law.

Example: The word " the " is the most frequently occurring word in the novel "Moby Dick," occurring 1450 times. The word " the " is the most frequently occurring word in the novel "Moby Dick," occurring 1450 times. The word " with " is the second-most frequently occurring word in that novel. The word " with " is the second-most frequently occurring word in that novel. How many times would we expect " with " to occur? How many times would we expect " with " to occur? How many times would we expect the third most frequently occurring word to appear? How many times would we expect the third most frequently occurring word to appear?

Information Theory Entropy (1948) Entropy (1948)  Use the distribution of symbols to predict the amount of information in a text.  Quantified measure for information Useful for (physical) data transferUseful for (physical) data transfer And compressionAnd compression Not directly applicable to IRNot directly applicable to IR  Example:  Which letter is likely to appear after a letter “c” is received?

English Letter Usage Statistics Letter use frequencies: Letter use frequencies:  E: %  T: %  A: %  O: %  N: %  I: %  H: %

Doubled letter frequencies: Doubled letter frequencies:  LL: %  EE: %  SS: %  OO: %  TT: %  RR: %  --: %  PP: %  FF: %

Initial letter frequencies: Initial letter frequencies:  T: %  A: %  H: %  W: %  I: %  S: %  O: %  M: %  B: %

Ending letter frequencies: Ending letter frequencies:  E: %  D: %  S: %  T: %  N: %  R: %  Y: %  O: %

Benford's Law If we randomly select a number from a table of statistical data, the probability that the first digit will be a "1" is about 0.301, rather than 0.1 as we might expect if all digits were equally likely. If we randomly select a number from a table of statistical data, the probability that the first digit will be a "1" is about 0.301, rather than 0.1 as we might expect if all digits were equally likely.

Bradford's Law On a given subject, a few core journals will provide 1/3 of the articles on that subject, a medium number of secondary journals will provide another 1/3 of the articles on that subject, and a large number peripheral journals will provide the final 1/3 of the articles on that subject. On a given subject, a few core journals will provide 1/3 of the articles on that subject, a medium number of secondary journals will provide another 1/3 of the articles on that subject, and a large number peripheral journals will provide the final 1/3 of the articles on that subject.

For example If you found 300 citations for IR, If you found 300 citations for IR,  100 of those citations likely came from a core group of 5 journals,  another 100 citations came from a group of 25 journals,  and the final 100 citations came from 125 peripheral journals. Bradford expressed his law with this formula: 1:n:n Bradford expressed his law with this formula: 1:n:n 2

Heaps’ Law The relationship of the size of vocabulary and the size of collections are: V = K * n The relationship of the size of vocabulary and the size of collections are: V = K * n b Text size Number of unique words

Computerized Text Analysis Word (token) extraction Word (token) extraction Stop words Stop words Stemming Stemming Frequency counts Frequency counts Clustering Clustering

Word Extraction Basic problems Basic problems  Digits  Hyphens  Punctuation  Cases Lexical analyzer Lexical analyzer  Define all possible characters into finite state machine  Specify what states should cause the break of tokens. Example: Example:  Parser.c

Stop words Many of the most frequently used words in English are worthless in the indexing – these words are called stop words. Many of the most frequently used words in English are worthless in the indexing – these words are called stop words.  the, of, and, to, ….  Typically about 400 to 500 such words Why do we need to remove stop words ? Why do we need to remove stop words ?  Reduce indexing file size  stopwords accounts 20-30% of total word counts.  Improve efficiency  stop words are not useful for searching  stop words always have a large number of hits

Stop words Potential problems of removing stop words Potential problems of removing stop words  small stop list does not improve indexing much  large stop list may eliminate some words that might be useful for someone or for some purposes  stopwords might be part of phrases  needs to process for both indexing and queries. Examples: Examples:  Lommoncommon.c  commonwords

Stemming Techniques used to find out the root/stem of a word: Techniques used to find out the root/stem of a word:  lookup “user engineering” user 15 engineering12user 15 engineering12 users 4engineered 23users 4engineered 23 used 5 engineer 12used 5 engineer 12 using 5using 5 stem: use engineer stem: use engineer

Advantages of stemming improving effectiveness improving effectiveness  matching similar words reducing indexing size reducing indexing size  combing words with same roots may reduce indexing size as much as 40-50%. Criteria for stemming Criteria for stemming  correctness  retrieval effectiveness  compression performance

Basic stemming methods Use tables and rules Use tables and rules  remove ending  if a word ends with a consonant other than s, followed by an s, then delete s. followed by an s, then delete s.  if a word ends in es, drop the s.  if a word ends in ing, delete the ing unless the remaining word consists only of one letter or of th.  If a word ends with ed, preceded by a consonant, delete the ed unless this leaves only a single letter.  …...

 transform the remaining word  if a word ends with “ies” but not “eies” or “aies” then “ies --> y.”

Example 1: Porter stem Algorithm A set of condition/action rules A set of condition/action rules  condition on the stem  condition on the suffix  condition on the rules  different combination of conditions will activate different rules. Implementation: Implementation:  stem.c  Stem(word)  ……..  ReplaceEnd(word, step1a_rule);  rule=ReplaceEnd(word, step1b_rule);  if (rule==106) || (rule ==107) ReplaceEnd(word, 1b1_rule);ReplaceEnd(word, 1b1_rule); … …… …

Example 2: Sound-based stemming Soundex rules: Soundex rules:  letterNumeric equivalent  B, F, P, V1  C, G, J, K, Q, S, X, Z2  D, T, 3  L4  M, N, 5  R, 6  A, E, I, O, U, W, Ynot coded Words sound similar often have same codes Words sound similar often have same codes The code is not unique The code is not unique high compression rate high compression rate

Frequency counts The idea The idea  The best a computer can do is counting numbers  counts the number of times a word occurred in a document  counts the number of documents in a collection that contains a word  Using occurrence frequencies to indicate relative importance of a word in a document  if a word appears often in a document, the document likely “deals with” subjects related to the word.

 Using occurrence frequencies to select most useful words to index a document collection  if a word appears in every document, it is not a good indexing word  if a word appears in only one or two documents, it may not be a good indexing word  If a word appears in titles, each occurrence should be count 5(or 10) times.

Automatic indexing 1. Parse individual words (tokens) 2. Remove stop words. 3. Stemming 4. Use frequency data  decide heading threshold  decide tail threshold  decide variance of counting

5. Create indexing structure  invert indexing  other structures

Term Associations Counting word pairs Counting word pairs  If two words appear together very often, they are likely to be a phrase Counting document pairs Counting document pairs  if two documents have many common words, they are likely related

More Counting Counting citation pairs Counting citation pairs  If documents A and B both cite document C, D, then A and B might be related.  If documents C and D often be cited together, they are likely related. Counting link patterns Counting link patterns  Get all pages that have links to my pages.  Get all pages that contain similar links to my pages

Google Search Engine Link analysis Link analysis  PageRank --The ranking of web pages are based on the number of links that refer to that web page  If page A has a link to B, page A has one vote to B.  The more votes a page get, the more useful the page is.  If page A itself receives many votes, its vote to B will count more heavily Combining link analysis with word matching. Combining link analysis with word matching.

ConceptLink Use terms’ co-occurring frequencies Use terms’ co-occurring frequencies  to predict semantic relationships  to build concept clusters  to suggest search terms Visualization of term relationships Visualization of term relationships  Link displays  Map displays  Drag-and drop interface for searching

Document clustering Grouping similar documents to different sets Grouping similar documents to different sets  Create similarity matrix  Apply a hierarchical clustering algorithm: 1 Identify the two closet documents and combine them into a cluster 2 Identify the next two closet documents and clusters and combine them into a clusters 3 If more then one cluster remains, return to step 1

Application of Document Clustering Vivisimo Vivisimo Vivisimo  Cluster search results on the fly  Hierarchical categories for drill-down capability AltaVista AltaVista  Refine search:  Cluster related words into different groups based on their co-occurrence rates in documents.

AltaVista

Document Similarity Documents Documents  D 1 ={t 11, t 12, t 13, …, t 1n }  D 2 ={t 21, t 22, t 23, …, t 2n } t ik is either 0 or 1. Simple measurement of difference/ similarity: Simple measurement of difference/ similarity:  w=the number of times t 1k =1, t 2k =1.  x=the number of times t 1k =1, t 2k =0.  y=the number of times t 1k =0, t 2k =1.  z=the number of times t 1k =0, t 2k =0.

Similarity Measure Cosine Coefficient: Cosine Coefficient: The same as: The same as:

D1’s terms only: n1=w+x (the number of times t 1k =1) D1’s terms only: n1=w+x (the number of times t 1k =1) D2’s terms only: n2=w+y (the number of times t 2k =1) D2’s terms only: n2=w+y (the number of times t 2k =1) Sameness count: sc =(w+z)/(n1+n2) Sameness count: sc =(w+z)/(n1+n2) Difference count: dc =(x+y)/(n1+n2) Difference count: dc =(x+y)/(n1+n2) Rectangular Distance: rd = MAX(n1, n2) Rectangular Distance: rd = MAX(n1, n2) Conditional probability: cp=min(n1, n2) Conditional probability: cp=min(n1, n2) mean:mean =(n1+n2)/2 mean:mean =(n1+n2)/2

Similarity Measure Dice’s Coefficient: Dice’s Coefficient:  Dice(D1, D2)= 2w/(n1+n2)  where w is the number of terms that D1, and D2 have in common; n1, n2 are the number of terms in D1and D2. Jaccard Coefficient: Jaccard Coefficient:  Jaccard(D1, D2) = w/(N-z) = w/(n1+n2-w) = w/(n1+n2-w)

Similarity Metric A metric has three defining properties A metric has three defining properties  It’s value are non-negative  It’s symmetric  It satisfies the triangle inequality: |AC|  |AB|+|BC|

L p Metrics

Similarity Matrix Pairwise coupling of similarities among a group of documents Pairwise coupling of similarities among a group of documents S 11 S 12 S 13 S 14 S 15 S 16 S 17 S 18 S 21 S 22 S 23 S 24 S 25 S 26 S 27 S 28 S 31 S 32 S 33 S 34 S 35 S 36 S 37 S 38 S 41 S 42 S 43 S 44 S 45 S 46 S 47 S 48 S 51 S 52 S 53 S 54 S 55 S 56 S 57 S 58 S 61 S 62 S 63 S 64 S 65 S 66 S 67 S 68 S 71 S 72 S 73 S 74 S 75 S 76 S 77 S 78 S 81 S 82 S 83 S 84 S 85 S 86 S 87 S 88

MetaData Data about data Data about data Descriptive Data Descriptive Data  External to the meaning of the document  Dublin Core Metadata Element Set  Author, title, publisher, etc. Semantic Metadata Semantic Metadata  Subject indexing Challenge: automatic generation of metadata for documents Challenge: automatic generation of metadata for documents

Markup Languages SGML XML HyTime Metalanguage Languages HTML Semantic Web? RDFMathMLSMIL Stylesheet XSL CSS

Midterm Concepts Concepts  What is information retrieval?  Data, information, text, and documents  Two abstractions principles  User’s information needs  Queries and query formats  Precision and Recall  Relevance  Zipf’s Law, Benford's Law

Midterm Procedures & problem solving Procedures & problem solving  How to translate a request into a query?  How to expand queries  for better recall or better precision?  How to create an inverted indexing?  How to create a vector space ?  How to calculate similarities of documents?  How to match a query to documents in a vector space?

Discussions Discussions  Challenges of IR  Advantages and disadvantages of Boolean search (vector space, automatic indexing, association-based queries, etc.)  Evaluation of IR systems  With or without using precision/recall.  Difference between data retrieval and information retrieval