Presentation is loading. Please wait.

Presentation is loading. Please wait.

INFO 624 -- Week 5 Text Properties and Operations Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Similar presentations


Presentation on theme: "INFO 624 -- Week 5 Text Properties and Operations Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University."— Presentation transcript:

1 INFO 624 -- Week 5 Text Properties and Operations Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University

2 Objectives of Assignment 1 Practice basic Web skills Practice basic Web skills Get familiar with a few search engines Get familiar with a few search engines Learn to describe features of search engines Learn to describe features of search engines Learn to compare search engines Learn to compare search engines

3 Grading Sheet for Assignment 1 1. Memo 2. Selection of Search engines 1. Is it downloadable 2. Can it be under controlled of the small business? 3. Quality of reviews 4. Formats of review pages, including metadata. 5. Appropriate links in reviews and in the registered page

4 What’s missing? Who are the current users of the selected search engine? Who are the current users of the selected search engine? Hand-on experience with the selected search engines Hand-on experience with the selected search engines  Personal observation or experience  Some testing on demos or customer sites. Convincing statements on the differences between the search engines. Convincing statements on the differences between the search engines.

5 Properties of Text Classic theories Classic theories  Zipf’s Law  Information Theory  Benford's Law  Bradford's Law  Heaps’ Law English letter/word frequencies English letter/word frequencies

6 Zipf’s Law (1945) in a large, well-written English document, in a large, well-written English document, r * f = c where where r is the ranking number, f is the number of times the given word is used in the document; c is a constant. Difference collections may have different c. English text tends to have c = N/10 where N is the number of words in the collection. Difference collections may have different c. English text tends to have c = N/10 where N is the number of words in the collection.

7 Zipf’s Law is an observation of a fact in proximity. Zipf’s Law is an observation of a fact in proximity.  Examples:  Word frequencies in Alice in Wonderland Word frequencies in Alice in Wonderland Word frequencies in Alice in Wonderland  Time magazine collection Time magazine collection Time magazine collection Zipf’s Law has been verified for many many years on many different collections. Zipf’s Law has been verified for many many years on many different collections. There are also many revised version of Ziph’s Law. There are also many revised version of Ziph’s Law.

8 Example: The word " the " is the most frequently occurring word in the novel "Moby Dick," occurring 1450 times. The word " the " is the most frequently occurring word in the novel "Moby Dick," occurring 1450 times. The word " with " is the second-most frequently occurring word in that novel. The word " with " is the second-most frequently occurring word in that novel. How many times would we expect " with " to occur? How many times would we expect " with " to occur? How many times would we expect the third most frequently occurring word to appear? How many times would we expect the third most frequently occurring word to appear?

9 Information Theory Entropy (1948) Entropy (1948)  Use the distribution of symbols to predict the amount of information in a text.  Quantified measure for information Useful for (physical) data transferUseful for (physical) data transfer And compressionAnd compression Not directly applicable to IRNot directly applicable to IR  Example:  Which letter is likely to appear after a letter “c” is received?

10 English Letter Usage Statistics Letter use frequencies: Letter use frequencies:  E: 72881 12.4%  T: 52397 8.9%  A: 47072 8.0%  O: 45116 7.6%  N: 41316 7.0%  I: 39710 6.7%  H: 38334 6.5%

11 Doubled letter frequencies: Doubled letter frequencies:  LL: 2979 20.6%  EE: 2146 14.8%  SS: 2128 14.7%  OO: 2064 14.3%  TT: 1169 8.1%  RR: 1068 7.4%  --: 701 4.8%  PP: 628 4.3%  FF: 430 2.9%

12 Initial letter frequencies: Initial letter frequencies:  T: 20665 15.2%  A: 15564 11.4%  H: 11623 8.5%  W: 9597 7.0%  I: 9468 6.9%  S: 9376 6.9%  O: 8205 6.0%  M: 6293 4.6%  B: 5831 4.2%

13 Ending letter frequencies: Ending letter frequencies:  E: 26439 19.4%  D: 17313 12.7%  S: 14737 10.8%  T: 13685 10.0%  N: 10525 7.7%  R: 9491 6.9%  Y: 7915 5.8%  O: 6226 4.5%

14 Benford's Law If we randomly select a number from a table of statistical data, the probability that the first digit will be a "1" is about 0.301, rather than 0.1 as we might expect if all digits were equally likely. If we randomly select a number from a table of statistical data, the probability that the first digit will be a "1" is about 0.301, rather than 0.1 as we might expect if all digits were equally likely.

15 Bradford's Law On a given subject, a few core journals will provide 1/3 of the articles on that subject, a medium number of secondary journals will provide another 1/3 of the articles on that subject, and a large number peripheral journals will provide the final 1/3 of the articles on that subject. On a given subject, a few core journals will provide 1/3 of the articles on that subject, a medium number of secondary journals will provide another 1/3 of the articles on that subject, and a large number peripheral journals will provide the final 1/3 of the articles on that subject.

16 For example If you found 300 citations for IR, If you found 300 citations for IR,  100 of those citations likely came from a core group of 5 journals,  another 100 citations came from a group of 25 journals,  and the final 100 citations came from 125 peripheral journals. Bradford expressed his law with this formula: 1:n:n Bradford expressed his law with this formula: 1:n:n 2

17 Heaps’ Law The relationship of the size of vocabulary and the size of collections are: V = K * n The relationship of the size of vocabulary and the size of collections are: V = K * n b Text size Number of unique words

18 Computerized Text Analysis Word (token) extraction Word (token) extraction Stop words Stop words Stemming Stemming Frequency counts Frequency counts Clustering Clustering

19 Word Extraction Basic problems Basic problems  Digits  Hyphens  Punctuation  Cases Lexical analyzer Lexical analyzer  Define all possible characters into finite state machine  Specify what states should cause the break of tokens. Example: Example:  Parser.c

20 Stop words Many of the most frequently used words in English are worthless in the indexing – these words are called stop words. Many of the most frequently used words in English are worthless in the indexing – these words are called stop words.  the, of, and, to, ….  Typically about 400 to 500 such words Why do we need to remove stop words ? Why do we need to remove stop words ?  Reduce indexing file size  stopwords accounts 20-30% of total word counts.  Improve efficiency  stop words are not useful for searching  stop words always have a large number of hits

21 Stop words Potential problems of removing stop words Potential problems of removing stop words  small stop list does not improve indexing much  large stop list may eliminate some words that might be useful for someone or for some purposes  stopwords might be part of phrases  needs to process for both indexing and queries. Examples: Examples:  Lommoncommon.c  commonwords

22 Stemming Techniques used to find out the root/stem of a word: Techniques used to find out the root/stem of a word:  lookup “user engineering” user 15 engineering12user 15 engineering12 users 4engineered 23users 4engineered 23 used 5 engineer 12used 5 engineer 12 using 5using 5 stem: use engineer stem: use engineer

23 Advantages of stemming improving effectiveness improving effectiveness  matching similar words reducing indexing size reducing indexing size  combing words with same roots may reduce indexing size as much as 40-50%. Criteria for stemming Criteria for stemming  correctness  retrieval effectiveness  compression performance

24 Basic stemming methods Use tables and rules Use tables and rules  remove ending  if a word ends with a consonant other than s, followed by an s, then delete s. followed by an s, then delete s.  if a word ends in es, drop the s.  if a word ends in ing, delete the ing unless the remaining word consists only of one letter or of th.  If a word ends with ed, preceded by a consonant, delete the ed unless this leaves only a single letter.  …...

25  transform the remaining word  if a word ends with “ies” but not “eies” or “aies” then “ies --> y.”

26 Example 1: Porter stem Algorithm A set of condition/action rules A set of condition/action rules  condition on the stem  condition on the suffix  condition on the rules  different combination of conditions will activate different rules. Implementation: Implementation:  stem.c  Stem(word)  ……..  ReplaceEnd(word, step1a_rule);  rule=ReplaceEnd(word, step1b_rule);  if (rule==106) || (rule ==107) ReplaceEnd(word, 1b1_rule);ReplaceEnd(word, 1b1_rule); … …… …

27 Example 2: Sound-based stemming Soundex rules: Soundex rules:  letterNumeric equivalent  B, F, P, V1  C, G, J, K, Q, S, X, Z2  D, T, 3  L4  M, N, 5  R, 6  A, E, I, O, U, W, Ynot coded Words sound similar often have same codes Words sound similar often have same codes The code is not unique The code is not unique high compression rate high compression rate

28 Frequency counts The idea The idea  The best a computer can do is counting numbers  counts the number of times a word occurred in a document  counts the number of documents in a collection that contains a word  Using occurrence frequencies to indicate relative importance of a word in a document  if a word appears often in a document, the document likely “deals with” subjects related to the word.

29  Using occurrence frequencies to select most useful words to index a document collection  if a word appears in every document, it is not a good indexing word  if a word appears in only one or two documents, it may not be a good indexing word  If a word appears in titles, each occurrence should be count 5(or 10) times.

30 Automatic indexing 1. Parse individual words (tokens) 2. Remove stop words. 3. Stemming 4. Use frequency data  decide heading threshold  decide tail threshold  decide variance of counting

31 5. Create indexing structure  invert indexing  other structures

32 Term Associations Counting word pairs Counting word pairs  If two words appear together very often, they are likely to be a phrase Counting document pairs Counting document pairs  if two documents have many common words, they are likely related

33 More Counting Counting citation pairs Counting citation pairs  If documents A and B both cite document C, D, then A and B might be related.  If documents C and D often be cited together, they are likely related. Counting link patterns Counting link patterns  Get all pages that have links to my pages.  Get all pages that contain similar links to my pages

34 Google Search Engine Link analysis Link analysis  PageRank --The ranking of web pages are based on the number of links that refer to that web page  If page A has a link to B, page A has one vote to B.  The more votes a page get, the more useful the page is.  If page A itself receives many votes, its vote to B will count more heavily Combining link analysis with word matching. Combining link analysis with word matching.

35 ConceptLink Use terms’ co-occurring frequencies Use terms’ co-occurring frequencies  to predict semantic relationships  to build concept clusters  to suggest search terms Visualization of term relationships Visualization of term relationships  Link displays  Map displays  Drag-and drop interface for searching

36 Document clustering Grouping similar documents to different sets Grouping similar documents to different sets  Create similarity matrix  Apply a hierarchical clustering algorithm: 1 Identify the two closet documents and combine them into a cluster 2 Identify the next two closet documents and clusters and combine them into a clusters 3 If more then one cluster remains, return to step 1

37 Application of Document Clustering Vivisimo Vivisimo Vivisimo  Cluster search results on the fly  Hierarchical categories for drill-down capability AltaVista AltaVista  Refine search:  Cluster related words into different groups based on their co-occurrence rates in documents.

38 AltaVista

39 Document Similarity Documents Documents  D 1 ={t 11, t 12, t 13, …, t 1n }  D 2 ={t 21, t 22, t 23, …, t 2n } t ik is either 0 or 1. Simple measurement of difference/ similarity: Simple measurement of difference/ similarity:  w=the number of times t 1k =1, t 2k =1.  x=the number of times t 1k =1, t 2k =0.  y=the number of times t 1k =0, t 2k =1.  z=the number of times t 1k =0, t 2k =0.

40 Similarity Measure Cosine Coefficient: Cosine Coefficient: The same as: The same as:

41 D1’s terms only: n1=w+x (the number of times t 1k =1) D1’s terms only: n1=w+x (the number of times t 1k =1) D2’s terms only: n2=w+y (the number of times t 2k =1) D2’s terms only: n2=w+y (the number of times t 2k =1) Sameness count: sc =(w+z)/(n1+n2) Sameness count: sc =(w+z)/(n1+n2) Difference count: dc =(x+y)/(n1+n2) Difference count: dc =(x+y)/(n1+n2) Rectangular Distance: rd = MAX(n1, n2) Rectangular Distance: rd = MAX(n1, n2) Conditional probability: cp=min(n1, n2) Conditional probability: cp=min(n1, n2) mean:mean =(n1+n2)/2 mean:mean =(n1+n2)/2

42 Similarity Measure Dice’s Coefficient: Dice’s Coefficient:  Dice(D1, D2)= 2w/(n1+n2)  where w is the number of terms that D1, and D2 have in common; n1, n2 are the number of terms in D1and D2. Jaccard Coefficient: Jaccard Coefficient:  Jaccard(D1, D2) = w/(N-z) = w/(n1+n2-w) = w/(n1+n2-w)

43 Similarity Metric A metric has three defining properties A metric has three defining properties  It’s value are non-negative  It’s symmetric  It satisfies the triangle inequality: |AC|  |AB|+|BC|

44 L p Metrics

45 Similarity Matrix Pairwise coupling of similarities among a group of documents Pairwise coupling of similarities among a group of documents S 11 S 12 S 13 S 14 S 15 S 16 S 17 S 18 S 21 S 22 S 23 S 24 S 25 S 26 S 27 S 28 S 31 S 32 S 33 S 34 S 35 S 36 S 37 S 38 S 41 S 42 S 43 S 44 S 45 S 46 S 47 S 48 S 51 S 52 S 53 S 54 S 55 S 56 S 57 S 58 S 61 S 62 S 63 S 64 S 65 S 66 S 67 S 68 S 71 S 72 S 73 S 74 S 75 S 76 S 77 S 78 S 81 S 82 S 83 S 84 S 85 S 86 S 87 S 88

46 MetaData Data about data Data about data Descriptive Data Descriptive Data  External to the meaning of the document  Dublin Core Metadata Element Set  Author, title, publisher, etc. Semantic Metadata Semantic Metadata  Subject indexing Challenge: automatic generation of metadata for documents Challenge: automatic generation of metadata for documents

47 Markup Languages SGML XML HyTime Metalanguage Languages HTML Semantic Web? RDFMathMLSMIL Stylesheet XSL CSS

48 Midterm Concepts Concepts  What is information retrieval?  Data, information, text, and documents  Two abstractions principles  User’s information needs  Queries and query formats  Precision and Recall  Relevance  Zipf’s Law, Benford's Law

49 Midterm Procedures & problem solving Procedures & problem solving  How to translate a request into a query?  How to expand queries  for better recall or better precision?  How to create an inverted indexing?  How to create a vector space ?  How to calculate similarities of documents?  How to match a query to documents in a vector space?

50 Discussions Discussions  Challenges of IR  Advantages and disadvantages of Boolean search (vector space, automatic indexing, association-based queries, etc.)  Evaluation of IR systems  With or without using precision/recall.  Difference between data retrieval and information retrieval


Download ppt "INFO 624 -- Week 5 Text Properties and Operations Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University."

Similar presentations


Ads by Google