Presentation is loading. Please wait.

Presentation is loading. Please wait.

WEB MINING. Why IR ? Research & Fun

Similar presentations


Presentation on theme: "WEB MINING. Why IR ? Research & Fun"— Presentation transcript:

1 WEB MINING

2 Why IR ?

3

4 Research & Fun http://duilian.msra.cn

5 Overview of Search Engine

6 Flow Chart of SE

7 Text Processing (1) - Indexing  A list of terms with relevant information Frequency of terms Location of terms Etc.  Index terms: represent document content & separate documents “ economy ” vs “ computer ” in a news article of Financial Times  To get Index Extraction of index terms Computation of their weights

8 Text Processing (2) - Extraction  Extraction of index terms Word or phrase level Morphological Analysis (stemming in English) “ information ”, “ informed ”, “ informs ”, “ informative ” inform Removal of stop words “ a ”, “ an ”, “ the ”, “ is ”, “ are ”, “ am ”, …

9 Text Processing (3) – Term Weight  Calculation of term weights  Statistical weights using frequency information  importance of a term in a document  E.g. TF*IDF  TF: total frequency of a term k in a document  IDF: inverse document frequency of a term k in a collection  DF: In how many documents the term appears?  High TF, low DF means good word to represent text  High TF, High DF means bad word 

10 An Example Document 1 Document 2

11 Text Processing (4) - Storing indexing results Arizona University :::::: … 1 1 2 2 Index WordWord Info. Document 1 Document 2 1 1 1 1

12 Text Processing (2) - Storing indexing result

13 Text Processing (3) - Inverted File

14 Matching & Ranking (2)  Ranking  Retrieval Model Boolean (exact) => Fuzzy Set (inexact) Vector Space Probabilistic Inference Net...  Weighting Schemes Index terms, query terms Document characteristics

15 Vector Space Model

16  Techniques for efficiency  New storage structure esp. for new document types  Use of accumulators for efficient generation of ranked output  Compression/decompression of indexes  Technique for Web search engines  Use of hyperlinks Inlinks & outlinks (PageRank) Authority vs hub pages (HITS)  In conjunction with Directory Services (e.g. Yahoo) Matching & Ranking (2)

17

18 Pagerank Algorithm  Basic idea: more links to a page implies a better page  But, all links are not created equal  Links from a more important page should count more than links from a weaker page  Basic PageRank R(A) for page A:  outDegree(B) = number of edges leaving page B = hyperlinks on page B  Page B distributes its rank boost over all the pages it points to

19

20

21

22 Readings  Gregory Grefenstette (1998). “ The Problem of Cross-Language Information Retrieval. ” In Cross-Language Information Retrieval (ed: Grefenstette), Kluwer Academic Publishers.  Doug Oard et al. (1999). “ Multilingual Information Discovery and AccesS (MIDAS). ” D-Lib Magazine, 5 (10), Oct.  Sung Hyon Myaeng et al. (1998). “ A Flexible Model for Retrieval of SGML Documents. ” Proc. of the 21st ACM SIGIR Conference, Austrailia.  James Allan (2002). “ Introduction to Topic Detection and Tracking. ” in Topic Detection and Tracking: Event-based Information Organization (ed: Allan), Kluwer Academic Publishers.  Paul Resnick & Hal Varian (1997). “ Recommender Systems. ” CACM 40 (3), March, pp 56-58.  Bardrul Sarwar et al. (2001). “ Item-based Collaborative Recommendation Algorithms ”, http://citeseer.nj.nec.com/sarwar01itembased.html http://citeseer.nj.nec.com/sarwar01itembased.html  Karen Sparck Jones (1999). “ Automatic summarizing: factors and directions. ” In Advances in Automatic Text Summarization (eds: Mani & Maybury), MIT Press.  Ellen Boorhees. (2000). “ Overview of TREC-9 Question Answering Track. ”  Ralph Grishman (1997). “ Information Extraction: Techniques and Challenges. ” In Information Extraction - International Summer School SCIE-97, (ed: Maria Teresa Pazienza), Springer- Verlag, 1997. (See http://nlp.cs.nyu.edu/publication/index.shtml)http://nlp.cs.nyu.edu/publication/index.shtml


Download ppt "WEB MINING. Why IR ? Research & Fun"

Similar presentations


Ads by Google