Web Mining ( 網路探勘 ) 1 1011WM06 TLMXM1A Wed 8,9 (15:10-17:00) U705 Information Retrieval and Web Search ( 資訊檢索與網路搜尋 ) Min-Yuh Day 戴敏育 Assistant Professor.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Introduction to Information Retrieval
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Consequence: If there isn’t an HREF path from some Yahoo.
Information Retrieval in Practice
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Social Media Marketing Analytics 社群網路行銷分析 SMMA02 TLMXJ1A (MIS EMBA) Fri 12,13,14 (19:20-22:10) D326 社群網路行銷分析 (Social Media Marketing Analytics) Min-Yuh.
Information Retrieval Review
Chapter 7: Text mining UIC - CS 594 Bing Liu 1 1.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Ch 4: Information Retrieval and Text Mining
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
Information Retrieval
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Case Study for Information Management 資訊管理個案
Chapter 5: Information Retrieval and Web Search
Data Warehousing 資料倉儲 Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of Information Management, Tamkang University Dept. of Information ManagementTamkang.
Overview of Search Engines
Social Media Marketing Research 社會媒體行銷研究 SMMR06 TMIXM1A Thu 7,8 (14:10-16:00) L511 Measuring the Construct Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授.
Web Mining ( 網路探勘 ) WM12 TLMXM1A Wed 8,9 (15:10-17:00) U705 Web Usage Mining ( 網路使用挖掘 ) Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of Information.
Web Mining ( 網路探勘 ) WM09 TLMXM1A Wed 8,9 (15:10-17:00) U705 Structured Data Extraction ( 結構化資料擷取 ) Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept.
Social Media Management 社會媒體管理 SMM06 TMIXM1A Fri. 7,8 (14:10-16:00) L215 Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of Information Management,
Web Mining ( 網路探勘 ) WM10 TLMXM1A Wed 8,9 (15:10-17:00) U705 Information Integration ( 資訊整合 ) Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of.
Special Topics in Social Media Services 社會媒體服務專題 1 992SMS07 TMIXJ1A Sat. 6,7,8 (13:10-16:00) D502 Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of Information.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Social Media Marketing Management 社會媒體行銷管理 SMMM12 TLMXJ1A Tue 12,13,14 (19:20-22:10) D325 確認性因素分析 (Confirmatory Factor Analysis) Min-Yuh Day 戴敏育.
Social Media Marketing Analytics 社群網路行銷分析 SMMA04 TLMXJ1A (MIS EMBA) Fri 12,13,14 (19:20-22:10) D326 測量構念 (Measuring the Construct) Min-Yuh Day 戴敏育.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
商業智慧實務 Practices of Business Intelligence BI06 MI4 Wed, 9,10 (16:10-18:00) (B130) 資料科學與巨量資料分析 (Data Science and Big Data Analytics) Min-Yuh Day 戴敏育.
Social Media Marketing Research 社會媒體行銷研究 SMMR09 TMIXM1A Thu 7,8 (14:10-16:00) L511 Confirmatory Factor Analysis Min-Yuh Day 戴敏育 Assistant Professor.
Data Mining 資料探勘 DM04 MI4 Thu 9, 10 (16:10-18:00) B216 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of Information Management,
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Web Mining ( 網路探勘 ) WM04 TLMXM1A Wed 8,9 (15:10-17:00) U705 Unsupervised Learning ( 非監督式學習 ) Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Case Study for Information Management 資訊管理個案 CSIM4B08 TLMXB4B Thu 8, 9, 10 (15:10-18:00) B508 Securing Information System: 1. Facebook, 2. European.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Web Mining ( 網路探勘 ) WM01 TLMXM1A Wed 8,9 (15:10-17:00) U705 Introduction to Web Mining ( 網路探勘導論 ) Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept.
Information Retrieval and Web Search
Case Study for Information Management 資訊管理個案 CSIM4B03 TLMXB4B (M1824) Tue 3,4 (10:10-12:00) L212 Thu 9 (16:10-17:00) B601 Global E-Business and Collaboration:
Web Mining ( 網路探勘 ) WM05 TLMXM1A Wed 8,9 (15:10-17:00) U705 Partially Supervised Learning ( 部分監督式學習 ) Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授.
Data Mining 資料探勘 DM04 MI4 Wed, 6,7 (13:10-15:00) (B216) 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of Information Management,
Web Mining ( 網路探勘 ) WM07 TLMXM1A Wed 8,9 (15:10-17:00) U705 Social Network Analysis ( 社會網路分析 ) Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of.
Vector Space Models.
Information Retrieval
Social Media Marketing Research 社會媒體行銷研究 SMMR04 TMIXM1A Thu 7,8 (14:10-16:00) L511 Marketing Research Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授.
Case Study for Information Management 資訊管理個案
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Social Computing and Big Data Analytics 社群運算與大數據分析
Data e Web Mining. - S. Orlando 1 Information Retrieval and Web Search Salvatore Orlando Bing Liu. “Web Data Mining: Exploring Hyperlinks, Contents”, and.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Information Retrieval in Practice
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Case Study for Information Management 資訊管理個案
Search Engine Architecture
Multimedia Information Retrieval
Data Mining Chapter 6 Search Engines
Chapter 5: Information Retrieval and Web Search
Boolean and Vector Space Retrieval Models
Information Retrieval and Web Design
Information Retrieval and Web Design
Case Study for Information Management 資訊管理個案
Presentation transcript:

Web Mining ( 網路探勘 ) WM06 TLMXM1A Wed 8,9 (15:10-17:00) U705 Information Retrieval and Web Search ( 資訊檢索與網路搜尋 ) Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of Information Management, Tamkang University Dept. of Information ManagementTamkang University 淡江大學 淡江大學 資訊管理學系 資訊管理學系 tku.edu.tw/myday/

週次 日期 內容( Subject/Topics ) 1 101/09/12 Introduction to Web Mining ( 網路探勘導論 ) 2 101/09/19 Association Rules and Sequential Patterns ( 關聯規則和序列模式 ) 3 101/09/26 Supervised Learning ( 監督式學習 ) 4 101/10/03 Unsupervised Learning ( 非監督式學習 ) 5 101/10/10 國慶紀念日 ( 放假一天 ) 6 101/10/17 Paper Reading and Discussion ( 論文研讀與討論 ) 7 101/10/24 Partially Supervised Learning ( 部分監督式學習 ) 8 101/10/31 Information Retrieval and Web Search ( 資訊檢索與網路搜尋 ) 9 101/11/07 Social Network Analysis ( 社會網路分析 ) 課程大綱 ( Syllabus) 2

週次 日期 內容( Subject/Topics ) /11/14 Midterm Presentation ( 期中報告 ) /11/21 Web Crawling ( 網路爬行 ) /11/28 Structured Data Extraction ( 結構化資料擷取 ) /12/05 Information Integration ( 資訊整合 ) /12/12 Opinion Mining and Sentiment Analysis ( 意見探勘與情感分析 ) /12/19 Paper Reading and Discussion ( 論文研讀與討論 ) /12/26 Web Usage Mining ( 網路使用挖掘 ) /01/02 Project Presentation 1 ( 期末報告 1) /01/09 Project Presentation 2 ( 期末報告 2) 課程大綱 ( Syllabus) 3

Chapter 6: Information Retrieval and Web Search Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 4 Bing Liu (2011), “Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data,” 2 nd Edition, Springer.

Introduction Text mining refers to data mining using text documents as data. Most text mining tasks use Information Retrieval (IR) methods to pre-process text documents. These methods are quite different from traditional data pre-processing methods used for relational tables. Web search also has its root in IR. Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 5

Information Retrieval (IR) Conceptually, IR is the study of finding needed information. I.e., IR helps users find information that matches their information needs. – Expressed as queries Historically, IR is about document retrieval, emphasizing document as the basic unit. – Finding documents relevant to user queries Technically, IR studies the acquisition, organization, storage, retrieval, and distribution of information. Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 6

IR architecture Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 7

IR queries Keyword queries Boolean queries (using AND, OR, NOT) Phrase queries Proximity queries Full document queries Natural language questions Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 8

Information retrieval models An IR model governs how a document and a query are represented and how the relevance of a document to a user query is defined. Main models: – Boolean model – Vector space model – Statistical language model – etc Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 9

Boolean model Each document or query is treated as a “bag” of words or terms. Word sequence is not considered. Given a collection of documents D, let V = {t 1, t 2,..., t |V| } be the set of distinctive words/terms in the collection. V is called the vocabulary. A weight w ij > 0 is associated with each term t i of a document d j ∈ D. For a term that does not appear in document d j, w ij = 0. d j = (w 1j, w 2j,..., w |V|j ), Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 10

Boolean model (contd) Query terms are combined logically using the Boolean operators AND, OR, and NOT. – E.g., ((data AND mining) AND (NOT text)) Retrieval – Given a Boolean query, the system retrieves every document that makes the query logically true. – Called exact match. The retrieval results are usually quite poor because term frequency is not considered. Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 11

Vector space model Documents are also treated as a “bag” of words or terms. Each document is represented as a vector. However, the term weights are no longer 0 or 1. Each term weight is computed based on some variations of TF or TF-IDF scheme. Term Frequency (TF) Scheme: The weight of a term t i in document d j is the number of times that t i appears in d j, denoted by f ij. Normalization may also be applied. Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 12

TF-IDF term weighting scheme The most well known weighting scheme – TF: still term frequency – IDF: inverse document frequency. N: total number of docs df i : the number of docs that t i appears. The final TF-IDF term weight is: Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 13

Retrieval in vector space model Query q is represented in the same way or slightly differently. Relevance of d i to q: Compare the similarity of query q and document d i. Cosine similarity (the cosine of the angle between the two vectors) Cosine is also commonly used in text clustering Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 14

An Example A document space is defined by three terms: – hardware, software, users – the vocabulary A set of documents are defined as: – A1=(1, 0, 0),A2=(0, 1, 0), A3=(0, 0, 1) – A4=(1, 1, 0),A5=(1, 0, 1), A6=(0, 1, 1) – A7=(1, 1, 1)A8=(1, 0, 1).A9=(0, 1, 1) If the Query is “hardware and software” what documents should be retrieved? Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 15

An Example (cont.) In Boolean query matching: – document A4, A7 will be retrieved (“AND”) – retrieved: A1, A2, A4, A5, A6, A7, A8, A9 (“OR”) In similarity matching (cosine): – q=(1, 1, 0) – S(q, A1)=0.71, S(q, A2)=0.71,S(q, A3)=0 – S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5 – S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5 – Document retrieved set (with ranking)= {A4, A7, A1, A2, A5, A6, A8, A9} Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 16

Okapi relevance method Another way to assess the degree of relevance is to directly compute a relevance score for each document to the query. The Okapi method and its variations are popular techniques in this setting. Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 17

Relevance feedback Relevance feedback is one of the techniques for improving retrieval effectiveness. The steps: – the user first identifies some relevant (D r ) and irrelevant documents (D ir ) in the initial list of retrieved documents – the system expands the query q by extracting some additional terms from the sample relevant and irrelevant documents to produce q e – Perform a second round of retrieval. Rocchio method (α, β and γ are parameters) Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 18

Rocchio text classifier In fact, a variation of the Rocchio method above, called the Rocchio classification method, can be used to improve retrieval effectiveness too – so are other machine learning methods. Why? Rocchio classifier is constructed by producing a prototype vector c i for each class i (relevant or irrelevant in this case): In classification, cosine is used. Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 19

Text pre-processing Word (term) extraction: easy Stopwords removal Stemming Frequency counts and computing TF-IDF term weights. Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 20

Stopwords removal Many of the most frequently used words in English are useless in IR and text mining – these words are called stop words. – the, of, and, to, …. – Typically about 400 to 500 such words – For an application, an additional domain specific stopwords list may be constructed Why do we need to remove stopwords? – Reduce indexing (or data) file size stopwords accounts 20-30% of total word counts. – Improve efficiency and effectiveness stopwords are not useful for searching or text mining they may also confuse the retrieval system. Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 21

Stopwords a, about, an, are, as, at, be, by, for, from, how, in, is, of, on, or, that, the, these, this, to, was, what, when, where, who, will, with Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 22

Stemming Techniques used to find out the root/stem of a word. E.g., – user engineering – users engineered – used engineer – using stem: use engineer Usefulness: improving effectiveness of IR and text mining – matching similar words – Mainly improve recall reducing indexing size – combing words with same roots may reduce indexing size as much as 40-50%. Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 23

Basic stemming methods Using a set of rules. E.g., remove ending – if a word ends with a consonant other than s, followed by an s, then delete s. – if a word ends in es, drop the s. – if a word ends in ing, delete the ing unless the remaining word consists only of one letter or of th. – If a word ends with ed, preceded by a consonant, delete the ed unless this leaves only a single letter. – …... transform words – if a word ends with “ies” but not “eies” or “aies” then “ies --> y.” Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 24

Frequency counts + TF-IDF Counts the number of times a word occurred in a document. – Using occurrence frequencies to indicate relative importance of a word in a document. if a word appears often in a document, the document likely “deals with” subjects related to the word. Counts the number of documents in the collection that contains each word TF-IDF can be computed. Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 25

Evaluation: Precision and Recall Given a query: – Are all retrieved documents relevant? – Have all the relevant documents been retrieved? Measures for system performance: – The first question is about the precision of the search – The second is about the completeness (recall) of the search. Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 26

Precision and recall values at each rank position 27 Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 27

Precision-recall curve Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 28

Compare different retrieval algorithms Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 29

Compare with multiple queries Compute the average precision at each recall level. Draw precision recall curves Do not forget the F-score evaluation measure. Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 30

Rank precision Compute the precision values at some selected rank positions. Mainly used in Web search evaluation. For a Web search engine, we can compute precisions for the top 5, 10, 15, 20, 25 and 30 returned pages – as the user seldom looks at more than 30 pages. Recall is not very meaningful in Web search. – Why? Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 31

Web Search as a huge IR system A Web crawler (robot) crawls the Web to collect all the pages. Servers establish a huge inverted indexing database and other indexing databases At query (search) time, search engines conduct different types of vector query matching. Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 32

Inverted index The inverted index of a document collection is basically a data structure that – attaches each distinctive term with a list of all documents that contains the term. Thus, in retrieval, it takes constant time to – find the documents that contains a query term. – multiple query terms are also easy handle as we will see soon. Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 33

An example Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 34

Index construction Example: Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 35

Search using inverted index Given a query q, search has the following steps: Step 1 (vocabulary search): find each term/word in q in the inverted index. Step 2 (results merging): Merge results to find documents that contain all or some of the words/terms in q. Step 3 (Rank score computation): To rank the resulting documents/pages, using, – content-based ranking – link-based ranking Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 36

Different search engines The real differences among different search engines are – their index weighting schemes Including location of terms, e.g., title, body, emphasized words, etc. – their query processing methods (e.g., query classification, expansion, etc) – their ranking algorithms – Few of these are published by any of the search engine companies. They are tightly guarded secrets. Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 37

Summary Introduction to Information Retrieval and Web Search Related topics – Statistical language model – Latent semantic indexing (LSI and SVD). – Web search Index compression Ranking: combining contents and hyperlinks – Web page pre-processing – Combining multiple rankings and meta search – Web spamming Source: Bing Liu (2011), Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 38

References Bing Liu (2011), “Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data,” 2 nd Edition, Springer