Text & Web Mining 9/22/2018.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Web Mining.
Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Mining Web’s Link Structure Sushanth Rai University of Texas at Arlington
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Web Mining Research: A Survey
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Chapter 19: Information Retrieval
Link Structure and Web Mining Shuying Wang
Information Retrieval
Overview of Web Data Mining and Applications Part I
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Lecture 12 IR in Google Age. Traditional IR Traditional IR examples – Searching a university library – Finding an article in a journal archive – Searching.
CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Data Mining for Web Intelligence Presentation by Julia Erdman.
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
Algorithmic Detection of Semantic Similarity WWW 2005.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.
Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
1 CS 430: Information Discovery Lecture 5 Ranking.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Automated Information Retrieval
Best pTree organization? level-1 gives te, tf (term level)
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Information Organization: Overview
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
DATA MINING Introductory and Advanced Topics Part III – Web Mining
HITS Hypertext-Induced Topic Selection
Methods and Apparatus for Ranking Web Page Search Results
Web Mining Ref:
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Search Techniques and Advanced tools for Researchers
Information Retrieval
Anatomy of a search engine
Information retrieval and PageRank
Data Mining Chapter 6 Search Engines
Web Mining Department of Computer Science and Engg.
Introduction to Information Retrieval
Junghoo “John” Cho UCLA
Web Mining Research: A Survey
Information Organization: Overview
Information Retrieval and Web Design
Presentation transcript:

Text & Web Mining 9/22/2018

Structured Data So far we have focused on mining from structured data: Attribute  Value  Outlook  Sunny Temperature  Hot Windy  Yes Humidity  High Play  Yes Most data mining involves such data 9/22/2018

Complex Data Types Increased importance of complex data: Spatial data: includes geographic data and medical & satellite images Multimedia data: images, audio, & video Time-series data: for example banking data and stock exchange data Text data: word descriptions for objects World-Wide-Web: highly unstructured text and multimedia data Focus 9/22/2018

Text Databases Many text databases exist in practice News articles Research papers Books Digital libraries E-mail messages Web pages Growing rapidly in size and importance 9/22/2018

Semi-Structured Data Text databases are often semi-structured Example: Title Author Publication_Date Length Category Abstract Content Structured attribute/value pairs Unstructured 9/22/2018

Handling Text Data Modeling semi-structured data Information Retrieval (IR) from unstructured documents Text mining Compare documents Rank importance & relevance Find patterns or trends across documents 9/22/2018

Information Retrieval IR locates relevant documents Key words Similar documents IR Systems On-line library catalogs On-line document management systems 9/22/2018

Performance Measure Two basic measures Retrieved Relevant Relevant & documents Relevant documents Relevant & retrieved All documents 9/22/2018

Retrieval Methods Keyword-based IR Similarity-based IR E.g., “data and mining” Synonymy problem: a document may talk about “knowledge discovery” instead Polysemy problem: mining can mean different things Similarity-based IR Set of common keywords Return the degree of relevance Problem: what is the similarity of “data mining” and “data analysis” 9/22/2018

Modeling a Document Set of n documents and m terms Each document is a vector v in Rm The j-th coordinate of v measures the association of the j-th term Here r is the number of occurrences of the j-th term and R is the number of occurrences of any term. 9/22/2018

Frequency Matrix 9/22/2018

Similarity Measures Cosine measure Dot product Norm of the vectors 9/22/2018

Example Google search for “association mining” Two of the documents retrieved: Idaho Mining Association: mining in Idaho (doc 1) Scalable Algorithms for Association mining (doc 2) Using only the two terms 9/22/2018

New Model Add the term “data” to the document model 9/22/2018

Frequency Matrix Will quickly become large Singular value decomposition can be used to reduce it 9/22/2018

{document_id, a_set_of_keywords} Association Analysis Collect set of keywords frequently used together and find association among them Apply any association rule algorithm to a database in the format {document_id, a_set_of_keywords} 9/22/2018

Document Classification Need already classified documents as training set Induce a classification model Any difference from before? A set of keywords associated with a document has no fixed set of attributes or dimensions 9/22/2018

Association-Based Classification Classify documents based on associated, frequently occurring text patterns Extract keywords and terms with IR and simple association analysis Create a concept hierarchy of terms Classify training documents into class hierarchies Use association mining to discover associated terms to distinguish one class from another 9/22/2018

Remember Generalized Association Rules Taxonomy: Ancestor of shoes and hiking boots Clothes Footwear Outerwear Shirts Shoes Hiking Boots Jackets Ski Pants Generalized association rule X Y where no item in Y is an ancestor of an item in X 9/22/2018

Classifiers Let X be a set of terms Let Anc (X) be those terms and their ancestor terms Consider a rule X C and document d If X  Anc (d) then X C covers d A rule that covers d may be used to classify d (but only one can be used) 9/22/2018

Procedure Step 1: Generate all generalized association rules , where X is a set of terms and C is a class, that satisfy minimum support. Step 2: Rank the rules according to some rule ranking criterion Step 3: Select rules from the list 9/22/2018

Web Mining The World Wide Web may have more opportunities for data mining than any other area However, there are serious challenges: It is too huge Complexity of Web pages is greater than any traditional text document collection It is highly dynamic It has a broad diversity of users Only a tiny portion of the information is truly useful 9/22/2018

Search Engines  Web Mining Current technology: search engines Keyword-based indices Too many relevant pages Synonymy and polysemy problems More challenging: web mining Web content mining Web structure mining Web usage mining 9/22/2018

Web Content Mining 9/22/2018

Example: Classification of Web Documents Assign a class to each document based on predefined topic categories E.g., use Yahoo!’s taxonomy and associated documents for training Keyword-based document classification Keyword-based association analysis 9/22/2018

Web Structure Mining 9/22/2018

Authoritative Web Pages High quality relevant Web pages are termed authoritative Explore linkages (hyperlinks) Linking a Web page can be considered an endorsement of that page Those pages that are linked frequently are considered authoritative (This has its roots back to IR methods based on journal citations) 9/22/2018

Structure via Hubs A hub is a set of Web pages containing collections of links to authorities There is a wide variety of hubs: Simple list of recommended links on a person’s home page Professional resource lists on commercial sites 9/22/2018

HITS Hyperlink-Induced Topic Search (HITS) Form a root set of pages using the query terms in an index-based search (200 pages) Expand into a base set by including all pages the root set links to (1000-5000 pages) Go into an iterative process to determine hubs and authorities 9/22/2018

Calculating Weights Authority weight Hub weight Page p is pointed to by page q 9/22/2018

Adjacency Matrix Lets number the pages {1,2,…,n} The adjacency matrix is defined by By writing the authority and hub weights as vectors we have 9/22/2018

Recursive Calculations We now have By linear algebra theory this converges to the principle eigenvectors of the the two matrices 9/22/2018

Output The HITS algorithm finally outputs Short list of pages with high hub weights Short list of pages with high authority weights Have not accounted for context 9/22/2018

Applications The Clever Project at IBM’s Almaden Labs Google Developed the HITS algorithm Google Developed at Stanford Uses algorithms similar to HITS (PageRank) On-line version 9/22/2018

Web Usage Mining 9/22/2018

Complex Data Types Summary Emerging areas of mining complex data types: Text mining can be done quite effectively, especially if the documents are semi-structured Web mining is more difficult due to lack of such structure Data includes text documents, hypertext documents, link structure, and logs Need to rely on unsupervised learning, sometimes followed up with supervised learning such as classification 9/22/2018