Web Mining. Two Key Problems  Page Rank  Web Content Mining.

Slides:

Advertisements

Similar presentations

The Semantic Web-Week 22 Information Extraction and Integration (continued) Module Website: Practical this week:

Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.

CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.

Link Analysis: PageRank

IEPAD: Information Extraction based on Pattern Discovery Chia-Hui Chang National Central University, Taiwan

How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.

RoadRunner: Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo Presented by Lei Lei.

Aki Hecht Seminar in Databases (236826) January 2009

ODE: Ontology-assisted Data Extraction WEIFENG SU et al. Presented by: Meher Talat Shaikh.

Traditional Information Extraction -- Summary CS652 Spring 2004.

Structured Data Extraction Based on the slides from Bing Liu at UCI.

Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Page Rank.  Intuition: solve the recursive equation: “a page is important if important pages link to it.”  Maximailly: importance = the principal eigenvector.

Link Analysis, PageRank and Search Engines on the Web

1 Evaluating the Web PageRank Hubs and Authorities.

1 Evaluating the Web PageRank Hubs and Authorities.

CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.

CS345 Data Mining Mining the Web for Structured Data.

CS246 Extracting Structured Information from the Web.

Web Mining for Extracting Relations Negin Nejati.

Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.

CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:

“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα

Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach by: Craig A. Knoblock, Kristina Lerman Steven Minton, Ion Muslea Presented.

Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.

Author(s): Rahul Sami and Paul Resnick, 2009 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution.

1 Page Rank uIntuition: solve the recursive equation: “a page is important if important pages link to it.” uIn technical terms: compute the principal eigenvector.

1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.

Presenter: Shanshan Lu 03/04/2010

PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.

Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.

Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

1 Internet Research Third Edition Unit A Searching the Internet Effectively.

Internet Research – Illustrated, Fourth Edition Unit A.

Information Extraction and Integration Bing Liu Department of Computer Science University of Illinois at Chicago (UIC)

Information Extraction and Integration Bing Liu Department of Computer Science University of Illinois at Chicago (UIC)

1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.

Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.

Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.

Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.

Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.

Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.

Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He Joint work with: Kevin Chen-Chuan Chang, Jiawei Han Univ.

Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03.

Google's Page Rank. Google Page Ranking “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Sergey Brin and Lawrence Page

CS 440 Database Management Systems Web Data Management 1.

CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.

Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.

Jeffrey D. Ullman Stanford University.  Web pages are important if people visit them a lot.  But we can’t watch everybody using the Web.  A good surrogate.

Statistical Schema Matching across Web Query Interfaces

Mining the Web for Structured Data

PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Hubs and Authorities Jeffrey D. Ullman.

PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.

Web Information Extraction

Internet Research Third Edition

Information Retrieval

Lecture 22 SVD, Eigenvector, and Web Search

CS 440 Database Management Systems

Kriti Chauhan CSE6339 Spring 2009

Junghoo “John” Cho UCLA

Lecture 22 SVD, Eigenvector, and Web Search

Lecture 22 SVD, Eigenvector, and Web Search

Extracting Patterns and Relations from the World Wide Web

Link Analysis Many slides are borrowed from Stanford Data Mining Class taught by Drs Anand Rajaraman, Jeffrey D. Ullman, and Jure Leskovec.

Presentation transcript:

Web Mining

Two Key Problems  Page Rank  Web Content Mining

PageRank  Intuition: solve the recursive equation: “a page is important if important pages link to it.”  Maximailly: importance = the principal eigenvector of the stochastic matrix of the Web. A few fixups needed.

Stochastic Matrix of the Web  Enumerate pages.  Page i corresponds to row and column i.  M [i,j ] = 1/n if page j links to n pages, including page i ; 0 if j does not link to i. M [i,j ] is the probability we’ll next be at page i if we are now at page j.

Example i j Suppose page j links to 3 pages, including i 1/3

Random Walks on the Web  Suppose v is a vector whose i th component is the probability that we are at page i at a certain time.  If we follow a link from i at random, the probability distribution for the page we are then at is given by the vector M v.

Random Walks --- (2)  Starting from any vector v, the limit M (M (…M (M v ) …)) is the distribution of page visits during a random walk.  Intuition: pages are important in proportion to how often a random walker would visit them.  The math: limiting distribution = principal eigenvector of M = PageRank.

Example: The Web in 1839 Yahoo M’softAmazon y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 y a m

Simulating a Random Walk  Start with the vector v = [1,1,…,1] representing the idea that each Web page is given one unit of importance.  Repeatedly apply the matrix M to v, allowing the importance to flow like a random walk.  Limit exists, but about 50 iterations is sufficient to estimate final distribution.

Example  Equations v = M v : y = y /2 + a /2 a = y /2 + m m = a /2 y a = m /2 1/2 5/4 1 3/4 9/8 11/8 1/2 6/5 3/5...

Solving The Equations  Because there are no constant terms, these 3 equations in 3 unknowns do not have a unique solution.  Add in the fact that y +a +m = 3 to solve.  In Web-sized examples, we cannot solve by Gaussian elimination; we need to use relaxation (= iterative solution).

Real-World Problems  Some pages are “dead ends” (have no links out). Such a page causes importance to leak out.  Other (groups of) pages are spider traps (all out-links are within the group). Eventually spider traps absorb all importance.

Microsoft Becomes Dead End Yahoo M’softAmazon y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 0 y a m

Example  Equations v = M v : y = y /2 + a /2 a = y /2 m = a /2 y a = m /2 3/4 1/2 1/4 5/8 3/8 1/

M’soft Becomes Spider Trap Yahoo M’softAmazon y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 1 y a m

Example  Equations v = M v : y = y /2 + a /2 a = y /2 m = a /2 + m y a = m /2 3/2 3/4 1/2 7/4 5/8 3/

Google Solution to Traps, Etc.  “Tax” each page a fixed percentage at each interation.  Add the same constant to all pages.  Models a random walk with a fixed probability of going to a random place next.

Example: Previous with 20% Tax  Equations v = 0.8(M v ) + 0.2: y = 0.8(y /2 + a/2) a = 0.8(y /2) m = 0.8(a /2 + m) y a = m /11 5/11 21/11...

General Case  In this example, because there are no dead-ends, the total importance remains at 3.  In examples with dead-ends, some importance leaks out, but total remains finite.

Solving the Equations  Because there are constant terms, we can expect to solve small examples by Gaussian elimination.  Web-sized examples still need to be solved by relaxation.

Speeding Convergence  Newton-like prediction of where components of the principal eigenvector are heading.  Take advantage of locality in the Web.  Each technique can reduce the number of iterations by 50%. Important --- PageRank takes time!

Web Content Mining  The Web is perhaps the single largest data source in the world.  Much of the Web (content) mining is about Data/information extraction from semi-structured objects and free text, and Integration of the extracted data/information  Due to the heterogeneity and lack of structure, mining and integration are challenging tasks.

Wrapper induction  Using machine learning to generate extraction rules. The user marks the target items in a few training pages. The system learns extraction rules from these pages. The rules are applied to extract target items from other pages.  Many wrapper induction systems, e.g., WIEN (Kushmerick et al, IJCAI-97), Softmealy (Hsu and Dung, 1998), Stalker (Muslea et al. Agents-99), BWI (Freitag and McCallum, AAAI-00), WL 2 (Cohen et al. WWW-02). IDE (Liu and Zhai, WISE-05) Thresher (Hogue and Karger, WWW-05)

Stalker: A wrapper induction system (Muslea et al. Agents-99) E1:513 Pico, Venice, Phone E2:90 Colfax, Palms, Phone (800) E3: st St., LA, Phone E4: 403 La Tijera, Watts, Phone: (310) We want to extract area code.  Start rules: R1: SkipTo(() R2: SkipTo(- )  End rules: R3: SkipTo()) R4: SkipTo( )

Learning extraction rules  Stalker uses sequential covering to learn extraction rules for each target item. In each iteration, it learns a perfect rule that covers as many positive items as possible without covering any negative items. Once a positive item is covered by a rule, the whole example is removed. The algorithm ends when all the positive items are covered. The result is an ordered list of all learned rules.

Rule induction through an example Training examples: E1:513 Pico, Venice, Phone E2:90 Colfax, Palms, Phone (800) E3: st St., LA, Phone E4: 403 La Tijera, Watts, Phone: (310) We learn start rule for area code.  Assume the algorithm starts with E2. It creates three initial candidate rules with first prefix symbol and two wildcards: R1: SkipTo(() R2: SkipTo(Punctuation) R3: SkipTo(Anything)  R1 is perfect. It covers two positive examples but no negative example.

Rule induction (cont …) E1:513 Pico, Venice, Phone E2:90 Colfax, Palms, Phone (800) E3: st St., LA, Phone E4: 403 La Tijera, Watts, Phone: (310)  R1 covers E2 and E4, which are removed. E1 and E3 need additional rules.  Three candidates are created: R4: SkiptTo( ) R5: SkipTo(HtmlTag) R6: SkipTo(Anything)  None is good. Refinement is needed.  Stalker chooses R4 to refine, i.e., to add additional symbols, to specialize it.  It will find R7: SkipTo(- ), which is perfect.

Limitations of Supervised Learning  Manual Labeling is labor intensive and time consuming, especially if one wants to extract data from a huge number of sites.  Wrapper maintenance is very costly: If Web sites change frequently It is necessary to detect when a wrapper stops to work properly. Any change may make existing extraction rules invalid. Re-learning is needed, and most likely manual re-labeling as well.

The RoadRunner System (Crescenzi et al. VLDB-01)  Given a set of positive examples (multiple sample pages). Each contains one or more data records.  From these pages, generate a wrapper as a union-free regular expression (i.e., no disjunction). The approach  To start, a sample page is taken as the wrapper.  The wrapper is then refined by solving mismatches between the wrapper and each sample page, which generalizes the wrapper.

Compare with wrapper induction  No manual labeling, but need a set of positive pages of the same template which is not necessary for a page with multiple data records  not wrapper for data records, but pages. A Web page can have many pieces of irrelevant information. Issues of automatic extraction  Hard to handle disjunctions  Hard to generate attribute names for the extracted data.  extracted data from multiple sites need integration, manual or automatic.

Relation Extraction  Assumptions: No single source contains all the tuples Each tuple appears on many web pages Components of tuple appear “close” together  Foundation, by Isaac Asimov  Isaac Asimov’s masterpiece, the Foundation trilogy There are repeated patterns in the way tuples are represented on web pages

Naïve approach  Study a few websites and come up with a set of patterns e.g., regular expressions letter = [A-Za-z. ] title = letter{5,40} author = letter{10,30} (title) by (author)

Problems with naïve approach  A pattern that works on one web page might produce nonsense when applied to another So patterns need to be page-specific, or at least site-specific  Impossible for a human to exhaustively enumerate patterns for every relevant website Will result in low coverage

Better approach (Brin)  Exploit duality between patterns and tuples Find tuples that match a set of patterns Find patterns that match a lot of tuples DIPRE (Dual Iterative Pattern Relation Extraction) PatternsTuples Match Generate

DIPRE Algorithm 1. R Ã SampleTuples  e.g., a small set of pairs 2. O Ã FindOccurrences(R)  Occurrences of tuples on web pages  Keep some surrounding context 3. P Ã GenPatterns(O)  Look for patterns in the way tuples occur  Make sure patterns are not too general! 4. R Ã MatchingTuples(P) 5. Return or go back to Step 2

Web query interface integration  Many integration tasks, Integrating Web query interfaces (search forms) Integrating extracted data Integrating textual information Integrating ontologies (taxonomy) …  We only introduce integration of query interfaces. Many web sites provide forms to query deep web Applications: meta-search and meta-query

Global Query Interface united.comairtravel.comdelta.comhotwire.com

Synonym Discovery ( He and Chang, KDD-04 )  Discover synonym attributes Author – Writer, Subject – Category Holistic Model Discovery authornamesubject category writer S2: writer title category format S3: name title keyword binding S1: author title subject ISBN Pairwise Attribute Correspondence S2: writer title category format S3: name title keyword binding S1: author title subject ISBN S1.author  S3.name S1.subject  S2.category V.S.

Schema matching as correlation mining Across many sources:  Synonym attributes are negatively correlated synonym attributes are semantically alternatives. thus, rarely co-occur in query interfaces  Grouping attributes with positive correlation grouping attributes semantically complement thus, often co-occur in query interfaces