CS345 Data Mining Mining the Web for Structured Data.

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Chapter 5: Introduction to Information Retrieval
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.
SECTIONS 21.4 – 21.5 Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin INFORMATION INTEGRATION.
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Web Mining. Two Key Problems  Page Rank  Web Content Mining.
CS345 Data Mining Virtual Databases. Example  Find marketing manager openings in Internet companies so that my commute is shorter than 10 miles. Web.
1 Question-Answering via the Web: the AskMSR System Note: these viewgraphs were originally developed by Professor Nick Kushmerick, University College Dublin,
CS246 Extracting Structured Information from the Web.
Web Mining for Extracting Relations Negin Nejati.
A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Chapter 5: Information Retrieval and Web Search
Web Searching. Web Search Engine A web search engine is designed to search for information on the World Wide Web and FTP servers The search results are.
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
The Confident Researcher: Google Away (Module 2) The Confident Researcher: Google Away 2.
Ch5 Mining Frequent Patterns, Associations, and Correlations
Text Classification, Active/Interactive learning.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
Conducting Research on the Web. This presentation will teach you about:  Different types of search engines  How to search on the Internet  How to cite.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.
Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
A fast algorithm for the generalized k- keyword proximity problem given keyword offsets Sung-Ryul Kim, Inbok Lee, Kunsoo Park Information Processing Letters,
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Bootstrapping April William Cohen. Prehistory Karl Friedrich Hieronymus, Freiherr von Münchhausen (11 May 1720 – 22 February 1797) was a German.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
Overview A regular expression defines a search pattern for strings. Regular expressions can be used to search, edit and manipulate text. The pattern defined.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.
Algorithmic Detection of Semantic Similarity WWW 2005.
1 Internet Research Third Edition Unit A Searching the Internet Effectively.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Association Rule Mining
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Internet Research – Illustrated, Fourth Edition Unit A.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Liang, Introduction to Java Programming, Sixth Edition, (c) 2007 Pearson Education, Inc. All rights reserved Chapter 23 Algorithm Efficiency.
MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
Mining the Web for Structured Data
WEB SPAM.
Internet Research Third Edition
Information Retrieval
Extracting Patterns and Relations from the World Wide Web
Information Retrieval and Web Design
Presentation transcript:

CS345 Data Mining Mining the Web for Structured Data

Our view of the web so far…  Web pages as atomic units  Great for some applications e.g., Conventional web search  But not always the right model

Going beyond web pages  Question answering What is the height of Mt Everest? Who killed Abraham Lincoln?  Relation Extraction Find all pairs  Virtual Databases Answer database-like queries over web data E.g., Find all software engineering jobs in Fortune 500 companies

Question Answering  E.g., Who killed Abraham Lincoln?  Naïve algorithm Find all web pages containing the terms “killed” and “Abraham Lincoln” in close proximity Extract k-grams from a small window around the terms Find the most commonly occuring k- grams

Question Answering  Naïve algorithm works fairly well!  Some improvements Use sentence structure e.g., restrict to noun phrases only Rewrite questions before matching  “What is the height of Mt Everest” becomes “The height of Mt Everest is ”  The number of pages analyzed is more important than the sophistication of the NLP For simple questions Reference: Dumais et al

Relation Extraction  Find pairs (title, author) Where title is the name of a book E.g., (Foundation, Isaac Asimov)  Find pairs (company, hq) E.g., (Microsoft, Redmond)  Find pairs (abbreviation, expansion) (ADA, American Dental Association)  Can also have tuples with >2 components

Relation Extraction  Assumptions: No single source contains all the tuples Each tuple appears on many web pages Components of tuple appear “close” together  Foundation, by Isaac Asimov  Isaac Asimov’s masterpiece, the Foundation trilogy There are repeated patterns in the way tuples are represented on web pages

Naïve approach  Study a few websites and come up with a set of patterns e.g., regular expressions letter = [A-Za-z. ] title = letter{5,40} author = letter{10,30} (title) by (author)

Problems with naïve approach  A pattern that works on one web page might produce nonsense when applied to another So patterns need to be page-specific, or at least site-specific  Impossible for a human to exhaustively enumerate patterns for every relevant website Will result in low coverage

Better approach (Brin)  Exploit duality between patterns and tuples Find tuples that match a set of patterns Find patterns that match a lot of tuples DIPRE (Dual Iterative Pattern Relation Extraction) PatternsTuples Match Generate

DIPRE Algorithm 1.R ← SampleTuples  e.g., a small set of pairs 2.O ← FindOccurrences(R)  Occurrences of tuples on web pages  Keep some surrounding context 3.P ← GenPatterns(O)  Look for patterns in the way tuples occur  Make sure patterns are not too general! 4.R ← MatchingTuples(P) 5.Return or go back to Step 2

Occurrences  e.g., Titles and authors  Restrict to cases where author and title appear in close proximity on web page Foundation by Isaac Asimov (1951)  url =  order = [title,author] (or [author,title]) denote as 0 or 1  prefix = “ ” (limit to e.g., 10 characters)  middle = “ by ”  suffix = “(1951) ”  occurrence = (’Foundation’,’Isaac Asimov’,url,order,prefix,middle,suffix)

Patterns Foundation by Isaac Asimov (1951) Nightfall by Isaac Asimov (1941)  order = [title,author] (say 0)  shared prefix =  shared middle = by  shared suffix = (19  pattern = (order,shared prefix, shared middle, shared suffix)

URL Prefix  Patterns may be specific to a website Or even parts of it  Add urlprefix component to pattern occurence: Foundation by Isaac Asimov (1951) occurence: Nightfall by Isaac Asimov (1941) shared urlprefix = pattern = (urlprefix,order,prefix,middle,suffix)

Generating Patterns 1.Group occurences by order and middle 2.Let O = set of occurences with the same order and middle  pattern.order = O.order  pattern.middle = O.middle  pattern.urlprefix = longest common prefix of all urls in O  pattern.prefix = longest common prefix of occurrences in O  pattern.suffix = longest common suffix of occurrences in O

Example occurence: Foundation by Isaac Asimov (1951) occurence: Nightfall by Isaac Asimov (1941)  order = [title,author]  middle = “ by ”  urlprefix =  prefix = “ ”  suffix = “ (19”

Example occurence: Foundation, by Isaac Asimov, has been hailed… occurence: Nightfall, by Isaac Asimov, tells the tale of…  order = [title,author]  middle = “, by ”  urlprefix =  prefix = “”  suffix = “, ”

Pattern Specificity  We want to avoid generating patterns that are too general  One approach: For pattern p, define specificity = |urlprefix||middle||prefix||suffix| Suppose n(p) = number of occurences that match the pattern p Discard patterns where n(p) < n min Discard patterns p where specificity(p)n(p) < threshold

Pattern Generation Algorithm 1.Group occurences by order and middle 2.Let O = a set of occurences with the same order and middle 3.p = GeneratePattern(O) 4.If p meets specificity requirements, add p to set of patterns 5.Otherwise, try to split O into multiple subgroups by extending the urlprefix by one character  If all occurences in O are from the same URL, we cannot extend the urlprefix, so we discard O

Extending the URL prefix Suppose O contains occurences from urls of the form urlprefix = When we extend the urlprefix, we split O into two subsets: urlprefix = urlprefix =

Finding occurrences and matches  Finding occurrences Use inverted index on web pages Examine resulting pages to extract occurrences  Finding matches Use urlprefix to restrict set of pages to examine Scan each page using regex constructed from pattern

Relation Drift  Small contaminations can easily lead to huge divergences  Need to tightly control process  Snowball (Agichtein and Gravano) Trust only tuples that match many patterns Trust only patterns with high “support” and “confidence”

Pattern support  Similar to DIPRE  Eliminate patterns not supported by at least n min known good tuples either seed tuples or tuples generated in a prior iteration

Pattern Confidence  Suppose tuple t matches pattern p  What is the probability that tuple t is valid?  Call this probability the confidence of pattern p, denoted conf(p) Assume independent of other patterns  How can we estimate conf(p)?

Categorizing pattern matches  Given pattern p, suppose we can partition its matching tuples into groups p.positive, p.negative, and p.unknown  Grouping methodology is application- specific

Categorizing Matches  e.g., Organizations and Headquarters A tuple that exactly matches a known pair (org,hq) is positive A tuple that matches the org of a known tuple but a different hq is negative  Assume org is key for relation A tuple that matches a hq that is not a known city is negative  Assume we have a list of valid city names All other occurrences are unknown

Categorizing Matches  Books and authors One possibility… A tuple that matches a known tuple is positive A tuple that matches the title of a known tuple but has a different author is negative  Assume title is key for relation All other tuples are unknown  Can come up with other schemes if we have more information e.g., list of possible legal people names

Example  Suppose we know the tuples Foundation, Isaac Asimov Startide Rising, David Brin  Suppose pattern p matches Foundation, Isaac Asimov Startide Rising, David Brin Foundation, Doubleday Rendezvous with Rama, Arthur C. Clarke  |p.positive| = 2, |p.negative| = 1, |p.unknown| = 1

Pattern Confidence (1) pos(p) = |p.positive| neg(p) = |p.negative| un(p) = |p.unknown| conf(p) = pos(p)/(pos(p)+neg(p))

Pattern Confidence (2)  Another definition – penalize patterns with many unknown matches conf(p) = pos(p)/(pos(p)+neg(p)+un(p)  ) where 0 ·  · 1

Tuple confidence  Suppose candidate tuple t matches patterns p 1 and p 2  What is the probability that t is an valid tuple? Assume matches of different patterns are independent events

Tuple confidence  Pr[t matches p 1 and t is not valid] = 1-conf(p 1 )  Pr[t matches p 2 and t is not valid] = 1-conf(p 2 )  Pr[t matches {p 1,p 2 } and t is not valid] = (1-conf(p 1 ))(1-conf(p 2 ))  Pr[t matches {p 1,p 2 } and t is valid] = 1 - (1-conf(p 1 ))(1-conf(p 2 ))  If tuple t matches a set of patterns P conf(t) = 1 -  p 2 P (1-conf(p))

Snowball algorithm 1.Start with seed set R of tuples 2.Generate set P of patterns from R  Compute support and confidence for each pattern in P  Discard patterns with low support or confidence 3.Generate new set T of tuples matching patterns P  Compute confidence of each tuple in T 4.Add to R the tuples t 2 T with conf(t)>threshold. 5.Go back to step 2

Some refinements  Give more weight to tuples found earlier  Approximate pattern matches  Entity tagging

Approximate matches  If tuple t matches a set of patterns P conf(t) = 1 -  p 2 P (1-conf(p))  Suppose we allow tuples that don’t exactly match patterns but only approximately conf(t) = 1 -  p 2 P (1-conf(p)match(t,p))