CS246 Extracting Structured Information from the Web.

Slides:



Advertisements
Similar presentations
MY NCBI (module 4.5). MODULE 4.5 PubMed/How to Use MY NCBI Instructions - This part of the: course is a PowerPoint demonstration intended to introduce.
Advertisements

KompoZer. This is what KompoZer will look like with a blank document open. As you can see, there are a lot of icons for beginning users. But don't be.
Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007.
David Luebke 1 5/4/2015 CS 332: Algorithms Dynamic Programming Greedy Algorithms.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Search Engines and Information Retrieval
RoadRunner: Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo Presented by Lei Lei.
Aki Hecht Seminar in Databases (236826) January 2009
Traditional Information Extraction -- Summary CS652 Spring 2004.
CS107 Introduction to Computer Science
CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001.
Normal forms for Context-Free Grammars
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley A Case Study in Database Organization The iDiary Database lawrence snyder.
CS345 Data Mining Mining the Web for Structured Data.
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
Web Mining for Extracting Relations Negin Nejati.
R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001.
CS107 Introduction to Computer Science Lecture 7, 8 An Introduction to Algorithms: Efficiency of algorithms.
A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)
Chapter 5: Information Retrieval and Web Search
Basic Concept of Data Coding Codes, Variables, and File Structures.
Radial Basis Function Networks
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Choosing and Implementing a Research Design Lauren Garcia-DuPlain The University of Akron English Composition 112.
Search Engines and Information Retrieval Chapter 1.
IMSS005 Computer Science Seminar
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Materials taken from Knisely, Karin
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Copyrighted material John Tullis 10/17/2015 page 1 04/15/00 XML Part 3 John Tullis DePaul Instructor
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Lexical Analysis Constructing a Scanner from Regular Expressions.
Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionSplitting Function Issues in Decision-Tree LearningIssues in Decision-Tree Learning.
Presenter: Shanshan Lu 03/04/2010
Information extraction from text Spring 2003, Part 4 Helena Ahonen-Myka.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Online Surveys Jacqui James and Malcolm Roberts School of Education.
Regular Expressions Chapter 6 1. Regular Languages Regular Language Regular Expression Finite State Machine L Accepts 2.
Tutorial 13 Validating Documents with Schemas
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Information Integration By Neel Bavishi. Mediator Introduction A mediator supports a virtual view or collection of views that integrates several sources.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
This is an example of an infinite series. 1 1 Start with a square one unit by one unit: This series converges (approaches a limiting value.) Many series.
Information Retrieval
Online Surveys Jacqui James Malcolm Roberts School of Education.
Regular Expressions Chapter 6. Regular Languages Regular Language Regular Expression Finite State Machine L Accepts.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
INVITATION TO Computer Science 1 11 Chapter 2 The Algorithmic Foundations of Computer Science.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Database Vocabulary Terms.
Lecture 12: Data Wrangling
Information Retrieval
Family History Technology Workshop
Introduction to Information Retrieval
Junghoo “John” Cho UCLA
CS246: Information Retrieval
Junghoo “John” Cho UCLA
Extracting Patterns and Relations from the World Wide Web
KnowItAll and TextRunner
Presentation transcript:

CS246 Extracting Structured Information from the Web

Junghoo "John" Cho (UCLA Computer Science)2 A Story of Nightmare Spam Inc Task from your boss 10M Web pages Find all [person name, ] Big salary cut unless you collect 100,000 “quality records” in a week

Junghoo "John" Cho (UCLA Computer Science)3 How? Any idea? Why such a task? Information is already there…  To use it for other programs: Use the addresses to send s For now let us ignore the techniques in the papers and see how we can approach the problem

Junghoo "John" Cho (UCLA Computer Science)4 Solution 1 Manual approach 10 sec/record 8640 records/day records/week Okay if 5 sec/record

Junghoo "John" Cho (UCLA Computer Science)5 Solution 2 Write an “extraction rule” Regular expression Name: [A-Z][a-z]* [A-Z][a-z]* Find all matches using the rule Maybe “filter out” manually

Junghoo "John" Cho (UCLA Computer Science)6 Question Do we have to construct an “extraction rule” for every task? Can we automate “rule construction”?

Junghoo "John" Cho (UCLA Computer Science)7 General Problem Extraction Rule or Pattern (John, (Eric, (James, Web pages or Plain text Structured data How to generate it?

Junghoo "John" Cho (UCLA Computer Science)8 Basic Idea Users provide small “examples” or a “training set” Tag some [name, ] pairs from the data

Junghoo "John" Cho (UCLA Computer Science)9 Tagging Name

Junghoo "John" Cho (UCLA Computer Science)10 Basic Idea Users provide small “examples” or a “training set” Tag some [name, ] pairs from the data System “generalize” the examples & derive a “rule” or “patterns” Find common patterns among the tagged pairs

Junghoo "John" Cho (UCLA Computer Science)11 Pattern Generation Chu … Cong Cho … …  #Name # !

Junghoo "John" Cho (UCLA Computer Science)12 Basic Idea Users provide small “examples” or a “training set” Tag some [name, ] pairs from the data System “generalize” the examples & derive a “rule” or “patterns” Find common patterns among the tagged pairs Use the rule to extract other instances.

Junghoo "John" Cho (UCLA Computer Science)13 Fundamental Questions How to generalize? Examples  patterns: how? Pattern construction algorithm How to express “patterns” or “rules” Regular expression? Context-free grammar? Pattern language How to select the right pattern? Many possible patterns. Which one to choose? Evaluation function

Junghoo "John" Cho (UCLA Computer Science)14 Dual Questions What kind of sources? Unstructured vs. Regular Plain text vs. Table Noisy vs. Clean What kind of data to extract? Difficult to identify vs. Easy to describe Name vs. Single occurrences vs. Multiple occurrences vs. Song title

Junghoo "John" Cho (UCLA Computer Science)15 Questions?

Junghoo "John" Cho (UCLA Computer Science)16 Book and Author paper How many people understood it? What is the problem? What is the basic idea? How many people got it? How many people liked it? What did you like/hate about the paper?

Junghoo "John" Cho (UCLA Computer Science)17 Basic Algorithm (1) Start with a small example (Issac Asimov, The Robots of Dawn) (David Brin, Startide Rising) Find all matches from Web pages (with surrounding text) … Startide Rising by David Brin (2 nd … …book The Robots of Dawn by Isaac Asimov (19… Derive common patterns among matches  #Book by #Author (

Junghoo "John" Cho (UCLA Computer Science)18 Basic Algorithm (2) Find more examples using the pattern #Book by #Author (  … The Time Machine by H.G. Wells (… … The Lurker at the Threshold by H.P. Lovedraft (…  (H.G. Wells, The Time Machine) (H.P. Lovedraft, The Lurker at the Threshold)

Junghoo "John" Cho (UCLA Computer Science)19 Basic Algorithm (3) Find more occurrences of the new examples …book The Time Machine by H.G. Wells (… … The Lurker at the Threshold by H.P. Lovedraft (… Derive more rules based on the matches  #Book by #Author Repeat the process

Junghoo "John" Cho (UCLA Computer Science)20 Basic Algorithm (Summary) Examples (Asimov, Dawn) Matching Strings Dawn by Asimov ( Patterns #Book by #Author ( More Examples (Brin, Star)

Junghoo "John" Cho (UCLA Computer Science)21 Basic Algorithm (Summary) Examples (Asimov, Dawn) Matching Strings Dawn by Asimov ( Patterns #Book by #Author ( More Examples (Brin, Star)

Junghoo "John" Cho (UCLA Computer Science)22 Result 23M Web pages 5 examples 5 Iterations 1 Manual filtering  15,257 pairs with few errors

Junghoo "John" Cho (UCLA Computer Science)23 What’s New? No tagging. Simple examples (Pattern, Relation) duality Conceptually elegant Feedback loop Why don’t we use learned examples? Small initial sample Promising results

Junghoo "John" Cho (UCLA Computer Science)24 Problems of Feedback Loop What if there are erroneous examples? Expand to meaningless data?

Junghoo "John" Cho (UCLA Computer Science)25 What Did the Author Do? Manual filtering in 4 th iteration Stopped iteration after 5 iterations Specificity factor |middle| x |prefix| x |suffix| x |urlprefix| Adopt a pattern if it has a long prefix, suffix and/or mid-string Limit rules to a very specific URL space Rule includes URL prefix

Junghoo "John" Cho (UCLA Computer Science)26 Divergence? Another experiment Initial examples: Baseball team names Data: Newspaper articles Results: All sports team names Given a set of examples, where would it converge?

Junghoo "John" Cho (UCLA Computer Science)27 How to Control Divergence? Example  Pattern More than k examples Pattern  Example More than k patterns

Junghoo "John" Cho (UCLA Computer Science)28 Matrix Interpretation Rows: Examples (Items) We assume a hypothetical set of all examples occurring in the data Columns: Patterns We assume a hypothetical set of all patterns that can be derived Cell[ i, j ] = 1 iff j th pattern matches i th example Row[ i ] = (Book of worm, Asimov) Column[ j ] = #Book by #Author Cell[ i, j ] = 1 if “ Book of worm by Asimov” exists

Junghoo "John" Cho (UCLA Computer Science)29 Matrix Example (A, B) (C, D) (C, A) (D, E) (S, L) (N, U) …. … Patterns Items

Junghoo "John" Cho (UCLA Computer Science)30 How to Control Divergence? Example  Pattern More than k examples Pattern  Example More than k patterns Fix the matrix!

Junghoo "John" Cho (UCLA Computer Science)31 How to Change Matrix? Change Row? Filter out noise from data Use only the pages mentioning “books” Classify pages based on word frequency Identify only “relevant” part of pages Identify only “structured” part of pages List? Tables?

Junghoo "John" Cho (UCLA Computer Science)32 How to Change Matrix? Change Column? Use different pattern language E.g., the author used “url prefix” Context-free grammar? What will be a good pattern space?

Junghoo "John" Cho (UCLA Computer Science)33 Fundamental Questions How to express “patterns” or “rules” Pattern language How to examples  patterns? Pattern construction algorithm How to select the right one? Evaluation function

Junghoo "John" Cho (UCLA Computer Science)34 Pattern Language? Very limited regular expression With URL filter URL filter seems to be important to minimize noise [prefix] #book [midstring] #author [suffix]

Junghoo "John" Cho (UCLA Computer Science)35 Pattern Construction Algorithm? 1. Group matching strings based on “mid- string” 2. Find longest prefix, suffix and URL-prefix 3. If the pattern is long enough, adopt it

Junghoo "John" Cho (UCLA Computer Science)36 Evaluation Function? The longer, the better. Specificity factor |middle| x |prefix| x |suffix| x |urlprefix| To minimize noise

Junghoo "John" Cho (UCLA Computer Science)37 Dual Question Regular vs. Unstructured source Relatively regular source required Noisy vs. Clean source General noise okay Single vs. Multiple occurrences Multiple occurrence

Junghoo "John" Cho (UCLA Computer Science)38 Would It Work? [name, phone number]

Junghoo "John" Cho (UCLA Computer Science)39 Would It Work? [name, phone number]? No: [mid-string] not fixed More expressive pattern language HTML parse-tree based?

Junghoo "John" Cho (UCLA Computer Science)40 Any Other Questions?

Junghoo "John" Cho (UCLA Computer Science)41 RoadRunner What is the problem? What is the main idea?

Junghoo "John" Cho (UCLA Computer Science)42 Key Observation Many Web pages generated from structured database These pages are based on “templates”, thus follow extremely regular structure We can extract data by identifying “different parts”

Junghoo "John" Cho (UCLA Computer Science)43 Key Idea Compare two pages Extract different parts

Junghoo "John" Cho (UCLA Computer Science)44 Simplest Case Books of: John Smith Title: DB Primer Books of: Paul Jones Title: XML at Work Mismatch!

Junghoo "John" Cho (UCLA Computer Science)45 Simplest Case Books of: Title: Books of: Title: Template

Junghoo "John" Cho (UCLA Computer Science)46 Simplest Case John Smith DB Primer Paul Jones XML at Work Data

Junghoo "John" Cho (UCLA Computer Science)47 What Other Cases?

Junghoo "John" Cho (UCLA Computer Science)48 Repeated Items (from Amazon)

Junghoo "John" Cho (UCLA Computer Science)49 Missing Items (from Amazon) No Image!

Junghoo "John" Cho (UCLA Computer Science)50 Varying Items (from Amazon) Item varies!

Junghoo "John" Cho (UCLA Computer Science)51 Other Cases Repeated items Number of items may vary Missing items Optional Varying items Multiple choices How can we express these cases? Pattern language

Junghoo "John" Cho (UCLA Computer Science)52 Pattern language What patterns can express the previous cases? Regular expression? Repeated items (+) Optional items (?) Varying items ( | ) Why not context-free grammar? More expressive, but not necessary

Junghoo "John" Cho (UCLA Computer Science)53 One Step Back What are we doing here? How can we formalize the problem? Given a set of strings (instances), Find a regular language/grammar that includes the strings Grammar inference problem (One of the most important contribution of the paper)

Junghoo "John" Cho (UCLA Computer Science)54 Grammar Inference T: All possible strings Example strings Which one?

Junghoo "John" Cho (UCLA Computer Science)55 Minimal Regular Language Pick the minimal language Conservative approach May minimize bogus tuples Is it the right choice? May not match the actual semantic. But easier to solve and looks fancy! Do the authors actually pick minimal language? No. They prefer list over optional. List is larger than optional.

Junghoo "John" Cho (UCLA Computer Science)56 Why Union free? Union is ugly Major source of exponential blow-up (a|b)(c|d)(e|f)(g|h): 2 x 2 x 2 x 2 Limited expressive power, but easier to work with

Junghoo "John" Cho (UCLA Computer Science)57 Pattern Space of RoadRunner Minimal Union-free regular expression List (+) and Optional (?) No Choice ( | ) List has precedence to optional Not exactly minimal

Junghoo "John" Cho (UCLA Computer Science)58 Language Inference Algorithm String mismatches Replace with #PCDATA Tag mismatches Try list first and then optional Heavily depends on Tag mismatch

Junghoo "John" Cho (UCLA Computer Science)59 String Mismatch Books of: John Smith Title: DB Primer Books of: Paul Jones Title: XML at Work Books of: #PCDATA Title: #PCDATA

Junghoo "John" Cho (UCLA Computer Science)60 Tag Mismatches Try to generalize by list If it does not work, consider optional

Junghoo "John" Cho (UCLA Computer Science)61 List Identification Title DB Primer 1 Title DB Primer 2 Title DB Primer 3 Title XML Primer 1 Title XML Primer 2 Missing! Search for previous tag to identify end of item Verify it by matching with previous one

Junghoo "John" Cho (UCLA Computer Science)62 Recursive Mismatch Title DB Primer 1 1 st Edition, 1996 Title DB Primer 2 1 st Edition, nd Edition, 2001 Title XML Primer 1 1 st Edition, 1996 Missing! Apply matching algorithm recursively

Junghoo "John" Cho (UCLA Computer Science)63 Optional If list does not work, use optional For multiple choices, what to choose? Many different choices to consider The authors do not explain… Some heuristic pruning criteria

Junghoo "John" Cho (UCLA Computer Science)64 Multiple Choices Mismatch! Potential wrappers (( )? )+ (( )? )+… and many others

Junghoo "John" Cho (UCLA Computer Science)65 Fundamental Questions Pattern space Union-free regular expression Example  Pattern algorithm Just described Evaluation function Supposedly minimal language… but not really Exact evaluation function not explained…

Junghoo "John" Cho (UCLA Computer Science)66 Dual Question Regular vs. Unstructured source Very regular Noisy vs. Clean source Very clean Single vs. Multiple occurrences Does not matter

Junghoo "John" Cho (UCLA Computer Science)67 Limitations Heavily dependent on HTML tags Cannot extract data in free text, even if the format is regular e.g., John is the author of Great Book Very fragile to noise Of course, limitations from Union-free: Regular expression No recursive items: …

Junghoo "John" Cho (UCLA Computer Science)68 Potential Improvements? Consider multiple pages simultaneously May provide more evidence to select one choice over the other

Junghoo "John" Cho (UCLA Computer Science)69 One More Consideration Is Section 3 necessary? Read the paper without Section 3 Is it still as impressive? Generalization and theoretical background study is always helpful to make a paper more “impressive”