CS246 Extracting Structured Information from the Web
Junghoo "John" Cho (UCLA Computer Science)2 A Story of Nightmare Spam Inc Task from your boss 10M Web pages Find all [person name, ] Big salary cut unless you collect 100,000 “quality records” in a week
Junghoo "John" Cho (UCLA Computer Science)3 How? Any idea? Why such a task? Information is already there… To use it for other programs: Use the addresses to send s For now let us ignore the techniques in the papers and see how we can approach the problem
Junghoo "John" Cho (UCLA Computer Science)4 Solution 1 Manual approach 10 sec/record 8640 records/day records/week Okay if 5 sec/record
Junghoo "John" Cho (UCLA Computer Science)5 Solution 2 Write an “extraction rule” Regular expression Name: [A-Z][a-z]* [A-Z][a-z]* Find all matches using the rule Maybe “filter out” manually
Junghoo "John" Cho (UCLA Computer Science)6 Question Do we have to construct an “extraction rule” for every task? Can we automate “rule construction”?
Junghoo "John" Cho (UCLA Computer Science)7 General Problem Extraction Rule or Pattern (John, (Eric, (James, Web pages or Plain text Structured data How to generate it?
Junghoo "John" Cho (UCLA Computer Science)8 Basic Idea Users provide small “examples” or a “training set” Tag some [name, ] pairs from the data
Junghoo "John" Cho (UCLA Computer Science)9 Tagging Name
Junghoo "John" Cho (UCLA Computer Science)10 Basic Idea Users provide small “examples” or a “training set” Tag some [name, ] pairs from the data System “generalize” the examples & derive a “rule” or “patterns” Find common patterns among the tagged pairs
Junghoo "John" Cho (UCLA Computer Science)11 Pattern Generation Chu … Cong Cho … … #Name # !
Junghoo "John" Cho (UCLA Computer Science)12 Basic Idea Users provide small “examples” or a “training set” Tag some [name, ] pairs from the data System “generalize” the examples & derive a “rule” or “patterns” Find common patterns among the tagged pairs Use the rule to extract other instances.
Junghoo "John" Cho (UCLA Computer Science)13 Fundamental Questions How to generalize? Examples patterns: how? Pattern construction algorithm How to express “patterns” or “rules” Regular expression? Context-free grammar? Pattern language How to select the right pattern? Many possible patterns. Which one to choose? Evaluation function
Junghoo "John" Cho (UCLA Computer Science)14 Dual Questions What kind of sources? Unstructured vs. Regular Plain text vs. Table Noisy vs. Clean What kind of data to extract? Difficult to identify vs. Easy to describe Name vs. Single occurrences vs. Multiple occurrences vs. Song title
Junghoo "John" Cho (UCLA Computer Science)15 Questions?
Junghoo "John" Cho (UCLA Computer Science)16 Book and Author paper How many people understood it? What is the problem? What is the basic idea? How many people got it? How many people liked it? What did you like/hate about the paper?
Junghoo "John" Cho (UCLA Computer Science)17 Basic Algorithm (1) Start with a small example (Issac Asimov, The Robots of Dawn) (David Brin, Startide Rising) Find all matches from Web pages (with surrounding text) … Startide Rising by David Brin (2 nd … …book The Robots of Dawn by Isaac Asimov (19… Derive common patterns among matches #Book by #Author (
Junghoo "John" Cho (UCLA Computer Science)18 Basic Algorithm (2) Find more examples using the pattern #Book by #Author ( … The Time Machine by H.G. Wells (… … The Lurker at the Threshold by H.P. Lovedraft (… (H.G. Wells, The Time Machine) (H.P. Lovedraft, The Lurker at the Threshold)
Junghoo "John" Cho (UCLA Computer Science)19 Basic Algorithm (3) Find more occurrences of the new examples …book The Time Machine by H.G. Wells (… … The Lurker at the Threshold by H.P. Lovedraft (… Derive more rules based on the matches #Book by #Author Repeat the process
Junghoo "John" Cho (UCLA Computer Science)20 Basic Algorithm (Summary) Examples (Asimov, Dawn) Matching Strings Dawn by Asimov ( Patterns #Book by #Author ( More Examples (Brin, Star)
Junghoo "John" Cho (UCLA Computer Science)21 Basic Algorithm (Summary) Examples (Asimov, Dawn) Matching Strings Dawn by Asimov ( Patterns #Book by #Author ( More Examples (Brin, Star)
Junghoo "John" Cho (UCLA Computer Science)22 Result 23M Web pages 5 examples 5 Iterations 1 Manual filtering 15,257 pairs with few errors
Junghoo "John" Cho (UCLA Computer Science)23 What’s New? No tagging. Simple examples (Pattern, Relation) duality Conceptually elegant Feedback loop Why don’t we use learned examples? Small initial sample Promising results
Junghoo "John" Cho (UCLA Computer Science)24 Problems of Feedback Loop What if there are erroneous examples? Expand to meaningless data?
Junghoo "John" Cho (UCLA Computer Science)25 What Did the Author Do? Manual filtering in 4 th iteration Stopped iteration after 5 iterations Specificity factor |middle| x |prefix| x |suffix| x |urlprefix| Adopt a pattern if it has a long prefix, suffix and/or mid-string Limit rules to a very specific URL space Rule includes URL prefix
Junghoo "John" Cho (UCLA Computer Science)26 Divergence? Another experiment Initial examples: Baseball team names Data: Newspaper articles Results: All sports team names Given a set of examples, where would it converge?
Junghoo "John" Cho (UCLA Computer Science)27 How to Control Divergence? Example Pattern More than k examples Pattern Example More than k patterns
Junghoo "John" Cho (UCLA Computer Science)28 Matrix Interpretation Rows: Examples (Items) We assume a hypothetical set of all examples occurring in the data Columns: Patterns We assume a hypothetical set of all patterns that can be derived Cell[ i, j ] = 1 iff j th pattern matches i th example Row[ i ] = (Book of worm, Asimov) Column[ j ] = #Book by #Author Cell[ i, j ] = 1 if “ Book of worm by Asimov” exists
Junghoo "John" Cho (UCLA Computer Science)29 Matrix Example (A, B) (C, D) (C, A) (D, E) (S, L) (N, U) …. … Patterns Items
Junghoo "John" Cho (UCLA Computer Science)30 How to Control Divergence? Example Pattern More than k examples Pattern Example More than k patterns Fix the matrix!
Junghoo "John" Cho (UCLA Computer Science)31 How to Change Matrix? Change Row? Filter out noise from data Use only the pages mentioning “books” Classify pages based on word frequency Identify only “relevant” part of pages Identify only “structured” part of pages List? Tables?
Junghoo "John" Cho (UCLA Computer Science)32 How to Change Matrix? Change Column? Use different pattern language E.g., the author used “url prefix” Context-free grammar? What will be a good pattern space?
Junghoo "John" Cho (UCLA Computer Science)33 Fundamental Questions How to express “patterns” or “rules” Pattern language How to examples patterns? Pattern construction algorithm How to select the right one? Evaluation function
Junghoo "John" Cho (UCLA Computer Science)34 Pattern Language? Very limited regular expression With URL filter URL filter seems to be important to minimize noise [prefix] #book [midstring] #author [suffix]
Junghoo "John" Cho (UCLA Computer Science)35 Pattern Construction Algorithm? 1. Group matching strings based on “mid- string” 2. Find longest prefix, suffix and URL-prefix 3. If the pattern is long enough, adopt it
Junghoo "John" Cho (UCLA Computer Science)36 Evaluation Function? The longer, the better. Specificity factor |middle| x |prefix| x |suffix| x |urlprefix| To minimize noise
Junghoo "John" Cho (UCLA Computer Science)37 Dual Question Regular vs. Unstructured source Relatively regular source required Noisy vs. Clean source General noise okay Single vs. Multiple occurrences Multiple occurrence
Junghoo "John" Cho (UCLA Computer Science)38 Would It Work? [name, phone number]
Junghoo "John" Cho (UCLA Computer Science)39 Would It Work? [name, phone number]? No: [mid-string] not fixed More expressive pattern language HTML parse-tree based?
Junghoo "John" Cho (UCLA Computer Science)40 Any Other Questions?
Junghoo "John" Cho (UCLA Computer Science)41 RoadRunner What is the problem? What is the main idea?
Junghoo "John" Cho (UCLA Computer Science)42 Key Observation Many Web pages generated from structured database These pages are based on “templates”, thus follow extremely regular structure We can extract data by identifying “different parts”
Junghoo "John" Cho (UCLA Computer Science)43 Key Idea Compare two pages Extract different parts
Junghoo "John" Cho (UCLA Computer Science)44 Simplest Case Books of: John Smith Title: DB Primer Books of: Paul Jones Title: XML at Work Mismatch!
Junghoo "John" Cho (UCLA Computer Science)45 Simplest Case Books of: Title: Books of: Title: Template
Junghoo "John" Cho (UCLA Computer Science)46 Simplest Case John Smith DB Primer Paul Jones XML at Work Data
Junghoo "John" Cho (UCLA Computer Science)47 What Other Cases?
Junghoo "John" Cho (UCLA Computer Science)48 Repeated Items (from Amazon)
Junghoo "John" Cho (UCLA Computer Science)49 Missing Items (from Amazon) No Image!
Junghoo "John" Cho (UCLA Computer Science)50 Varying Items (from Amazon) Item varies!
Junghoo "John" Cho (UCLA Computer Science)51 Other Cases Repeated items Number of items may vary Missing items Optional Varying items Multiple choices How can we express these cases? Pattern language
Junghoo "John" Cho (UCLA Computer Science)52 Pattern language What patterns can express the previous cases? Regular expression? Repeated items (+) Optional items (?) Varying items ( | ) Why not context-free grammar? More expressive, but not necessary
Junghoo "John" Cho (UCLA Computer Science)53 One Step Back What are we doing here? How can we formalize the problem? Given a set of strings (instances), Find a regular language/grammar that includes the strings Grammar inference problem (One of the most important contribution of the paper)
Junghoo "John" Cho (UCLA Computer Science)54 Grammar Inference T: All possible strings Example strings Which one?
Junghoo "John" Cho (UCLA Computer Science)55 Minimal Regular Language Pick the minimal language Conservative approach May minimize bogus tuples Is it the right choice? May not match the actual semantic. But easier to solve and looks fancy! Do the authors actually pick minimal language? No. They prefer list over optional. List is larger than optional.
Junghoo "John" Cho (UCLA Computer Science)56 Why Union free? Union is ugly Major source of exponential blow-up (a|b)(c|d)(e|f)(g|h): 2 x 2 x 2 x 2 Limited expressive power, but easier to work with
Junghoo "John" Cho (UCLA Computer Science)57 Pattern Space of RoadRunner Minimal Union-free regular expression List (+) and Optional (?) No Choice ( | ) List has precedence to optional Not exactly minimal
Junghoo "John" Cho (UCLA Computer Science)58 Language Inference Algorithm String mismatches Replace with #PCDATA Tag mismatches Try list first and then optional Heavily depends on Tag mismatch
Junghoo "John" Cho (UCLA Computer Science)59 String Mismatch Books of: John Smith Title: DB Primer Books of: Paul Jones Title: XML at Work Books of: #PCDATA Title: #PCDATA
Junghoo "John" Cho (UCLA Computer Science)60 Tag Mismatches Try to generalize by list If it does not work, consider optional
Junghoo "John" Cho (UCLA Computer Science)61 List Identification Title DB Primer 1 Title DB Primer 2 Title DB Primer 3 Title XML Primer 1 Title XML Primer 2 Missing! Search for previous tag to identify end of item Verify it by matching with previous one
Junghoo "John" Cho (UCLA Computer Science)62 Recursive Mismatch Title DB Primer 1 1 st Edition, 1996 Title DB Primer 2 1 st Edition, nd Edition, 2001 Title XML Primer 1 1 st Edition, 1996 Missing! Apply matching algorithm recursively
Junghoo "John" Cho (UCLA Computer Science)63 Optional If list does not work, use optional For multiple choices, what to choose? Many different choices to consider The authors do not explain… Some heuristic pruning criteria
Junghoo "John" Cho (UCLA Computer Science)64 Multiple Choices Mismatch! Potential wrappers (( )? )+ (( )? )+… and many others
Junghoo "John" Cho (UCLA Computer Science)65 Fundamental Questions Pattern space Union-free regular expression Example Pattern algorithm Just described Evaluation function Supposedly minimal language… but not really Exact evaluation function not explained…
Junghoo "John" Cho (UCLA Computer Science)66 Dual Question Regular vs. Unstructured source Very regular Noisy vs. Clean source Very clean Single vs. Multiple occurrences Does not matter
Junghoo "John" Cho (UCLA Computer Science)67 Limitations Heavily dependent on HTML tags Cannot extract data in free text, even if the format is regular e.g., John is the author of Great Book Very fragile to noise Of course, limitations from Union-free: Regular expression No recursive items: …
Junghoo "John" Cho (UCLA Computer Science)68 Potential Improvements? Consider multiple pages simultaneously May provide more evidence to select one choice over the other
Junghoo "John" Cho (UCLA Computer Science)69 One More Consideration Is Section 3 necessary? Read the paper without Section 3 Is it still as impressive? Generalization and theoretical background study is always helpful to make a paper more “impressive”