Presentation is loading. Please wait.

Presentation is loading. Please wait.

Annotation Free Information Extraction

Similar presentations


Presentation on theme: "Annotation Free Information Extraction"— Presentation transcript:

1 Annotation Free Information Extraction
Chia-Hui Chang Department of Computer Science & Information Engineering National Central University 10/4/2002

2 IEPAD: Information Extraction based on Pattern Discovery
C.H. Chang. National Central University WWW10

3 Semi-structured Information Extraction
Information Extraction (IE) Input: Html pages Output: A set of records

4 Pattern Discovery based IE
Motivation Display of multiple records often forms a repeated pattern The occurrences of the pattern are spaced regularly and adjacently Now the problem becomes ... Find regular and adjacent repeats in a string

5 IEPAD Architecture Pattern Generator Extractor Html Page Pattern
Extraction Results Html Page Patterns Pattern Viewer Extraction Rule Users Html Pages

6 The Pattern Generator Translator PAT tree construction
HTML Page Token Translator PAT Tree Constructor Validator Rule Composer PAT trees and Maximal Repeats Advenced Patterns Extraction Rules A Token String Translator PAT tree construction Pattern validator Rule Composer

7 1. Web Page Translation Encoding of HTML source HTML Example:
Rule 1: Each tag is encoded as a token Rule 2: Any text between two tags are translated to a special token called TEXT (denoted by a underscore) HTML Example: <B>Congo</B><I>242</I><BR> <B>Egypt</B><I>20</I><BR> Encoded token string T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)

8 Various Encoding Schemes

9 2. PAT Tree Construction PAT tree: binary suffix tree
A Patricia tree constructed over all possible suffix strings of a text Example T(<B>) 000 T(</B>) 001 T(<I>) 010 T(</I>) 011 T(<BR>) 100 T(_) 110 T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)

10 The Constructed PAT Tree

11 Definition of Maximal Repeats
Let a occurs in S in position p1, p2, p3, …, pk a is left maximal if there exists at least one (i, j) pair such that S[pi-1]S[pj-1] a is right maximal if there exists at least one (i, j) pair such that S[pi+|a|]S[pj+|a|] a is a maximal repeat if it it both left maximal and right maximal

12 Finding Maximal Repeats
Definition: Let’s call character S[pi-1] the left character of suffix pi A node  is left diverse if at least two leaves in the ’s subtree have different left characters Lemma: The path labels of an internal node  in a PAT tree is a maximal repeat if and only if  is left diverse

13 3. Pattern Validator Suppose a maximal repeat  are ordered by its position such that suffix p1 < p2 < p3… < pk, where pi denotes the position of each suffix in the encoded token sequence. Characteristics of a Pattern Regularity: Variance coefficient Adjacency: Density

14 Pattern Validator (Cont.)
Basic Screening For each maximal repeat a, compute V(a) and D(a) a) check if the pattern’s variance: V(a) < 0.5 b) check if the pattern’s density: 0.25 < D(a) < 1.5 V(a)<0.5 0.25<D(a)<1.5 Yes No Discard Pattern a

15 4. Rule Composer Occurrence partition Multiple string alignment
Flexible variance threshold control Multiple string alignment Increase density of a pattern

16 Occurrence Partition Problem Solution
Some patterns are divided into several blocks Ex: Lycos, Excite with large regularity Solution Clustering of the occurrences of such a pattern Clustering V(P)<0.1 No P Discard Yes Check density

17 Multiple String Alignment
Problem Patterns with density less than 1 can extract only part of the information Solution Align k-1 substrings among the k occurrences A natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths.

18 Multiple String Alignment (Cont.)
Suppose “adc” is the discovered pattern for token string “adcwbdadcxbadcxbdadcb” If we have the following multiple alignment for strings ``adcwbd'', ``adcxb'' and ``adcxbd'': a d c w b d a d c x b - a d c x b d The extraction pattern can be generalized as “adc[w|x]b[d|-]”

19 Pattern Viewer Java-application based GUI Web based GUI

20 The Extractor Matching the pattern against the encoding token string
Knuth-Morris-Pratt’s algorithm Boyer-Moore’s algorithm Alternatives in a rule matching the longest pattern What are extracted? The whole record

21 Experiment Setup Fourteen sources: search engines Performance measures
Number of patterns Retrieval rate and Accuracy rate Parameters Encoding scheme Thresholds control

22 Translation Average page length is 22.7KB

23 Accuracy and Retrieval Rate

24 Problems Guarantee high retrieval rate instead of accuracy rate
Generalized rule can extract more than the desired data Only applicable when there are several records in a Web page, currently

25 ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites
Valter Crescenzi , Giansalvatore , Paolo Merialdo VLDB2001

26 Observations 1. Wrapper generator works by using additional information. (labeled samples) 2. Wrapper induction system has some a priori knowledge about the page organization. 3. Finally, systems generate wrapper by examining one HTML page at a time.

27 ROADRUNNER new perspective
1. Don’t rely on any interaction with the user. (Completely automatic) 2. No a priori knowledge HTML schema will be inferred along with wrapper. Can handle any nested structures. 3. Works with two HTML pages at a time. (based on the study of similarities and dissimilarities between the pages)

28

29 Theoretical Background
Site generation = Encoding of database content Data extraction = Decoding The problem is based on a close correspondence between nested type and union-free regular expressios.

30 Delimiter #PCDATA : map to string
+ : map to lists (nested) , being iterator ? : map to nullable fields, optional patterns. Find schema and data extraction = Find minimal UFRE.

31 Matching Technique It is based on a matching technique called ACME. (Align, Collapse under Mismatch, and Extract) HTML  XHTML  tokens Matching algorithm works on two objects: A list of tokens, call the sample A wrapper (one UFRE) This is done by solving mismatches between the wrapper and the sample.

32

33 Mismatches 1. String mismatches: 2. Tag mismatches:
May be due only to different values of a database field. These mismatches are use to discover fields. (#PCDATA) Ex : ‘John Smith’ and ‘Paul Jones’ at token 4 2. Tag mismatches: Optional patterns Iterative patterns

34 Discovering Optionals
Strategy: Looking for repeated patterns as a first step, and then, if this attempt fails, in trying to identify optional pattern. Two steps: 1. Optional Pattern Location by Cross-Search Mismatch at token 6 - <UL> and <IMG…/> Assume optional pattern is located on wrapper or sample. 2. Wrapper Generalization ( <IMG src=…/> ) ?

35 Discovering Iterators
1. Square Location by Terminal – Tag Search : Both the wrapper and sample contain at least one occurrence of the square. Terminal Tag = position before the mismatch In this example is </LI> Test which is the square initial tag ? </UI> ~ </LI> v.s. <LI> ~ </LI> Finally, we can infer that the sample contains one candidate occurrence of the square at token

36 Discovering Iterators (con’t)
2. Square Matching : Try to match the candidate square occurrence (tokens 20-25). Backwards : matching token 25 and 19, then moves to 24 and 18 and so on. 3. Wrapper Generalization : If we denote the newly found square by s, we replace the repeated pattern by (s)+

37 More Complex Example First mismatch at token 15 (external mismatch)
Find iterators : Terminal tag = </LI> Candidate square is found : <LI> ~ </LI> at token 15-28 Backward match : second mismatch at token 23 and 9 (internal mismatch)  solve the mismatch by recursive

38 Recursively solve mismatch
Internal mismatch at token 23 and 9 Solve it by the same way at external mismatch. But don’t work by comparing one wrapper and one sample, rather two different portions of the same objects. Terminal tag = <B> Candidate square is </B>~<B> token 23-18 Backward match : mismatch at token 20 and 26 Find token is optional pattern.

39

40 Matching as an AND-OR tree
Finding one solution to match(w,s) corresponds to finding one visit for the AND-OR tree. (i) match(w,s) = all external mismatches encountered during the parsing (AND node) (ii) solve mismatch by either introducing one field, or one iterator, or one optional (OR) (iii) The search may either on wrapper or sample (OR) (iv) iterators and optionals are various candidates (OR) (v) Discover iterators may be need to recursively solve several internal mismatches. (AND)

41 AND-OR tree

42 Experimental Results

43 Experimental Results (con’t)

44 Extracting Structured Data from Web Page
Arvind Arasu, Hector Garcia-Molina ACM SIGMOD 2003

45 Cue Keywords: schema, template
Web pages belonging to the same site are generated by encoding data of the same schema with a common template => a common template by plugging-in value

46 Figuration

47 Goal and Challenge Previous IE Techniques rely on heuristic by human. ex. wrapper Goal: to deduce the template without human Time consuming and error-prone Optional attributes are ignored Challenge: No obvious way of differentiating what text is template or data The schema of data in pages isn’t flat but more complex and semi-structured of attributes

48 Model, Problem Formulation
Structured Data Model of Page Creation Optionals and Disjunctions Problem Statement Miscellaneous Terminology, Definition

49 Structured Data Token: A token is some basic unit of text
Structured Data: any set of data values conforming to a common schema or type Define “Type”: 1. Basic Type (β): string of tokens e.g. <html>, text 2. Ordered List Type: tuple constructor order “n” e.g. <T1, T2, …, Tn>, T1, T2, …, Tn : type 3. Define Type: set constructor e.g. {T} , T: type

50 Define term value and example
Define “instance”: 1. an instance of basic type, β, token 2. an instance of type <T1, T2, …, Tn> is   tuple of the form <i1, i2, …, in>, attributes i1, i2, …, in are instances of typesT1, T2, …, Tn 3. an instance of type {T}, is any set of elements {e1, e2, …, em}, such ei is an instance of type T Instance → Value; String → token Example: Schema S1= Value =

51 Model of Page Creation Definition: A template T for a schema S (as shown TS), is defined as a function that maps each type constructor τ of S into an ordered set of strings T(τ ), such that, τis the tuple constructor of order n, T(τ) is an order set of n+1 string τis the set constructor of order n, T(τ) is string Sτ λ(T, x) :values x that are instances of sub-schema of S

52 Encoding of a value x S 1. if x β, then λ (T,x)→x
2. if x  <x1, x2, …, xn>τt λ (T,x) → C1 λ (T, x1) C2 …λ (T, xn) Cn+1 3. if x  {e1, e2, …, em}τs , τs  S λ (T,x) → λ (T, e1) S λ (T, e2) ….S λ (T, em)

53 Example of Schema S1

54 Optionals and Disjunctions
If T is type, optional type (T)?≡{T}τ |τ| = 0 or 1 Disjunction: If T1 and T2 is type, disjunction type (T1| T2) ≡ <{T1}τ1, {T2}τ2 >τ |τ1|+|τ2| = 1

55 Problem Statement Extract Problem: n pages, pi = λ(T, xi)
(1 ≤ i ≤ n), created from some unknown deduction template T and values {x1,. . .,x1} from the set of pages alone

56 Example of correct solution of EXTRACT (cont.)

57 Example of correct solution of EXTRACT (cont.)

58 Miscellaneous Terminology, Definition
An occurrence of a token in template is called a template-token An occurrence of a token in value is called a value-token An occurrence of a token in page is called a page-token 2 page-token in Pe have the same role iff they have been generated by the same template-token

59 Overview Approach - EXALG
(ECGM)

60 EXALG - ECGM – FINDEQ (step2)
The module used to compute “equivalence classes:ε”, set of tokens having the same frequency of occurrence in every pages Pe Ex. εe1:{ <html>, <body>, Book, Reviews, <ol>, </ol>, </body>, </html> } Ex. εe3:{ <li>, Reviewer, Rating, Text, </li> } EXALG retain only EQ Classes that are Large and Frequently occurring EQ Classes (LFEQ)

61 EXALG - ECGM – HANDINV (step3)
The module used to detect and remove invalid LFEQs – those that are not formed by tokens associated with a type constructor

62 DIFFFORM (step1) and DIFFEQ (step4)
The module used to add more tokens to LFEQ by “differentiating” roles Ex. Name has multiple “role”, one occurs in Book Name and the other occurs in Reviewer Name Differentiate the multiple roles : The multiple tokens occur in different path from root in the HTML parse tree (DIFFFORM) The multiple tokens occur in different “Position” with respect to LFEQ εe1(DIFFEQ) dtoken: ex. Name5 and Name14 regard NameA and NameB as different tokens

63 Review ECGM Find dtoken from path in html parse tree Find LFEQ
Detect and remove invalid LFEQ Find dtoken from position in valid LFEQ

64 Example After ECGM Process
εe1: { <html>, <body>, <b>, Book, Name, </b>, <b>, Reviews, </b>, <ol>, </ol>, </body>, </html> } 8 →13 εe3: { <li>, <b>, Reviewer, Name, </b>, <b>, Rating, </b>, <b>, Text, </b>, </li>} 5 →12 Position: empty and non-empty

65 Construct Schema from ECGM
Construct Schema S’ fromεe1 The 1st of non-empty position is Basic Type β The 2nd of non-empty position is εe3 , are generated by set type constructorτe3 → T(τe1) = <C11, C12,C13>, S’ = <β,{ S” }τe2 >τe1 → T(τe2) = S” = < C31, C32,C33,C34 > → T(τe3) = < C31, C32,C33,C34 >, <β,β,β,>τe3  S’ = < β,{ <β,β,β,>τe3 }τe2 >τe1

66 Equivalence Classes (Cont.)
Pages P = { p1, … , pn } , pi = λ(TS, xi) TS = {τ1, … , τk }: type constructor Definition: All tokens of equivalence class have the same occurrence vector ex. εe1: <1,1,1,1>; εe3: <1,2,1,0> Observation1 : Tokens associated with the same type constructor τj in T that have unique-roles occur in the same equivalence class. (used to decide EQ valid or not) Support of token: #(page contain) Size of EQ class: #(token of EQ)

67 Equivalence Classes (Cont.)
Observation2: for real pages, an equivalence class of large size and support is usually valid Properties of EQ class: <t1, … , tm> Ordered Nested: the span of all occurrences of εi is within for some fixed Position_p or doesn’t overlap Observation3: A valid equivalence class is ordered and a pair of two valid equivalence classes is nested

68 Handling Invalid Equivalence classes
Detect the existence of invalid LFEQs using violation of ordered and nesting Yes, discard some of LFEQs and break other into smaller LFEQs Differentiating roles of tokens By Path – different roles of tokens are in different path of HTML parse tree By Position – different roles of tokens locates at different Position (non-empty)

69 Equivalence Class Generation Module
OUTPUT: set of LFEQs of dtokens and page represented as string of dtokens FINDEQ: 2 parameters used to consider LFEQs (SIZETHRES, SUPTHRES) On running example: SIZETHRES = SUPTHRES = 3 the iteration = 2, find out εe1 and εe3

70 Building Template and Extracting Values
Input to this module is {ε1 ,ε2 , … ,εm } The ANALYSIS consist of 2 modules – CONSTTEMP and EXVAL CONSTTEMP ,εi = { d1, d2, … , dl } Start the basic ε1= { <html>, <body>, … ,</body>, </html> } recursively constructs a template Tεi , corresponding toεi , and template Tεi, p, corresponding to each non-empty position p ofεi Checks if the set of strings, PosString(εi ,p), corresponding has some recognizable pattern

71 Example In running example, PosString(εe1+ ,6) is a string dtokens for every occurrence of εe1+, which matches Pattern 5 of table; PosString(εe1+ ,10) is always a string of 0 or more occurrences of εe3+, which matches Pattern 1 εe1: { <html>, <body>, <b>, Book, Name, </b>, <b>, Reviews, </b>, <ol>, </ol>, </body>, </html> }

72 Assumption The 4 assumptions:
(A1) A large number of tokens occurring in template have unique roles (A2) The EQ class derived from a type constructor is recognized as an LFEQ (A3) Irregularity in encoded data that leads to invalid EQ class (A4) The separators are around data values. In this model, strings associated with type construction are non-empty position

73 Evaluation Leaf attribute Am in schema Sm
Correct: the set of Am in the page is equal to the set of extracted value Ae in the page Partially Correct: the set of Am in the page is not equal to the set of extracted value Ae in the page, but as part of value of Ae Incorrect: not correct and Partially correct

74 Result 18 or 40% of input collections our System correctly extracted all the attribute Around 80% of the attributes were extracted correctly Normalized average Input size <=10 Parameter = 3

75 Conclusion EXALG: use 2 novel concept equivalence classes and differentiate roles, to discovery the template Impact of the failed assumption is limit to a few attributes Future work: Develop techniques for crawling, indexing, and providing querying support for the structured pages in the web Develop techniques for automatically annotating the extracted data, possibly using the words that appear in the template

76 References C.H. Chang. and S.C. Lui. IEPAD: Information Extraction based on Pattern Discovery, WWW2001, pp Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo. RoadRunner: Towards Automatic Data Extraction from Large Web Sites. VLDB2001, Arvind Arasu, Hector Garcia-Molina. Extracting Structured Data from Web Pages. SIGMOD2003,


Download ppt "Annotation Free Information Extraction"

Similar presentations


Ads by Google