Presentation is loading. Please wait.

Presentation is loading. Please wait.

Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State.

Similar presentations


Presentation on theme: "Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State."— Presentation transcript:

1 Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State University Tempe, AZ, USA

2 2 Information Integration Systems need wrappers Unprocessed, Unintegrated Details Text, Images/Video, Spreadsheets Hierarchical & Network Databases Relational Databases Object & Knowledge Bases SQLORBWrapper Mediator Human & Computer Users Heterogeneous Data Sources Information Integration Service Translation and Wrapping Semantic Integration Mediation Abstracted Information Mediator User Services: Query Monitor Update Agent/Module Coordination

3 3 Web wrappers nWeb wrappers wrap... –``Query-able’’ or ``Search-able’’ Web sites –Web pages with large itemized lists nThe primary issues are: –how to translate (or extract) the contents of a Web page into machine-understandable data? –how to build the extractor quickly, can it be learned?

4 4 Free Text Extraction v.s. Semistructured Text Extraction nExample: to extract attributes --- job title, employer and phone number --- from a job item nFree text extraction can depend on NL knowledge –“The department of computer science at Cranberry Lemon University has a faculty position opening. Please call (555)333-5555 for more details.” nSemistructured text extraction? --- depend on appearance and regularity –“Faculty position, department of computer science, Cranberry Lemon University. Call (555)333-5555”

5 5 Wrapper representations in previous work nShopbot (Doorenbos, Etzioni, Weld, AA-97), Ariadne (Ashish, Knoblock, Coopis-97), WIEN (Kushmerick, Weld, IJCAI-97)… nDelimiter-based, linear finite-state transducers For i = 1 to k skip through input string until locate the delimiter at the beginning of attribute Ai extract Ai until locate the delimiter at the end of attribute Ai A1A2A4 extract skip A3

6 6 Situations where previous work fails nMissing attributes –e.g., a faculty may not have an administrative title nMultiple attribute values –e.g., a faculty may have two administrative titles nVariant attribute permutations –e.g., (U,N,A,M), (U,N,M,A)… nExceptions and typos

7 7 Why previous work fails? nOne-attribute-permutation assumption nThe use of delimiters –prevents the wrapper to recognize different attribute permutations in many cases –How to extract state and zip code from “CA90210”?--- cases where there is no delimiters at all.

8 8 Example Mani Chandy, Professor of Computer Science and Executive Officer for Computer Science David E. Breen, Assistant Director of Computer Graphics Laboratory

9 9 U (URL) N (Name) A (Academic title) M (Admin title)

10 10 SoftMealy wrapper representation Key features: nUses finite-state transducer where each distinct attribute permutations can be encoded as a successful path nReplaces delimiters with contextual rules that describes the context delimiting two adjacent attributes NEW

11 11 Advantages of SoftMealy wrapper representation nExpressive enough to tolerate Web pages with the four troubles: –missing attributes –multiple attribute values –variant attribute permutations –exceptions and typos nPolynomially learnable nRetaining extraction efficiency

12 12 Basic building blocks of SoftMealy nToken: segment of input string –e.g., html tags, punctuation symbols, words nSeparator: invisible border line between two tokens nDummy attribute: sub-string we want to skip; if following attribute k, denoted as -k nContextual rules: characterize the context of a class of separators that separate two adjacent attributes (including dummy attributes) –Consists of a left context and a right context –Can be disjunctive

13 13 Mani Chandy, Professor of Computer Science and Executive Officer for Computer Science David E. Breen, Assistant Director of Computer Graphics Laboratory Example of tokens and separators useless separator useful separator

14 14 Mani Chandy, Professor of Computer Science and Executive Officer for Computer Science David E. Breen, Assistant Director of Computer Graphics Laboratory Example of a contextual rule contextual rule -N, A left: “, ” or “, ” right: any initial capital word token

15 15 Finite-state transducer nInput: separator instances nOutput: strings nStates: initial state b, final state e, one for each attribute and each dummy attribute nEdges: (i,r,o,j) state transition from i to j when input separator instance satisfies contextual rule r and output string o –o = empty when we want to skip –o = the next token when we want to extract –i and j cannot be both dummy attributes

16 16 Example FST b M -A-A A -N-N N-U-UU e extract skip

17 17 Expressiveness of SoftMealy nSoftMealy can deal with –missing attributes –multiple attribute values –variant attribute permutations nSoftMealy can deal with exceptions and typos nSoftMealy subsumes wrapper classes in (Kushmerick Ph.D. thesis U of WA 1997) nSoftMealy can wrap nested sources

18 18 Example of nested sources Chapter 1 Introduction Chapter 2 Related Work 2.1 Shopbot 2.2 Ariadne 2.2 WEIN Chapter 3 SoftMealy Wrapper Representation 3.1 Representation 3.1.1 Tokens and Separators 3.1.2 Contextual Rules 3.2 Expressiveness Analysis Chapter 4 Learning SoftMealy Wrappers

19 19 FST for nested sources b subsectionsectionchapter e

20 20 Learnability of SoftMealy nHow difficult (many example items need to see) is it to learn a correct graph structure of a SoftMealy FST to cover all attribute permutations? nPAC model: given k attributes, SoftMealy nRepresent each attribute permutation as a linear FST: (multiple attribute values not allowed)

21 21 Learnability of SoftMealy (continued) nMultinomial model: how many training items we need so that we have at least one instance of each attribute permutation with more than 0.95 probability? –Let ub be the upper bound of the items needed –Let  be the number of attribute permutations –For each permutation j, let p j be the probability that the attribute permutation of a randomly selected item is j

22 22 Learning SoftMealy Wrappers: a simple algorithm nInput: Attributes to be extracted, example Web pages where some items are labeled nOutput: a SoftMealy Wrapper nAlgorithm: 1. Create states according to the given attributes 2. Create edges according to the attribute permutation of the example items 3. For each edge, collect the corresponding separator instances (as positive examples) 4. Generalize separator instances into contextual rules

23 23 Experimental results on expressiveness nWrap 30 hand-coded CS faculty Web pages, randomly selected from cra.org list –SoftMealy successfully wraps all of them –# of distinct attribute permutations in sample pages up to 13, 2.63 on average –# of training items used about linear with regard to # of edges (separator classes) –# of disjuncts learned also about linear with regard to # of edges

24 24 Generalizing over unseen pages nASU directory (www.asu.edu/asuweb/directory): 28 known distinct attribute permutationswww.asu.edu/asuweb/directory nRandomly select 11 output pages, the largest one serves as the test page and 10 used for training –test page contains 69 items, 17 permutations –training pages: total 85 items, 18 permutations –Only 7 permutations are the same nTrain the system using the training pages in the ascending order of their sizes –labeled a total of 15 items –achieves 87% coverage in the test page

25 25 Future work nLearning algorithm that uses negative examples nDeterminization, disambiguation and minimization of learned FSTs nRobustness of wrappers

26 Initial Results on Wrapping Semistructured Web pages with Finite-State Transducers and Contextual rules Chun-Nan Hsu Institute of Information Science Academia Sinica Taipei, Taiwan Copyright © Chun-Nan Hsu, all right reserved Prepared for presentation in AAAI-98 Workshop on AI and Information Integration, Madison, Wisconsin, USA,July 26, 1998


Download ppt "Wrapping Semistructured Web Pages with Finite-State Transducers Chun-Nan Hsu and Ming-Tzung Dung Department of Computer Science & Engineering Arizona State."

Similar presentations


Ads by Google