Download presentation
Presentation is loading. Please wait.
1
Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw
2
Outline What is information extraction? Document types Applications Wrapper induction Automatic Wrapper generator Conclusions
3
An information extraction system is a cascade of transducers or modules that at each step add structure and often lose information, hopefully irrelevant, by applying rules that are acquired manually and/or automatically. Example-- Parser input a sequence of lexical items and perhaps small-scale structures (phrases) and output a set of parse tree fragments, possibly complete What ’ s information extraction?
4
Modules Text Zoner turn a text into a set of text segments Preprocessor turn a text or text segment into a sequence of sentences, each of which is a sequence of lexical items, where a lexical item is a word together with its lexical attributes Filter turn a set of sentences into a smaller set of sentences by filtering out the irrelevant ones
5
Document types Plain text: ( 一句一句,平鋪直述 ) 利用 lexical 、 semantic analysis 。 AutoSlog(Riloff 93), LIEP(Huffman 95), CRYSTAL(Soderland 95), HASTEN(Krupka 95) 。 Web page : ( 半結構性文件 ) 利用 html 語法特性 -tag 。 觀察所得之 heuristics: Layout 。
6
Applications Meta Search Engines Information Agents 以特定目的為導向,例 : 新聞代理人 (News spider) 網羅新聞 購物比價 找工作 ShopBot (Doorenbos 97), Software LEGO(Hsu 99) 。
7
Information Integration Systems Unprocessed, Unintegrated Details Translation and Wrapping Semantic Integration Mediation Abstracted Information Text, Images/Video, Spreadsheets Hierarchical & Network Databases Relational Databases Object & Knowledge Bases SQLORBWrapper Mediator Human & Computer Users Heterogeneous Data Sources Information Integration Service Mediator User Services: Query Monitor Update Agent/Module Coordination
8
What is a wrapper? Wrapper An extracting program to extract desired information from Web pages. Semi-Structure Doc.– wrapper → Structure Info.
9
Web Wrappers Web wrappers wrap... “ Query-able ’’ or “ Search-able ’’ Web sites Web pages with large itemized lists The primary issues are: How to build the extractor quickly?
10
Free Text Extraction v.s. Semi-structured Text Extraction Example: to extract attributes --- job title, employer and phone number --- from a job item list Free text extraction can depend on NL knowledge “ The department of computer science at Cranberry Lemon University has a faculty position opening. Please call (555)333-5555 for more details. ” Semistructured text extraction? --- depend on appearance and regularity “ Faculty position, department of computer science, Cranberry Lemon University. Call (555)333-5555 ”
11
Wrapper Representations Delimiter-based finite state automata Some Country Codes Congo 242 Egypt 20 Belize 501 Spain 34 23 1 extractskipextract skip 4
12
Related Work Shopbot Doorenbos, Etzioni, Weld, AA-97 Ariadne Ashish, Knoblock, Coopis-97 WIEN Kushmerick, Weld, IJCAI-97
13
Related Work (Cont.) SoftMealy wrapper representation Hsu, IJCAI-99 STALKER Muslea, Minton, Knoblock, AA-99 A hierarchical FST IEPAD Chang, WWW01
14
WIEN HLRT (Head-Left-Right-Tail) Labeling: by PageOracle, LableOracle. PAC analysis Extract 48% web pages successfully. Weakness: Missing attributes, attributes not in order, tabular data..etc.
15
Softmealy Chun-Nan Hsu, 1998 Arizona State University
16
Softmealy Finite-State Transducers for Semi- Structured Text Mining Labeling: use a interface to label example by manually. FST ( Finite-State Transducer) Sigle-pass Multi-pass
17
SoftMealy wrapper representation Uses finite-state transducer where each distinct attribute permutations can be encoded as a successful path Replaces delimiters with contextual rules that describes the context delimiting two adjacent attributes
18
Example
19
4 種情形 Output
20
Finite State Transducer b M -A A -N N-UU e extract skip 多解決了 (N, M) 、 (N, A, M) 2 個情形
21
Find the starting position -- Single Pass 新增的定義
22
Taxonomy Tree
23
Stalker Muslea, Minton, Knoblock, AA-99 A Hierarchical FST
24
STALKER “ STALKER: Learning Extraction Rules for Semi-structured, Web-based Information Sources ”. AAAI-98, Muslea. Embeded Catalog Description is a tree- like structure.
25
EC Tree of a page
26
Multi-Pass or Hierarchical Wrapper 先 extract Body 再 extract Tuples Pass1: extract U Pass2:extract N Pass3:extract A Pass4:extract M
27
Rule Generating 1 st : terminals: {; reservation _Symbol_ _Word_} Candidate:{; _Symbol_ _HtmlTag_} perfect Disj:{ _HtmlTag_} positive example: D3, D4 2 nd : uncover{D1, D2} Candicate:{; _Symbol_} Extract Credit info.
28
Possible Rules
31
Features Process is performed in a hierarchical manner. 沒有 Attributes not in order 的問題。 Use disjunctive rule 可以解決 Missing attributes 的問題。
32
Comparison Both : can handle irregular missing attributes. 對於未見過的 attribute ,需要 training Single-pass : 允許的 attribute permutations 有限 Single-pass is good for tabular pages 比較快 Multi-pass: Attribute permutations 沒有影響 Multi-pass is good for tagged-list pages 比較慢
33
Comparison Quote Server Stalker: 10 example tuples, 79%, 500 test WIEN: the collection beyond learn ’ s capablity SoftMealy: multi-pass 85%, single-pass 97% Internet Address Finder Stalker: 80% ~ 100%, 500 test WIEN: the collection beyond learn ’ s capablity SoftMealy: multi-pass 68%, single-pass 41%,
34
Comparison Okra (tabular pages) Stalker: 97%, 1 example tuple WIEN: 100%, 13 example tuples, 30 test SoftMealy: single-pass 100%, 1 example tuple, 30 test Big-book (tagged-list pages) Stalker: 97%, 8 example tuples WIEN: perfect, 18 example tuples, 30 test SoftMealy: single-pass 97%, 4 examples, 30 test multi-pass 100%, 6 examples, 30 test
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.