Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University

Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

Outline What is information extraction? Document types Applications Wrapper induction Automatic Wrapper generator Conclusions

An information extraction system is a cascade of transducers or modules that at each step add structure and often lose information, hopefully irrelevant, by applying rules that are acquired manually and/or automatically. Example-- Parser input a sequence of lexical items and perhaps small-scale structures (phrases) and output a set of parse tree fragments, possibly complete What ’ s information extraction?

Modules Text Zoner turn a text into a set of text segments Preprocessor turn a text or text segment into a sequence of sentences, each of which is a sequence of lexical items, where a lexical item is a word together with its lexical attributes Filter turn a set of sentences into a smaller set of sentences by filtering out the irrelevant ones

Document types Plain text: ( 一句一句，平鋪直述 ) 利用 lexical 、 semantic analysis 。 AutoSlog(Riloff 93), LIEP(Huffman 95), CRYSTAL(Soderland 95), HASTEN(Krupka 95) 。 Web page ： ( 半結構性文件 ) 利用 html 語法特性 -tag 。觀察所得之 heuristics: Layout 。

Applications Meta Search Engines Information Agents 以特定目的為導向，例 :  新聞代理人 (News spider)  網羅新聞  購物比價  找工作 ShopBot (Doorenbos 97), Software LEGO(Hsu 99) 。

Information Integration Systems Unprocessed, Unintegrated Details Translation and Wrapping Semantic Integration Mediation Abstracted Information Text, Images/Video, Spreadsheets Hierarchical & Network Databases Relational Databases Object & Knowledge Bases SQLORBWrapper Mediator Human & Computer Users Heterogeneous Data Sources Information Integration Service Mediator User Services: Query Monitor Update Agent/Module Coordination

What is a wrapper? Wrapper An extracting program to extract desired information from Web pages. Semi-Structure Doc.– wrapper → Structure Info.

Web Wrappers Web wrappers wrap... “ Query-able ’’ or “ Search-able ’’ Web sites Web pages with large itemized lists The primary issues are: How to build the extractor quickly?

Free Text Extraction v.s. Semi-structured Text Extraction Example: to extract attributes --- job title, employer and phone number --- from a job item list Free text extraction can depend on NL knowledge “ The department of computer science at Cranberry Lemon University has a faculty position opening. Please call (555)333-5555 for more details. ” Semistructured text extraction? --- depend on appearance and regularity “ Faculty position, department of computer science, Cranberry Lemon University. Call (555)333-5555 ”

Wrapper Representations Delimiter-based finite state automata Some Country Codes Congo 242 Egypt 20 Belize 501 Spain 34 23 1 extractskipextract skip 4

Related Work Shopbot Doorenbos, Etzioni, Weld, AA-97 Ariadne Ashish, Knoblock, Coopis-97 WIEN Kushmerick, Weld, IJCAI-97

Related Work (Cont.) SoftMealy wrapper representation Hsu, IJCAI-99 STALKER Muslea, Minton, Knoblock, AA-99 A hierarchical FST IEPAD Chang, WWW01

WIEN HLRT (Head-Left-Right-Tail) Labeling: by PageOracle, LableOracle. PAC analysis Extract 48% web pages successfully. Weakness:  Missing attributes, attributes not in order, tabular data..etc.

Softmealy Chun-Nan Hsu, 1998 Arizona State University

Softmealy Finite-State Transducers for Semi- Structured Text Mining Labeling: use a interface to label example by manually. FST ( Finite-State Transducer) Sigle-pass Multi-pass

SoftMealy wrapper representation Uses finite-state transducer where each distinct attribute permutations can be encoded as a successful path Replaces delimiters with contextual rules that describes the context delimiting two adjacent attributes

Example

4 種情形 Output

Finite State Transducer b M -A A -N N-UU e extract skip 多解決了 (N, M) 、 (N, A, M) 2 個情形

Find the starting position -- Single Pass 新增的定義

Taxonomy Tree

Stalker Muslea, Minton, Knoblock, AA-99 A Hierarchical FST

STALKER “ STALKER: Learning Extraction Rules for Semi-structured, Web-based Information Sources ”. AAAI-98, Muslea. Embeded Catalog Description is a tree- like structure.

EC Tree of a page

Multi-Pass or Hierarchical Wrapper 先 extract Body 再 extract Tuples Pass1: extract U Pass2:extract N Pass3:extract A Pass4:extract M

Rule Generating 1 st : terminals: {; reservation _Symbol_ _Word_} Candidate:{; _Symbol_ _HtmlTag_} perfect Disj:{ _HtmlTag_} positive example: D3, D4 2 nd : uncover{D1, D2} Candicate:{; _Symbol_} Extract Credit info.

Possible Rules

Features Process is performed in a hierarchical manner. 沒有 Attributes not in order 的問題。 Use disjunctive rule 可以解決 Missing attributes 的問題。

Comparison Both : can handle irregular missing attributes. 對於未見過的 attribute ，需要 training Single-pass : 允許的 attribute permutations 有限 Single-pass is good for tabular pages 比較快 Multi-pass: Attribute permutations 沒有影響 Multi-pass is good for tagged-list pages 比較慢

Comparison Quote Server Stalker: 10 example tuples, 79%, 500 test WIEN: the collection beyond learn ’ s capablity SoftMealy: multi-pass 85%, single-pass 97% Internet Address Finder Stalker: 80% ~ 100%, 500 test WIEN: the collection beyond learn ’ s capablity SoftMealy: multi-pass 68%, single-pass 41%,

Comparison Okra (tabular pages) Stalker: 97%, 1 example tuple WIEN: 100%, 13 example tuples, 30 test SoftMealy: single-pass 100%, 1 example tuple, 30 test Big-book (tagged-list pages) Stalker: 97%, 8 example tuples WIEN: perfect, 18 example tuples, 30 test SoftMealy: single-pass 97%, 4 examples, 30 test multi-pass 100%, 6 examples, 30 test

Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University

Similar presentations

Presentation on theme: "Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University

Similar presentations

Presentation on theme: "Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University"— Presentation transcript:

Similar presentations

About project

Feedback