Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Semantic Web-Week 22 Information Extraction and Integration (continued) Module Website: Practical this week:

Similar presentations


Presentation on theme: "The Semantic Web-Week 22 Information Extraction and Integration (continued) Module Website: Practical this week:"— Presentation transcript:

1 The Semantic Web-Week 22 Information Extraction and Integration (continued) Module Website: http://scom.hud.ac.uk/scomtlm/chs2533 Practical this week: http://www.isi.edu/info-agents/

2 Recap n Information extraction is the process of extracting “meaningful” data from raw or semi-structured text n Information Agents are capable of retrieving info from some web sites via database-like queries (such as required in the example above) and integrating info from web sites to solve complex queries n Use ‘similarity-based’ machine learning techniques to learn/extract meaning from traditional web page content

3 Induction algorithm SEED INSTANCE SPACE GENERALISATION SPACE

4 Induction algorithm - continued SEED INSTANCE SPACE GENERALISATION SPACE

5 Induction algorithm - continued NEW SEED INSTANCE SPACE GENERALISATION SPACE Hypothesis =V

6 Induction algorithm – abstract example INSTANCE SPACE GENERALISATION SPACE a&b&f a&b&c e&c&a SEED a&f

7 Induction algorithm – abstract example INSTANCE SPACE GENERALISATION SPACE a&b&f a&b&c e&c&a a&b b&c a&c a&f VERSION SPACE – all expressions that are cover all +exs and no –exs IS EMPTY

8 Induction algorithm – abstract example INSTANCE SPACE GENERALISATION SPACE a&b&f a&b&c e&c&a a&b b&c a&c a&f a VERSION SPACE – all expressions that are cover all +exs and no –exs IS EMPTY

9 Induction algorithm – disjunction + ¬ INSTANCE SPACE GENERALISATION SPACE a&b&f a&b&c e&c&a a&f a&b&f V a&b&c V e&c&a ¬a V ¬f VERSION SPACE – all expressions that are cover all +exs and no -exs b V c a&¬f

10 Generalisation hierarchies a&b&c a a&ba&c b&c c b Onto(a,b)&red(a) ExEy on(x,y) Ex red(x)

11 Back to Example: ISI’s project “ Wrappers” are rules (actually they are like finite state machines!) for extracting information from Web Pages. See “Hierarchical Wrapper Induction for Semi-structured Information Sources” Ion Muslea, Steven Minton, Craig A. Knoblock, Kluwer, 1999. At the heart of ISI’s Heracles system is the Stalker inductive algorithm that generates certain types wrappers - rules that identify the start and end of an item within a web page.

12 Example of training examples Stalker is given examples of ‘items’ it had to learn the wrapper for – eg examples of the item (or concept) “area code” of a tel no, E1: 513 Pico, Venice, Phone: 1- 800 -555-1515 E2: 90 Colfax, Palms, Phone: ( 818 ) 508-1570 E3: 523 1st St., LA, Phone: 1- 888 -578-2293 E4: 403 La Tijera, Watts, Phone: ( 310 ) 798-0008 Imagine you had to write an FSM to extract this data – this is the kind of thing that the Learning Algorithm has to learn.

13 Brief example of Stalker execution.. E1: 513 Pico, Venice, Phone: 1- 800 -555-1515 E2: 90 Colfax, Palms, Phone: ( 818 ) 508-1570 E3: 523 1st St., LA, Phone: 1- 888 -578-2293 E4: 403 La Tijera, Watts, Phone: ( 310 ) 798-0008 n SEED = E2 R1 = SkipTo((), R2 = SkipTo(Punctuation ), R3 = SkipTo(AnyToken ) n Choose R1 - covers E4 and E2 and no –ve exs n NEW SEED for E1 /E3 = E1 R4 = SkipTo( ) R5 = SkipTo(HtmlTag ) R6 = SkipTo(AnyToken) n All cover E1/E3 but also covers –ve exs n Specialise R4 = SkipTo( ) : R7 = SkipTo( - ) R8 = SkipTo( Punctuation ) R9 = SkipTo( AnyToken )

14 Other Refinements: E1: 513 Pico, Venice, Phone: 1- 800 -555-1515 R10: SkipTo(Venice) SkipTo( ) R17: SkipTo(Numeric) SkipTo( ) R11: SkipTo( ) SkipTo( ) R18: SkipTo(Punctuation)SkipTo( ) R12: SkipTo(:) SkipTo( ) R19: SkipTo(HtmlTag) SkipTo( ) R13: SkipTo(-) SkipTo( ) R20: SkipTo(AlphaNum) SkipTo( ) R14: SkipTo(,) SkipTo( ) R21: SkipTo(Alphabetic) SkipTo( ) R15: SkipTo(Phone) SkipTo( ) R22: SkipTo(Capitalized) SkipTo( ) R16: SkipTo(1) SkipTo( ) R23: SkipTo(NonHtml) SkipTo( ) R24: SkipTo(Anything) SkipTo( ) R7, R11, R12, R13, R15, R16, and R19 all match correctly on E1 and E3, and fail to match on E2 and E4; R7 represents the best solution according to the algorithms heuristics - Consequently stalker completes its execution by returning the disjunctive rule either R1 or R7.

15 Summary Stalker is an example of an inductive learning algorithm which is given -- examples of fields in web pages and learns -- the begin/end patterns of fields so that it can be used to ‘mine’ data in unseen web pages Many other examples exist of the use of “wrapper induction” in order to automatically extract information from web pages


Download ppt "The Semantic Web-Week 22 Information Extraction and Integration (continued) Module Website: Practical this week:"

Similar presentations


Ads by Google