The Semantic Web-Week 22 Information Extraction and Integration (continued) Module Website: Practical this week:
Recap n Information extraction is the process of extracting “meaningful” data from raw or semi-structured text n Information Agents are capable of retrieving info from some web sites via database-like queries (such as required in the example above) and integrating info from web sites to solve complex queries n Use ‘similarity-based’ machine learning techniques to learn/extract meaning from traditional web page content
Induction algorithm SEED INSTANCE SPACE GENERALISATION SPACE
Induction algorithm - continued SEED INSTANCE SPACE GENERALISATION SPACE
Induction algorithm - continued NEW SEED INSTANCE SPACE GENERALISATION SPACE Hypothesis =V
Induction algorithm – abstract example INSTANCE SPACE GENERALISATION SPACE a&b&f a&b&c e&c&a SEED a&f
Induction algorithm – abstract example INSTANCE SPACE GENERALISATION SPACE a&b&f a&b&c e&c&a a&b b&c a&c a&f VERSION SPACE – all expressions that are cover all +exs and no –exs IS EMPTY
Induction algorithm – abstract example INSTANCE SPACE GENERALISATION SPACE a&b&f a&b&c e&c&a a&b b&c a&c a&f a VERSION SPACE – all expressions that are cover all +exs and no –exs IS EMPTY
Induction algorithm – disjunction + ¬ INSTANCE SPACE GENERALISATION SPACE a&b&f a&b&c e&c&a a&f a&b&f V a&b&c V e&c&a ¬a V ¬f VERSION SPACE – all expressions that are cover all +exs and no -exs b V c a&¬f
Generalisation hierarchies a&b&c a a&ba&c b&c c b Onto(a,b)&red(a) ExEy on(x,y) Ex red(x)
Back to Example: ISI’s project “ Wrappers” are rules (actually they are like finite state machines!) for extracting information from Web Pages. See “Hierarchical Wrapper Induction for Semi-structured Information Sources” Ion Muslea, Steven Minton, Craig A. Knoblock, Kluwer, At the heart of ISI’s Heracles system is the Stalker inductive algorithm that generates certain types wrappers - rules that identify the start and end of an item within a web page.
Example of training examples Stalker is given examples of ‘items’ it had to learn the wrapper for – eg examples of the item (or concept) “area code” of a tel no, E1: 513 Pico, Venice, Phone: E2: 90 Colfax, Palms, Phone: ( 818 ) E3: 523 1st St., LA, Phone: E4: 403 La Tijera, Watts, Phone: ( 310 ) Imagine you had to write an FSM to extract this data – this is the kind of thing that the Learning Algorithm has to learn.
Brief example of Stalker execution.. E1: 513 Pico, Venice, Phone: E2: 90 Colfax, Palms, Phone: ( 818 ) E3: 523 1st St., LA, Phone: E4: 403 La Tijera, Watts, Phone: ( 310 ) n SEED = E2 R1 = SkipTo((), R2 = SkipTo(Punctuation ), R3 = SkipTo(AnyToken ) n Choose R1 - covers E4 and E2 and no –ve exs n NEW SEED for E1 /E3 = E1 R4 = SkipTo( ) R5 = SkipTo(HtmlTag ) R6 = SkipTo(AnyToken) n All cover E1/E3 but also covers –ve exs n Specialise R4 = SkipTo( ) : R7 = SkipTo( - ) R8 = SkipTo( Punctuation ) R9 = SkipTo( AnyToken )
Other Refinements: E1: 513 Pico, Venice, Phone: R10: SkipTo(Venice) SkipTo( ) R17: SkipTo(Numeric) SkipTo( ) R11: SkipTo( ) SkipTo( ) R18: SkipTo(Punctuation)SkipTo( ) R12: SkipTo(:) SkipTo( ) R19: SkipTo(HtmlTag) SkipTo( ) R13: SkipTo(-) SkipTo( ) R20: SkipTo(AlphaNum) SkipTo( ) R14: SkipTo(,) SkipTo( ) R21: SkipTo(Alphabetic) SkipTo( ) R15: SkipTo(Phone) SkipTo( ) R22: SkipTo(Capitalized) SkipTo( ) R16: SkipTo(1) SkipTo( ) R23: SkipTo(NonHtml) SkipTo( ) R24: SkipTo(Anything) SkipTo( ) R7, R11, R12, R13, R15, R16, and R19 all match correctly on E1 and E3, and fail to match on E2 and E4; R7 represents the best solution according to the algorithms heuristics - Consequently stalker completes its execution by returning the disjunctive rule either R1 or R7.
Summary Stalker is an example of an inductive learning algorithm which is given -- examples of fields in web pages and learns -- the begin/end patterns of fields so that it can be used to ‘mine’ data in unseen web pages Many other examples exist of the use of “wrapper induction” in order to automatically extract information from web pages