Presentation is loading. Please wait.

Presentation is loading. Please wait.

Integration of Friendly Data Islands on the Web. Information Extraction.

Similar presentations


Presentation on theme: "Integration of Friendly Data Islands on the Web. Information Extraction."— Presentation transcript:

1 Integration of Friendly Data Islands on the Web. Information Extraction.

2 Roadmap Introduction What extraction rules are Generating extraction rules A couple of systems Conclusions

3 Roadmap Introduction What extraction rules are Generating extraction rules A couple of systems Conclusions

4 The theory A wrapper is a building block that provides an ad-hoc, message-based API to an app They interface apps at one or more layers, but, more often than not, they must deal with the user interface or the data layer User Interface Controller Business Logic Data Access Layer Data Layer

5 The problem The Da Vinci Code Buy Dan Brown Doubleday, 2006 15.95 € Robert Langdon is a Harvard Professor of Symbology…

6 Features of current web documents Trillions of documents Generated on demand by software applications Change continuously Require navigation from search forms Written in telegraphic language Formatted according to HTML templates

7 The solution

8 Wrapping in a nutshell Goals – Endow data islands with APIs – Ease implementing software applications Implications – Form filling – Navigation – Info extraction – “Ontologisation”

9 Look out! Information extraction has driven most research efforts Few wrapping systems are complete Wrapping is usually mistaken for information extraction This talk is about engineering information extraction for enabling information integration

10 How IE works Information extractor Document Extraction rules Attributes The Da Vinci Code Dan Brown 15.95 € 2006 Robert Langdon… Doubleday Templates Message ID: MUC-0001 Message Template: Court resolution Date of Event: April, 30 2007 Charge: Terrorist attack Perpetrator: Salahuddin Amin Perpetrator: Anthony Garcia Perpetrator: Waheed Mahmood Perpetrator: Omar Khyam … The Da Vinci Code Dan Brown 15.95 € 2006 P1 Robert Langdon… Doubleday A1 B1 Ontology instances Templating/ Ontologisation rules

11 Roadmap Introduction What extraction rules are Generating extraction rules A couple of systems Side by side comparison Conclusions

12 Running example

13 Book name: Ontologies Reviews: Reviewer: John Doe Rating: 7 Text: blah, blah Reviewer: Alan Wohl Rating: 8 Text: yeah, yeah Book name: SPARQL in action Reviews: Reviewer: Dan Smith Rating: 9 Text: cough, cough Book name: W4F explained Reviews:

14 Kinds of extraction rules Regular expressions First-order logic rules Pointers into DOM tree Context-free grammars Tag trees

15 TSIMMIS Regular expressions [Root, get("page.html"), "#"] [BookReview, Root, " # "] [BookName, BookReview, " # "] [Tmp, Rook, " # "] [Reviews, Tmp, "split(Tmp, ' ')"] [ReviewerNames, Reviews, "Reviewer: # "] [Ratings, Reviews, "Rating: # "] [Text, Reviews, "Text: # "] [Root, get("page.html"), "#"] [BookReview, Root, " # "] [BookName, BookReview, " # "] [Tmp, Rook, " # "] [Reviews, Tmp, "split(Tmp, ' ')"] [ReviewerNames, Reviews, "Reviewer: # "] [Ratings, Reviews, "Rating: # "] [Text, Reviews, "Text: # "] RoadRunner $FileName Book name: $BookTitle Reviews: (( Reviewer: $ReviewerName Rating: $Rating Text: $Text )+)?

16 First-order logic rules SRV bookTitle(X) :- prev(X, "Book name: "), next(X, " "). reviewerName(X) :- prev(X, "name: "), next(X, " "), !bookTitle(X). rating(X) :- isNatural(X), length(X, 1), inList(X). text(X) :- prev(X, "Text: "), next(X, " "). bookTitle(X) :- prev(X, "Book name: "), next(X, " "). reviewerName(X) :- prev(X, "name: "), next(X, " "), !bookTitle(X). rating(X) :- isNatural(X), length(X, 1), inList(X). text(X) :- prev(X, "Text: "), next(X, " ").

17 Pointer into the DOM tree WebOQL select x’.Text, y’.Text, y’’’’.Text, y’’’’’’’.Text from x, y in browse("page.html") where x.Text = "Book name:" and y.Text = "Reviewer:" select x’.Text, y’.Text, y’’’’.Text, y’’’’’’’.Text from x, y in browse("page.html") where x.Text = "Book name:" and y.Text = "Reviewer:"

18 Context-free grammars Minerva Page ::= $FileName Review Review ::= Book name: $BookName Reviews: ( Reviewer Rating Text )* Reviewer ::= Reviewer: $Reviewer Rating ::= Rating: $Rating Text ::= Text: $Text Page ::= $FileName Review Review ::= Book name: $BookName Reviews: ( Reviewer Rating Text )* Reviewer ::= Reviewer: $Reviewer Rating ::= Rating: $Rating Text ::= Text: $Text

19 DEPTA Tag trees li bbbbr

20 Roadmap Introduction What extraction rules are Generating extraction rules A couple of systems Conclusions

21 Classification Hand-crafted Supervised induction Little-supervised induction Unsupervised induction

22 Hand-crafted The pattern to extract the title is “…” Techniques – Natural intelligence Systems – TSIMMIS – Minerva – WebOQL – W4F – XWrap

23 Supervised induction Techniques – Bottom-up ILP – Top-down ILP – Ad-hoc algorithms Systems – SRV – RAPIER – WIEN – WHISK – NoDoSE – SoftMealy – STALKER – DEByE Raw documents Labelled documents Automated induction

24 Little-supervised induction Techniques – String alignment – Tree alignment Systems – OLERA – Thresher Raw document Record and attribute labelling Automated induction

25 Unsupervised induction Techniques – String alignment – Tree alignment – Statistical roles Systems – DeLa – RoadRunner – EXALG – DEPTA – IEPAD Raw documents Automated induction Pattern interpretation

26 Roadmap Introduction What extraction rules are Generating extraction rules A couple of systems Conclusions

27 Roadmap Introduction What extraction rules are Generating extraction rules A couple of systems – RoadRunner – SRV Conclusions

28 Token matching Book name: Ontologies Reviews: Reviewer: John Doe Rating: 7 … Reviewer: Alan Wohl Rating: 8 … Book name: SPARQL in action Reviews: Reviewer: Dan Smith Rating: 9 … String mistmatch $1

29 ...and matching… Book name: Ontologies Reviews: Reviewer: John Doe Rating: 7 … Reviewer: Alan Wohl Rating: 8 … Book name: SPARQL in action Reviews: Reviewer: Dan Smith Rating: 9 … Tag match $1

30 ...and matching… Book name: Ontologies Reviews: Reviewer: John Doe Rating: 7 … Reviewer: Alan Wohl Rating: 8 … Book name: SPARQL in action Reviews: Reviewer: Dan Smith Rating: 9 … Tag match $1

31 ...and matching… Book name: Ontologies Reviews: Reviewer: John Doe Rating: 7 … Reviewer: Alan Wohl Rating: 8 … Book name: SPARQL in action Reviews: Reviewer: Dan Smith Rating: 9 … Tag match, string match, … $1 Book name:

32 ...and matching… Book name: Ontologies Reviews: Reviewer: John Doe Rating: 7 … Reviewer: Alan Wohl Rating: 8 … Book name: SPARQL in action Reviews: Reviewer: Dan Smith Rating: 9 … String mismatch, tag match $1 Book name: $2

33 ...and matching… Book name: Ontologies Reviews: Reviewer: John Doe Rating: 7 … Reviewer: Alan Wohl Rating: 8 … Book name: SPARQL in action Reviews: Reviewer: Dan Smith Rating: 9 … … $1 Book name: $2 Reviewer: $3 Rating: $4 Text: $5

34 Stop: lists and optionals Book name: Ontologies Reviews: Reviewer: John Doe Rating: 7 … Reviewer: Alan Wohl Rating: 8 … Book name: SPARQL in action Reviews: Reviewer: Dan Smith Rating: 9 … Tag mismatch $1 Book name: $2 Reviewer: $3 Rating: $4 Text: $5

35 Stop: lists and optionals Book name: Ontologies Reviews: Reviewer: John Doe Rating: 7 … Reviewer: Alan Wohl Rating: 8 … Book name: SPARQL in action Reviews: Reviewer: Dan Smith Rating: 9 … $1 Book name: $2 Reviewer: $3 Rating: $4 Text: $5

36 Stop: lists and optionals Book name: Ontologies Reviews: Reviewer: John Doe Rating: 7 … Reviewer: Alan Wohl Rating: 8 … Book name: SPARQL in action Reviews: Reviewer: Dan Smith Rating: 9 … $1 Book name: $2 ( Reviewer: $3 Rating: $4 Text: $5 )+

37 …and matching finishes Book name: Ontologies Reviews: Reviewer: John Doe Rating: 7 … Reviewer: Alan Wohl Rating: 8 … Book name: SPARQL in action Reviews: Reviewer: Dan Smith Rating: 9 … $1 Book name: $2 ( Reviewer: $3 Rating: $4 Text: $5 )+

38 Just union-free grammars!

39 Roadmap Introduction What extraction rules are Generating extraction rules A couple of systems – RoadRunner – SRV Conclusions

40 Exercise Support predicates: next(x,y), previous(x,y) Try to explain isCorD(X) abcabdab bbcaabda

41 Exercise Support Predicates: next(x,y), previous(x,y) Now, try to Explain isCorDorE(X) abcabdabee bbcaabdaee

42 Target Predicates Define target predicates title: #PCDATA. reviewer: #PCDATA. rating: #PCDATA. text: #PCDATA. title: #PCDATA. reviewer: #PCDATA. rating: #PCDATA. text: #PCDATA.

43 Instantiate target predicates Book name: Ontologies Reviews: Reviewer: John Doe Rating: 7 Text: blah, blah Reviewer: Alan Wohl Rating: 8 Text: yeah, yeah Book name: SPARQL in action Reviews: Reviewer: Dan Smith Rating: 9 Text: cough, cough Book name: W4F explained Reviews:

44 Instantiate target predicates Positive Samples title("Ontologies"). title("SPARQL in action"). title("W4F Explained"). reviewer("John Doe"). reviewer("Alan Wohl"). reviewer("Dan Smith"). rating("7"). rating("8"). rating("9"). text("blah, blah"). text("yeah, yeah"). text("cough, cough"). title("Ontologies"). title("SPARQL in action"). title("W4F Explained"). reviewer("John Doe"). reviewer("Alan Wohl"). reviewer("Dan Smith"). rating("7"). rating("8"). rating("9"). text("blah, blah"). text("yeah, yeah"). text("cough, cough"). Negative Samples !title("Book name:"). !reviewer("Book name:"). !rating("Book name:"). !text("Book name:"). !title("Reviews:"). !reviewer("Reviews:"). !rating("Reviews:"). !text("Reviews:"). !title("Reviewer:"). !reviewer("Reviewer:"). !rating("Reviewer:"). !text("Reviewer:"). !title("Rating:"). !reviewer("Rating:"). !rating("Rating:"). … !title("Book name:"). !reviewer("Book name:"). !rating("Book name:"). !text("Book name:"). !title("Reviews:"). !reviewer("Reviews:"). !rating("Reviews:"). !text("Reviews:"). !title("Reviewer:"). !reviewer("Reviewer:"). !rating("Reviewer:"). !text("Reviewer:"). !title("Rating:"). !reviewer("Rating:"). !rating("Rating:"). …

45 Support Predicates Define support predicates prev: #PCDATA, #PCDATA. next: #PCDATA, #PCDATA. length: #PCDATA, #PCDATA. isNatural: #PCDATA. prev: #PCDATA, #PCDATA. next: #PCDATA, #PCDATA. length: #PCDATA, #PCDATA. isNatural: #PCDATA.

46 Instantiate support predicates On Positive Samples prev("Ontologies", " "). next("Ontologies", " "). length("Ontologies", 10). !isNatural("Ontologies"). prev("SPARQL in action", " "). next("SPARQL in action", " "). length("SPARQL in action", 16). !isNatural("SPARQL in action"). prev("W4F explained", " "). next("W4F explained", " "). length("W4F explained", 16). !isNatural("W4F explained"). … prev("Ontologies", " "). next("Ontologies", " "). length("Ontologies", 10). !isNatural("Ontologies"). prev("SPARQL in action", " "). next("SPARQL in action", " "). length("SPARQL in action", 16). !isNatural("SPARQL in action"). prev("W4F explained", " "). next("W4F explained", " "). length("W4F explained", 16). !isNatural("W4F explained"). … On Negative Samples prev("Book name:", " "). next("Book name:", " "). length("Book name:", 10). !isNatural("Book name:"). prev("Reviews:", " "). next("Reviews:", " "). !isNatural("Reviews:"). prev("Reviewer:", " "). next("Reviewer:", " "). !isNatural("Reviewer:"). prev("Rating:", " "). next("Rating:", " "). !isNatural("Rating:"). … prev("Book name:", " "). next("Book name:", " "). length("Book name:", 10). !isNatural("Book name:"). prev("Reviews:", " "). next("Reviews:", " "). !isNatural("Reviews:"). prev("Reviewer:", " "). next("Reviewer:", " "). !isNatural("Reviewer:"). prev("Rating:", " "). next("Rating:", " "). !isNatural("Rating:"). …

47 … Top-down induction title(X) :-. (3, 14) title(X) :- prev(X, X). (0, 0) title(X) :- !prev(X, X). (3, 14) title(X) :- prev(X, Y). (3, 14) title(X) :- !prev(X, Y). (?, ?) title(X) :- next(X, X). (0, 0) title(X) :- !next(X, X). (3, 14) title(X) :- next(X, Y). (3, 14) title(X) :- !next(X, Y). (?, ?) title(X) :- length(X, X). (0, 0) title(X) :- prev(X, " "). (0, 5) title(X) :- !prev(X, " "). (3, 9) title(X) :- prev(X, " "). (3, 9) title(X) :- !prev(X, " "). (0, 5) …

48 Rule selection p 0 = # positive bindings of R n 0 = # negative bindings of R p 1 = # positive bindings of R&A n 0 = # negative bindings of R&A t = # positive bindings of both R and R&A New coveringOld coveringCombined covering

49 Induction goes on… title(X) :-. (3, 14) title(X) :- prev(X, Y). (3, 14) title(X) :- prev(X, Y), X = Y. (?, ?) title(X) :- prev(X, Y), X != Y. (?, ?) title(X) :- prev(X, Y), prev(X, X). (?, ?) title(X) :- prev(X, Y), !prev(X, X). (?, ?) title(X) :- prev(X, Y), prev(X, Z). (?, ?) title(X) :- prev(X, Y), !prev(X, Z). (?, ?) title(X) :- prev(X, Y), prev(Y, X). (?, ?) …

50 …and on… title(X) :-. (3, 14) title(X) :- prev(X, Y). (3, 14) title(X) :- prev(X, Y), Y = " ". (?, ?) title(X) :- prev(X, Y), Y = " ", prev(X, X). (?, ?) title(X) :- prev(X, Y), Y = " ", !prev(X, X). (?, ?) title(X) :- prev(X, Y), Y = " ", prev(Y, Y). (?, ?) title(X) :- prev(X, Y), Y = " ", !prev(Y, Y). (?, ?) title(X) :- prev(X, Y), Y = " ", prev(X, Z). (?, ?) title(X) :- prev(X, Y), Y = " ", !prev(X, Z). (?, ?) …

51 …and eventually finishes title(X) :-. (3, 14) title(X) :- prev(X, Y). (3, 14) title(X) :- prev(X, Y), Y = " ". (?, ?) title(X) :- prev(X, Y), Y = " ", prev(Y, "Book name:"). (3, 0)

52 Optimisations Intelligent predicates – Non-sense atoms – Non-sense atom combinations – Non-bindable variables Instantiated target predicates Statistical analysis of constants Keep track of non-instantiable predicates

53 Roadmap Introduction What extraction rules are Generating extraction rules A couple of systems Conclusions

54 That's quite clear! Information extraction enables information integration

55 Research challenges Information extraction – Efficient rule generation – Maintaining rules automatically – Union non-free Grammars (unsupervised) Ontologisation rules – Everything is a challenge

56 Thanks! Drop by our web site at http://www.tdg-seville.info


Download ppt "Integration of Friendly Data Islands on the Web. Information Extraction."

Similar presentations


Ads by Google