1 Cui Tao PhD Dissertation Defense Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages
2 Motivation Birth date of my great grandpa Price and mileage of red Nissans, 1990 or newer Protein and amino acids information of gene cdk-4? US states with property crime rates above 1%
3 Search by Search Engine
4 Search the Hidden Web The Hidden Web: – Hidden behind forms – Hard to query “cdk-4"
5 Query for Data The Hidden Web: – Hidden behind forms – Hard to query Find the protein and the animo-acids information for gene “cdk-4"
6 A Web of Pages A Web of Knowledge Web of Knowledge – Machine-“understandable” – Publicly accessible – Queriable by standard query languages Semantic annotation – Domain ontologies – Populated conceptual model Problems to resolve – How do we create ontologies? – How do we annotate pages for ontologies?
Contributions of Dissertation Work Web of Pages Web of Knowledge – Knowledge & meta-knowledge extraction – Reformulation as machine-“understandable” knowledge Automatic & semi-automatic solutions via: – Sibling tables (TISP/TISP++) – User-created forms (FOCIH) 7
8 Automatic Annotation with TISP (Table Interpretation with Sibling Pages) Recognize tables (discard non-tables) Locate table labels Locate table values Find label/value associations
9 Recognize Tables Data Table Layout Tables (discard) Nested Data Tables
10 Find Label/Value Associations Example: (Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE
11 Interpretation Technique: Sibling Page Comparison
12 Interpretation Technique: Sibling Page Comparison Same
13 Interpretation Technique: Sibling Page Comparison Almost Same
14 Interpretation Technique: Sibling Page Comparison Different Same
15 Technique Details Unnest tables Match tables in sibling pages – “Perfect” match (table for layout discard ) – “Reasonable” match (sibling table) Determine & use table-structure pattern – Discover pattern – Pattern usage – Dynamic pattern adjustment
16 Table Unnesting
17 Regularity Expectations: ( {L} {V}) n ( {L}) n ( ( {V}) n ) + … Pattern combinations are also possible. Table Structure Patterns
18 ( {L}) n ( ( {V}) n ) + Table Structure Patterns
19 Pattern Usage
20 Dynamic Pattern Adjustment
21 TISP++ Automatic ontology generation Automatic information annotation
22 Ontology Generation – OSM Object set: table labels – Lexical: labels that associate with actual values – Non-lexical: labels that associate with other tables Relationship set: table nesting Constraints: updates based on observation
23 Ontology Generation – OWL Object set: OWL class Relationship set: OWL object property Lexical object set: – OWL data type property – Different annotation properties to keep track of the provenance
Generated Ontology
26 RDF Graph
27 Query the Data Find the protein and the animo-acids information for gene “cdk-4"
28 TISP Evaluation Applications – Commercial: car ads – Scientific: molecular biology – Geopolitical: US states and countries Data: > 2,000 tables in 35 sites Evaluation – Initial two sibling pages Correct separation of data tables from layout tables? Correct pattern recognition? – Remaining tables in site Information properly extracted? Able to detect and adjust for pattern variations?
29 Experimental Results Table recognition: correctly discarded 157 of 158 layout tables Pattern recognition: correctly found 69 of 72 structure patterns Extraction and adjustments: 5 path adjustments and 34 label adjustments all correct
30 TISP++ Performance Performance depends on TISP TISP test set – Generates all ontologies correctly – Annotates all information in tables correctly
31 Form-based Ontology Creation and Information Harvesting (FOCIH) Personalized ontology creation by form – General familiarity – Reasonable conceptual framework – Appropriate correspondence Transformable to ontological descriptions Capable of accepting source data Automated ontology creation Automated information harvesting
32 Form Creation
33 Created Sample Form
34 Generated Ontology View
35 Source-to-Form Mapping
36 Source-to-Form Mapping
37 Source-to-Form Mapping
38 Source-to-Form Mapping
39 Almost Ready to Harvest Need reading path: DOM-tree structure Need to resolve mapping problems – Pattern recognition – Instance recognition
40 Reading Path
41 Pattern & Instance Recognition
42 Pattern & Instance Recognition
43 Pattern & Instance Recognition regular expression for decimal number left context right context
44 Pattern & Instance Recognition list pattern, delimiter is “,”
45 Pattern & Instance Recognition list pattern, delimiter is regular expression for percentage numbers and a comma
46 Pattern & Instance Recognition list pattern, delimiter is regular expression for percentage numbers and a comma
47 Can Now Harvest
48 Can Now Harvest
49 Can Now Harvest
50 Semantic Annotation
51 Semantic Annotation
52 Semantic Annotation
53 Semantic Annotation
54 Semantic Annotation
55 Semantic Query
56 FOCIH Performance Ontology creation Semantic annotation – Depends on TISP performance – Depends on pattern and instance recognition performance
57 FOCIH Performance Pattern and instance recognition: – Works with highly regular data – Tested 71 mappings – 25 full-string values (25/25 correct) – 38 substring values (29/38 correct) – 8 list patterns (6/8 correct)
58 FOCIH Difficulties
59 FOCIH Difficulties
60 FOCIH Difficulties No selection
61 WoK via TISP
62 WoK via TISP
63 WoK via FOCIH
64 WoK via FOCIH
65 Contributions TISP: automatic sibling table interpretation TISP++: – Automatic ontology generation based on interpreted tables – Automatic semantic annotation for interpreted tables FOCIH: – Semi-automatic personalized ontology creation – Automatic personalized information harvesting and semantic annotation All together: contributes to turning the current web of pages into a web of Knowledge
66 Future Work Sibling pages in addition to sibling tables Reverse engineer from ontologies to forms as a basis for information harvesting for already defined ontologies.