Download presentation
Presentation is loading. Please wait.
1
1 Cui Tao PhD Dissertation Defense Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages
2
2 Motivation Birth date of my great grandpa Price and mileage of red Nissans, 1990 or newer Protein and amino acids information of gene cdk-4? US states with property crime rates above 1%
3
3 Search by Search Engine
4
4 Search the Hidden Web The Hidden Web: – Hidden behind forms – Hard to query “cdk-4"
5
5 Query for Data The Hidden Web: – Hidden behind forms – Hard to query Find the protein and the animo-acids information for gene “cdk-4"
6
6 A Web of Pages A Web of Knowledge Web of Knowledge – Machine-“understandable” – Publicly accessible – Queriable by standard query languages Semantic annotation – Domain ontologies – Populated conceptual model Problems to resolve – How do we create ontologies? – How do we annotate pages for ontologies?
7
Contributions of Dissertation Work Web of Pages Web of Knowledge – Knowledge & meta-knowledge extraction – Reformulation as machine-“understandable” knowledge Automatic & semi-automatic solutions via: – Sibling tables (TISP/TISP++) – User-created forms (FOCIH) 7
8
8 Automatic Annotation with TISP (Table Interpretation with Sibling Pages) Recognize tables (discard non-tables) Locate table labels Locate table values Find label/value associations
9
9 Recognize Tables Data Table Layout Tables (discard) Nested Data Tables
10
10 Find Label/Value Associations Example: (Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918 1212
11
11 Interpretation Technique: Sibling Page Comparison
12
12 Interpretation Technique: Sibling Page Comparison Same
13
13 Interpretation Technique: Sibling Page Comparison Almost Same
14
14 Interpretation Technique: Sibling Page Comparison Different Same
15
15 Technique Details Unnest tables Match tables in sibling pages – “Perfect” match (table for layout discard ) – “Reasonable” match (sibling table) Determine & use table-structure pattern – Discover pattern – Pattern usage – Dynamic pattern adjustment
16
16 Table Unnesting
17
17 Regularity Expectations: ( {L} {V}) n ( {L}) n ( ( {V}) n ) + … Pattern combinations are also possible. Table Structure Patterns
18
18 ( {L}) n ( ( {V}) n ) + Table Structure Patterns
19
19 Pattern Usage
20
20 Dynamic Pattern Adjustment
21
21 TISP++ Automatic ontology generation Automatic information annotation
22
22 Ontology Generation – OSM Object set: table labels – Lexical: labels that associate with actual values – Non-lexical: labels that associate with other tables Relationship set: table nesting Constraints: updates based on observation
23
23 Ontology Generation – OWL Object set: OWL class Relationship set: OWL object property Lexical object set: – OWL data type property – Different annotation properties to keep track of the provenance
24
Generated Ontology
26
26 RDF Graph
27
27 Query the Data Find the protein and the animo-acids information for gene “cdk-4"
28
28 TISP Evaluation Applications – Commercial: car ads – Scientific: molecular biology – Geopolitical: US states and countries Data: > 2,000 tables in 35 sites Evaluation – Initial two sibling pages Correct separation of data tables from layout tables? Correct pattern recognition? – Remaining tables in site Information properly extracted? Able to detect and adjust for pattern variations?
29
29 Experimental Results Table recognition: correctly discarded 157 of 158 layout tables Pattern recognition: correctly found 69 of 72 structure patterns Extraction and adjustments: 5 path adjustments and 34 label adjustments all correct
30
30 TISP++ Performance Performance depends on TISP TISP test set – Generates all ontologies correctly – Annotates all information in tables correctly
31
31 Form-based Ontology Creation and Information Harvesting (FOCIH) Personalized ontology creation by form – General familiarity – Reasonable conceptual framework – Appropriate correspondence Transformable to ontological descriptions Capable of accepting source data Automated ontology creation Automated information harvesting
32
32 Form Creation
33
33 Created Sample Form
34
34 Generated Ontology View
35
35 Source-to-Form Mapping
36
36 Source-to-Form Mapping
37
37 Source-to-Form Mapping
38
38 Source-to-Form Mapping
39
39 Almost Ready to Harvest Need reading path: DOM-tree structure Need to resolve mapping problems – Pattern recognition – Instance recognition
40
40 Reading Path
41
41 Pattern & Instance Recognition
42
42 Pattern & Instance Recognition
43
43 Pattern & Instance Recognition regular expression for decimal number left context right context
44
44 Pattern & Instance Recognition list pattern, delimiter is “,”
45
45 Pattern & Instance Recognition list pattern, delimiter is regular expression for percentage numbers and a comma
46
46 Pattern & Instance Recognition list pattern, delimiter is regular expression for percentage numbers and a comma
47
47 Can Now Harvest
48
48 Can Now Harvest
49
49 Can Now Harvest
50
50 Semantic Annotation
51
51 Semantic Annotation
52
52 Semantic Annotation
53
53 Semantic Annotation
54
54 Semantic Annotation
55
55 Semantic Query
56
56 FOCIH Performance Ontology creation Semantic annotation – Depends on TISP performance – Depends on pattern and instance recognition performance
57
57 FOCIH Performance Pattern and instance recognition: – Works with highly regular data – Tested 71 mappings – 25 full-string values (25/25 correct) – 38 substring values (29/38 correct) – 8 list patterns (6/8 correct)
58
58 FOCIH Difficulties
59
59 FOCIH Difficulties
60
60 FOCIH Difficulties No selection
61
61 WoK via TISP
62
62 WoK via TISP
63
63 WoK via FOCIH
64
64 WoK via FOCIH
65
65 Contributions TISP: automatic sibling table interpretation TISP++: – Automatic ontology generation based on interpreted tables – Automatic semantic annotation for interpreted tables FOCIH: – Semi-automatic personalized ontology creation – Automatic personalized information harvesting and semantic annotation All together: contributes to turning the current web of pages into a web of Knowledge
66
66 Future Work Sibling pages in addition to sibling tables Reverse engineer from ontologies to forms as a basis for information harvesting for already defined ontologies.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.