Towards the Self-Annotating Web Philipp Cimiano, Siegfried Handschuh, Steffen Staab Presenter: Hieu K Le (most of slides come from Philipp Cimiano) CS598CXZ - Spring UIUC
Outline Introduction The Process of PANKOW Pattern-based categorization Evaluation Integration to CREAM Related work Conclusion
The annotation problem in 4 cartoons
The annotation problem from a scientific point of view
The annotation problem in practice
The viscious cycle
Annotating A Noun A Concep t ?
Annotating To annotate terms in a web page: –Manually defining –Learning of extraction rules Both require lot of labor
A small Quiz What is “Laska” ? A A. A dish B B. A city C C. A temple D D. A mountain The answer is:
A small Quiz What is “Laska” ? A A. A dish B B. A city C C. A temple D D. A mountain The answer is:
A small Quiz What is “Laska” ? A A. A dish B B. A city C C. A temple D D. A mountain The answer is:
From Google “Laska”
From Google „cities such as Laksa“ 0 hits „dishes such as Laksa“ 10 hits „mountains such as Laksa“ 0 hits „temples such as Laksa“ 0 hits Google knows more than all of you together! Example of using syntactic information + statistics to derive semantic information
Self-annotating PANKOW ( P attern-based A nnotation through K nowledge O n the W eb) –Unsupervised –Pattern based –Within a fixed ontology –Involve information of the whole web
The Self-Annotating Web There is a huge amount of implicit knowledge in the Web Make use of this implicit knowledge together with statistical information to propose formal annotations and overcome the viscious cycle: semantics ≈ syntax + statistics? Annotation by maximal statistical evidence
Outline Introduction The Process of PANKOW Pattern-based categorization Evaluation Integration to CREAM Related work Conclusion
PANKOW Process
Outline Introduction The Process of PANKOW Pattern-based categorization Evaluation Integration to CREAM Related work Conclusion
Patterns HEARST1: s such as HEARST2: such s as HEARST3: s, (especially/including) HEARST4: (and/or) other s Examples: –dishes such as Laksa –such dishes as Laksa –dishes, especially Laksa –dishes, including Laksa –Laksa and other dishes –Laksa or other dishes
Patterns (Cont‘d) DEFINITE1: the DEFINITE2: the APPOSITION:, a COPULA: is a Examples: the Laksa dish the dish Laksa Laksa, a dish Laksa is a dish
Asking Google (more formally) Instance i I, concept c C, pattern p {Hearst1,...,Copula} count(i,c,p) returns the number of Google hits of instantiated pattern E.g. count(Laksa,dish):=count(Laksa,dish,def1)+... Restrict to the best ones beyond threshold
Outline Introduction The Process of PANKOW Pattern-based categorization Evaluation Integration to CREAM Related work Conclusion
Evaluation Scenario Corpus: 45 texts from Ontology: tourism ontology from GETESS project –#concepts: original – 1043; pruned – 682 Manual Annotation by two subjects: –A: 436 instance/concept assignments –B: 392 instance/concept assignments –Overlap: 277 instances (Gold Standard) –A and B used 59 different concepts –Categorial (Kappa) agreement on 277 instances: 63.5%
Examples Atlantic city Bahamas island USA country Connecticut state Caribbean sea Mediterranean sea Canada country Guatemala city Africa region Australia country France country Germany country Easter island St Lawrence river Commonwealth state New Zealand island Adriatic sea Netherlands country St John church Belgium country San Juan island Mayotte island EU country UNESCO organization Austria group Greece island Malawi lake Israel country Perth street Luxembourg city Nigeria state St Croix river Nakuru lake Kenya country Benin city Cape Town city 13768
Results F=28,24% R/Acc=24,90%
Comparison System#Preprocessing / CostAccuracy [MUC-7]3Various (?)>> 90% [Fleischman02]8N-gram extraction ($)70.4% PANKOW59none24.9% [Hahn98] –TH196syn. & sem. analysis ($$$)21% [Hahn98]-CB196syn. & sem. analysis ($$$)26% [Hahn98]-CB196syn. & sem. analysis ($$$)31% [Alfonseca02]1200syn. analysis ($$)17.39% (strict)
Outline Introduction The Process of PANKOW Pattern-based categorization Evaluation Integration to CREAM Related work Conclusion
CREAM/OntoMat Document Management Annotation Environment Annotated Web Pages Web Pages Domain Ontologies WWW PANKOW annotate crawl Annotation Tool GUI plugin query extract load Annotation Inference Server Annotation by Markup Ontology Guidance & Fact Browser Document Editor / Viewer
PANKOW & CREAM/OntoMat
Results (Interactive Mode) F=51,65% R/Acc=49.46%
Outline Introduction The Process of PANKOW Pattern-based categorization Evaluation Integration to CREAM Related work Conclusion
Current State-of-the-art Large-scale IE –only disambiguation Standard IE (MUC) –need of handcrafted rules ML-based IE –need of hand-annotated training corpus –does not scale to large numbers of concepts –rule induction takes time KnowItAll (Etzioni et al. WWW‘04) –shallow (pattern-matching-based) approach
Outline Introduction The Process of PANKOW Pattern-based categorization Evaluation Integration to CREAM Related work Conclusion
Conclusion Summary new paradigm to overcome the annotation problem unsupervised instance categorization first step towards the self-annotating Web difficult task: open domain, many categories decent precision, low recall very good results for interactive mode currently inefficient (590 Google queries/instance) Challenges: contextual disambiguation annotating relations (currently restricted to instances) scalability (e.g. only choose reasonable queries to Google) accurate recognition of Named Entities (currently POS-tagger)
Outline Introduction The Process of PANKOW Pattern-based categorization Evaluation Integration to CREAM Related work Conclusion
Thanks to… Philipp Cimiano karlsruhe.de) for karlsruhe.de The audience for listening