Presentation is loading. Please wait.

Presentation is loading. Please wait.

What’s the difference between Tony Blair and Mother Theresa? (Human Language Technology for Preservation return on investment)

Similar presentations


Presentation on theme: "What’s the difference between Tony Blair and Mother Theresa? (Human Language Technology for Preservation return on investment)"— Presentation transcript:

1 What’s the difference between Tony Blair and Mother Theresa? (Human Language Technology for Preservation return on investment) http://gate.ac.uk/http://gate.ac.uk/ http://nlp.shef.ac.uk/http://nlp.shef.ac.uk/ Hamish Cunningham Dept. Computer Science, University of Sheffield Alghero, March 2004

2 2(19) 20 th Century Rot 20 th Century audio-visual media is rapidly disappearing Preservation and restoration are high cost The costs must be justified by increased access “Metadata”: descriptive information about content Therefore the rest of the talk will cover: –rich metadata and semantic access –cross-lingual access –syndicated delivery –repurposeable content

3 3(19) IT context: the Knowledge Economy and Human Language Gartner, December 2002: taxonomic and hierachical knowledge mapping and indexing will be prevalent in almost all information-rich applications through 2012 more than 95% of human-to-computer information input will involve textual language A contradiction: to deal with the information deluge we need formal knowledge in semantics-based systems our archived history is in informal and ambiguous natural language The challenge: to reconcile these two phenomena

4 4(19) Human Language Formal Knowledge (ontologies and instance bases) (A)IE CLIE (M)NLG Controlled Language OIE Semantic Web; Semantic Grid; Semantic Web Services KEY MNLG: Multilingual Natural Language Generation OIE: Ontology-aware Information Extraction AIE: Adaptive IE CLIE: Controlled Language IE HLT: Closing the Loop

5 5(19) Information Extraction Information Extraction (IE) pulls facts and structured information from the content of large text collections. Contrast IE and Information Retrieval NLP history: from NLU to IE Progress driven by quantitative measures MUC: Message Understanding Conferences ACE: Advanced Content Extraction

6 6(19) IE Example The shiny red rocket was fired on Tuesday. It is the brainchild of Dr. Big Head. Dr. Head is a staff scientist at We Build Rockets Inc. Named entities (NE): "rocket", "Tuesday", "Dr. Head" and "We Build Rockets" Co-reference resolution (CO): "it" refers to the rocket; "Dr. Head" and "Dr. Big Head" are the same Template Elements (TE): the rocket is "shiny red" and Head's "brainchild". Template Relations (TR): Dr. Head works for We Build Rockets Inc. Scenario Templates (ST): a rocket launching event occurred with the various participants.

7 7(19) Performance levels (Extensive quantitative evaluation since early ’90s; mainly on text, ASR; now also video OCR) Vary according to text type, domain, scenario, language NE: up to 97% (tested in English, Spanish, Japanese, Chinese, others) CO: 60-70% resolution TE: 80% TR: 75-80% ST: 60% (but: human level may be only 80%)

8 8(19) Ontology-based IE XYZ was established on 03 November 1978 in London. It opened a plant in Bulgaria in … Ontology & KB Company type HQ establOn CityCountry Location partOf type “03/11/1978” XYZ London UK Bulgaria HQ partOf

9 9(19) Entity Person … Job-title president chancellor minister … G.Brown “ Gordon Brown met George Bush during his two day visit. Classes, instances & metadata Classes+instances before Bush http://… 1.html 0 12 Gordon Brown …#Person …#Person12345 18 32 George Bush …#Person …#Person67890 Classes+ instances after

10 10(19) An example: the MUMIS project Multimedia Indexing and Searching Environment Composite index of a multimedia programme from multiple sources in different languages ASR, video processing, Information Extraction (Dutch, English, German), merging, user interface University of Twente/CTIT, University of Sheffield, University of Nijmegen, DFKI, MPI, ESTEAM AB, VDA An important experimental result: multiple sources for same events can improve extraction quality –PrestoSpace applications in news and sports archiving

11 11(19) Semantic Query Not “goal Beckham” (includes e.g. missed goals, or “this was not a goal”) Instead: “goal events with scorer David Beckham”

12 12(19) The results: England win!

13 13(19) PSpace: good news and bad news The good news: PrestoSpace has some of the world leaders on AI and metadata The bad news: AI always fails How does the machine tell the difference between “Mother Theresa is a saint” and “Tony Blair is a saint”? (Or, who tells Google which statement is important?) Other web users do, by linking (also cf. Amazon) Two solutions to the AI problem: –allow archivists and users to build their own (simple specific models can succeed, but the cost may be too high) –use recommender systems to make the user an archivist’s assistant (researchers and students may barter for access) Any route to searchable content!

14 14(19) Syndication and Merging The web promotes diversity, but also fragmentation Original web: separate content and presentation (“this is a header”, not “set in 20 point bold font”) Now: many incompatible/inaccessible interfaces Archives need to: –pool their impact: syndication in networked communities –support repurposable content Therefore data must be presentation indepenent Candidate technologies: XML, RSS, RDF, OWL (“semantic web”)

15 15(19) GATE, a General Architecture for Text Engineering is... An architecture A macro-level organisational picture for LE software systems. A framework For programmers, GATE is an object-oriented class library that implements the architecture. A development environment For language engineers, a graphical development environment. GATE comes with... Free components, and wrappers for other people's Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc. Free software (LGPL) at http://gate.ac.uk/download/http://gate.ac.uk/download/ Used by thousands of people at hundreds of sites

16 16(19) A bit of a nuisance (GATE users) GATE team projects. Past: Conceptual indexing: MUMIS: automatic semantic indices for sports video MUSE, cross-genre entitiy finder HSL, Health-and-safety IE Old Bailey: collaboration with HRI on 17th century court reports Multiflora: plant taxonomy text analysis for biodiversity research e- science Present: Advanced Knowledge Technologies: €12m UK five site collaborative project EMILLE: S. Asian languages corpus ACE / TIDES: Arabic, Chinese NE JHU summer w/s on semtagging Future: Five new projects inc. PrestoSpace Thousands of users at hundreds of sites. A representative sample: the American National Corpus project the Perseus Digital Library project, Tufts University, US Longman Pearson publishing, UK Merck KgAa, Germany Canon Europe, UK Knight Ridder, US BBN (leading HLT research lab), US SMEs inc. Sirma AI Ltd., Bulgaria Imperial College, London, the University of Manchester, UMIST, the University of Karlsruhe, Vassar College, the University of Southern California and a large number of other UK, US and EU Universities UK and EU projects inc. MyGrid, CLEF, dotkom, AMITIES, Cub Reporter, EMILLE, Poesia...

17 17(19) GATE – infrastructure for semantic metadata extraction Combines learning and rule-based methods (new work on mixed-initiative learning Allows combination of IE and IR Enables use of large-scale linguistic resources for IE, such as WordNet Supports ontologies as part of IE applications - Ontology-Based IE Supports languages from Hindi to Chinese, Italian to German

18 18(19) (Not the) MAD Semantics Architecture EN Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Sources IE IT Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Signal md, Transcr- iptions ASR, etc. Forma l Text Forma l Text Forma l Text Forma l Text Forma l Text Forma l Text Forma l Text Forma l Text Forma l Text Forma l Text Forma l Text AV Signals Merging Final Annotations Forma l Text Forma l Text Forma l Text Anno- tations Multilingual Conceptual Q & A... Ontology- Based Metadata

19 19(19) Archiving is not a luxury C21 st : all the C20 th mistakes but bigger & better? If you don’t know where you’ve been, how can you know where you’re going? Archives: ammunition in the war on ignorance Ammunition is useless if you can’t find it: new technology must make our history accessible to all, for all our futures More information: http://gate.ac.uk/http://gate.ac.uk/ http://www.prestospace.org/http://www.prestospace.org/


Download ppt "What’s the difference between Tony Blair and Mother Theresa? (Human Language Technology for Preservation return on investment)"

Similar presentations


Ads by Google