Multi-Source Information Extraction Valentin Tablan University of Sheffield.

Slides:



Advertisements
Similar presentations
Generation of Multimedia TV News Contents for WWW Hsin Chia Fu, Yeong Yuh Xu, and Cheng Lung Tseng Department of computer science, National Chiao-Tung.
Advertisements

National Technical University of Athens Department of Electrical and Computer Engineering Image, Video and Multimedia Systems Laboratory
A Human-Centered Computing Framework to Enable Personalized News Video Recommendation (Oh Jun-hyuk)
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
MUMIS User Group Workshop P. Wittenburg Max-Planck-Institut für Psycholinguistik Nijmegen.
ACCESSIBLE TECHNOLOGIES FOR SPEECH MANAGEMENT “Making media accessible to all” ITU workshop – Geneva October 2013.
AUTOMATIC ORGANIZING AND FORMATTING FOR LECTURE NOTES SHIQING (LICIA) HE ADIVISOR: PROF.KRISTINA STRIEGNITZ SPRING 2014 STRUCTURING THE UNSTRUCTURED NOTE:
1 Texmex – November 15 th, 2005 Strategy for the future Global goal “Understand” (= structure…) TV and other MM documents Prepare these documents for applications.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Mining the web to improve semantic-based multimedia search and digital libraries
Information Retrieval in Practice
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials 2.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
1 CS 502: Computing Methods for Digital Libraries Lecture 20 Multimedia digital libraries.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.
Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Toward Semantic Web Information Extraction B. Popov, A. Kiryakov, D. Manov, A. Kirilov, D. Ognyanoff, M. Goranov Presenter: Yihong Ding.
1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System Supervisor: Prof Michael Lyu Presented by: Lewis Ng,
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Result presentation. Search Interface Input and output functionality – helping the user to formulate complex queries – presenting the results in an intelligent.
WP5.4 - Introduction  Knowledge Extraction from Complementary Sources  This activity is concerned with augmenting the semantic multimedia metadata basis.
Lecture #32 WWW Search. Review: Data Organization Kinds of things to organize –Menu items –Text –Images –Sound –Videos –Records (I.e. a person ’ s name,
Mining the Semantic Web: Requirements for Machine Learning Fabio Ciravegna, Sam Chapman Presented by Steve Hookway 10/20/05.
Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System.
The PrestoSpace Project Valentin Tablan. 2 Sheffield NLP Group, January 24 th 2006 Project Mission The 20th Century was the first with an audiovisual.
1 CS 430 / INFO 430 Information Retrieval Lecture 23 Non-Textual Materials 2.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Web-Assisted Annotation, Semantic Indexing and Search of Television and Radio News (proceedings page 255) Mike Dowman Valentin Tablan Hamish Cunningham.
Chapter 6: Information Retrieval and Web Search
Search. Search issues How do we say what we want? –I want a story about pigs –I want a picture of a rooster –How many televisions were sold in Vietnam.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials: Informedia.
MUMIS Franciska de Jong & Thijs Westerveld University of Twente Multimedia Indexing and Searching.
Natural Language Processing for Information Retrieval -KVMV Kiran ( )‏ -Neeraj Bisht ( )‏ -L.Srikanth ( )‏
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
1 CS 430 / INFO 430 Information Retrieval Lecture 17 Metadata 4.
Behrooz ChitsazLorrie Apple Johnson Microsoft ResearchU.S. Department of Energy.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Genoa – May 23, 2006 LREC workshop From Media Crossing to Media Mining Franciska de Jong University of Twente/TNO ICT
LREC – Workshop on Crossing media for Improved Information Access, Genova, Italy, 23 May Cross-Media Indexing in the Reveal-This System Murat Yakici,
ESWC 2005, Crete, Greece Semantically Enhanced Television News through Web and Video Integration Multimedia and the Semantic Web workshop Borislav PopovMike.
Multimedia Semantic Analysis in the PrestoSpace Project Valentin Tablan, Hamish Cunningham, Cristian Ursu NLP Research Group University of Sheffield Regent.
University of Sheffield, NLP Introduction to Text Mining Module 4: Applications (Part 2)
Information Retrieval in Practice
Building Community around Tools for Automated Video Transcription for Rich Media Notebooks: The SpokenMedia Project Brandon Muramatsu MIT,
Digital Video Library - Jacky Ma.
Visual Information Retrieval
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
Supervisor: Prof Michael Lyu Presented by: Lewis Ng, Philip Chan
Multimedia Information Retrieval
Heuristic Evaluation Jon Kolko Professor, Austin Center for Design.
Multimedia Information Retrieval
CSE 635 Multimedia Information Retrieval
Content Augmentation for Mixed-Mode News Broadcasts Mike Dowman
Developing Listening strategies
Presentation transcript:

Multi-Source Information Extraction Valentin Tablan University of Sheffield

University of Sheffield, NLP GATE Summer School, Sheffield Multi-Source IE □Redundant sources: better precision. □Complementary sources: better recall. Information Extraction Input 1 Information Extraction Input 2 Information Extraction Input N … Results Merge Output (Template / Ontology)

University of Sheffield, NLP GATE Summer School, Sheffield RichNews □A prototype addressing the automation of semantic annotation for multimedia material □Fully automatic □Aimed at news material □Not aiming at reaching performance comparable to that of human experts □TV and radio news broadcasts from the BBC were used during development and testing

University of Sheffield, NLP GATE Summer School, Sheffield Motivation □Broadcasters produce many of hours of material daily (BBC has 8 TV and 11 radio national channels) □Some of this material can be reused in new productions □Access to archive material is provided by some form of semantic annotation and indexing □Manual annotation is time consuming (up to 10x real time) and expensive □Currently some 90% of BBC’s output is only annotated at a very basic level

University of Sheffield, NLP GATE Summer School, Sheffield Overview □Input: multimedia file □Output: OWL/RDF descriptions of content ○Headline (short summary) ○List of entities (Person/Location/Organization/…) ○Related web pages ○Segmentation □Multi-source Information Extraction system ○Automatic speech transcript ○Subtitles/closed captions (if available) ○Related web pages ○Legacy metadata

University of Sheffield, NLP GATE Summer School, Sheffield Key Problems □Obtaining a transcript: ○Speech recognition produces poor quality transcripts with many mistakes (error rate ranging from 10 to 90%) ○More reliable sources (subtitles/closed captions) not always available □Broadcast segmentation: ○A news broadcast contains several stories. How do we work out where one starts and another one stops?

University of Sheffield, NLP GATE Summer School, Sheffield Workflow THISL Speech Recogniser C99 Topical Segmenter TF/IDF Keyphrase Extraction Web Search & Document Matching Media File ASR Transcript ASR Segments Search Terms Related Web Pages Entity Validation And Alignment Web Entities ASR Entities Ouput Entities KIM Information Extraction Degraded Text Information Extraction

University of Sheffield, NLP GATE Summer School, Sheffield Using ASR Transcripts □ASR is performed by the THISL system. □Based on ABBOT connectionist speech recognizer. □Optimized specifically for use on BBC news broadcasts. □Average word error rate of 29%. □Error rate of up to 90% for out of studio recordings.

University of Sheffield, NLP GATE Summer School, Sheffield ASR Errors he was suspended after his arrest [SIL] but the process were set never to have lost confidence in him he was suspended after his arrest [SIL] but the Princess was said never to have lost confidence in him and other measures weapons inspectors have the first time entered one of saddam hussein's presidential palaces United Nations weapons inspectors have for the first time entered one of saddam hussein's presidential palaces

University of Sheffield, NLP GATE Summer School, Sheffield Topical Segmentation □Uses C99 segmenter: ○Removes common words from the ASR transcripts. ○Stems the other words to get their roots. ○Then looks to see in which parts of the transcripts the same words tend to occur. □→ These parts will probably report the same story.

University of Sheffield, NLP GATE Summer School, Sheffield Key Phrase Extraction Term frequency inverse document frequency (TF.IDF): □Chooses sequences of words that tend to occur more frequently in the story than they do in the language as a whole. □Any sequence of up to three words can be a phrase. □Up to four phrases extracted per story.

University of Sheffield, NLP GATE Summer School, Sheffield Web Search and Document Matching □The Key-phrases are used to search on the BBC, and the Times, Guardian and Telegraph newspaper websites for web pages reporting each story in the broadcast. □Searches are restricted to the day of broadcast, or the day after. □Searches are repeated using different combinations of the extracted key-phrases. □The text of the returned web pages is compared to the text of the transcript to find matching stories.

University of Sheffield, NLP GATE Summer School, Sheffield Using the Web Pages The web pages contain: □A headline, summary and section for each story. □Good quality text that is readable, and contains correctly spelt proper names. □They give more in depth coverage of the stories.

University of Sheffield, NLP GATE Summer School, Sheffield Semantic Annotation The KIM knowledge management system can semantically annotate the text derived from the web pages: □KIM will identify people, organizations, locations etc. □KIM performs well on the web page text, but very poorly when run on the transcripts directly. □It allows for semantic ontology-aided searches for stories about particular people or locations etcetera. □So we could search for people called Sydney, which would be difficult with a text-based search.

University of Sheffield, NLP GATE Summer School, Sheffield Entity Matching

University of Sheffield, NLP GATE Summer School, Sheffield Search for Entities

University of Sheffield, NLP GATE Summer School, Sheffield Story Retrieval

University of Sheffield, NLP GATE Summer School, Sheffield Evaluation Success in finding matching web pages was investigated. □Evaluation based on 66 news stories from 9 half- hour news broadcasts. □Web pages were found for 40% of stories. □7% of pages reported a closely related story, instead of that in the broadcast.

University of Sheffield, NLP GATE Summer School, Sheffield Possible Improvements □Use teletext subtitles (closed captions) when they are available □Better story segmentation through visual cues and latent semantic analysis □Use for content augmentation for interactive media consumption

University of Sheffield, NLP GATE Summer School, Sheffield Other Examples: Multiflora □Improve recall in analysing botany texts by using multiple sources and unification of populated templates. □Store templates as an ontology (which gets populated from the multiple sources). □Recall for the full template improves from 22% (1 source) to 71% (6 sources) □Precision decreases from 74% to 63%

University of Sheffield, NLP GATE Summer School, Sheffield Multiflora - IE

University of Sheffield, NLP GATE Summer School, Sheffield Multiflora: Output

University of Sheffield, NLP GATE Summer School, Sheffield Other Examples: MUMIS □Multi-Media Indexing and Search □Indexing of football matches, using multiple sources: ○Tickers (time-aligned with video stream) ○Match reports (more in-depth) ○Comments (extra details, such as player profiles)

University of Sheffield, NLP GATE Summer School, Sheffield Mumis Interface

University of Sheffield, NLP GATE Summer School, Sheffield More Information Thank You! Questions?