Web-Assisted Annotation, Semantic Indexing and Search of Television and Radio News (proceedings page 255) Mike Dowman Valentin Tablan Hamish Cunningham.

Slides:



Advertisements
Similar presentations
Generation of Multimedia TV News Contents for WWW Hsin Chia Fu, Yeong Yuh Xu, and Cheng Lung Tseng Department of computer science, National Chiao-Tung.
Advertisements

National Technical University of Athens Department of Electrical and Computer Engineering Image, Video and Multimedia Systems Laboratory
Visit the ccScan Website Scan, Import, and Automatically File documents to the Cloud SCAN, IMPORT, AND AUTOMATICALLY FILE DOCUMENTS TO SALESFORCE ® Introduction.
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
A Human-Centered Computing Framework to Enable Personalized News Video Recommendation (Oh Jun-hyuk)
Yansong Feng and Mirella Lapata
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Information Extraction from Spoken Language Dr Pierre Dumouchel Scientific Vice-President, CRIM Full Professor, ÉTS.
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
Distributed search for complex heterogeneous media Werner Bailer, José-Manuel López-Cobo, Guillermo Álvaro, Georg Thallinger Search Computing Workshop.
Mining the web to improve semantic-based multimedia search and digital libraries
Information Retrieval in Practice
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials 2.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
1 CS 502: Computing Methods for Digital Libraries Lecture 20 Multimedia digital libraries.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.
Presentation Outline  Project Aims  Introduction of Digital Video Library  Introduction of Our Work  Considerations and Approach  Design and Implementation.
Access to News Audio User Interaction in Speech Retrieval Systems by Jinmook Kim and Douglas W. Oard May 31, th Annual Symposium and Open House.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Toward Semantic Web Information Extraction B. Popov, A. Kiryakov, D. Manov, A. Kirilov, D. Ognyanoff, M. Goranov Presenter: Yihong Ding.
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Information Retrieval in Practice
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
A Light-weight Approach to Coreference Resolution for Named Entities in Text Marin Dimitrov Ontotext Lab, Sirma AI Kalina Bontcheva, Hamish Cunningham,
Špindlerův Mlýn, Czech Republic, SOFSEM Semantically-aided Data-aware Service Workflow Composition Ondrej Habala, Marek Paralič,
Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Lecture #32 WWW Search. Review: Data Organization Kinds of things to organize –Menu items –Text –Images –Sound –Videos –Records (I.e. a person ’ s name,
Real-Time Speech Recognition Subtitling in Education Respeaking 2009 Dr Mike Wald University of Southampton.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
The PrestoSpace Project Valentin Tablan. 2 Sheffield NLP Group, January 24 th 2006 Project Mission The 20th Century was the first with an audiovisual.
1 CS 430 / INFO 430 Information Retrieval Lecture 23 Non-Textual Materials 2.
Information Retrieval and Web Search Lecture 1. Course overview Instructor: Rada Mihalcea Class web page:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
Chapter 6: Information Retrieval and Web Search
Search. Search issues How do we say what we want? –I want a story about pigs –I want a picture of a rooster –How many televisions were sold in Vietnam.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials: Informedia.
Workshop on Human Language Technology for the Semantic Web and Web Services 2nd International Semantic Web Conference October 20th 2003, Sanibel Island,
National Taiwan University, Taiwan
1 Applications of video-content analysis and retrieval IEEE Multimedia Magazine 2002 JUL-SEP Reporter: 林浩棟.
1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Reviews Crawler (Detection, Extraction & Analysis) FOSS Practicum By: Syed Ahmed & Rakhi Gupta April 28, 2010.
1 CS 430 / INFO 430 Information Retrieval Lecture 17 Metadata 4.
Behrooz ChitsazLorrie Apple Johnson Microsoft ResearchU.S. Department of Energy.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
LREC – Workshop on Crossing media for Improved Information Access, Genova, Italy, 23 May Cross-Media Indexing in the Reveal-This System Murat Yakici,
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
ESWC 2005, Crete, Greece Semantically Enhanced Television News through Web and Video Integration Multimedia and the Semantic Web workshop Borislav PopovMike.
Multimedia Semantic Analysis in the PrestoSpace Project Valentin Tablan, Hamish Cunningham, Cristian Ursu NLP Research Group University of Sheffield Regent.
Multi-Source Information Extraction Valentin Tablan University of Sheffield.
University of Sheffield, NLP Introduction to Text Mining Module 4: Applications (Part 2)
Information Retrieval in Practice
Digital Video Library - Jacky Ma.
Visual Information Retrieval
Search Engine Architecture
Multimedia Information Retrieval
Multimedia Information Retrieval
Content Augmentation for Mixed-Mode News Broadcasts Mike Dowman
Presentation transcript:

Web-Assisted Annotation, Semantic Indexing and Search of Television and Radio News (proceedings page 255) Mike Dowman Valentin Tablan Hamish Cunningham University of Sheffield Borislav Popov Ontotext Lab, Sirma AI

2 WWW 2005, Chiba, Japan Motivation Broadcasters produce many of hours of material daily (BBC has 8 TV and 11 radio national channels) Some of this material can be reused in new productions Access to archive material is provided by some form of semantic annotation and indexing Manual annotation is time consuming (up to 10x real time) and expensive Currently some 90% of BBC’s output is only annotated at a very basic level

3 WWW 2005, Chiba, Japan RichNews A prototype addressing the automation of semantic annotation for multimedia material Not aiming at reaching performance comparable to that of human documentarists Fully automatic Aimed at news material, further extensions possible TV and radio news broadcasts from the BBC were used during development and testing

4 WWW 2005, Chiba, Japan Overview Input: multimedia file Output: OWL/RDF descriptions of content –Headline (short summary) –List of entities (Person/Location/Organization/…) –Related web pages –Segmentation Multi-source Information Extraction system –Automatic speech transcript –Subtitles/closed captions –Related web pages –Legacy metadata

5 WWW 2005, Chiba, Japan Key Problems Obtaining a transcript: Speech recognition produces poor quality transcripts with many mistakes (error rate ranging from 10 to 90%) More reliable sources (subtitles/closed captions) not always available Broadcast segmentation: A news broadcast contains several stories. How do we work out where one starts and another one stops?

6 WWW 2005, Chiba, Japan Architecture THISL Speech Recogniser C99 Topical Segmenter TF.IDF Key Phrase Extraction Media File Manual Annotation (Optional) Entity Validation Semantic Index Web-Search and Document Matching KIM Information Extraction Degraded Text Information Extraction

7 WWW 2005, Chiba, Japan Using ASR Transcripts ASR is performed by the THISL system. Based on ABBOT connectionist speech recognizer. Optimized specifically for use on BBC news broadcasts. Average word error rate of 29%. Error rate of up to 90% for out of studio recordings.

8 WWW 2005, Chiba, Japan ASR he was suspended after his arrest [SIL] but the process were set never to have lost confidence in him he was suspended after his arrest [SIL] but the Princess was said never to have lost confidence in him and other measures weapons inspectors have the first time entered one of saddam hussein's presidential palaces United Nations weapons inspectors have for the first time entered one of saddam hussein's presidential palaces

9 WWW 2005, Chiba, Japan Topical Segmentation Uses C99 segmenter: Removes common words from the ASR transcripts. Stems the other words to get their roots. Then looks to see in which parts of the transcripts the same words tend to occur.  These parts will probably report the same story.

10 WWW 2005, Chiba, Japan Key Phrase Extraction Term frequency inverse document frequency (TF.IDF): Chooses sequences of words that tend to occur more frequently in the story than they do in the language as a whole. Any sequence of up to three words can be a phrase. Up to four phrases extracted per story.

11 WWW 2005, Chiba, Japan Web Search and Document Matching The Key-phrases are used to search on the BBC, and the Times, Guardian and Telegraph newspaper websites for web pages reporting each story in the broadcast. Searches are restricted to the day of broadcast, or the day after. Searches are repeated using different combinations of the extracted key-phrases. The text of the returned web pages is compared to the text of the transcript to find matching stories.

12 WWW 2005, Chiba, Japan Using the Web Pages The web pages contain: A headline, summary and section for each story. Good quality text that is readable, and contains correctly spelt proper names. They give more in depth coverage of the stories.

13 WWW 2005, Chiba, Japan Semantic Annotation The KIM knowledge management system can semantically annotate the text derived from the web pages: KIM will identify people, organizations, locations etc. KIM performs well on the web page text, but very poorly when run on the transcripts directly. This allows for semantic ontology-aided searches for stories about particular people or locations etcetera. So we could search for people called Sydney, which would be difficult with a text-based search.

14 WWW 2005, Chiba, Japan Entity Matching

15 WWW 2005, Chiba, Japan Search for Entities

16 WWW 2005, Chiba, Japan Story Retrieval

17 WWW 2005, Chiba, Japan Evaluation Success in finding matching web pages was investigated. Evaluation based on 66 news stories from 9 half-hour news broadcasts. Web pages were found for 40% of stories. 7% of pages reported a closely related story, instead of that in the broadcast. Results are based on earlier version of the system, only using BBC web pages.

18 WWW 2005, Chiba, Japan Future Improvements Use teletext subtitles (closed captions) when they are available Better story segmentation through visual cues and latent semantic analysis Use for content augmentation for interactive media consumption

19 WWW 2005, Chiba, Japan Acknowledgments This work has been supported by European Union grants under the Sixth Framework Program projects PrestoSpace (FP ) and SEKT (EU IST IP ). More Information Thank you!