1 CS 430 / INFO 430 Information Retrieval Lecture 22 Metadata 4.

Slides:



Advertisements
Similar presentations
DELOS Highlights COSTANTINO THANOS ITALIAN NATIONAL RESEARCH COUNCIL.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
1 Building the NSDL William Y. Arms Cornell University Thinking aloud about the NSDL.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials 2.
Building Reliable Distributed Information Spaces Carl Lagoze CS /22/2002.
1 CS 430 / INFO 430 Information Retrieval Lecture 13 Architecture of Information Retrieval Systems.
1 DLESE in Context: Educational Computing, Digital Libraries and Scientific Education William Y. Arms Cornell University.
1 CS 502: Computing Methods for Digital Libraries Lecture 20 Multimedia digital libraries.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
CS335 Principles of Multimedia Systems Content Based Media Retrieval Hao Jiang Computer Science Department Boston College Dec. 4, 2007.
1 NSDL The National Science Foundation's National Digital Library for Science, Mathematics, Engineering and Technology Education [a.k.a. Smete, NSDL, Learns,...]
SCORM-NSDL Workshop May 18, Educational Materials are Scattered across the Internet NASA Math Forum State standards Scientific American Ask.
Mixed content, mixed metadata: Information discovery in the NSDL.
1 CS 430: Information Discovery Lecture 21 Web Search 3.
© Anselm SpoerriInfo + Web Tech Course Information Technologies Info + Web Tech Course Anselm Spoerri PhD (MIT) Rutgers University
Internet Resources Discovery (IRD) IBM DB2 Digital Library Thanks to Zvika Michnik and Avital Greenberg.
Visual Information Retrieval Chapter 1 Introduction Alberto Del Bimbo Dipartimento di Sistemi e Informatica Universita di Firenze Firenze, Italy.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Computer comunication B Information retrieval Repetition Retrieval models Wildcards Web information retrieval Digital libraries.
1 Automated Digital Libraries William Y. Arms Department of Computer Science Cornell University.
1 William Y. Arms September 26, 2002 A Research Program for Information Science with the NSDL as an Example.
1 An introduction to the NSDL William Y. Arms Cornell University.
Information Retrieval in Practice
1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.
1 Lessons Learned From Building a Terabyte Digital Video Library Presented by Jia Yao Multimedia Communications and Visualization Laboratory Department.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
1 Samson Cheung EE 639, Fall 2004 Lecture 1: Applications & Trends Multimedia Information Systems advent: open communicator browser, screen cam, hari’s.
1 The NSDL: A Case Study in Interoperability William Y. Arms Cornell University.
Contactforum: Digitale bibliotheken voor muziek. 3/6/2005 Real music libraries in the virtual future: for an integrated view of music and music information.
1 CS 430 / INFO 430 Information Retrieval Lecture 23 Non-Textual Materials 2.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
NSDL: OAI and a large- scale digital library Carl Lagoze, Cornell University NSDL Director of Technology
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Architecture of Information Retrieval Systems.
1 CS430: Information Discovery Lecture 18 Usability 3.
1 CS 502: Computing Methods for Digital Libraries Lecture 19 Interoperability Z39.50.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials: Informedia.
Mixed content, mixed metadata: Information discovery in the NSDL.
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
1 The NSDL Program Stephen Griffin National Science Foundation.
March 31, 1998NSF IDM 98, Group F1 Group F Multi-modal Issues, Systems and Applications.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
MMDB-9 J. Teuhola Standardization: MPEG-7 “Multimedia Content Description Interface” Standard for describing multimedia content (metadata).
Information Retrieval
A Resource Discovery Service for the Library of Texas Requirements, Architecture, and Interoperability Testing William E. Moen, Ph.D. Principal Investigator.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
MPEG-7 Audio Overview Ichiro Fujinaga MUMT 611 McGill University.
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
1 CS 430 / INFO 430 Information Retrieval Lecture 17 Metadata 4.
1 CS 430: Information Discovery Lecture 21 Non-Textual Materials 1.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
1 CS 430: Information Discovery Lecture 21 Non-Textual Materials 1.
1 CS 430: Information Discovery Lecture 23 Non-Textual Materials.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems.
1 CS 430: Information Discovery Lecture 13 Case Study: the NSDL.
Digital Video Library - Jacky Ma.
Visual Information Retrieval
CS 430: Information Discovery
Introduction Multimedia initial focus
CS 430: Information Discovery
CS 430 / INFO 430 Information Retrieval
Multimedia Information Retrieval
Metadata to fit your needs... How much is too much?
Multimedia Information Retrieval
Presentation transcript:

1 CS 430 / INFO 430 Information Retrieval Lecture 22 Metadata 4

2 Course Administration

3 Automated Creation of Metadata Records Sometimes it is possible to generate metadata automatically from the content of a digital object. The effectiveness varies from field to field. Examples Images -- characteristics of color, texture, shape, etc. (crude) Music -- optical recognition of score (good) Bird song -- spectral analysis of sounds (good) Fingerprints (good)

4 Automated Information Retrieval Using Feature Extraction Example: features extracted from images Spectral features: color or tone, gradient, spectral parameter etc. Geometric features: edge, shape, size, etc. Textural features: pattern, spatial frequency, homogeneity, etc. Features can be recorded in a feature vector space (as in a term vector space). A query can be expressed in terms of the same features. Machine learning methods, such as a support vector machine, can be used with training data to create a similarity metric between image and query Example: Searching satellite photographs for dams in California

5 Example: Blobworld

6

7

8 Effective Information Discovery With Homogeneous Digital Information Comprehensive metadata with Boolean retrieval Can be excellent for well-understood categories of material, but requires standardized metadata and relatively homogeneous content (e.g., MARC catalog). Full text indexing with ranked retrieval Can be excellent, but methods developed and validated for relatively homogeneous textual material (e.g., TREC ad hoc track).

9 Mixed Content Examples: NSDL-funded collections at Cornell Atlas. Data sets of earthquakes, volcanoes, etc. Reuleaux. Digitized kinematics models from the nineteenth century Laboratory of Ornithology. Sound recording, images, videos of birds and other animals. Nuprl. Logic-based tools to support programming and to implement formal computational mathematics.

10 Mixed Metadata: the Chimera of Standardization Technical reasons (a)Characteristics of formats and genres (b)Differing user needs Social and cultural reasons (a)Economic factors (b)Installed base

11 Information Discovery in a Messy World Building blocks Brute force computation The expertise of users -- human in the loop Methods (a)Better understanding of how and why users seek for information (b)Relationships and context information (c)Multi-modal information discovery (d)User interfaces for exploring information

12 Understanding How and Why Users Seek for Information Homogeneous content All documents are assumed equal Criterion is relevance (binary measure) Goal is to find all relevant documents (high recall) Hits ranked in order of similarity to query Mixed content Some documents are more important than other Goal is to find most useful documents on a topic and then browse Hits ranked in order that combines importance and similarity to query

13 Automatic Creation of Surrogates for Non-textual Materials Discovery of non-textual materials usually requires surrogates How far can these surrogates be created automatically? Automatically created surrogates are much less expensive than manually created, but have high error rates. If surrogates have high rates of error, is it possible to have effective information discovery?

14 Example: Informedia Digital Video Library Collections: Segments of video programs, e.g., TV and radio news and documentary broadcasts. Cable Network News, British Open University, WQED television. Segmentation: Automatically broken into short segments of video, such as the individual items in a news broadcast. Size: More than 4,000 hours, 2 terabyte. Objective: Research into automatic methods for organizing and retrieving information from video. Funding: NSF, DARPA, NASA and others. Principal investigator: Howard Wactlar (Carnegie Mellon University).

15 Informedia Digital Video Library History Carnegie Mellon has broad research programs in speech recognition, image recognition, natural language processing Basic mock-up demonstrated the general concept of a system using speech recognition to build an index from a sound track matched against spoken queries. (DARPA funded.) Informedia developed the concept of multi-modal information discovery with a series of users interface experiments. (NSF/DARPA/NASA Digital Libraries Initiative.) Continued research particularly in human computer interaction. Commercial spin-off failed.

16 The Challenge A video sequence is awkward for information discovery: Textual methods of information retrieval cannot be applied Browsing requires the user to view the sequence. Fast skimming is difficult. Computing requirements are demanding (MPEG-1 requires 1.2 Mbits/sec). Surrogates are required

17 Multi-Modal Information Discovery The multi-modal approach to information retrieval Computer programs to analyze video materials for clues e.g., changes of scene methods from artificial intelligence, e.g., speech recognition, natural language processing, image recognition. analysis of video track, sound track, closed captioning if present, any other information. Each mode gives imperfect information. Therefore use many approaches and combine the evidence.

18 Multi-Modal Information Discovery With mixed content and mixed metadata, the amount of information about the various resources varies greatly but clues from many difference sources can be combined. "The fundamental premise of the research was that the integration of these technologies, all of which are imperfect and incomplete, would overcome the limitations of each, and improve the overall performance in the information retrieval task." [Wactlar, 2000]

19 Informedia Library Creation Video Audio Text Speech recognition Image extraction Natural language interpretation Segmentation Segments with derived metadata

20 Text Extraction Source Sound track: Automatic speech recognition using Sphinx II and III recognition systems. (Unrestricted vocabulary, speaker independent, multi-lingual, background sounds). Error rates 25% up. Closed captions: Digitally encoded text. (Not on all video. Often inaccurate.) Text on screen: Can be extracted by image recognition and optical character recognition. (Matches speaker with name.) Query Spoken query: Automatic speech recognition using the same system as is used to index the sound track. Typed by user

21 Multimodal Metadata Extraction

22 Informedia: Information Discovery User Segments with derived metadata Browsing via multimedia surrogates Querying via natural language Requested segments and metadata

23

24 Limits to Scalability Informedia has demonstrated effective information discovery with moderately large collections Problems with increased scale: Technical -- storage, bandwidth, etc. Diversity of content -- difficult to tune heuristics User interfaces -- complexity of browsing grows with scale

25 Lessons Learned Searching and browsing must be considered integrated parts of a single information discovery process. Data (content and metadata), computing systems (e.g., search engines), and user interfaces must be designed together. Multi-modal methods compensate for incomplete or error- prone data.

26 Interoperability The Problem Conventional approaches require partners to support agreements (technical, content, and business) But a Web based digital library program needs thousands of very different partners... most of whom are not directly part of the program The challenge is to create incentives for independent digital libraries to adopt agreements

27 Approaches to interoperability The conventional approach  Wise people develop standards: protocols, formats, etc.  Everybody implements the standards.  This creates an integrated, distributed system. Unfortunately...  Standards are expensive to adopt.  Concepts are continually changing.  Systems are continually changing.  Different people have different ideas.

28 Interoperability is about agreements Technical agreements cover formats, protocols, security systems so that messages can be exchanged, etc. Content agreements cover the data and metadata, and include semantic agreements on the interpretation of the messages. Organizational agreements cover the ground rules for access, for changing collections and services, payment, authentication, etc. The challenge is to create incentives for independent digital libraries to adopt agreements

29 Function versus cost of acceptance Function Cost of acceptance Many adopters Few adopters

30 Example: security Function Cost of acceptance Public key infrastructure IP address Login ID and password

31 Example: metadata standards Function Cost of acceptance MARC Free text Dublin Core

32 NSDL: The Spectrum of Interoperability LevelAgreementsExample FederationStrict use of standardsAACR, MARC (syntax, semantic, Z and business) HarvestingDigital libraries exposeOpen Archives metadata; simplemetadata harvesting protocol and registry GatheringDigital libraries do notWeb crawlers cooperate; services mustand search engines seek out information