IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

An Ontology Creation Methodology: A Phased Approach
Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li.
Probabilistic Language Processing Chapter 23. Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram,
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.
Interfaces for Retrieval Results. Information Retrieval Activities Selecting a collection –Talked about last class –Lists, overviews, wizards, automatic.
Traditional Information Extraction -- Summary CS652 Spring 2004.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
IMT530- Organization of Information Resources1 Feedback Like exercises –But want more instructions and feedback on them –Wondering about grading on these.
CS 430 / INFO 430 Information Retrieval
Chapter 5: Information Retrieval and Web Search
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Adding metadata to web pages Please note: this is a temporary test document for use in internal testing only.
Information Retrieval – and projects we have done. Group Members: Aditya Tiwari ( ) Harshit Mittal ( ) Rohit Kumar Saraf ( ) Vinay.
Search Engines and Information Retrieval Chapter 1.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Survey of Semantic Annotation Platforms
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Document Categorization Problem: given –a collection of documents, and –a taxonomy of subject areas Classification: Determine the subject area(s) most.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.
Chapter 6: Information Retrieval and Web Search
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Evidence from Metadata INST 734 Doug Oard Module 8.
8 December 1997Industry Day Applications of SuperTagging Raman Chandrasekar.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
WIRED Week 5 Readings Overview - Text & Multimedia Languages & Properties - Text Operations - Multimedia IR Finalize Topic Discussions Schedule Projects.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Information Organization: Overview
Presented by: Hassan Sayyadi
Multimedia Information Retrieval
Machine Learning in Natural Language Processing
Introduction to Semantic Metadata & Semantic Web
Text Categorization Assigning documents to a fixed set of categories
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
CS246: Information Retrieval
Information Retrieval
Information Organization: Overview
Presentation transcript:

IR & Metadata

Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external to meaning of content –Semantic metadata is related to content How is it created? –Catalogers, authors, data entry, etc. –Requires lots of human effort

Automating Metadata Can some metadata be assigned automatically? –Yes, depending on how willing you are to live with mistakes –But humans also make mistakes … How to determine metadata values? –Natural language processing –Pattern matching –Term/phrase recognition –Information retrieval

Natural Language Processing Use rules of sentence construction (grammar) to “understand” the meaning of the text. Difficulties –Grammar is not from grammar school –Human communication requires non-literal interpretation What types of metadata fields could NLP provide? Example: weather forecasts

Pattern Matching Use patterns (e.g. regular expressions) to locate and interpret specific forms of meaning Difficulties –Patterns must be expressible in pattern language –Lots of variations require lots of patterns –Polysemy What types of metadata fields could pattern matching provide?

Term/Phrase Matching Look for specific terms or phrases in order to determine document characteristics Difficulties –No understanding of context –Polysemy What types of metadata fields could term/phrase matching provide?

Information Retrieval Use statistical analysis of vocabulary use and document structure to determine document characteristics Difficulties –No understanding of terms –No understanding of semantic context What types of metadata fields could information retrieval provide?

Practical Metadata No metadata extraction algorithm works 100% of the time –Could send results to human to okay Still requires lots of human resources –Need to decide how good algorithm has to be or how sure the algorithm is if it provides confidence values before accepting results INFOMINE –Project crawling and generating metadata for scholarly resources on the Web –Has 100,000 automatically created records

Types of Metadata Extraction Assignment –Assigns values drawn from text of the document –NLP, pattern matching, term/phrase matching Classification –Assigns values from a controlled vocabulary –Use machine learning during training stage to match document attributes (e.g. term vector) to element in controlled vocabulary

Evaluating Metadata Extraction Automatic evaluation –Based on document set with human-expert previously assigned metadata –Compare similarity between system-assigned and human-assigned metadata –Limited to document/metadata sets where the values are known Human evaluation –Subject experts rate the appropriateness of the assigned metadata –Allows for near misses and alternate values –Expensive to do

Metadata Extraction Metrics Single-value metadata fields –Accuracy is a good performance measure –Partial match fields Parent or child in ontological hierarchies Multi-value metadata fields –Precision = # right / # assigned –Recall = # right / # of expert-assigned values Semantic summaries and keyphrases –Content-word precision = # same words / # words –Content-word recall = # same words / # expert words –Requires stopword and stemming

INFOMINE Assignments Title –Single value open text field –Title tag worked well Creator –Multiple value field –Used “creator” meta tag if there (good precision, no smarts) Keyphrase –Used “keyword” meta tag with PhraseRate (IR approach)

INFOMINE Assignments Description –1-2 paragraphs long –Meta tags and AutoAnnotator (NLP + IR approach) LCSH –Selected from over 200,000 values –Determines nearest neighbor in human-assigned data set (IR and ML) INFOMINE Category –Put document in set of nine categories –Uses nine binary classifiers created using ML

Summary Metadata is useful but expensive –Lots of human effort to generate –Need to automate when possible Metadata generation –NLP, pattern matching, term/phrase matching, IR –Approaches appropriate for generating different types of metadata Evaluating generated metadata –Automatic vs. human evaluation –Accuracy, precision/recall, etc.