Presentation is loading. Please wait.

Presentation is loading. Please wait.

IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.

Similar presentations


Presentation on theme: "IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external."— Presentation transcript:

1 IR & Metadata

2 Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external to meaning of content –Semantic metadata is related to content How is it created? –Catalogers, authors, data entry, etc. –Requires lots of human effort

3 Automating Metadata Can some metadata be assigned automatically? –Yes, depending on how willing you are to live with mistakes –But humans also make mistakes … How to determine metadata values? –Natural language processing –Pattern matching –Term/phrase recognition –Information retrieval

4 Natural Language Processing Use rules of sentence construction (grammar) to “understand” the meaning of the text. Difficulties –Grammar is not from grammar school –Human communication requires non-literal interpretation What types of metadata fields could NLP provide? Example: weather forecasts

5 Pattern Matching Use patterns (e.g. regular expressions) to locate and interpret specific forms of meaning Difficulties –Patterns must be expressible in pattern language –Lots of variations require lots of patterns –Polysemy What types of metadata fields could pattern matching provide?

6 Term/Phrase Matching Look for specific terms or phrases in order to determine document characteristics Difficulties –No understanding of context –Polysemy What types of metadata fields could term/phrase matching provide?

7 Information Retrieval Use statistical analysis of vocabulary use and document structure to determine document characteristics Difficulties –No understanding of terms –No understanding of semantic context What types of metadata fields could information retrieval provide?

8 Practical Metadata No metadata extraction algorithm works 100% of the time –Could send results to human to okay Still requires lots of human resources –Need to decide how good algorithm has to be or how sure the algorithm is if it provides confidence values before accepting results INFOMINE –Project crawling and generating metadata for scholarly resources on the Web –Has 100,000 automatically created records

9 Types of Metadata Extraction Assignment –Assigns values drawn from text of the document –NLP, pattern matching, term/phrase matching Classification –Assigns values from a controlled vocabulary –Use machine learning during training stage to match document attributes (e.g. term vector) to element in controlled vocabulary

10 Evaluating Metadata Extraction Automatic evaluation –Based on document set with human-expert previously assigned metadata –Compare similarity between system-assigned and human-assigned metadata –Limited to document/metadata sets where the values are known Human evaluation –Subject experts rate the appropriateness of the assigned metadata –Allows for near misses and alternate values –Expensive to do

11 Metadata Extraction Metrics Single-value metadata fields –Accuracy is a good performance measure –Partial match fields Parent or child in ontological hierarchies Multi-value metadata fields –Precision = # right / # assigned –Recall = # right / # of expert-assigned values Semantic summaries and keyphrases –Content-word precision = # same words / # words –Content-word recall = # same words / # expert words –Requires stopword and stemming

12 INFOMINE Assignments Title –Single value open text field –Title tag worked well Creator –Multiple value field –Used “creator” meta tag if there (good precision, no smarts) Keyphrase –Used “keyword” meta tag with PhraseRate (IR approach)

13 INFOMINE Assignments Description –1-2 paragraphs long –Meta tags and AutoAnnotator (NLP + IR approach) LCSH –Selected from over 200,000 values –Determines nearest neighbor in human-assigned data set (IR and ML) INFOMINE Category –Put document in set of nine categories –Uses nine binary classifiers created using ML

14 Summary Metadata is useful but expensive –Lots of human effort to generate –Need to automate when possible Metadata generation –NLP, pattern matching, term/phrase matching, IR –Approaches appropriate for generating different types of metadata Evaluating generated metadata –Automatic vs. human evaluation –Accuracy, precision/recall, etc.


Download ppt "IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external."

Similar presentations


Ads by Google