Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatic Metadata Generation & Evaluation Generating & Evaluating MetaData Elizabeth D. Liddy Center for Natural Language Processing School of Information.

Similar presentations


Presentation on theme: "Automatic Metadata Generation & Evaluation Generating & Evaluating MetaData Elizabeth D. Liddy Center for Natural Language Processing School of Information."— Presentation transcript:

1 Automatic Metadata Generation & Evaluation Generating & Evaluating MetaData Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University Cornell Libraries March 21, 2003

2 Automatic Metadata Generation & Evaluation “Practical Digital Libraries” by Mike Lesk Definition: Enable digital information to be accessed rapidly around the world, copied for preservation without error, stored compactly, & searched very quickly. Combine: –Principle-based acquisition and organization of information - Library Science –Digital representation and searching – Computer Science Web serves as prime example of a shared world-wide collection of information –Turned into mini digital libraries by groups that select, organize, &metatag a special interest collection.

3 Automatic Metadata Generation & Evaluation From Modern Information Retrieval A Digital Library is a combination of: –a collection of digital objects (Repository) –descriptions of those objects (Metadata) –a set of users or target audience (Community) –a system that offers a variety of services such as: - Capture- Retrieval - Indexing - Delivery - Cataloging- Archiving - Search- Preservation - Browsing

4 Automatic Metadata Generation & Evaluation Metadata – Data about data Originally based on MARC (Machine Readable Catalog) –Record-oriented language with rigorous formats tailored for bibliographic entries –Does not support many of the features needed for full text documents Moved to SGML, a more flexible standard –A syntax and a philosophy –Information about an object “is to be tagged meaning- fully” and tags to be contained within angle brackets Huckleberry Finn Mark Twain

5 Automatic Metadata Generation & Evaluation Metadata (cont’d) Dublin Core Metadata Initiative (DCMI) Mission - Make it easier to find resources on the Internet: –Develop metadata standards for resource discovery across domains –Define frameworks for the interoperation of metadata sets –Facilitate the development of community- or domain- specific metadata sets that work within these frameworks

6 Automatic Metadata Generation & Evaluation National Science Digital Library (NSDL) Provide access to the best standards-based, inquiry- driven learning for K-12 and undergraduate students. Focus on gathering & organizing content in science, technology, engineering & math (STEM) Library went live on December 3 rd, 2002 http://www.nsdl.org

7 Automatic Metadata Generation & Evaluation NSDL MetaData Research Projects Breaking the MetaData Generation Bottleneck –CNLP, IST, Syracuse –University of Washington –Information Institute of Syracuse StandardConnection –University of Washington –CNLP MetaTest –CNLP –Center for Human Computer Interaction – Cornell University

8 Automatic Metadata Generation & Evaluation MetaData Research Projects 1.Breaking the MetaData Generation Bottleneck 2.StandardConnection 3.MetaTest

9 Automatic Metadata Generation & Evaluation Breaking the MetaData Generation Bottleneck Goal: Demonstrate feasibility of high-quality automatically-generated metadata for digital libraries through Natural Language Processing Data: Full-text resources from ERIC and the Eisenhower National Clearinghouse on Science & Mathematics Metadata Schema: Dublin Core + Gateway for Educational Materials (GEM) Schema

10 Automatic Metadata Generation & Evaluation Metadata Schema Elements GEM Metadata Elements Audience Cataloging Duration Essential Resources Grade Pedagogy Quality Standards Dublin Core Metadata Elements Contributor Coverage Creator Date Description Format Identifier Language Publisher Relation Rights Source Subject Title Type

11 Automatic Metadata Generation & Evaluation Method: Information Extraction Natural Language Processing –Technology which enables a system to accomplish human-like understanding of document contents –Extracts both explicit and implicit meaning Sublanguage Analysis –Utilizes domain and genre-specific regularities vs. full-fledged linguistic analysis Discourse Model Development –Extractions specialized for communication goals of document type and activities under discussion

12 Automatic Metadata Generation & Evaluation Types of Features: Non-linguistic Length of document HTML and XML tags Linguistic Root forms of words Part-of-speech tags Phrases (Noun, Verb, Proper Noun, Numeric Concept) Categories (Proper Name & Numeric Concept) Concepts (sense disambiguated words / phrases) Semantic Relations Discourse Level Components Information Extraction

13 Automatic Metadata Generation & Evaluation Stream Channel Erosion Activity Student/Teacher Background: Rivers and streams form the channels in which they flow. A river channel is formed by the quantity of water and debris that is carried by the water in it. The water carves and maintains the conduit containing it. Thus, the channel is self-adjusting. If the volume of water, or amount of debris is changed, the channel adjusts to the new set of conditions. ….. Student Objectives: The student will discuss stream sedimentation that occurred in the Grand Canyon as a result of the controlled release from Glen Canyon Dam. … Sample Lesson Plan

14 Automatic Metadata Generation & Evaluation Input: The student will discuss stream sedimentation that occurred in the Grand Canyon as a result of the controlled release from Glen Canyon Dam. Morphological Analysis: The student will discuss stream sedimentation that occurred in the Grand Canyon as a result of the controlled release from Glen Canyon Dam. Lexical Analysis: The|DT student|NN will|MD discuss|VB stream|NN sedimentation|NN that|WDT occurred|VBD in|IN the|DT Grand|NP Canyon|NP as|IN a|DT result|NN of|IN the|DT controlled|JJ release|NN from|IN Glen|NP Canyon|NP Dam|NP.|. NLP Processing of Lesson Plan

15 Automatic Metadata Generation & Evaluation Syntactic Analysis - Phrase Identification: The|DT student|NN will|MD discuss|VB stream|NN sedimentation|NN that|WDT occurred|VBD in|IN the|DT Grand|NP Canyon|NP as|IN a|DT result|NN of|IN the|DT controlled|JJ release|NN from|IN Glen|NP Canyon|NP Dam|NP.|. Semantic Analysis Phase 1- Proper Name Interpretation: The|DT student|NN will|MD discuss|VB stream|NN sedimentation|NN that|WDT occurred|VBD in|IN the|DT Grand|NP Canyon|NP as|IN a|DT result|NN of|IN the|DT controlled|JJ release|NN from|IN Glen|NP Canyon|NP Dam|NP.|. NLP Processing of Lesson Plan (cont’d)

16 Automatic Metadata Generation & Evaluation Semantic Analysis Phase 2 - Event & Role Extraction Teaching event: discuss actor: student topic: stream sedimentation event: stream sedimentation location: Grand Canyon cause: controlled release NLP Processing of Lesson Plan (cont’d)

17 Automatic Metadata Generation & Evaluation Potential Keyword data Html Document Configuration HTML Converter Metadata Retrieval Module Cataloger Catalog Date Rights Publisher Format Language Resource Type eQuery Extraction Module Creator Grade/Level Duration Date Pedagogy Audience Standard HTML Document with Metadata PreProcessor Tf/idf Keywords Title Description Essential Resources Relation Output Gathering Program MetaExtract HTML Document

18 Automatic Metadata Generation & Evaluation Title:Grand Canyon: Flood! - Stream Channel Erosion Activity Grade Levels: 6, 7, 8 GEM Subjects: Science--Geology Mathematics--Geometry Mathematics--Measurement Keywords: Proper Names:Colorado River (river), Grand Canyon (geography / location), Glen Canyon Dam (buildings&structures) Subject Keywords:channels, clayboard, conduit, controlled_release, cookie_sheet, cup, dam, flow_volume, hold, paper_towel, pencil, reservoir, rivers, roasting_pan, sand, sediment, streams, water, Automatically Generated Metadata

19 Automatic Metadata Generation & Evaluation Pedagogy: Collaborative learning Hands on learning Tool For: Teachers Resource Type: Lesson Plan Format: text/HTML Placed Online:1998-09-02 Name: PBS Online Role:onlineProvider Homepage: http://www.pbs.org Automatically Generated Metadata (cont’d)

20 Automatic Metadata Generation & Evaluation Evaluating MetaData Blind Test of Automatic vs. Manual Metadata Expectation Condition – Subjects reviewed: 1 st - metadata record 2 nd - lessson plan and then judged whether metadata provided an accurate preview of the lesson plan on 1 to 5 scale

21 Automatic Metadata Generation & Evaluation Evaluating MetaData Blind Test of Automatic vs. Manual Metadata Expectation Condition – Subjects reviewed: 1 st - metadata record 2 nd - lessson plan and then judged whether metadata provided an accurate preview of the lesson plan on 1 to 5 scale Satisfaction Condition– Subjects reviewed: 1 st – lesson plan 2 nd – metadata record and then judged the accuracy and coverage of metadata on 1 to 5 scale, with 5 being high

22 Automatic Metadata Generation & Evaluation Qualitative Study Results Expec Satis Comb # Manual Metadata Records 153 571 724 # Automatic Metadata Records 139 532 671

23 Automatic Metadata Generation & Evaluation Qualitative Study Results Expec Satis Comb # Manual Metadata Records 153 571 724 # Automatic Metadata Records 139 532 671 Manual Metadata Average Score 4.03 3.81 3.85 Automatic Metadata Average Score 3.76 3.55 3.59

24 Automatic Metadata Generation & Evaluation Qualitative Study Results Expec Satis Comb # Manual Metadata Records 153 571 724 # Automatic Metadata Records 139 532 671 Manual Metadata Average Score 4.03 3.81 3.85 Automatic Metadata Average Score 3.76 3.55 3.59 Difference 0.27 0.26 0.26

25 Automatic Metadata Generation & Evaluation Current Status of Metadata Generation Improving automatic metadata extraction / generation capabilities –Based on findings of pilot user study Moving towards integration of NLP-based metadata generation in the National NSDL Digital Library –Core Integration Team at Cornell (Carl Lagoze, Donna Bergmark) Revising evaluation procedures

26 Automatic Metadata Generation & Evaluation MetaData Research Projects 1.Breaking the MetaData Generation Bottleneck 2.StandardConnection 3.MetaTest

27 Automatic Metadata Generation & Evaluation StandardConnection Goal: Determine feasibility & quality of automatically mapping teaching standards to learning resources “Solve linear equations and inequalities algebraically and non-linear equations using graphing, symbol- manipulating or spreadsheet technology.” Data: Educational Resources: Lesson Plans, Activities, Assessment Units, etc. from ERIC Teaching Standards: Achieve/McREL Compendix

28 Automatic Metadata Generation & Evaluation StandardConnection Components Compendix Mathematics: 6.2.1 CG Adds, subtracts, multiplies, and divides whole numbers and decimals A. addition, B. subtraction, C. multiplication D. division, E. whole number, F. decimal G. product, H. remainder, I. quotient State Standards Compendium Educational Resources: Lesson Plans, Activities, Assessment Units, etc.

29 Automatic Metadata Generation & Evaluation “Simultaneous Equations Using Elimination” URI: M8.4.11ABCJ Washington Mapping Compendix Arkansas Mapping Alaska Mapping Michigan Mapping California Mapping New York Mapping Florida Mapping Texas Mapping Cross-mapping through the Compendix Meta-language

30 Automatic Metadata Generation & Evaluation Lesson Plan: “Simultaneous Equations Using Elimination” Submitted by: Leslie Howe Email: teachhowe2@hotmail.com School/University/Affiliation: Farragut High School, Knoxville, Tn Grade Level: 9, 10, 11, 12, Higher education, Vocational education, Adult/Continuing education Subject(s): Mathematics / Algebra Duration: 30 minutes Description: The Elimination method is an effective method for solving a system of two unknowns. This lesson provides students with immediate feedback using a computer program or online applet. Goals: The student will be able to solve a system of two equations when there are two unknowns. Materials: Online computer applet / program http://www.usit.com/howe2/eqations/index.htm Similar downloadable C++ application available at the same site. Procedure: A system of two unknowns can be solved by multiplying each equation by the constant that will make the coefficient of one of the variables become the LCM (least common multiple) of the initial coefficients. Students may use the scroll bars on the indicated applet to multiply the equations by constants until the GCF is located. When the "add" button is activated after the correct constants are chosen one of the variables will be eliminated. The process can be repeated for the second variable. The student may enter the solution of the system by using scroll bars. When the "check" button is pressed the answer is evaluated and the student is given immediate feedback. (The same procedure can be done using the downloadable C++ application.) After 5-10 correct responses the student should make the transition to paper and solve the equations without using the applet. The student can still use the applet to check the answer. The applet will generate problems in a random fashion. All solutions are integers. Assessment: The lesson itself provides alternative assessment. The correct responses are recorded.

31 Automatic Metadata Generation & Evaluation Lesson Plan: “Simultaneous Equations Using Elimination” Submitted by: Leslie Howe Email: teachhowe2@hotmail.com School/University/Affiliation: Farragut High School, Knoxville, Tn Grade Level: 9, 10, 11, 12, Higher education, Vocational education, Adult/Continuing education Subject(s): Mathematics / Algebra Duration: 30 minutes Standard: McREL 8.4.11 Uses a variety of methods (e.g., with graphs, algebraic methods, and matrices) to solve systems of equations and inequalities Description: The Elimination method is an effective method for solving a system of two unknowns. This lesson provides students with immediate feedback using a computer program or online applet. Goals: The student will be able to solve a system of two equations when there are two unknowns. Materials: Online computer applet / program http://www.usit.com/howe2/eqations/index.ht m Similar downloadable C++ application available at the same site. Procedure: A system of two unknowns can be solved by multiplying each equation by the constant that will make the coefficient of one of the variables become the LCM (least common multiple) of the initial coefficients. Students may use the scroll bars on the indicated applet to multiply the equations by constants until the GCF is located. When the "add" button is activated after the correct constants are chosen one of the variables will be eliminated. The process can be repeated for the second variable. The student may enter the solution of the system by using scroll bars. When the "check" button is pressed the answer is evaluated and the student is given immediate feedback. (The same procedure can be done using the downloadable C++ application.) After 5-10 correct responses the student should make the transition to paper and solve the equations without using the applet. The student can still use the applet to check the answer. The applet will generate problems in a random fashion. All solutions are integers. Assessment: The lesson itself provides alternative assessment. The correct responses are recorded.

32 Automatic Metadata Generation & Evaluation Lesson Plan as Query Indexed terms from Standards Automatic Assigning of Standards as a Retrieval Process Assignment of Standard to Lesson Plan

33 Automatic Metadata Generation & Evaluation Standards Assembled Standard Indexed DOCUMENT COLLECTION = Compendix Standards Processed Index of Standards is assembled from the subject heading, secondary subject, actual standard, and vocabulary.

34 Automatic Metadata Generation & Evaluation New Lesson Plan Query=Top 30 terms: equation, eliminate solve TF/IDF: Relative frequency weights of words, phrases, proper names, etc QUERY = NLP Processed Lesson Plan Filtering : Sections are eliminated or given greater weight (e.g. citations are removed). Relevant parts of lesson plan Simultaneous|JJ Equations|NNS Using|VBG Elimination|NN Natural Language Processing: Includes part-of-speech tagging, bracketing of phrases & proper names

35 Automatic Metadata Generation & Evaluation Teaching Standard Assignment as Retrieval Task Experiment Exploratory test run –3,326 standards (documents) –TF/IDF term weighting scheme –2,239 lesson plans (queries) –top 30 weighted terms from each as a query vector Manual evaluation –Focusing on understanding of issues & solutions

36 Automatic Metadata Generation & Evaluation Information Retrieval Experiments Baseline Results –68 queries (lesson plans) reviewed –30 in math, 37 in science, 1 art –24 (35%) queries - appropriate standard was ranked first –28 (41%) queries - predominant standard was in top 5 –Room for improvement

37 Automatic Metadata Generation & Evaluation Future Research Improve current retrieval performance –Matching algorithm, document expansion, etc Apply classification approach to Standard Connection Project Compare information retrieval approach and classification approach

38 Automatic Metadata Generation & Evaluation Automatic Assignment of Standard to Lesson Plan Standard 8.3.6: Solves simple inequalities and non-linear equations with rational number solutions, using concrete and informal methods. Standard 8.4.11: Uses a variety of methods (e.g., with graphs, algebraic methods, and matrices) to solve systems of equations and inequalities Lesson Plan with Standards attached Standard 8.4.12 Understands formal notation (e.g., sigma notation, factorial representation) and various applications (e.g., compound interest) of sequences and series Browsable Map of Standards, e.g. Strand Maps Standard 8.4.11 Linked Browsing Access to Learning Resources

39 Automatic Metadata Generation & Evaluation MetaData Research Projects 1.Breaking the MetaData Generation Bottleneck 2.StandardConnection 3.MetaTest

40 Automatic Metadata Generation & Evaluation Questioning of Metadata Assumptions Do we need metadata? –Why? How much metadata do we need? –For what purposes? Which elements do we need? –For which digital library tasks? How do information-seekers utilize the metadata? –When browsing / searching / previewing? How do automatically vs. manually generated metadata perform in a standard IR experiment?

41 Automatic Metadata Generation & Evaluation Life-Cycle Evaluation of Metadata 1. Initial generation 2. Accessing DL resources - Methods - Users’ interactions - Manual - Browsing - Automatic - Searching - Relative contribution of - Costs each metadata element - Time - Human Resources 3. Search Effectiveness - Technology - Precision - Recall

42 Automatic Metadata Generation & Evaluation Metadata Generation System User Metadata Understanding Evaluation GOAL: Measure Quality & Usefulness of Metadata Precision Recall Browsing Searching METHODS: Manual Semi-Automatic Automatic COSTS: Time Human Resources Technology

43 Automatic Metadata Generation & Evaluation Evaluation Methodology Automatically metatag a Digital Library collection that has already been manually meta-tagged. Solicit range of appropriate Digital Library users. For each metadata element: 1. Users qualitatively evaluate it in light of the digital resource. 2. Observe subjects while searching & browsing. Monitor with eye-tracking & think-aloud protocols. 3. Conduct a standard IR experiment.

44 Automatic Metadata Generation & Evaluation Qualitative Evaluation

45 Automatic Metadata Generation & Evaluation Potential Keyword data Html Document Configuration HTML Converter Metadata Retrieval Module Cataloger Catalog Date Rights Publisher Format Language Resource Type eQuery Extraction Module Creator Grade/Level Duration Date Pedagogy Audience Standard HTML Document with Metadata PreProcessor Tf/idf Keywords Title Description Essential Resources Relation Output Gathering Program MetaExtract HTML Document

46 Automatic Metadata Generation & Evaluation Cataloger: CNLP metaExtract™ Catalog Date: 9/24/01 Rights: http://askeric.org/Virtual/Lessons/copyright.htm http://askeric.org/Virtual/Lessons/copyright.htm Publisher:AskERIC Format: Text/html Language: English GEM Metadata: Automatically Generated Configuration Files

47 Automatic Metadata Generation & Evaluation Creator: Kelli Carfang Grade: Preschool education, Kindergarten, 1 Duration:One 45 minute session Date:September 22, 1999 Pedagogy-teaching-method:Brainstorming GEM Metadata: Automatically Generated eQuery

48 Automatic Metadata Generation & Evaluation GEM Metadata: Automatically Generated eQuery Pedagogy-method-teaching-process: Recognize—Classify—Introduce—Invite— Explain—Make—Pass—Do—Make—Go over— Give

49 Automatic Metadata Generation & Evaluation GEM Metadata: Automatically Generated tf/idf Keyword Identification Keywords: live things characteristics plant think up living things live insect

50 Automatic Metadata Generation & Evaluation Title: Living or Non-living Description: In this lesson plan, the students will recognize the differences between living creatures and non-living objects. The students will classify different objects as being living or non-living. Resources: a plant; a live insect in a jar; an artificial plant Relations: Linked to: http://manning.boston.k12.ma.us/scilivingthings.htm GEM Metadata: Automatically Generated MetaData Retrieval Module

51 Automatic Metadata Generation & Evaluation System Evaluation

52 Automatic Metadata Generation & Evaluation Information Retrieval Experiment Users ask queries of system System retrieves documents using either: –Manually assigned metadata –Automatically generated metadata System ranks documents in order by system estimation of relevance User reviews retrieved documents & judges if relevant Compute metrics –Precision: How many retrieved documents are relevant? –Recall: How many of the relevant documents in the collection are retrieved? Compare results between method of assignment

53 Automatic Metadata Generation & Evaluation Sample Lesson Plans

54 Automatic Metadata Generation & Evaluation User Evaluation

55 Automatic Metadata Generation & Evaluation User Studies: Methods & Questions 1. Observations of Users Seeking DL Resources –How do users search & browse the digital library? –Do search attempts reflect the available metadata? –Which metadata elements are the most important to users? –What metadata elements are used most consistently with the best results?

56 Automatic Metadata Generation & Evaluation User Studies: Methods & Questions (cont’d) 2. Eye-tracking with Think-aloud Protocols –Which metadata elements do users spend most time viewing? –What are users thinking about when seeking digital library resources? –Show correlation between what users are looking at and thinking. –Use eye-tracking to measure the number & duration of fixations, scan paths, dilation, etc. 3. Individual Subject Data –How does expertise / role influence seeking resources from digital libraries?

57 Automatic Metadata Generation & Evaluation Eye Scan Path For Bug Club Document

58 Automatic Metadata Generation & Evaluation Eye Scan Path For Sigmund Freud Document

59 Automatic Metadata Generation & Evaluation What, When, Where, and How Long Word Fixated Fixation Number Fixation Duration

60 Automatic Metadata Generation & Evaluation Conclusion: Metadata Research Goals 1.Increase the number of educational resources available electronically. 2.Increase the speed with which educational resources are added to digital libraries. 3.Add teaching standards to each resource, mapable to each state’s standards. 4.Provide empirical results on quality and utility of automatic vs. manual metadata. 5.Inform HCI design with a better understanding of users’ behaviors when browsing and searching Digital Libraries. 6.Provide improved access to a digital library via richer, more complete and consistent metadata.

61 Automatic Metadata Generation & Evaluation Thanks! Questions?

62 Automatic Metadata Generation & Evaluation Evaluation Methodology 1.Automatically metatag a Digital Library collection that has already been manually meta-tagged. 2.Solicit range of appropriate Digital Library users. 3.Have users qualitatively evaluate metadata tags. 4.Conduct searching & browsing experiments. 5.Monitor with eye-tracking & think-aloud protocols. 6.Develop metrics of relative utility of each meta- data element (manual & automatic) for both tasks. 7.Conduct a standard IR experiment to compare 2 types of metadata generation.


Download ppt "Automatic Metadata Generation & Evaluation Generating & Evaluating MetaData Elizabeth D. Liddy Center for Natural Language Processing School of Information."

Similar presentations


Ads by Google