Testing appraisal models with digital corpora

Slides:



Advertisements
Similar presentations
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Advertisements

Modern Information Retrieval Chapter 1: Introduction
Introduction to Research Methodology
Search Engines and Information Retrieval
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document.
Evaluating the Performance of IR Sytems
Stylometry System CSIS Stylometry Projects, mostly Fall 2009 Project Seidenberg School of Computer Science and Information Systems.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Test Design Techniques
Web Information Retrieval Projects Ida Mele. Rules Students can work in teams (max 3 people) The project must be delivered by the deadline that will be.
Chapter 10 Communication
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Unit 2: Engineering Design Process
Search Engines and Information Retrieval Chapter 1.
International Council on Archives Section on University and Research Institution Archives Michigan State University September 7, 2005 Preserving Electronic.
Lect 6 chapter 3 Research Methodology.
Research Methodology.
 To explain the importance of software configuration management (CM)  To describe key CM activities namely CM planning, change management, version management.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Topic Rathachai Chawuthai Information Management CSIM / AIT Review Draft/Issued document 0.1.
Transparency in Searching and Choosing Peer Reviewers Doris DEKLEVA SMREKAR, M.Sc.Arch. Central Technological Library at the University of Ljubljana, Trg.
Understanding The Semantics of Media Chapter 8 Camilo A. Celis.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
SINGULAR VALUE DECOMPOSITION (SVD)
Lifecycle Metadata for Digital Objects November 1, 2004 Descriptive Metadata: “Modeling the World”
What to do with the Bits? Triage, First Aid, Clean Room Patricia Galloway School of Information University of Texas at Austin.
Personalized Interaction With Semantic Information Portals Eric Schwarzkopf DFKI
Final Project and Term Paper Requirements Qiang Yang, MTM521 Material.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
IR Homework #1 By J. H. Wang Mar. 16, Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection.
Software Engineering1  Verification: The software should conform to its specification  Validation: The software should do what the user really requires.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
Link Distribution on Wikipedia [0407]KwangHee Park.
Lifecycle Metadata for Digital Objects November 15, 2004 Preservation Metadata.
Donald G. Davis Collection 392K Amy Baker, Megan Peck, Zach Vowell.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
Introduction to Parsing
From Frequency to Meaning: Vector Space Models of Semantics
Foundations of Technology The Engineering Design Process
Writing Scientific Research Paper
Research Skills.
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Integrating SysML with OWL (or other logic based formalisms)
True/False questions (3pts*2)
Text Based Information Retrieval
Digital Humanities Centers Internships, Summer 2009 – Spring 2011
Information Retrieval and Web Search
Active Data Management in Space 20m DG
Research Design and Methods (METHODOLOGY)
Applications of Data Mining in Software Engineering
Lifecycle Metadata for Digital Objects
Lexical and Syntax Analysis
Independent work of students
Presented by: Prof. Ali Jaoua
Representation of documents and queries
Design open relay based DNS blacklist system
CS 430: Information Discovery
HCC class lecture 13 comments
Model Types Instructional Decisions Associated Lesson Plans
ece 627 intelligent web: ontology and beyond
Digital Humanities Centers Internships, Summer 2009 – Summer 2011
Introduction to Parsing
Introduction to Parsing
Foundations of Technology The Engineering Design Process
AGEC 640 November 20, Project planning and tips on effective writing
Advanced Design Applications The Engineering Design Process
OOTI Talks Apr-19 Johan J. Lukkien,
Presentation transcript:

Testing appraisal models with digital corpora Patricia Galloway School of Information University of Texas at Austin

Why appraise? Not enough room [Moore’s Law?] Not enough time for description [Google?] Nobody will care about most of the material anyway [the long tail?]

How appraise? Reject outright Accept everything Accept partially (reductive appraisal) Accept by initial agreement only part of materials offered (front-end) Perform granular “processing-appraisal” (back-end)

What are the effects of reductive appraisal? The ideal: appraisal accurately chooses the best selection of materials for informational and evidentiary uses according to best knowledge of the time …and without going broke How (short of living forever) to test how closely this ideal is reached?

Reductive appraisal as preemptive IR Is there a fit between appraisal and conventional IR methods? Appraisal as preemptive information retrieval IR selects desirable records, but can always come back to original corpus Appraisal selects desirable records, discards the rest; original corpus is gone Can evaluation methods and measures borrowed from IR be used?

Testing appraisal effectiveness against digital corpora Digital corpora permit digital tools, so digital corpora permit complete testing Sources of digital tools: Corpus linguistics Literary analysis tools (e.g., style, authorship) Text mining/clustering Information retrieval evaluation tools

Proposed experiment Begin with corpus Use simple appraisal model to reduce corpus Measure “information loss”

Example: PKG’s MDAH email Sent only Attachments removed Consistent class of records No privacy concerns Potential for classification Topics Correspondents

Appraisal modelling Can appraisal methods be modelled formally? Maybe not simply (cf. Gilliland’s results) Selection constraints Selection as decision tree Appraisal as data-reduction process

Appraisal data reduction Implementing data reduction procedure Reduction of corpus FBI appraisal decision tree (based on results of the FBI appraisal) Selection profiles applied to Sent97: systematic sample, “fat files”

Corpus preprocessing Isolating one message per file for entire corpus Tokenization of all messages, including Removal of headers Stopword removal Stemming Derivation of reduced versions of corpus Preparing term/document incidence and term/document frequency matrices Calculating distance and similarity measures

Analysis of reduced versions against original corpus Internal semantic structure (clustering of tokens) Network structure (correspondents, dates) Similarity in vector space Information gain (or loss)

Larger project Creating multiple formal appraisal models from case studies in archival literature Testing appraisal models against appropriate digital corpora and each other

Characterizing appraisal with more elaborate formal models Operationalizing implied model of record production (provenance) Appraisal modelled explicitly as data reduction process Discovering and specifying formal effects on content through automated content analysis

Characterizing specific appraisal contexts Actual digital record corpus Formal methods for characterizing corpus Stakeholder/corpus actor-network as provenance specification Correspondence analysis Effects of assumptions on selection procedures

Evaluating appropriate appraisal procedures Choosing a digital corpus As generic model for similar digital collections For analogous behavior to some paper collection of interest Testing against formal appraisal models Comparing results to appraisal goals Formally defining appraisal method choice as a function of “acceptable loss”