Recognizing Document Value from Reading and Organizing Activities in Document Triage Rajiv Badi, Soonil Bae, J. Michael Moore, Konstantinos Meintanis, Anna Zacchi, Haowei Hsieh, Frank Shipman Center for the Study of Digital Libraries & Department of Computer Science Texas A&M University Catherine C. Marshall Microsoft Corporation
Document Triage Document triage is the rapid evaluation of a set of documents for later use. Document triage places different demands on attention than single- document reading activities Continuum of types of reading: –working in overview (metadata), –reading at various levels of depth (skimming), –reading intensively
Visual Knowledge Builder (VKB)
Search in VKB
Supporting Document Triage Central problem in document triage is limited time. VKB enables rapid expression of human assessment using visual cues Goal: have system aid in selecting documents How: observe user’s triage activities to provide cues that will aid in the selection of further documents
Process for Providing Support 1.Recognize user interest in and interpretation of documents 2.Generate a representation of user interests 3.Identify documents that match these interests 4.Provide visual cues to indicate the potential value of documents
Process for Providing Support 1.Recognize user interest in and interpretation of documents 2.Generate an abstract representation of user interests 3.Identify documents that match these interests 4.Provide visual cues to indicate the potential value of documents
Acquiring User Interest Model Explicit Methods –users tend not to provide explicit feedback Implicit Methods –Reading time has been used in many cases –Scrolling and mouse events have been shown somewhat predictive –Annotations have been used to identify passages of interest Problem: Individuals vary greatly and have idiosyncratic work practices
Data from an Earlier Study Task: subjects placed in role of a reference librarian, selecting and organizing information on ethnomathematics for a teacher Setting: top 20 search results from NSDL & top 20 search results from Google Subjects given as much time as they deemed necessary (after training) After completing task, the 24 subjects were asked to identify: –5 documents they found most valuable –5 documents they found least valuable
slide w/vkb + IE
What Actions Were Correlated with Document Preferences? Lots (ordered from most to least correlated) –Number of object moves –Scroll offset –Number of scrolls –Number of border color changes –Number of object resizes –Total number of scroll groups –Number of scrolling direction changes –Number of background color changes –Time spent in document –Number of border width changes –Number of object deletions –Number of document accesses –Length of document in characters
Modeling based on Reading and Interpretation Document triage combines multiple forms of reading and interpretation Infrastructure for applications to construct and share interest models Location/Overview Application Organizing Application Reading Application User Interest Estimation Engine Reading Application Reading Application Interest Profile Manager Interest Profile
Interest Models Based on data from an earlier study, we developed four interest models –Three were mathematically derived Reading-Activity Model Organizing-Activity Model Combined Model –One hand-tuned model included human assessment based on observations of user activity and interviews with users.
Quick Comparison of Models How much of difference in original data was modeled? –Reading-activity model 47.7% –Organizing-activity model63.6% –Combined model70.8% How well would models do for new data?
Evaluation of Models 16 Subjects with same –Task (collecting information on ethnomathmatics for teacher) and –Setting (20 NSDL and 20 Google results) Different display configuration –Using a single display in this case where used two displays before Different rating of documents –Subjects rated all documents on a 5-point Likert scale (with 1 meaning “not useful” and 5 meaning “very useful”)
Predictive Power of Models Models were conservative due to data from original study. Used aggregated user activity and user evaluations to evaluate models Model Avg. Residue Std. Dev. Reading-activity model Organizing-activity model Combined model Hand-tuned model
Size of Errors
Next Steps Update models –Revise weights based on Likert-scale data –Incorporate additional features of user activity Run another set of subjects with same form of document evaluation Evaluate predictive power for individuals Evaluation with other domains/tasks –Effect of document set –Effect of domain/subject matter expertise
Summary Our goal is to support document triage by inferring user interest Developed infrastructure for applications to share interest model Compared reading-activity, organizing- activity, and combined models Combined model better than reading-activity model (p=0.02) and organizing-activity model (p=0.07). Lots of work left to do …
Contact Information Frank Shipman Download VKB 2 from: