Planning for the TREC 2008 Legal Track Douglas Oard Stephen Tomlinson Jason Baron
Agenda Track goals Deciding on a document collection “Beating Boolean” Handling nasty OCR Making the best use of the metadata Ad hoc task design Interactive task design Relevance feedback task design Other issues
Track Goals Develop a reusable test collection –Documents, topics, evaluation measures Foster formation of a research community Establish baseline results
Choosing a Collection FERC Enron (w/attachments, full headers) –Somewhat larger than CMU – is the real killer app for E-discovery IIT CDIP version 1.0 (same as 2006/07) –We have 83 topics. Do we need more? State Department Cables –Task model would be FOIA, not E-Discovery
TREC Topic Number: 1 Title: Marketers or Traders of Electricity on the Financial Market Description: Identify Enron employees who bought and sold electricity on California’s financial (long-term sales) energy market, solely for the purpose of re-buying/re-selling this energy later for a profit. Narrative: A relevant document must at a minimum identify the name and address of the marketer, as well as the Enron subsidiary to which he/she belonged. The marketer’s phone number would be helpful as well, to help analysis of the corresponding Enron voice dataset. Hint: Enron Power Marketing, Inc. (EPMI), Enron Energy Services, Inc. and Enron Energy Marketing Corporation all appear to have conducted long-term marketing services for Enron. This observation is based on the fact that Enron submitted information for all three of these subsidiaries in its reply to FERC’s data request 2 (DR2). (DR2 asked Enron to submit information about its short-term and long-term sales. Enron replied with data from these three subsidiaries.) (38, pp. 1-2, plus personal analysis.) It would be good, however, to know for sure which entities or persons did marketing at Enron. Query Possibilities: (marketer or marketers or “Enron Power Marketing” or EPMI or “Enron Energy Services” or “Enron Energy Marketing Corporation”) (marketer or marketers or “Enron Power Marketing” or EPMI or “Enron Energy Services” or “Enron Energy Marketing Corporation”) and (MW or KW or watt* or MwH or KwH) o This is to target electricity sales rather than natural gas sales. All the subsequent electricity queries can be similarly modified. (marketer or marketers or EPMI) and (short or long) o As in have a long or short position in sales/purchases. (marketer or marketers or EPMI) and (NYMEX or CBOT or “Mid-Columbia” or COB or “California-Oregon Border” or “Four Corners” or “Palo Verde” or EOL) o The electricity futures hubs were Mid-Columbia, COB, Four Corners, and Palo Verde, as best the author can tell. (85) NYMEX and CBOT ran these. (89; 15, p. 78) o EOL was the forward market trading place. (36, p. 3)
Identity Modeling in Enron susan m scott suebob susan scott sue susan ciao again m scott scott susan susan m scott susan scott susan scott friday sscott5 susan sscott susan m scott com members 66,715 models 82,084 addr-name 3,151 addr-nickname 19,708 addr-addr
Enron Identity Test Collections Collection sIdentitiesMention Candidates QueriesMin.Avg.Max. Sager1, Shapiro Enron-subset54,01827, Enron-all248,451123, Sager Shapiro Enron-subset Enron-all Test Collections
Example Document Title: CIGNA WELL-BEING NEWSLETTER - FUTURE STRATEGY Organization Authors: PMUSA, PHILIP MORRIS USA Person Authors: HALLE, L Document Date: Document Type: MEMO, MEMORANDUM Bates Number: /9377 Page Count: 2 Collection: Philip Morris Philip Moxx's. U.S.A. x.dr~am~c. cvrrespoaa.aa Benffrts Departmext Rieh>pwna, Yfe&ia Ta: Dishlbutfon Data aday 90,1997. From: Lisa Fislla Sabj.csr CIGNA WeWedng Newsbttsr - Yntsre StratsU During our last CIGNA Aatfoa Plan meadng, tlu iasuo of wLetSae to i0op per'Irw+ng artieles aod discontinue mndia6 CIGNA Well-Being aawslener to om employees was a msiter of disanision. I Imvm done somme reaearc>>, and wanted to pruedt you with my Sadings and pcdiminary recwmmeadatioa for PM's atratezy Ieprding l4aas aewelattee*. I believe.vayone'a input is valusble, and would epproolate hoarlng fmaa aaeh of you on whetlne you concur with my reeommendatioa … ScannedOCRMetadata
State Department Cables 791,857 records – 550,983 of which are full text
State Department Cables
Handling Nasty OCR Index pruning Error estimation Character n-grams Duplicate detection Expansion using a cleaner collection
How to “Beat Boolean” Work from reference Boolean? –Swap out low-ranked-in for high-ranked-out Relax Boolean somehow? –Cover density, proximity perturbation, …
Using Metadata Title (term match) Author (social network Bates number (sequence)
Ad Hoc Task Design Evaluation measures Index size? –Error bars / Statistical significance testing –Limits on post-hoc use of the collection? –What are “meaningful” differences? Topic design –Negotiation transcript? Inter-annotator agreement
Interactive Track Design Evaluation measure –Precision-oriented? –Recall-oriented? –Effect of assessor disagreement
Relevance Feedback Task Evaluation measure –Residual recall at B_Residual? Two-stage feedback?
Some Open Questions Test collection reusability –Unbiased estimates? Tight error bars? Why can’t we beat Boolean??? –Different strategies? Detailed failure analysis? Can we improve topic formulation? –Structured relevance relevance feedback? Is OCR masking effects we need to see? –Is it time for a new collection? –Must it be de-duped? Is metadata needed? Does Δscope invalidate the interactive task?