Presentation is loading. Please wait.

Presentation is loading. Please wait.

Planning for the TREC 2008 Legal Track Douglas Oard Stephen Tomlinson Jason Baron.

Similar presentations


Presentation on theme: "Planning for the TREC 2008 Legal Track Douglas Oard Stephen Tomlinson Jason Baron."— Presentation transcript:

1 Planning for the TREC 2008 Legal Track Douglas Oard Stephen Tomlinson Jason Baron

2 Agenda Track goals Deciding on a document collection “Beating Boolean” Handling nasty OCR Making the best use of the metadata Ad hoc task design Interactive task design Relevance feedback task design Other issues

3 Track Goals Develop a reusable test collection –Documents, topics, evaluation measures Foster formation of a research community Establish baseline results

4 Choosing a Collection FERC Enron (w/attachments, full headers) –Somewhat larger than CMU –Email is the real killer app for E-discovery IIT CDIP version 1.0 (same as 2006/07) –We have 83 topics. Do we need more? State Department Cables –Task model would be FOIA, not E-Discovery

5 TREC Topic Number: 1 Title: Marketers or Traders of Electricity on the Financial Market Description: Identify Enron employees who bought and sold electricity on California’s financial (long-term sales) energy market, solely for the purpose of re-buying/re-selling this energy later for a profit. Narrative: A relevant document must at a minimum identify the name and email address of the marketer, as well as the Enron subsidiary to which he/she belonged. The marketer’s phone number would be helpful as well, to help analysis of the corresponding Enron voice dataset. Hint: Enron Power Marketing, Inc. (EPMI), Enron Energy Services, Inc. and Enron Energy Marketing Corporation all appear to have conducted long-term marketing services for Enron. This observation is based on the fact that Enron submitted information for all three of these subsidiaries in its reply to FERC’s data request 2 (DR2). (DR2 asked Enron to submit information about its short-term and long-term sales. Enron replied with data from these three subsidiaries.) (38, pp. 1-2, plus personal analysis.) It would be good, however, to know for sure which entities or persons did marketing at Enron. Query Possibilities: (marketer or marketers or “Enron Power Marketing” or EPMI or “Enron Energy Services” or “Enron Energy Marketing Corporation”) (marketer or marketers or “Enron Power Marketing” or EPMI or “Enron Energy Services” or “Enron Energy Marketing Corporation”) and (MW or KW or watt* or MwH or KwH) o This is to target electricity sales rather than natural gas sales. All the subsequent electricity queries can be similarly modified. (marketer or marketers or EPMI) and (short or long) o As in have a long or short position in sales/purchases. (marketer or marketers or EPMI) and (NYMEX or CBOT or “Mid-Columbia” or COB or “California-Oregon Border” or “Four Corners” or “Palo Verde” or EOL) o The electricity futures hubs were Mid-Columbia, COB, Four Corners, and Palo Verde, as best the author can tell. (85) NYMEX and CBOT ran these. (89; 15, p. 78) o EOL was the forward market trading place. (36, p. 3)

6 Identity Modeling in Enron m..scott@enron.com susan m scott suebob susan scott sue susan ciao again m scott scott.susan@enron.com scott susan susan m scott susan scott sscott5@enron.com susan scott friday sscott5 susan sscott susan m scott com members 66,715 models 82,084 addr-name 3,151 addr-nickname 19,708 addr-addr

7 Enron Identity Test Collections CollectionEmailsIdentitiesMention Candidates QueriesMin.Avg.Max. Sager1,628627511411 Shapiro974855491821 Enron-subset54,01827,340781152489 Enron-all248,451123,7837835181785 Sager Shapiro Enron-subset Enron-all Test Collections

8 Example Document Title: CIGNA WELL-BEING NEWSLETTER - FUTURE STRATEGY Organization Authors: PMUSA, PHILIP MORRIS USA Person Authors: HALLE, L Document Date: 19970530 Document Type: MEMO, MEMORANDUM Bates Number: 2078039376/9377 Page Count: 2 Collection: Philip Morris Philip Moxx's. U.S.A. x.dr~am~c. cvrrespoaa.aa Benffrts Departmext Rieh>pwna, Yfe&ia Ta: Dishlbutfon Data aday 90,1997. From: Lisa Fislla Sabj.csr CIGNA WeWedng Newsbttsr - Yntsre StratsU During our last CIGNA Aatfoa Plan meadng, tlu iasuo of wLetSae to i0op per'Irw+ng artieles aod discontinue mndia6 CIGNA Well-Being aawslener to om employees was a msiter of disanision. I Imvm done somme reaearc>>, and wanted to pruedt you with my Sadings and pcdiminary recwmmeadatioa for PM's atratezy Ieprding l4aas aewelattee*. I believe.vayone'a input is valusble, and would epproolate hoarlng fmaa aaeh of you on whetlne you concur with my reeommendatioa … ScannedOCRMetadata

9 State Department Cables 791,857 records – 550,983 of which are full text

10 State Department Cables

11 Handling Nasty OCR Index pruning Error estimation Character n-grams Duplicate detection Expansion using a cleaner collection

12 How to “Beat Boolean” Work from reference Boolean? –Swap out low-ranked-in for high-ranked-out Relax Boolean somehow? –Cover density, proximity perturbation, …

13 Using Metadata Title (term match) Author (social network Bates number (sequence)

14 Ad Hoc Task Design Evaluation measures –R@B?, P@R?, Index size? –Error bars / Statistical significance testing –Limits on post-hoc use of the collection? –What are “meaningful” differences? Topic design –Negotiation transcript? Inter-annotator agreement

15 Interactive Track Design Evaluation measure –Precision-oriented? –Recall-oriented? –Effect of assessor disagreement

16 Relevance Feedback Task Evaluation measure –Residual recall at B_Residual? Two-stage feedback?

17 Some Open Questions Test collection reusability –Unbiased estimates? Tight error bars? Why can’t we beat Boolean??? –Different strategies? Detailed failure analysis? Can we improve topic formulation? –Structured relevance relevance feedback? Is OCR masking effects we need to see? –Is it time for a new collection? –Must it be de-duped? Is metadata needed? Does Δscope invalidate the interactive task?


Download ppt "Planning for the TREC 2008 Legal Track Douglas Oard Stephen Tomlinson Jason Baron."

Similar presentations


Ads by Google