The Enron and W3C Collections Tamer Elsayed and Douglas W. Oard ICAIL 2007, DESI Workshop, June 4 th, 2007 University of Maryland
The Enron and W3C Collections ParticipantNon-participant Personal My own s Shneiderman’s Postel’s Organization Help desks White House Enron Public Online communities Usenet news W3C Variants of Search Searcher Collection
The Enron and W3C Collections Rich multimodal data s Phone calls Databases The (Extended) Enron Collection
The Enron and W3C Collections “Public” version of Enron collection (CMU) 150 sets of rescued Outlook folders 517,431 s, 52% duplicates, 133,581 unique addresses Subset annotated w/genre, speech act, mentioned calls, … Extended Enron collection (Aspen Systems) Attachments, additional (later release, redaction) Phone calls from/to Enron traders (Shohomish PUD) Transcribed subset from 52 DVDs of recorded audio Recovered from scanned transcripts using OCR 93 annotated with date, time, participants, mentioned names, mentioned s, mentioned meetings,... Relational databases (Aspen Systems) The (Extended) Enron Collection
The Enron and W3C Collections Cross-References Phone Calls
The Enron and W3C Collections Phone Call Transcripts Message-ID: Message-Type: PhoneCall Date: Fri, 26 Jan :43: (CST) From: To: Parties: Subject: Snohornish deal, Houston Chronicle Article, Bonuses , Houston Chronicle Article, Deal, to Jane King Subject-TimePos: 145, 313, 713, 775, 920, 1018 InCallNames: Christian, Ken Lay, Greg, Chris Foster, Stewie, Stewie, Mike, Mike, Laverado, Mike, Kim, Shari, Greg, Forney, Stewie, Jane King, Shari InCallNames-TimePos: 42, 81, 90, 95, 96, 143, 146, 190, 262, 266, 522, 580, 780, 1007, 1018, 1038, 1067 Keywords: CDWR, , Keywords-TimePos: 55, 689, 1038 X-From: Stack, Shari <> X-To: Wolfe, Greg <> X-Parties: Stack, Shari <>, Wolfe, Greg <> X-AudioFile: R.wav X-TranscriptFile: R.txt SHARI STACK: Hey. GREG WOLFE: All right, let me get my fax machine workin'. Uh - [laughs] SHARI: [laughs] She's like, it was so easy, I could make you a lot of money [laughs]. She's like, he said it so desperate. She goes I hate to laugh at people, but - [laughs] GREG: Did you, um, did you, ah, ah tell her about the, ah, that voice mail? SHARI: Yeah, I said - I said Greg [inaudible] he's got the - they got a mob connection [langhs] - his friend threw away the business card after the meeting.[both laughing] SHARI: But, my God - my God, and so anyway, have you talked to Chnstian about this 'cause Christian apparently talked to him twice today. GREG: Oh, he sent a - Christian sent an shortly after, you know, that, and said we're not doin' business with this guy. SHARI: [laughs] GREG: Ah, so I still don't understand why this guy's trying to get in the middle of us and CDWR and I guess - SHARI: [laughs]
The Enron and W3C Collections Message Header Main Body Salutation Signature Block Quoted HeaderQuoted Text Message Body Quoted Signature Quoted Main Body Typical Enron Original Message----- From: Sent: Monday, July 30, :24 PM To: Sager, Elizabeth; Murphy, Harlan; Cc: Subject:Shhhh.... it's a SURPRISE ! Message-ID: Date: Mon, 30 Jul :40: (PDT) From: To: Subject: RE: Shhhh.... it's a SURPRISE ! X-From: Sager, Elizabeth X-To: Hope all is well. Count me in for the group present. See ya next week if not earlier Please call me (713) Liza Elizabeth Sager Hi Shari Thanks! Shari
The Enron and W3C Collections Research Problems (Enron) Threading Classification Social Network Analysis Mention Resolution
The Enron and W3C Collections Date: Wed Dec 20 08:57:00 EST 2000 From: Kay Mann To: Suzanne Adams Subject: Re: GE Conference Call has be rescheduled Did Sheila want Scott to participate? Looks like the call will be too late for him. Who is that “Sheila”? Sheila ?
The Enron and W3C Collections Rich Evidence about Identity susan m scott suebob susan scott sue susan m scott scott susan susan m scott susan scott susan scott friday sscott5 susan sscott susan m scott com members 66,715 models 82,084 addr-name 3,151 addr-nickname 19,708 addr-addr
The Enron and W3C Collections Test Collection of Mention Resolution Candidates Collection sIdentitiesQueriesMin.Avg.Max. Sager1, Shapiro Enron-subset54,01827, Enron-all248,451123, Sager Shapiro Enron-subset Enron-all Test Collections
The Enron and W3C Collections Evaluation Task named-mention ranked list of people Measures Mean Reciprocal Rank K 1 Confidence-based scoring
The Enron and W3C Collections Limitations (Mention Resolution) Small number of queries Only resolved by Enron employees Much easier Most of participants are outsides Measures focus only on accuracy
The Enron and W3C Collections Identity-Content Interplay Search for People Search for Content Social Context Topical Context
The Enron and W3C Collections W3C Collection Set of mailing lists public not private Topically-oriented ~175,000 s Introduced at TREC 2005 50 topics (x 2 years) relevance judgments available for ad-hoc retrieval
The Enron and W3C Collections Research Problems (W3C) Expert Finding Topic ranked list of experts Know-item Retrieval Query ranked list of s Discussion Search (i.e., ad-hoc retrieval) Pro/con retrieval Query ranked list of s
The Enron and W3C Collections Topic Type Analysis Find categories amenable to pro/con classification (TREC 2005-Enterprise Track)
The Enron and W3C Collections Limitations (Pro/Con Retrieval) Not private/personal communication Mailing lists receivers are hidden Topical categories are unbalanced Developed by researchers NOT users
The Enron and W3C Collections Related Projects Others working with CMU’s Enron s Berkeley, CMU, U Mass, SIAM Workshop University of Southern California ISI/ICT eArchivarius, Postel collection (Anton Leuski) Georgia Tech Research Institute PERPOS Presidential records (Bill Underwood)
The Enron and W3C Collections Conclusion Two test collections Public Hundreds of thousands of s Annotated s and transcripts Tasks and ground truth Need for “real” user needs Development of evaluation measures for utility
The Enron and W3C Collections For More Information Joint Institute for Knowledge Discovery
The Enron and W3C Collections Running System