Presentation is loading. Please wait.

Presentation is loading. Please wait.

Enron Emails as Graph Data Corpus for Large-scale Graph Querying Experimentation Michal Laclavík, Martin Šeleng, Marek Ciglan, Ladislav Hluchý.

Similar presentations


Presentation on theme: "Enron Emails as Graph Data Corpus for Large-scale Graph Querying Experimentation Michal Laclavík, Martin Šeleng, Marek Ciglan, Ladislav Hluchý."— Presentation transcript:

1 Enron Emails as Graph Data Corpus for Large-scale Graph Querying Experimentation Michal Laclavík, Martin Šeleng, Marek Ciglan, Ladislav Hluchý

2 Motivation and Approach Motivation To exploit information and knowledge included in email communication Approach Social Network Extraction Entities extraction like People, Organizations, Locations, Contact data Forming semantic trees and graphs User interaction with graph data Bratislava, 26 October 2011GCCP 20112

3 Email Social Networks Email Social Networks are less explored –Several scientific publications: Apache mailing list, Enron, … –Commercial: Xobni (contacts and attachments) Benefit –Web Social Network Sites: owned by third parties –Email SN: owned by organization, individual or community –Additional level of interaction and context is present in emails Information and Knowledge –People, locations, contacts, product, services, attachments or links –Interactions –Time –Discovering relations can bring significant benefits –Spread of Activation – simple way to discover relations Bratislava, 26 October 2011GCCP 20113

4 Ontea: Information Extraction Tool  Regex patterns  Gazetteers  Resuls  Key-value pairs  Structured into trees  graphs  Transformers, Configuration  Automatic loading of extractors  Visual Annotation Tool  Integration with external tools  GATE, Stemers, Hadoop …  Multilingual tests English, Slovak, Spanish, Italian GCCP 20114Bratislava, 26 October 2011 http://ontea.sf.net

5 GCCP 20115 Business objects in Emails Study on 6 organizations show: –Objects can be identified by patterns and gazeteers –It is possible to define set of common objects Objects identified: –Organization: org:Name, org:RegNo, org:TaxNo –Person: person:Name, person:Function –Contact: contact:Phone, contact:Email, contact:Webpage –Address: address:ZIP, address:Street, address:Settlement –Product: product:Name, product:Module, product:Component, product:BOID –Document: doc:Invoice, doc:Order, doc:Contract, doc:ChangeRequest –Inventory: inventory:ResID, inventory:ResType –Other business object ID: BOID Bratislava, 26 October 2011

6 Email Social Graph/Network Bratislava, 26 October 2011GCCP 20116

7 Use of Social Network from email Includes extracted objects Full text of extracted objects Related objects discovered and ordered by spread activation on social network graph Faceted search, navigation http://ikt.ui.sav.sk/esns/ Email Search Prototype GCCP 20117Bratislava, 26 October 2011 gSemSearch: Graph based Semantic Search

8 Email Example 1 Vertex: Doc=>/home/misos/enron/test/6.eml 1 Vertex: Quote=>/6.eml0:1:0 2 Edge: (Doc=>/home/misos/enron/test/6.eml)=>(Quote=>/6.eml0:1:0) 1 Vertex: Paragraph=>/6.eml0:1:0 2 Edge: (Quote=>/6.eml0:1:0)=>(Paragraph=>/6.eml0:1:0) 1 Vertex: Sentence=>/6.eml0:1:0 2 Edge: (Paragraph=>/6.eml0:1:0)=>(Sentence=>/6.eml0:1:0) 1 Vertex: DateTime=>Fri, 8 Mar 2002 06:46:07 -0800 (PST) 2 Edge: (Sentence=>/6.eml0:1:0)=>(DateTime=>Fri, 8 Mar 2002 06:46:07 -0800 (PST)) 1 Vertex: Email=>mike.grigsby@enron.com 2 Edge: (Sentence=>/6.eml0:1:0)=>(Email=>mike.grigsby@enron.com) 1 Vertex: Email=>robert.badeer@enron.com 2 Edge: (Sentence=>/6.eml0:1:0)=>(Email=>robert.badeer@enron.com) 1 Vertex: Person:Name=>Grigsby, Mike 2 Edge: (Sentence=>/6.eml0:1:0)=>(Person:Name=>Grigsby, Mike) 1 Vertex: Company=>ENRON 2 Edge: (Sentence=>/6.eml0:1:0)=>(Company=>ENRON) 1 Vertex: Person:Name=>Badeer, Robert 2 Edge: (Sentence=>/6.eml0:1:0)=>(Person:Name=>Badeer, Robert) 1 Vertex: Company=>ENRON 2 Edge: (Sentence=>/6.eml0:1:0)=>(Company=>ENRON) 1 Vertex: Person:GivenName=>Robert 2 Edge: (Sentence=>/6.eml0:1:0)=>(Person:GivenName=>Robert) 1 Vertex: Person:Name=>Badeer, Robert 2 Edge: (Sentence=>/6.eml0:1:0)=>(Person:Name=>Badeer, Robert) 1 Vertex: Paragraph=>/6.eml659:19:0 2 Edge: (Quote=>/6.eml0:1:0)=>(Paragraph=>/6.eml659:19:0) 1 Vertex: Sentence=>/6.eml659:19:0 2 Edge: (Paragraph=>/6.eml659:19:0)=>(Sentence=>/6.eml659:19:0) 1 Vertex: Person:Name=>Michael D. Grigsby 2 Edge: (Sentence=>/6.eml659:19:0)=>(Person:Name=>Michael D. Grigsby) 1 Vertex: Company=>UBS Warburg Energy, LLC 2 Edge: (Sentence=>/6.eml659:19:0)=>(Company=>UBS Warburg Energy, LLC) 1 Vertex: TelephoneNumber=>713-853-7031 2 Edge: (Sentence=>/6.eml659:19:0)=>(TelephoneNumber=>713-853-7031) 1 Vertex: TelephoneNumber=>713-408-6256 2 Edge: (Sentence=>/6.eml659:19:0)=>(TelephoneNumber=>713-408-6256) Bratislava, 26 October 2011GCCP 20118

9 Enron Graph corpus Statistics Bratislava, 26 October 2011GCCP 20119

10 Conclusions and Future Directions

11 Future Direction: Relations Discovery in Large Graph Data Motivation –Graph/Network data are everywhere: social networks, web, LinkedData, transactions, communication (email, phone). –Also text can be converted to graph. –Interconnecting graph data and searching for relations is crucial. Approach –Forming semantic trees and graphs from text, web, communication, databases and LinkedData –User interaction with graph data in order to achieve integration and data cleansing –Users will do it, if user effort have immediate impact on search results Bratislava, 26 October 2011GCCP 201111

12 SGDB: Simple Graph Database Storage for graphs Optimized for graph traversing and spread of activation Faster then Neo4j for graph traversing operations Supports Blueprints API https://simplegdb.svn.sourceforge.net/svnroot/simplegdb/Sgdb3 Graph Database Benchmarks –Graph Traversal Benchmark for Graph Databases –http://ups.savba.sk/~marek/gbench.htmlhttp://ups.savba.sk/~marek/gbench.html –Blueprints API - possibility to test compliant Graph databases Bratislava, 26 October 2011GCCP 201112

13 Email Archives –Valuable source of knowledge –Hidden Social Networks owned by Enterprise or Individual –Information Extraction and Social Network Analysis can help Challenges –Graph based Querying –New data and approach for information search –Relation search Applications –Recommendation and Search in Emails –Population of Databases (Cold start problem) –Possibility to extend social network graph with transaction data, processed document repositories and other business data –Business Intelligence and Knowledge Management Conclusion Bratislava, 26 October 2011GCCP 201113


Download ppt "Enron Emails as Graph Data Corpus for Large-scale Graph Querying Experimentation Michal Laclavík, Martin Šeleng, Marek Ciglan, Ladislav Hluchý."

Similar presentations


Ads by Google