Download presentation
Presentation is loading. Please wait.
Published byKristin Harvey Modified over 9 years ago
1
Enron Emails as Graph Data Corpus for Large-scale Graph Querying Experimentation Michal Laclavík, Martin Šeleng, Marek Ciglan, Ladislav Hluchý
2
Motivation and Approach Motivation To exploit information and knowledge included in email communication Approach Social Network Extraction Entities extraction like People, Organizations, Locations, Contact data Forming semantic trees and graphs User interaction with graph data Bratislava, 26 October 2011GCCP 20112
3
Email Social Networks Email Social Networks are less explored –Several scientific publications: Apache mailing list, Enron, … –Commercial: Xobni (contacts and attachments) Benefit –Web Social Network Sites: owned by third parties –Email SN: owned by organization, individual or community –Additional level of interaction and context is present in emails Information and Knowledge –People, locations, contacts, product, services, attachments or links –Interactions –Time –Discovering relations can bring significant benefits –Spread of Activation – simple way to discover relations Bratislava, 26 October 2011GCCP 20113
4
Ontea: Information Extraction Tool Regex patterns Gazetteers Resuls Key-value pairs Structured into trees graphs Transformers, Configuration Automatic loading of extractors Visual Annotation Tool Integration with external tools GATE, Stemers, Hadoop … Multilingual tests English, Slovak, Spanish, Italian GCCP 20114Bratislava, 26 October 2011 http://ontea.sf.net
5
GCCP 20115 Business objects in Emails Study on 6 organizations show: –Objects can be identified by patterns and gazeteers –It is possible to define set of common objects Objects identified: –Organization: org:Name, org:RegNo, org:TaxNo –Person: person:Name, person:Function –Contact: contact:Phone, contact:Email, contact:Webpage –Address: address:ZIP, address:Street, address:Settlement –Product: product:Name, product:Module, product:Component, product:BOID –Document: doc:Invoice, doc:Order, doc:Contract, doc:ChangeRequest –Inventory: inventory:ResID, inventory:ResType –Other business object ID: BOID Bratislava, 26 October 2011
6
Email Social Graph/Network Bratislava, 26 October 2011GCCP 20116
7
Use of Social Network from email Includes extracted objects Full text of extracted objects Related objects discovered and ordered by spread activation on social network graph Faceted search, navigation http://ikt.ui.sav.sk/esns/ Email Search Prototype GCCP 20117Bratislava, 26 October 2011 gSemSearch: Graph based Semantic Search
8
Email Example 1 Vertex: Doc=>/home/misos/enron/test/6.eml 1 Vertex: Quote=>/6.eml0:1:0 2 Edge: (Doc=>/home/misos/enron/test/6.eml)=>(Quote=>/6.eml0:1:0) 1 Vertex: Paragraph=>/6.eml0:1:0 2 Edge: (Quote=>/6.eml0:1:0)=>(Paragraph=>/6.eml0:1:0) 1 Vertex: Sentence=>/6.eml0:1:0 2 Edge: (Paragraph=>/6.eml0:1:0)=>(Sentence=>/6.eml0:1:0) 1 Vertex: DateTime=>Fri, 8 Mar 2002 06:46:07 -0800 (PST) 2 Edge: (Sentence=>/6.eml0:1:0)=>(DateTime=>Fri, 8 Mar 2002 06:46:07 -0800 (PST)) 1 Vertex: Email=>mike.grigsby@enron.com 2 Edge: (Sentence=>/6.eml0:1:0)=>(Email=>mike.grigsby@enron.com) 1 Vertex: Email=>robert.badeer@enron.com 2 Edge: (Sentence=>/6.eml0:1:0)=>(Email=>robert.badeer@enron.com) 1 Vertex: Person:Name=>Grigsby, Mike 2 Edge: (Sentence=>/6.eml0:1:0)=>(Person:Name=>Grigsby, Mike) 1 Vertex: Company=>ENRON 2 Edge: (Sentence=>/6.eml0:1:0)=>(Company=>ENRON) 1 Vertex: Person:Name=>Badeer, Robert 2 Edge: (Sentence=>/6.eml0:1:0)=>(Person:Name=>Badeer, Robert) 1 Vertex: Company=>ENRON 2 Edge: (Sentence=>/6.eml0:1:0)=>(Company=>ENRON) 1 Vertex: Person:GivenName=>Robert 2 Edge: (Sentence=>/6.eml0:1:0)=>(Person:GivenName=>Robert) 1 Vertex: Person:Name=>Badeer, Robert 2 Edge: (Sentence=>/6.eml0:1:0)=>(Person:Name=>Badeer, Robert) 1 Vertex: Paragraph=>/6.eml659:19:0 2 Edge: (Quote=>/6.eml0:1:0)=>(Paragraph=>/6.eml659:19:0) 1 Vertex: Sentence=>/6.eml659:19:0 2 Edge: (Paragraph=>/6.eml659:19:0)=>(Sentence=>/6.eml659:19:0) 1 Vertex: Person:Name=>Michael D. Grigsby 2 Edge: (Sentence=>/6.eml659:19:0)=>(Person:Name=>Michael D. Grigsby) 1 Vertex: Company=>UBS Warburg Energy, LLC 2 Edge: (Sentence=>/6.eml659:19:0)=>(Company=>UBS Warburg Energy, LLC) 1 Vertex: TelephoneNumber=>713-853-7031 2 Edge: (Sentence=>/6.eml659:19:0)=>(TelephoneNumber=>713-853-7031) 1 Vertex: TelephoneNumber=>713-408-6256 2 Edge: (Sentence=>/6.eml659:19:0)=>(TelephoneNumber=>713-408-6256) Bratislava, 26 October 2011GCCP 20118
9
Enron Graph corpus Statistics Bratislava, 26 October 2011GCCP 20119
10
Conclusions and Future Directions
11
Future Direction: Relations Discovery in Large Graph Data Motivation –Graph/Network data are everywhere: social networks, web, LinkedData, transactions, communication (email, phone). –Also text can be converted to graph. –Interconnecting graph data and searching for relations is crucial. Approach –Forming semantic trees and graphs from text, web, communication, databases and LinkedData –User interaction with graph data in order to achieve integration and data cleansing –Users will do it, if user effort have immediate impact on search results Bratislava, 26 October 2011GCCP 201111
12
SGDB: Simple Graph Database Storage for graphs Optimized for graph traversing and spread of activation Faster then Neo4j for graph traversing operations Supports Blueprints API https://simplegdb.svn.sourceforge.net/svnroot/simplegdb/Sgdb3 Graph Database Benchmarks –Graph Traversal Benchmark for Graph Databases –http://ups.savba.sk/~marek/gbench.htmlhttp://ups.savba.sk/~marek/gbench.html –Blueprints API - possibility to test compliant Graph databases Bratislava, 26 October 2011GCCP 201112
13
Email Archives –Valuable source of knowledge –Hidden Social Networks owned by Enterprise or Individual –Information Extraction and Social Network Analysis can help Challenges –Graph based Querying –New data and approach for information search –Relation search Applications –Recommendation and Search in Emails –Population of Databases (Cold start problem) –Possibility to extend social network graph with transaction data, processed document repositories and other business data –Business Intelligence and Knowledge Management Conclusion Bratislava, 26 October 2011GCCP 201113
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.