Algorithms and Data Structures for Massive Datasets (Acube Lab) Rossano Venturini Dipartimento di Informatica Università di Pisa Paolo Ferragina Giuseppe Prencipe Marco Cornolti Andrea Farruggia Giovanni Micale Francesco Piccinno Giorgio Audrito
2 A 3 Lab (acube.di.unipi.it) Algorithms and data structures for massive dataset – Data Compression – Compressed Indexing Web or arbitrary texts Storage and analysis of massive graphs – Information Retrieval on news, tweet, … Submitted US patents: 3 with Yahoo, 1 with NYU Accepted US patents: 1 with U. Rutgers, 1 with AT&T-Lucent
3 Social Networks and Social Data Graph structure + Textual Content Nodes users (~ 1 bil) Edges explicit = friend, follower, retweet, +1, … (~ 10 bil) Edges implicit = similarity, co-occurrence, click, … (» 100 bil) Given an idea, you need the right platform to implement it: HW + SW (IT Center) Algorithms (our Lab)
4 No SQL HyperTable Cassandra Hadoop 2006 Cosmos
5 Storage and access to Labeled Graphs – Compress the graph structure – Compress the node and edge labels – Guarantee fast access, dynamicity and search
Key issue: Minimize space occupancy Maximize decompression speed Data Compression: Theory & Engineering Compressor on DBLP Compressed space (MB) Decompression time (secs) Gzip bzip Snappy LZ Our result 130 1.9 J. ACM ‘05 ACM-SIAM Soda ’09-’14 ACM WSDM ‘10 ESA ’11-’14 Algorithmica ‘12 SIAM J. Computing ‘13 Two interesting scenarios: - Energy-efficiency issues - Cloud computing A new algorithmic concept: Multi-objective design of compressors Can we fix the space occupancy and minimize the decompression time ? Or, vice versa ?
Performance over hundreds of MBs and commodity PC Count(P) takes 5 microsecs/char, taking about bzip’s space Locate(P) outputs 100K occ/sec, taking +10% space This may be 4x faster than IL, within <35% space occupancy Compressed Indexing: Theory & Engineering Key issue: Minimize space occupancy Maximize substring-search throughput J. ACM ‘05 ACM SIGIR ‘07 J. ACM ‘09 ACM Trans. Algo. ’10 ESA ’13 ACM-SIAM SODA ’13 … and many others December 2003 Suffix-array compressible «-» Bzip searchable
Compressed Indexing: Theory & Engineering Trie: 14x more space than input data. Front-coding & two-level indexing: 110% of input data 4 microsecs/char Our Compressed Permuterm: < 25% of input data, i.e. close to bzip2 10 60 microsecs/char So, time close to FC but one-fourth of its space The problem: Under Y! -patenting No SQL DB
We know how to “manage” everything… 9
“Diego Maradona won against Mexico” Dictionary against Diego Maradona Mexico won TF-IDF vector Similarity(v,w) ≈ cos( ) t1t1 v w t3t3 t2t2 a Vector Space model Information Retrieval
“Diego Maradona won against Mexico” Detect mentions and annotate them with entity/topic extracted from a catalog The soccer player Mexico soccer team Topic Annotators Wikipedia! we serve about 170k requests/day
obama asks iran for RQ-170 sentinel drone back us president issues Ahmadinejad ultimatum Barack Obama Iran Lockheed Martin RQ-170 Sentinel President of the United States Mahmoud Ahmadinejad Ultimatum A new scenario
The literature 13 Paper at WWW 2013, we serve about 170k requests/day Many commercial software: AlchemyAPI, DBpedia Spotlight, Extractiv, Lupedia, OpenCalais, Saplo, SemiTags, TextRazor, Wikimeta, Yahoo! Content Analysis, Zemanta.
14 Paper at ACM WSDM 2012 Paper at ECIR 2012 Paper at IEEE Software 2012 Details on...