Download presentation
Presentation is loading. Please wait.
Published byAmy Brook Baldwin Modified over 9 years ago
1
Algorithms and Data Structures for Massive Datasets (Acube Lab) Rossano Venturini Dipartimento di Informatica Università di Pisa Paolo Ferragina Giuseppe Prencipe Marco Cornolti Andrea Farruggia Giovanni Micale Francesco Piccinno Giorgio Audrito
2
2 A 3 Lab (acube.di.unipi.it) Algorithms and data structures for massive dataset – Data Compression – Compressed Indexing Web or arbitrary texts Storage and analysis of massive graphs – Information Retrieval on news, tweet, … Submitted US patents: 3 with Yahoo, 1 with NYU Accepted US patents: 1 with U. Rutgers, 1 with AT&T-Lucent
3
3 Social Networks and Social Data Graph structure + Textual Content Nodes users (~ 1 bil) Edges explicit = friend, follower, retweet, +1, … (~ 10 bil) Edges implicit = similarity, co-occurrence, click, … (» 100 bil) Given an idea, you need the right platform to implement it: HW + SW (IT Center) Algorithms (our Lab)
4
4 No SQL HyperTable Cassandra Hadoop 2006 Cosmos
5
5 Storage and access to Labeled Graphs – Compress the graph structure – Compress the node and edge labels – Guarantee fast access, dynamicity and search
6
Key issue: Minimize space occupancy Maximize decompression speed Data Compression: Theory & Engineering Compressor on DBLP Compressed space (MB) Decompression time (secs) Gzip19111.6 bzip212149 Snappy3232.1 LZ42151.9 Our result 130 1492.9 1.9 J. ACM ‘05 ACM-SIAM Soda ’09-’14 ACM WSDM ‘10 ESA ’11-’14 Algorithmica ‘12 SIAM J. Computing ‘13 Two interesting scenarios: - Energy-efficiency issues - Cloud computing A new algorithmic concept: Multi-objective design of compressors Can we fix the space occupancy and minimize the decompression time ? Or, vice versa ?
7
Performance over hundreds of MBs and commodity PC Count(P) takes 5 microsecs/char, taking about bzip’s space Locate(P) outputs 100K occ/sec, taking +10% space This may be 4x faster than IL, within <35% space occupancy Compressed Indexing: Theory & Engineering Key issue: Minimize space occupancy Maximize substring-search throughput J. ACM ‘05 ACM SIGIR ‘07 J. ACM ‘09 ACM Trans. Algo. ’10 ESA ’13 ACM-SIAM SODA ’13 … and many others December 2003 Suffix-array compressible «-» Bzip searchable
8
Compressed Indexing: Theory & Engineering Trie: 14x more space than input data. Front-coding & two-level indexing: 110% of input data 4 microsecs/char Our Compressed Permuterm: < 25% of input data, i.e. close to bzip2 10 60 microsecs/char So, time close to FC but one-fourth of its space The problem: Under Y! -patenting No SQL DB
9
We know how to “manage” everything… 9
10
“Diego Maradona won against Mexico” Dictionary against Diego Maradona Mexico won 2.2 5.1 9.1 1.0 0.1 TF-IDF vector Similarity(v,w) ≈ cos( ) t1t1 v w t3t3 t2t2 a Vector Space model Information Retrieval
11
“Diego Maradona won against Mexico” Detect mentions and annotate them with entity/topic extracted from a catalog The soccer player Mexico soccer team Topic Annotators Wikipedia! we serve about 170k requests/day
12
obama asks iran for RQ-170 sentinel drone back us president issues Ahmadinejad ultimatum Barack Obama Iran Lockheed Martin RQ-170 Sentinel President of the United States Mahmoud Ahmadinejad Ultimatum A new scenario
13
The literature 13 Paper at WWW 2013, we serve about 170k requests/day Many commercial software: AlchemyAPI, DBpedia Spotlight, Extractiv, Lupedia, OpenCalais, Saplo, SemiTags, TextRazor, Wikimeta, Yahoo! Content Analysis, Zemanta.
14
14 Paper at ACM WSDM 2012 Paper at ECIR 2012 Paper at IEEE Software 2012 Details on... http://acube.di.unipi.it/tagme
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.