Algorithms and Data Structures for Massive Datasets (Acube Lab) Rossano Venturini Dipartimento di Informatica Università di Pisa Paolo Ferragina Giuseppe.

Slides:

Advertisements

Similar presentations

Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Advertisements

Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.

Date: 2014/05/06 Author: Michael Schuhmacher, Simon Paolo Ponzetto Source: WSDM’14 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Knowledge-based Graph Document.

A PowerPoint Presentation

Search in Source Code Based on Identifying Popular Fragments Eduard Kuric and Mária Bieliková Faculty of Informatics and Information.

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.

Paolo Ferragina, Università di Pisa Compressed Rank & Select on general strings Paolo Ferragina Dipartimento di Informatica, Università di Pisa.

DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.

The course Project #1: Dictionary search with 1 error The problem consists of building a data structure to index a large dictionary of strings.

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.

 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. 1 The Architecture of a Large-Scale Web Search and Query Engine.

Disambiguation Algorithm for People Search on the Web Dmitri V. Kalashnikov, Sharad Mehrotra, Zhaoqi Chen, Rabia Nuray-Turan, Naveen Ashish For questions.

IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)

Xpantrac connection with IDEAL Sloane Neidig, Samantha Johnson, David Cabrera, Erika Hoffman CS /6/2014.

Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.

CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.

CS492: Special Topics on Distributed Algorithms and Systems Fall 2008 Lab 3: Final Term Project.

Search Engines and Information Retrieval Chapter 1.

Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)

Multimedia Databases (MMDB)

A new era: Topic-based Annotators. “ Diego Maradona won against Mexico ” Dictionary of terms against Diego Maradona Mexico won Term.

Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.

WING Monthly Meeting SIGIR 2014 Debrief 25 th July 2014 By Jovian Lin.

Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.

Microblogs: Information and Social Network Huang Yuxin.

Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.

Text Mining: Fast Phrase-based Text Indexing and Matching Khaled Hammouda, Ph.D. Student PAMI Research Group University of Waterloo Waterloo, Ontario,

A Model for Fast Web Mining Prototyping Nivio Ziviani UFMG – Brazil Álvaro Pereir a Ricardo Baeza-Yates Jesus Bisbal UPF – Spain.

A Discrepancy Detector James Abello, CCICADA-DIMACS FACULTY ( Student: Nishchal Devanur CS Dept Rutgers Goal To detect the most influential.

Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.

Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.

A Personalized Search Engine Based on Web Snippet Hierarchical Clustering Paolo Ferragina, Antonio Gulli Dipartimento di Informatica, Pisa

How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.

Algorithmic Detection of Semantic Similarity WWW 2005.

Trip Report FINAL MEETING AND SUMMER SCHOOL OF DFG PRIORITY PROGRAM ALGORITHM ENGINEERING.

Index Construction: sorting Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Chap 4.

Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper.

Recommendation systems Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

Topical Clustering of Search Results Date : 2012/11/8 Resource : WSDM’12 Advisor : Dr. Jia-Ling Koh Speaker : Wei Chang 1.

Your caption here POLYPHONET: An Advanced Social Network Extraction System from the Web Yutaka Matsuo Junichiro Mori Masahiro Hamasaki National Institute.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.

Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.

Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Compression of documents

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

A new era: Topic-based Annotators

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

中国计算机学会学科前沿讲习班:信息检索 Course Overview

Paolo Ferragina Dipartimento di Informatica, Università di Pisa

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Index Construction: sorting

Information retrieval and PageRank

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Disambiguation Algorithm for People Search on the Web

Searching and browsing through fragments of TED Talks

Thales Alenia Space Competence Center Software Solutions

A Framework for Benchmarking Entity-Annotation Systems

Index construction: Compression of postings

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Index construction: Compression of postings

A new era: Topic-based Annotators

Accelerating Regular Path Queries using FPGA

Presentation transcript:

Algorithms and Data Structures for Massive Datasets (Acube Lab) Rossano Venturini Dipartimento di Informatica Università di Pisa Paolo Ferragina Giuseppe Prencipe Marco Cornolti Andrea Farruggia Giovanni Micale Francesco Piccinno Giorgio Audrito

2 A 3 Lab (acube.di.unipi.it) Algorithms and data structures for massive dataset – Data Compression – Compressed Indexing Web or arbitrary texts Storage and analysis of massive graphs – Information Retrieval on news, tweet, … Submitted US patents: 3 with Yahoo, 1 with NYU Accepted US patents: 1 with U. Rutgers, 1 with AT&T-Lucent

3 Social Networks and Social Data Graph structure + Textual Content Nodes  users (~ 1 bil) Edges explicit = friend, follower, retweet, +1, … (~ 10 bil) Edges implicit = similarity, co-occurrence, click, … (» 100 bil) Given an idea, you need the right platform to implement it: HW + SW (IT Center) Algorithms (our Lab)

4 No SQL HyperTable Cassandra Hadoop 2006 Cosmos

5 Storage and access to Labeled Graphs – Compress the graph structure – Compress the node and edge labels – Guarantee fast access, dynamicity and search

Key issue: Minimize space occupancy Maximize decompression speed Data Compression: Theory & Engineering Compressor on DBLP Compressed space (MB) Decompression time (secs) Gzip bzip Snappy LZ Our result 130   1.9 J. ACM ‘05 ACM-SIAM Soda ’09-’14 ACM WSDM ‘10 ESA ’11-’14 Algorithmica ‘12 SIAM J. Computing ‘13 Two interesting scenarios: - Energy-efficiency issues - Cloud computing A new algorithmic concept: Multi-objective design of compressors Can we fix the space occupancy and minimize the decompression time ? Or, vice versa ?

Performance over hundreds of MBs and commodity PC Count(P) takes 5 microsecs/char, taking about bzip’s space Locate(P) outputs 100K occ/sec, taking +10% space This may be 4x faster than IL, within <35% space occupancy Compressed Indexing: Theory & Engineering Key issue: Minimize space occupancy Maximize substring-search throughput J. ACM ‘05 ACM SIGIR ‘07 J. ACM ‘09 ACM Trans. Algo. ’10 ESA ’13 ACM-SIAM SODA ’13 … and many others December 2003 Suffix-array compressible «-» Bzip searchable

Compressed Indexing: Theory & Engineering  Trie: 14x more space than input data.  Front-coding & two-level indexing:  110% of input data  4 microsecs/char  Our Compressed Permuterm:  < 25% of input data, i.e. close to bzip2  10  60 microsecs/char  So, time close to FC but one-fourth of its space The problem: Under Y! -patenting No SQL DB

We know how to “manage” everything… 9

“Diego Maradona won against Mexico” Dictionary against Diego Maradona Mexico won TF-IDF vector Similarity(v,w) ≈ cos(  ) t1t1 v w t3t3 t2t2 a Vector Space model Information Retrieval

“Diego Maradona won against Mexico” Detect mentions and annotate them with entity/topic extracted from a catalog The soccer player Mexico soccer team Topic Annotators Wikipedia! we serve about 170k requests/day

obama asks iran for RQ-170 sentinel drone back us president issues Ahmadinejad ultimatum Barack Obama Iran Lockheed Martin RQ-170 Sentinel President of the United States Mahmoud Ahmadinejad Ultimatum A new scenario

The literature 13 Paper at WWW 2013, we serve about 170k requests/day Many commercial software: AlchemyAPI, DBpedia Spotlight, Extractiv, Lupedia, OpenCalais, Saplo, SemiTags, TextRazor, Wikimeta, Yahoo! Content Analysis, Zemanta.

14 Paper at ACM WSDM 2012 Paper at ECIR 2012 Paper at IEEE Software 2012 Details on...