The course 2014 - 2015. Project #1: Dictionary search with 1 error The problem consists of building a data structure to index a large dictionary of strings.

Slides:



Advertisements
Similar presentations
Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,
Advertisements

File Organizations and Indexing Lecture 4 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears,
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Aggregating local image descriptors into compact codes
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Greedy Algorithms Amihood Amir Bar-Ilan University.
Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Fast Compressed Tries through Path Decompositions Roberto Grossi Giuseppe Ottaviano* Università di Pisa * Part of the work done while at Microsoft Research.
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
Inverted Index Hongning Wang
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.
Tries Standard Tries Compressed Tries Suffix Tries.
Speaker: Sattam Alsubaiee Supporting Location-Based Approximate-Keyword Queries Sattam Alsubaiee, Alexander Behm, and Chen Li University of California,
Packing bag-of-features ICCV 2009 Herv´e J´egou Matthijs Douze Cordelia Schmid INRIA.
Welcome CS 315 Data Structures. Instructor: B. Ravikumar Office: 116 I Darwin Hall Phone: Course Web site:
1 Compressed Index for Dictionary Matching WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)
CS Lecture 9 Storeing and Querying Large Web Graphs.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
Chair of Software Engineering Einführung in die Programmierung Introduction to Programming Prof. Dr. Bertrand Meyer Exercise Session 10.
Automatic Evaluation Of Search Engines Project Presentation Team members: Levin Boris Laserson Itamar Instructor Name: Gurevich Maxim.
Spring 2004 ECE569 Lecture ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
Web Search – Summer Term 2006 V. Web Search - Page Repository (c) Wolfgang Hürst, Albert-Ludwigs-University.
Data Structures and Programming.  John Edgar2.
To quantitatively test the quality of the spell checker, the program was executed on predefined “test beds” of words for numerous trials, ranging from.
Algorithms and Data Structures for Massive Datasets (Acube Lab) Rossano Venturini Dipartimento di Informatica Università di Pisa Paolo Ferragina Giuseppe.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Data Compression By, Keerthi Gundapaneni. Introduction Data Compression is an very effective means to save storage space and network bandwidth. A large.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 3: tolerant retrieval.
 CIKM  Implementation of Smoothing techniques on the GPU  Re running experiments using the wt2g collection  The Future.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Random access to arrays of variable-length items
Szymon Grabowski, Marcin Raniszewski Institute of Applied Computer Science, Lodz University of Technology, Poland The Prague Stringology Conference, 1-3.
David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
Clusterpoint Margarita Sudņika ms RDBMS & NoSQL Databases & tables → Document stores Columns, rows → Schemaless documents Scales UP → Scales UP.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper.
Improving Search for Emerging Applications * Some techniques current being licensed to Bimaple Chen Li UC Irvine.
Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,
CS 315 Data Structures Fall 2011 Instructor: B. (Ravi) Ravikumar Office: 116 I Darwin Hall Phone: Course Web site:
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
CS 315 Data Structures Spring 14 Instructors (lecture) Bala Ravikumar Office: 116 I Darwin Hall Phone: piazza, (Lab.
A simple storage scheme for strings achieving entropy bounds Paolo Ferragina and Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.
Index construction: Compression of postings
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Locality-sensitive hashing and its applications
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Tries 07/28/16 11:04 Text Compression
Text Indexing and Search
Succinct Data Structures
Succinct Data Structures
Locality-sensitive hashing and its applications
Implementation Issues & IR Systems
Index Construction: sorting
Searching Similar Segments over Textual Event Sequences
Query processing: phrase queries and positional indexes
Minwise Hashing and Efficient Search
Presentation transcript:

The course

Project #1: Dictionary search with 1 error The problem consists of building a data structure to index a large dictionary of strings of variable length that supports efficient search with edit-distance at most one. This problem has important applications in correcting misspelled keywords in search engine queries. The project consists of designing and implementing a solution based on one/more of libraries below in which the strings to be indexed are the queries in dataset. Then, the goal is to execute a serie of queries sampled from the provided dataset and possibly changed in one character (per query) in order to simulate misspells. The students have to profile and optimize their code and provide measurements of its time/space efficiency (suggested by Yahoo! Research, Barcelona)

Datasets MSN Query Logs (2006 and 2007). [14M and 100M queries]. Software and Libraries  Compressed Permuterm Index  Succinct Data Structure Library  Ternary Search Tree  Minimal Perfect Hashing References D. Belazzougui, R. Venturini: Compressed String Dictionary Look-Up with Edit Distance One. CPM , I. Chegrane, D. Belazzougui: Simple, compact and robust approximate string dictionary. CoRR abs/ , Project #1: Dictionary search with 1 error (suggested by Yahoo! Research, Barcelona)

Project #2: Compressing a large web collection (suggested by Tiscali, Pisa) The problem consists of grouping Web pages and then compressing each group individually via Lzma2 compressor. The goal is to find the grouping strategy that minimizes the achieved compression. This problem has important applications in storage of text collections in search engines. The project consists of designing and implementing an efficient clustering solution which can scale to Million of pages. Then, the goal is to test the proposed solution on the provided dataset by measuring the compressed space with Lzma2, and the overall time to compress the whole dataset. The students have to profile and optimize their code and provide measurements of its time/space efficiency. As a baseline for comparison the students can compare their solution with the one that sort Web pages by their URLs.

Project #2: Compressing a large web collection (suggested by Tiscali, Pisa) Datasets Gov2 crawl - 25M Web pages. 81 Gb gzipped. Software and Libraries  Lzma2 compressor -  Sorting large datasets: Unix sort or STXXL (  Library for LSH implementation and shingling  Weka - implements k-means and k-means++ References …few… P. Ferragina, G. Manzini: On compressing the textual web. ACM WSDM, , 2010.

10 Lectures Lez 1. Trie + Ternary Search Tree - Rossano Lez 2. Rank/Select + Succinct tree encodings - Rossano Lez 3. FMI + SA (no construction) - Paolo Lez 4. q-grammi + (compr) permuterm + (min-)perfect hash - Paolo Lez 5. Discussion on the project #1 – Rossano Lez 6. Sorting and Permuting - Paolo Lez 7. Shingling, Min-Hash and LSH - Paolo Lez 8. Clustering - Paolo Lez. 9. Compressors LZ77 and Lzma - Rossano Lez 10. Discussion on the project #2 - Paolo

The exam It consists of three parts: Project 70% Written/oral test 20% Project presentation 10%

Page of the course