Indexing The World Wide Web: The Journey So Far Abhishek Das, Ankit Jain 2011 Paper Presentation : Abhishek Rangnekar 1.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Spark: Cluster Computing with Working Sets
Information Retrieval in Practice
Distributed Computations
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
ODISSEA Mehdi Kharrazi Kulesh Shanmugasundaram Security Issues.
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Parallel and Distributed IR
Distributed Computations MapReduce
Overview of Search Engines
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
An Approach for Processing Large and Non-uniform Media Objects on MapReduce-Based Clusters Rainer Schmidt and Matthias Rella Speaker: Lin-You Wu.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Open Search Office Web Services Database Doc Mgt Sys Pipeline Index Geospatial Analysis Text Search Faceting Caching Query parsing Clustering Synonyms.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
Author: Abhishek Das Google Inc., USA Ankit Jain Google Inc., USA Presented By: Anamika Mukherji 13/26/2013Indexing The World Wide Web.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Web Search Algorithms By Matt Richard and Kyle Krueger.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation MongoDB Architecture.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Bigtable: A Distributed Storage System for Structured Data
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Information Retrieval in Practice
Overview Part 2 – Combinational Logic Functions and functional blocks
Big Data is a Big Deal!.
Information Retrieval in Practice
Search Engine Architecture
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Map Reduce.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
CS6604 Digital Libraries IDEAL Webpages Presented by
Computer Architecture
Ch 4. The Evolution of Analytic Scalability
Introduction to MapReduce
MAPREDUCE TYPES, FORMATS AND FEATURES
5/7/2019 Map Reduce Map reduce.
Map Reduce, Types, Formats and Features
Presentation transcript:

Indexing The World Wide Web: The Journey So Far Abhishek Das, Ankit Jain 2011 Paper Presentation : Abhishek Rangnekar 1

Better Understanding of User Intent 2

Laying Out the Index To handle the load of a modern search engine, a combination of distribution and replication techniques is required. – Distribution : document collection and its index are split across multiple machines and that answers to the query as a whole must be synthesized from the various collection components. – Replication (or mirroring) : making enough identical copies of the system so that the required query load can be handled even during single or multiple machine failures. 3

Document vs. Term Based Partitioning Document Based Partition the collection and allocate one sub-collection to each of the processors. A local index is built for each sub- collection; when queries arrive, they are passed to every sub- collection and evaluated against every local index. The sets of sub-collection answers are then combined in some way to provide an overall set of answers. Term Based In a term-partitioned index, the index is split into components by partitioning the dictionary. Each processor has full information about only a subset of the terms. This implies that to handle a query, only the relevant subset of processors needs to respond. 4

Document vs. Term Based Partitioning 5

Compressing the Index Faster query processing Optimized caching Using GAPS between document identifiers to reduce space – Original posting lists: the: 1, 9 2, 8 3, 8 4, 5 5, 6 6, 9 to: 1, 5 3, 1 4, 2 5, 2 6, 6 john: 2, 4 4, 1 6, 4 – With gaps: the: 1, 9 1, 8 1, 8 1, 5 1, 6 1, 9 to: 1, 5 2, 1 1, 2 1, 2 john: 2, 4 2, 1 2, 4 Variable Byte Encoding for faster querying. 6

Variable Byte Encoding Encoding: The first bit of each byte is a continuation bit, which is flipped only in the last byte of the encoded gap. The remaining 7 bits in each byte are used to encode part of the gap. Decoding: Read a sequence of bytes until the continuation bit flips. Extract and concatenate the 7-bit parts to get the magnitude of a gap. 7

Bit Level Encoding Each codeword has two parts, a prefix and a suffix. – Prefix indicates the binary magnitude of the value and tells the decoder how many bits there are in the suffix part. – Suffix indicates the value of the number within the corresponding binary range. 8

Ordering By Highest Impact First – ( ): 12, 2 17, 2 29, 1 32, 1 40, 6 78, 1 101, 3 106, 1 – When the list is reordered by term frequency, it gets transformed: 40, 6 101, 3 12, 2 17, 2 29, 1 32, 1 78, 1 106, 1 – The repeated frequency information can then be factored out into a prefix component with a counter that indicates how many documents there are with this same frequency value: 6 : 1 : 40 3 : 1 : : 2 : 12, 17 1 : 4 : 29, 32, 78, 106 – Not storing the repeated frequencies gives a considerable saving. Finally, if differences of document identifiers are taken, we get the following: 6 : 1 : 40 3 : 1 : : 2 : 12, 5 1 : 4 : 29, 3, 46, 28 9

Managing Multiple Indices Classification of indexes based on rate of refreshing. – The large, rarely-refreshing pages index Re-crawled and Re-indexed once a month – The small, ever-refreshing pages index Re-crawled and Re-indexed daily – The dynamic real-time/news pages index Re-crawled and Re-indexed on a per-second basis 10

Scaling The System DISTRIBUTED FILE SYSTEM – In order to manage large amounts of data across large commodity clusters, a distributed file system that provides efficient remote file access, file transfers, and the ability to carry out concurrent independent operations while being extremely fault tolerant is essential. MAP – SHUFFLE – REDUCE – MAP - The master node chops up the problem into small chunks and assigns each chunk to a worker. The worker either processes the chunk of data with the mapper and returns the result to the master or further chops up the input data and assigns it hierarchically. – SHUFFLE – Optional. Data is transferred between nodes in order to group key- value pairs from the mapper output to in a way that enables proper reducing. – REDUCE - The master takes the sub-answers and combines them to create the final output 11

Future Research Directions Real Time Data and Search (Social Media) – Create a Social Graph- User creates a graph of who they are interested in (who they follow) as well as the topics they are interested in (what they tweet). User’s influence on his her followers can be referred as UserRank and a user’s influence on a given topic as UserTopicRank. – Extract and index the links- It involve parsing of each tweet, extracting a URL if present, crawling it and indexing it. The secondary inputs for the indexing stage are similar to those needed for webpage indexing. Instead of having anchor text, we have tweet-text. Instead of PageRank, we have UserRank. – Real-time Related Topics- Related topics help users discover information about current topics better than traditional suggestions. Lot of work on topic clustering has been done and being done. – Sentiment Analysis– There are many teams working on using NLP techniques to extract sentiment from tweets and other real time sources. 12

Social Search and Personalized Web Search Facebook Connect (2009): opened up the data to any third party service as long as their user authenticate themselves using Facebook Connect. Facebook has started returning web search results based on the recommendations of those friends who are within two degrees of the user. Facebook Graph Search: It combines the big data acquired from its over one billion users and external data into a search engine providing user-specific search results. 13

Pros & Cons of the paper Pros – Good overview of search-engine architecture – Explains search engine lifecycle in nutshell – Latest developments support Future Work section of the paper Cons – Compression section lacks examples – Encoding techniques could have been easier to understand with more examples 14

Conclusion Indexing is the key factor in organizing the World Wide Web Query Processing and result generation Term based-Phrase based indexing Document- Term based partitioning Memory-Disk based storage Compression Techniques Managing multiple indices Scaling Future Scope Social Search 15