Presentation is loading. Please wait.

Presentation is loading. Please wait.

Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.

Similar presentations


Presentation on theme: "Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos."— Presentation transcript:

1 Panagiotis Antonopoulos Microsoft Corp panant@microsoft.com Ioannis Konstantinou National Technical University of Athens ikons@ece.ntua.gr Dimitrios Tsoumakos Ionian University dtsouma@ionio.gr Nectarios Koziris National Technical University of Athens nkoziris@ece.ntua.gr Efficient Index Updates over the Cloud

2 Requirements in the Web Huge volume of datasets > 1.8 zettabytes, growing by 80% each year Huge number of users > 2 billion users searching and updating web content Explosion of User Generated Content Facebook: 90 updates/user/month, 30 billions/day Wikipedia: 30 updates/article/month, 8K new/day Users demand fresh results 2/18

3 Our contribution A distributed system which allows fast and frequent updates on web-scale Inverted Indexes. Incremental processing of updates Distributed processing - MapReduce Distributed index storage and serving – NoSQL 3/ 26

4 Goals Update time independent of existing index size – Fast and frequent updates on large indexes Index consistency after an update – System stability and performance unaffected by updates Scalability – Exploit large commodity clusters 4/ 26

5 TermList of documents distributedDoc2, Doc3, Doc7, Doc10 updateDoc2, Doc5, Doc12 HadoopDoc1, Doc2, Doc8 Inverted Index Maps each term included in a collection of documents to the documents that contain the term: (term, list(doc_ref)) Popular for fast content search, search engines Index Record: (term, doc_ref) Example: 5/26

6 Related Work Google, distributed index creation – Google Caffeine, fast and continuous index updates Apache Solr, distributed search through index replication Katta, distributed index creation and serving CSLAB, distributed index creation and serving LucidWorks, distributed index creation and updates on top of Solr (not open-source) 6/ 26

7 Basic Update Procedure Input: Collection of new/modified documents For each new document: Simply add each term to the corresponding list For each modified document: – Delete all index records that refer to the old version – Add each term of the new version to the corresponding list 7 /26

8 Basic Update Procedure 8/26 For modified documents we need to: Obtain the indexed terms of the old version Locate and delete the corresponding index records – Complexity depends on the schema of the index Update time critically depends on these operations! How can we do it efficiently?

9 Proposed Schema HBase: – Stores and indexes millions of columns per row – Stores varying number of columns for each row Proposed Schema: – One row for every indexed term – One column for each document contained in the list of the corresponding term – Use the document ID as the column name 9 /26

10 Proposed Schema Each cell (row, column) corresponds to an index record (term, docID) Advantages – Fast record discovery and deletion Almost independent of the list size Disadvantages – Required storage space (overhead per column) 10 /26

11 Forward Index Forward Index: List of terms of each document Example: Advantages: – Immediate access to the terms of old version – Retrieving the Forward Index is faster (smaller size) Disadvantages: – Required storage space – Small overhead to the indexing process 11/ 26 Document IDWords Doc1data, management, in, the, cloud Doc2Inverted, index, updates

12 Minimizing Index Changes 12/26 General Idea: Modifications in the documents’ content are limited Update the index based only on the content modifications Procedure: Compare the two different versions of each document Delete the terms contained in the old version but not in the new Add the terms contained in the new version but not in the old

13 Minimizing Index Changes 13/26 No changes required for the common terms Advantages: Minimize the changes required to the index ‒Minimize costly insertions and deletions in Hbase ‒Minimize volume of intermediate K/V pairs (distributed) Disadvantages: Increased complexity of indexing process

14 Distributed Index Updates 14/26 Better but still centralized! Perfectly suited to the MapReduce logic: – Each document can be processed independently – The updates have to be merged before they are applied to the index Utilizing MR model: – Easily distribute the processing – Exploit the resources of large commodity clusters

15 Distributed Index Updates 15/ 26 Mappers: Scan modified document Retrieve old FI Compare two versions Emit K/V pairs for additions (term, docID) Emit K/V pairs for deletions (term, docID) Emit K/V pairs for FI and Content Combiners: Merge the K/V pairs into a list of values per key (only for additions and deletions) Emit a Key/Value pair for additions: (term, list(docID)) Emit a Key/Value pair for deletions: (term, list(docID)) Reducers: For additions: Create an index record for each pair (term, docID) Write the records to HFiles For deletions: Delete the corresponding cells using the HBase Client API Bulk Load the output HFiles to HBase Content Table: The raw documents Forward Index Table: The Forward Index Inverted Index Table: The Inverted Index using the schema described in the previous slides

16 Even Load Distribution 16 /26 Two different types of keys: Document ID: – One K/V pair for the Content and one for the FI of each document – Divide the keys into equally sized partitions using a hash function Term: – Skewed-Zipfian distribution in natural languages – The number of values per key-term varies significantly

17 Even Load Distribution 17 /26 Solution: Sampling the input Mappers: Process a sample using the same algorithm Emit a K/V per (term, 1) for each addition or deletion Reducers: (1 for additions, 1 for deletions) Count the occurrences to determine the splitting points Indexer: Loads the splitting points and chooses the reducer for each key

18 Experimental Setup 18 /26 Cluster: 2-12 worker nodes (default: 8) 8 cores @2GHz, 8GB RAM Hadoop v.0.20.2-CDH3 (Cloudera) HBase v.0.90.3-CDH3 (Cloudera) 6 mappers and 6 reducers per node Datasets: Wikipedia snapshots on April 5, 2011 and May 26, 2011 Default initial dataset: 64.2 GB, 23.7 million documents Default update dataset: 15.4 GB, 2.2 million documents

19 Experimental Results 19/26 Evaluating our design choices Comparison: Depends on the number of indexed terms Forward Index: Important in both cases Bulk Loading: Depends on the number of indexed terms Sampling: Not important, small number of intermediate K/V pairs

20 Experimental Results 20 /26 Update time vs. Update dataset size Update time linear to update dataset size For fixed size of initial dataset: 64.2 GB (≈24 mil. documents)

21 Experimental Results 21 /26 4X larger initial dataset size increases update time by less than 6% Update time roughly independent of the initial index size For fixed new/modified documents dataset: 5,1 GB (≈400 thousand docs) Update time vs. Initial Dataset Size

22 Experimental Results 22/ 26 5X faster indexing from 2 to 12 nodes Bulk loading to HBase does NOT scale as expected 3.3X better performance in total Update time vs. Available resources (# of Mappers/Reducers) For fixed size of initial/update datasets: 64.2 GB/15.4GB

23 Conclusion Incremental Processing: Process updates, minimize required changes Update time : – Almost independent of initial index size – Linear to the update dataset size Distributed Processing Reduced update time Scalability 23/26

24 Conclusion 24/26 Fast and frequent updates on web-scale Indexes Wikipedia: >6X faster than index rebuild Disadvantages: Slower index creation (done only once) Increase in required storage space (low cost)

25 The End Thank you! 25/26


Download ppt "Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos."

Similar presentations


Ads by Google