2/25/2004 The Google Cluster Architecture February 25, 2004.

Slides:



Advertisements
Similar presentations
Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,
Advertisements

Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.
Clustering and Load Balancing Optimization for Redundant Content Removal Shanzhong Zhu (Ask.com) Alexandra Potapova, Maha Alabduljalil (Univ. of California.
Information Retrieval in Practice
Web Crawlers Nutch. Agenda What are web crawlers Main policies in crawling Nutch Nutch architecture.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
2/23/2004 Load Balancing February 23, /23/2004 Assignments Work on Registrar Assignment.
Cloud Computing Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering University of Washington August 2010.
1 Sept 7, 2011 COMP6111A Fall 2011 HKUST Lin Gu Cloud Computing Systems.
2/18/2004 Challenges in Building Internet Services February 18, 2004.
Algorithms (Contd.). How do we describe algorithms? Pseudocode –Combines English, simple code constructs –Works with various types of primitives Could.
Nikolay Tomitov Technical Trainer SoftAcad.bg.  What are Amazon Web services (AWS) ?  What’s cool when developing with AWS ?  Architecture of AWS 
Distributed Computations MapReduce
DISTRIBUTED COMPUTING
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
PNUTS: YAHOO!’S HOSTED DATA SERVING PLATFORM FENGLI ZHANG.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
The Anatomy of a Large- Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page CS Department Stanford University Presented by Md. Abdus Salam.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Distributed File System By Manshu Zhang. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
Multiprossesors Systems.. What are Distributed Databases ? “ A Logically interrelated collection of shared data ( and a description of this data) physically.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Authors Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, Ramana.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Topic Distributed DBMS Database Management Systems Fall 2012 Presented by: Osama Ben Omran.
1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),
Combining Systems and Databases: A Search Engine Retrospective By: Rooma Rathore Rohini Prinja Author: Eric A. Brewer.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
The Google Cluster Google. The Google Floor Plan.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Cloud Computing Vs RAID Group 21 Fangfei Li John Soh Course: CSCI4707.
The Google Cluster Architecture Written By: Luiz André Barroso Jeffrey Dean Urs Hölzle Presented By: Omkar Kasinadhuni Simerjeet Kaur.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Indexing The World Wide Web: The Journey So Far Abhishek Das, Ankit Jain 2011 Paper Presentation : Abhishek Rangnekar 1.
Cloud Computing Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering University of Washington August 2012.
1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
Information Retrieval in Practice
Scaling Network Load Balancing Clusters
Hadoop Aakash Kag What Why How 1.
Search Engine Architecture
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Map Reduce.
Unit OS10: Fault Tolerance
CHAPTER 3 Architectures for Distributed Systems
Cloud Computing Ed Lazowska August 2011 Bill & Melinda Gates Chair in
TYPES OF SERVER. TYPES OF SERVER What is a server.
Prepared by Rao Umar Anwar For Detail information Visit my blog:
The Anatomy of a Large-Scale Hypertextual Web Search Engine
CS6604 Digital Libraries IDEAL Webpages Presented by
Introduction to MapReduce
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Information Retrieval and Web Design
Presentation transcript:

2/25/2004 The Google Cluster Architecture February 25, 2004

2/25/2004 Assignments Work on Registrar Assignment Study for your quiz!

2/25/2004 Web Crawling Start with seed URL Follow all URLs in page, etc Store documents Create index –mapping between word in document and document

2/25/2004 Web Search

2/25/2004 Properties of Web Search Embarrassingly parallel –Stateless –Read-only Requires lots of storage Requires lots of computation Requires small response time

2/25/2004 Google Design Goals Energy efficiency Price performance ratio

2/25/2004 Software Architecture Reliability in software –Fault tolerance, not prevention –Cheap PCs High degree of replication

2/25/2004 Load Distribution/Balancing Geographically distributed clusters –Increased fault tolerance DNS-based load balancing –Select closest cluster to minimize RTT Hardware-based local load balancing

2/25/2004 Query Execution

2/25/2004 Query Execution 1.Index each query term 2.Compute relevance score across results

2/25/2004 Index Shards A pool of machines serves a particular shard Request goes to one machine in the pool If a machine goes down, capacity marginally reduced

2/25/2004 Query Execution 1.Index each query term 2.Compute relevance score across results 3.Retrieve document Highlight keywords 4.Generate/return HTML

2/25/2004 Replication No consistency issues Nearly linear speedup

2/25/2004 Discussion For which other applications would this architecture be useful/not useful?