B534 distributed computing

Slides:



Advertisements
Similar presentations
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Advertisements

1 Presented By Avinash Gutte Under The Guidance of Mrs. Hemangi Kulkarni Department of Computer Engineering Pimpri-Chinchwad College of Engineering, Pune.
Google and the Page Rank Algorithm Székely Endre
07/14/08. 2 Points Introduction. Cluster and Supercomputers. Cluster Types and Advantages. Our Cluster. Cluster Performance. Cluster Computer for Basic.
Google’s PageRank: The Math Behind the Search Engine Author:Rebecca S. Wills, 2006 Instructor: Dr. Yuan Presenter: Wayne.
A.V. Bogdanov Private cloud vs personal supercomputer.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
CS492: Special Topics on Distributed Algorithms and Systems Fall 2008 Lab 3: Final Term Project.
DISTRIBUTED COMPUTING
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Example: Sorting on Distributed Computing Environment Apr 20,
Scaling Personalized Web Search Authors: Glen Jeh, Jennfier Widom Stanford University Written in: 2003 Cited by: 923 articles Presented by Sugandha Agrawal.
The Business Model of Google MBAA 609 R. Nakatsu.
PARALLEL COMPUTING overview What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Grid Appliance The World of Virtual Resource Sharing Group # 14 Dhairya Gala Priyank Shah.
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Simplifying Cloud Connectivity for Your Clients Presenter: Tom SharkeyTom Sharkey December 8,
Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University.
G042 - Lecture 09 Commencing Task A Mr C Johnston ICT Teacher
G053 - Lecture 02 Search Engines Mr C Johnston ICT Teacher
The Internet is a Big Collection of Computers and Cables. -"interconnection of computer networks". Millions of personal, business, and governmental.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
Background Computer System Architectures Computer System Software.
SYSTEM MODELS FOR ADVANCED COMPUTING Jhashuva. U 1 Asst. Prof CSE
General Architecture of Retrieval Systems 1Adrienn Skrop.
© 2007 IBM Corporation IBM Software Strategy Group IBM Google Announcement on Internet-Scale Computing (“Cloud Computing Model”) Oct 8, 2007 IBM Confidential.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
Supercomputing versus Big Data processing — What's the difference?
HOW TO USE GOOGLE WEBMASTER TOOLS TO IMPROVE SEO ? GOOGLE WEBMASTEER.
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
Objectives Overview Explain why computer literacy is vital to success in today’s world Define the term, computer, and describe the relationship between.
Clouds , Grids and Clusters
Introduction to Load Balancing:
TECHNOLOGY GUIDE THREE
Improving searches through community clustering of information
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Types of Operating System
UbiCrawler: a scalable fully distributed Web crawler
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Open Source distributed document DB for an enterprise
Evaluating state of the art in AI
CLUSTER COMPUTING Presented By, Navaneeth.C.Mouly 1AY05IS037
A. Rama Bharathi Regd. No: 08931F0040 III M.C.A
Real Life Networking Examples
VIRTUAL SERVERS Presented By: Ravi Joshi IV Year (IT)
Super Computing By RIsaj t r S3 ece, roll 50.
Grid Computing.
Recap: introduction to e-science
PageRank and Markov Chains
Introduction to Cloud Computing
TECHNOLOGY GUIDE THREE
SEARCH ENGINE OPTIMIZATION SEO. What is SEO? It is the process of optimizing structure, design and content of your website in order to increase traffic.
A Comparative Study of Link Analysis Algorithms
*.
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
Parallel and Multiprocessor Architectures – Shared Memory
PageRank algorithm based on Eigenvectors
Grid Computing Done by: Shamsa Amur Al-Matani.
WIS Strategy – WIS 2.0 Submitted by: Matteo Dell’Acqua(CBS) (Doc 5b)
DIGITAL MARKETING SERVICES FOR YOUR BUSINESS ftlmedia.com.
Knowledge Sharing Mechanism in Social Networking for Learning
TECHNOLOGY GUIDE THREE
Presentation transcript:

B534 distributed computing Developing a Dynamic Virtual Cluster for Massively Parallel Applications: Case Study of Performance Analysis with PageRank Algorithm on FutureGrid Team 009: Joshi Harshad, Joshi Swapnil, Nachankar Vaibhav

Distributed systems A distributed system is a collection of independent computers that appears to its users as a single coherent system. - Tannenbaum’s book

Distributed systems Historically, computers were used only for complex Scientific and engineering problems. Engaged large computer clusters. Issues of performance and benchmarking of these clusters were thus mainly limited to the select set of scientists and engineers. With the birth of internet, distributed systems are becoming ubiquitous. These include using mobile phones to booking of travel tickets to office works. Internet and internet-based computing can be found everywhere. Cloud computing is becoming another measure of success and has sparked many academic and commercial institutions to implement this platform for their work. important to understand the features and differences between the distributed systems and study their components. In this project we attempt to decompose and study in details two systems: academic cloud and academic bare-metal supercomputing platform.

Two popular systems: Bare-metal and cloud computing Bare-metal platform platform which is formed by joining compute nodes via a interconnect communication switch. There are many types of these switches including commonly used Gigabit Ethernet, Myrinet and Infiniband; infiniband being the fastest among them. Cloud computing Model for delivering Internet-based information and technology services in real time. Allows users to see the services while the infrastructure that delivers these services remains transparent (or in the "cloud").

Hypothesis for the study in question A hypothesis for this study is that for larger and more complex problems where the performance of the computation on a distributed system relies on the communication will show stark differences in the results obtained from the above two platforms. The cloud platform will show lower performance in this case since the infini0band interconnect in the bare-metal will be much faster in achieving better communication between compute nodes.

Overview of the PageRank algorithm In a web2.0 era it is becoming increasingly important to search/find the most relevant data specific to query from millions of webpages on the internet. Everyday thousands or more webpages get added, so the filtering of this search criteria has to be updated constantly or at least periodically enough to get the data properly indexed. Need to sort/index the webpages with some scoring index. PageRank algorithm introduced by Google-search engine tries to address this need.

Overview of the PageRank algorithm Taken from Prof Qiu’s lecture notes

PageRank algorithm contd… PR, pagerank (a probability value) pi , a page under consideration L(pi), the number of outbound links on page pj d, damping factor which can be set between 0 and 1 (It is usually set d to 0.85) N, total number of pages

Implementation of Parallel PageRank

Results Performance Analysis for small dataset

Varying No. of URLs BareMetal

Varying No. of Processes (BareMetal)

Monitoring system Test and implement parallel PageRank on FutureGrid Optimization for better speed up from the initial results Build a monitoring system using Pub/Sub Build a dynamic virtual cluster

Implementation of Monitoring system - Results on bare-metal cluster

Conclusions Parallel algorithm for PageRank calculations was successfully implemented The algorithm was tested on two system – bare-metal cluster and virtual platform – eucalyptus The results obtained were in agreement with the hypothesis, that infiniband interconnect provided better communications and that for large datasets the communication between nodes becomes the bottleneck for the calculations

Future Work Rigorous performance can be tested with other systems and variety of datasets If possible performance can also be tested for different interconnect protocols

Thanks We are grateful for FutureGrid administrators for providing FutureGrid access for our work and their help in running the program successfully. Special thanks are to Andrew Young who helped solving major issues whenever technical problems regarding FutureGrid arose. We are also thankful to all the questions-answers raised by the class-mates as the forums helped to solve problems while executing our tasks. Last but not the least, we thank Prof Qiu and the AIs for both guiding in each task as well as showing the distributed system approach of the overall project.