Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows.

Slides:



Advertisements
Similar presentations
Introduction to Hadoop Richard Holowczak Baruch College.
Advertisements

Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Overview of MapReduce and Hadoop
LIBRA: Lightweight Data Skew Mitigation in MapReduce
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Workshop on Basics & Hands on Kapil Bhosale M.Tech (CSE) Walchand College of Engineering, Sangli. (Worked on Hadoop in Tibco) 1.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Hadoop: A Software Framework for Data Intensive Computing Applications
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Introduction to Hadoop 趨勢科技研發實驗室. Copyright Trend Micro Inc. Outline Introduction to Hadoop project HDFS (Hadoop Distributed File System) overview.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Owen O’Malley Yahoo! Grid Team
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
HAMS Technologies 1
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
Performance Evaluation on Hadoop Hbase By Abhinav Gopisetty Manish Kantamneni.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Hadoop implementation of MapReduce computational model Ján Vaňo.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
HDFS MapReduce Hadoop  Hadoop Distributed File System (HDFS)  An open-source implementation of GFS  has many similarities with distributed file.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Next Generation of Apache Hadoop MapReduce Owen
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
BIG DATA/ Hadoop Interview Questions.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Microsoft Ignite /28/2017 6:07 PM
School of Computer Science and Mathematics BIG DATA WITH HADOOP EXPERIMENTATION APARICIO CARRANZA NYC College of Technology – CUNY ECC Conference 2016.
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Introduction to MapReduce and Hadoop
Map reduce Cs 595 Lecture 11.
Big Data is a Big Deal!.
Hadoop Aakash Kag What Why How 1.
Hadoop.
Introduction to Distributed Platforms
Introduction to MapReduce and Hadoop
Introduction to HDFS: Hadoop Distributed File System
Hadoop: A Software Framework for Data Intensive Computing Applications
Hadoop Clusters Tess Fulkerson.
CS6604 Digital Libraries IDEAL Webpages Presented by
Hadoop Basics.
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Hadoop Technopoints.
Charles Tappert Seidenberg School of CSIS, Pace University
Presentation transcript:

Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows

 Hadoop  MapReduce  Download Links  Install and Tutorials

Software platform that lets one easily write and run applications that process vast amounts of data. It includes: o MapReduce – offline computing engine MapReduce – offline computing engine o HDFS – Hadoop distributed file system HDFS – Hadoop distributed file system o HBase (pre-alpha) – online data access HBase (pre-alpha) – online data access Here's what makes it especially useful: o Scalable: It can reliably store and process petabytes. o Economical: It distributes the data and processing across clusters of commonly available computers (in thousands). o Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located. o Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures.

Hadoop implements Google’s MapReduce, using HDFS MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located.

A9.com – Amazon: To build Amazon's product search indices; process millions of sessions daily for analytics, using both the Java and streaming APIs; clusters vary from 1 to 100 nodes. A9.com Yahoo! : More than 100,000 CPUs in ~20,000 computers running Hadoop; biggest cluster: 2000 nodes (2*4cpu boxes with 4TB disk each); used to support research for Ad Systems and Web Search Yahoo! AOL : Used for a variety of things ranging from statistics generation to running advanced algorithms for doing behavioral analysis and targeting; cluster size is 50 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB hard-disk giving us a total of 37 TB HDFS capacity. AOL Facebook: To store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning; 320 machine cluster with 2,560 cores and about 1.3 PB raw storage; Facebook FOX Interactive Media : 3 X 20 machine cluster (8 cores/machine, 2TB/machine storage) ; 10 machine cluster (8 cores/machine, 1TB/machine storage); Used for log analysis, data mining and machine learning FOX Interactive Media University of Nebraska Lincoln: one medium-sized Hadoop cluster (200TB) to store and serve physics data;

Adknowledge - to build the recommender system for behavioral targeting, plus other clickstream analytics; clusters vary from 50 to 200 nodes, mostly on EC2.Adknowledge Contextweb - to store ad serving log and use it as a source for Ad optimizations/ Analytics/reporting/machine learning; 23 machine cluster with 184 cores and about 35TB raw storage. Each (commodity) node has 8 cores, 8GB RAM and 1.7 TB of storage.Contextweb Cornell University Web Lab: Generating web graphs on 100 nodes (dual 2.4GHz Xeon Processor, 2 GB RAM, 72GB Hard Drive)Cornell University Web Lab NetSeer - Up to 1000 instances on Amazon EC2 ; Data storage in Amazon S3; Used for crawling, processing, serving and log analysisNetSeerAmazon EC2Amazon S3 The New York Times : Large scale image conversions ; EC2 to run hadoop on a large virtual clusterThe New York TimesLarge scale image conversions Powerset / Microsoft - Natural Language Search; up to 400 instances on Amazon EC2 ; data storage in Amazon S3Powerset / MicrosoftAmazon EC2Amazon S3

MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. MapReduce program is composed of  Map() procedure(method) that performs filtering and sorting, and  Reduce() method that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies).

Pioneered by Google – Processes 20 PB of data per day Sort/merge based distributed computing Initially, it was intended for their internal search/indexing application, but now used extensively by more organizations Popularized by open-source Hadoop project – Used by Yahoo!, Facebook, Amazon, …

At Google: – Index building for Google Search – Article clustering for Google News – Statistical machine translation At Yahoo!: – Index building for Yahoo! Search – Spam detection for Yahoo! Mail At Facebook: – Data mining – Ad optimization – Spam detection In research: – Analyzing Wikipedia conflicts (PARC) – Natural language processing (CMU) – Bioinformatics (Maryland) – Particle physics (Nebraska) – Ocean climate simulation (Washington)

1. Scalability to large data volumes: – Scan 100 TB on 1 50 MB/s = 24 days – Scan on 1000-node cluster = 35 minutes 2. Cost-efficiency: – Commodity nodes (cheap, but unreliable) – Commodity network – Automatic fault-tolerance (fewer admins) – Easy to use (fewer programmers)

Distributed file system (DFS) The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. highly fault-tolerant and is designed to be deployed on low-cost hardware. provides high throughput access to application data and is suitable for applications that have large data sets. part of the Apache Hadoop Core project. The project URL is

Files split into 128MB blocks Blocks replicated across several datanodes (usually 3) Namenode stores metadata (file names, locations, etc) Optimized for large files, sequential reads Files are append-only

MapReduce framework – Executes user jobs specified as “map” and “reduce” functions – Manages work distribution & fault- tolerance

o HBase is an open source, non-relational, distributed database modeled after Google's BigTable and written in Java. o It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS o it provides a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught within a large collection of empty or unimportant data) o Hbase is now serving several data-driven websites, including Facebook's Messaging Platform.

 Download links  Sandbox  How to Download and Install  Tutorials

 Running Hadoop on Ubuntu Linux (Single-Node Cluster) node-cluster/ program-in-python/  Sandbox  Some other links:

The easiest way to get started with Enterprise Hadoop Sandbox is a personal, portable Hadoop environment, comes with a dozen interactive Hadoop tutorials that will guide through the basics of Hadoop. Includes the Hortonworks Data Platform in an easy to use form. We can add our own datasets, and connect it to our existing tools and applications. We can test new functionality with the Sandbox before we put it into production. Simply, easily and safely.

o Install any one of the following on your host machine : 1.VirtualBox 2.VMware Fusion 3.Hyper-V o Oracle VirtualBox Version 4.2 or later ( o Hortonworks Sandbox virtual appliance for VirtualBox Download the correct virtual appliance file for your environment sandbox/#installhttp://hortonworks.com/products/hortonworks- sandbox/#install

o CPU - A 64-bit machine with a multi-core CPU that supports virtualization. o BIOS - Has been enabled for virtualization support o RAM - At least 4 GB of RAM o Browsers - Chrome 25+, IE 9+ (Sandbox will not run on IE 10), Safari 6+ o Complete Installation guide can be found at: content/uploads/2015/07/Import_on_Vbox_7_2 0_2015.pdf