Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Slides:



Advertisements
Similar presentations
Introduction to Hadoop Richard Holowczak Baruch College.
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Overview of MapReduce and Hadoop
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Distributed Computations
Hadoop(MapReduce) in the Wild —— Our current understandings & uses of Hadoop Le Zhao, Changkuk Yoo, Mark Hoy, Jamie Callan Presenter: Le Zhao
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Distributed Computations MapReduce
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Cloud Computing Other Mapreduce issues Keke Chen.
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech Feb. 18, 2015 presentation for.
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech 21 May 2015 presentation for.
HADOOP ADMIN: Session -2
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
8/9/2015 Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, dtsouma, Computing Systems Laboratory.
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Goodbye rows and tables, hello documents and collections.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Reducing Noise CS5604: Final Presentation Xiangwen Wang, Prashant Chandrasekar.
Solr Team CS5604: Cloudera Search in IDEAL Nikhil Komawar, Ananya Choudhury, Rich Gruss Tuesday May 5, 2015 Department of Computer Science Virginia Tech,
Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
VIRGINIA TECH BLACKSBURG CS 4624 MUSTAFA ALY & GASPER GULOTTA CLIENT: MOHAMED MAGDY IDEAL Pages.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
Nov 2006 Google released the paper on BigTable.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Problem Based Learning To Build And Search Tweet And Web Archives Richard Gruss Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
GFURR seminar Can Collecting, Archiving, Analyzing, and Accessing Webpages and Tweets Enhance Resilience Research and Education? Edward A. Fox, Andrea.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
BIG DATA/ Hadoop Interview Questions.
CPSC8985 FA 2015 Team C3 DATA MIGRATION FROM RDBMS TO HADOOP By Naga Sruthi Tiyyagura Monika RallabandiRadhakrishna Nalluri.
Information Storage and Retrieval(CS 5604) Collaborative Filtering 4/28/2016 Tianyi Li, Pranav Nakate, Ziqian Song Department of Computer Science Blacksburg,
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
Big Data is a Big Deal!.
Running virtualized Hadoop, does it make sense?
Collection Management
ArchiveSpark Andrej Galad 12/6/2016 CS-5974 – Independent Study
Hadoop Clusters Tess Fulkerson.
Central Florida Business Intelligence User Group
CS 5604 Information Storage and Retrieval
CS6604 Digital Libraries IDEAL Webpages Presented by
Cse 344 May 4th – Map/Reduce.
Information Storage and Retrieval
Overview of big data tools
Charles Tappert Seidenberg School of CSIS, Pace University
Introduction to MapReduce
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Pig Hive HBase Zookeeper
Presentation transcript:

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox

Big data and Hadoop

Data sets are so large or complex that traditional data processing tools are inadequate Challenges include: ●analysis ●search ●storage ●transfer

Big data and Hadoop Hadoop solution (inspired by Google) ●distributed storage: HDFS ○ a distributed, scalable, and portable file-system ○ high capacity at very low cost ●distributed processing: MapReduce ○ a programming model for processing large data sets with a parallel, distributed algorithm on a cluster ○ is composed of Map() and Reduce() procedures

Hadoop Cluster for this Class ● Nodes o 19 Hadoop nodes o 1 Manager node o 2 Tweet DB nodes o 1 HDFS Backup node ● CPU: Intel i5 Haswell Quad core 3.3Ghz, Xeon ● RAM: 660 GB o 32GB * 19 (Hadoop nodes) + 4GB * 1 (manager node) o 16GB * 1 (HDFS backup) + 16GB * 2 (tweet DB nodes) ● HDD: 60 TB TB (backup) TB SSD ● Hadoop distribution: CDH 5.3.1

Data sets of this class 5.3 GB 3.0 GB 9.9 GB 8.7 GB 2.2 GB 9.6 GB 0.5 GB ~87 million of tweets in total

Mapreduce ●Originally developed for rewriting the indexing system for the Google web search product ●Simplifying the large-scale computations ●MapReduce programs are automatically parallelized and executed on a large-scale cluster ●Programmers without any experience with parallel and distributed systems can easily use large distributed resources

Typical problem solved by MapReduce ●Read data as input ●Map: extract something you care about from each record ●Shuffle and Sort ●Reduce: aggregate, summarize, filter, or transform ●Write the results

MapReduce Process Input

Requirements ●Design a workflow for the IDEAL project using appropriate Hadoop tools ●Coordinate data transfer between the different teams ●Help other teams to use the cluster effectively

HADOOP HDFS Noise Reduction Original tweets Original web pages (HTML) Webpage-text Sqoop seedURLs.txt Nutch Noise- reduced web pages Analyzed data tweets webpages Lily indexer SOLR ClusteringClassifyingNER Social LDA HBASE MapReduce SQL Tweets Webpages Noise- reduced tweets Avro Files

Schema Design - HBase ●Separate tables for tweets and web pages ●Both tables have two column families o original  tweet / web page content and metadata o analysis  results of the analysis of each team ●Row ID of a document o [collection_name]--[UID] o allows fast retrieval of the documents of a specific collection

Schema Design - HBase

●Why HBase? o Our datasets are sparse o Real-time random I/O access to data o Lily Indexer allows real-time indexing of data into Solr

Schema Design - Avro ●One schema for each team o No risk for teams overwriting each other’s data o Changes in schema for one team do not affect others ●Each schema contains the fields to be indexed into Solr

Schema Design - Avro ●Why Avro? o Supports versioning and a schema can be split in smaller schemas  We take advantage of these properties for the data upload o Schemas can be used to generate a Java API o MapReduce support and libraries for different programming languages used in this course o Supports compression formats used in MapReduce

Loading Data Into HBase ●Sequential Java Program o Good solution for the small collections o Does not scale for the big collections  Out-of-memory errors on the master node

Loading Data Into HBase ●MapReduce Program o Map-only job o Each map task writes one document to HBase

Loading Data Into HBase ●Bulk-loading o Use MapReduce job to generate HFiles o Write HFiles directly, bypassing the normal HBase write path o Much faster than our Map-only job, but requires pre-configuration of the HBase table HFile

Loading Data Into HBase

Collaboration with other teams ●Helped other teams to interact with Avro files and output data o Multiple rounds and revisions were needed o Thank you, everyone! ●Helped with MapReduce programming o Classification team had to adapt a third-party tool for their task

Acknowledgements ●Dr. Fox ●Mr. Sunshin Lee ●Solr and Noise Reduction teams ●National Science Foundation ●NSF grant IIS , III: Small: Integrated Digital Event Archiving and Library (IDEAL)

Thank you