Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech Feb. 18, 2015 presentation for.

Slides:



Advertisements
Similar presentations
Meet Hadoop Doug Cutting & Eric Baldeschwieler Yahoo!
Advertisements

Dan Bassett, Jonathan Canfield December 13, 2011.
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Spark in the Hadoop Ecosystem Eric Baldeschwieler (a.k.a. Eric14)
Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A.
Presented by: Yash Gurung, ICFAI UNIVERSITY.Sikkim BUILDING of 3 R'sCLUSTER PARALLEL COMPUTER.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Searching with Lucene Chapter 2. For discussion Information retrieval What is Lucene? Code for indexer using Lucene Pagerank algorithm.
Hadoop Distributed File System (HDFS) implementation in GENI Wei Kou – University of Connecticut Madhav –Missouri University of Science and Technology.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech 21 May 2015 presentation for.
© 2013 Mellanox Technologies 1 NoSQL DB Benchmarking with high performance Networking solutions WBDB, Xian, July 2013.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
DLRL Cluster Matt Bollinger, Joseph Pontani, Adam Lech Client: Sunshin Lee CS4624 Capstone Project March 3, 2014 Virginia Tech, Blacksburg, VA.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Frankie Pike. 2010: 1.2 zettabytes 1.2 trillion gigabytes DVDs past the moon 2-way = 6 newspapers everyday ~58% growth per year Why care?
A Makeshift HPC (Test) Cluster Hardware Selection Our goal was low-cost cycles in a configuration that can be easily expanded using heterogeneous processors.
Per Møldrup-Dalum State and University Library SCAPE Information Day State and University Library, Denmark, Hadoop and its applications at the.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
SEMINAR ON Guided by: Prof. D.V.Chaudhari Seminar by: Namrata Sakhare Roll No: 65 B.E.Comp.
CUDA Performance Study on Hadoop MapReduce Clusters Chen He Peng Du University of Nebraska-Lincoln.
An Introduction to HDInsight June 27 th,
Mining High Utility Itemset in Big Data
Solr Team CS5604: Cloudera Search in IDEAL Nikhil Komawar, Ananya Choudhury, Rich Gruss Tuesday May 5, 2015 Department of Computer Science Virginia Tech,
Hadoop implementation of MapReduce computational model Ján Vaňo.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
HAMA: An Efficient Matrix Computation with the MapReduce Framework Sangwon Seo, Edward J. Woon, Jaehong Kim, Seongwook Jin, Jin-soo Kim, Seungryoul Maeng.
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
Teaching Big Data Through Problem-Based Learning Richard Gruss, Business Information Technology, Virginia Tech Tarek Kanan Software Engineering Department.
Cloud Computing project NSYSU Sec. 1 Demo. NSYSU EE IT_LAB2 Outline  Our system’s architecture  Flow chart of the hadoop’s job(web crawler) working.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
GFURR seminar Can Collecting, Archiving, Analyzing, and Accessing Webpages and Tweets Enhance Resilience Research and Education? Edward A. Fox, Andrea.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa {
Learn Hadoop and Big Data Technologies. Hadoop  An Open source framework that stores and processes Big Data in distributed manner on a large groups of.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Scaling up R computation with high performance computing resources.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
This is a free Course Available on Hadoop-Skills.com.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
BIG DATA BIGDATA, collection of large and complex data sets difficult to process using on-hand database tools.
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
MapReduce Compilers-Apache Pig
Introduction to MapReduce and Hadoop
- Inter-departmental Lab
Big Data is a Big Deal!.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Hadoop Aakash Kag What Why How 1.
Big Data A Quick Review on Analytical Tools
An Open Source Project Commonly Used for Processing Big Data Sets
Tutorial: Big Data Algorithms and Applications Under Hadoop
Distributed Network Traffic Feature Extraction for a Real-time IDS
NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.
Hadoop Clusters Tess Fulkerson.
Central Florida Business Intelligence User Group
Overview Introduction VPS Understanding VPS Architecture
The Basics of Apache Hadoop
Introduction to Apache
Overview of big data tools
Execution Framework: Hadoop 2.x
Adam Lech Joseph Pontani Matthew Bollinger
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Presentation transcript:

Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech Feb. 18, 2015 presentation for NLI: Working with Big Data

Google First

DLRL Cluster – Order Parts

DLRL Cluster – Assemble No-ODD such as DVD or CD-ROM

DLRL Cluster – Configure Network Difficult things: configuring gateway(NAT) and Firewall

DLRL Cluster Spec # of Nodes: (manager node) CPU: Intel Xeon and i5 RAM: 208 GB – 32GB * 2 (node1(master node) and node2) – 16GB * 9 (all nodes) HDD: 51.3 TB – 12TB (node 1), 6TB (node 2) – 3TB * 7, 2TB * 2 (remaining nodes) – 8.3TB NAS Backup

DLRL Cluster – Architecture

DLRL Cluster (Upgraded Feb. 2015) Funding provided by CS to expand to 20 nodes – To support multiple courses as well as research – CS4984 Computational Linguistics – CS5604 Information Retrieval – CS4624 Multimedia, Hypertext&Information Access – Monitoring by Prof. Ali Butt and his students Cloudera: CDH High Availability architecture

DLRL Cluster

DLRL Cluster Spec # of Nodes: – 19 (Hadoop nodes) + 1 (manager node) – 1 (HDFS backup) + 2 (tweet DB nodes) CPU: Intel i5 Haswell Quad core 3.3Ghz, Xeon RAM: 660 GB – 32GB * 19 (Hadoop nodes) + 4GB * 1 (manger node) – 16GB * 1 (HDFS backup) + 16GB * 2 (tweet DB nodes) HDD: 60 TB TB (backup) TB SSD – 3TB * 19 (Hadoop nodes, head node:6TB ) – 3TB (HDFS backup), 1TB + 256GB SSD (Tweet DB nodes) – 8.3TB NAS Backup

DLRL Cluster - Architecture

DLRL Cluster – Install Cloudera Hadoop

DLRL cluster - Services

Tools for research Mahout: Classification, clustering, topic analysis (LDA), frequent patterns mining Solr/Lucene: Search/(Faceted) Browse Natural Language Processing and Named Entity Recognition – Hadoop Streaming – NLTK (Python) and SNER (Stanford NER)

CS4984 Projects in IDEAL Project

What is Big Data and Hadoop Definition – Big data a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. 1) – Apache Hadoop a framework for distributed processing of large data sets across clusters of computers using simple programming models. 2) 1) Big data definition: wikipedia.org 2) Hadoop definition: hadoop.apache.org

Hadoop solutions Hadoop – MapReduce (YARN: MapReduce V2) a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. – HDFS a distributed, scalable, and portable file-system written in Java for the Hadoop framework

Which Hadoop distributions 1) Vanilla Apache Hadoop – Manual installation/configuration Cloudera Hadoop – Oldest distribution, #1 Revenue 2), Open Source – Management Software: Cloudera Manager – Doug Cutting, Chief Architect – Recently Intel invested $740 million 3) MapR – Not Apache Hadoop, Modified HDFS, C not Java 1) 2) 3)