Adam Lech Joseph Pontani Matthew Bollinger

Slides:



Advertisements
Similar presentations
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech Feb. 18, 2015 presentation for.
Advertisements

Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech 21 May 2015 presentation for.
Application Development On AWS MOULIKRISHNA KOPPOLU CHANDAN SINGH RANA.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.
DLRL Cluster Matt Bollinger, Joseph Pontani, Adam Lech Client: Sunshin Lee CS4624 Capstone Project March 3, 2014 Virginia Tech, Blacksburg, VA.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
8/9/2015 Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, dtsouma, Computing Systems Laboratory.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.
An Introduction to HDInsight June 27 th,
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
VIRGINIA TECH BLACKSBURG CS 4624 MUSTAFA ALY & GASPER GULOTTA CLIENT: MOHAMED MAGDY IDEAL Pages.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Nov 2006 Google released the paper on BigTable.
CS5604: Final Presentation ProjOpenDSA: Log Support Victoria Suwardiman Anand Swaminathan Shiyi Wei Department of Computer Science, Virginia Tech December.
Querying CSV Files with SQL using ‘q’ Presented by Simon Frank.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB.
Big Data & Test Automation
A Simple Approach for Author Profiling in MapReduce
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Sentiment Analysis of Twitter Data(using HadoopMapreduce)
Big Data is a Big Deal!.
SAS users meeting in Halifax
PROTECT | OPTIMIZE | TRANSFORM
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Hadoop.
Software Systems Development
Integrating QlikView with MPP data sources
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Collection Management (Tweets) Final Presentation
MSBIC Hadoop Series Processing Data with Pig
Sqoop Mr. Sriram
Spark Presentation.
Scaling SQL with different approaches
Common Crawl Mining Team: Brian Clarke, Tommy Dean, Ali Pasha, Casey Butenhoff Manager: Don Sanderson (Eastman Chemical Company) Client: Ken Denmark.
Hadoop.
Introduction to HDFS: Hadoop Distributed File System
Powering real-time analytics on Xfinity using Kudu
Azure Machine Learning & ML Studio
Temple Analytics Challenge
Text Classification CS5604 Information Retrieval and Storage – Spring 2016 Virginia Polytechnic Institute and State University Blacksburg, VA Professor:
Lesson 1: Introduction to Trifacta Wrangler
Microsoft Dumps PDF Cloudera CCA175 Dumps PDF CCA Spark and Hadoop Developer Exam - Performance Based Scenarios RealExamCollection.com.
Introduction to Spark.
Lesson 1: Introduction to Trifacta Wrangler
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Lesson 1 – Chapter 1B Chapter 1B – Terminology
SETL: Efficient Spark ETL on Hadoop
Collection Management Webpages Final Presentation
Introduction to Apache
Twitter Equity Firm Value
Validation of Ebola LOD
Information Storage and Retrieval
Setup Sqoop.
CSE 491/891 Lecture 21 (Pig).
CSE 491/891 Lecture 24 (Hive).
Zoie Barrett and Brian Lam
Charles Tappert Seidenberg School of CSIS, Pace University
Virginia Lenvik Geography 375 Spring 2013
IBM C IBM Big Data Engineer. You want to train yourself to do better in exam or you want to test your preparation in either situation Dumpspedia’s.
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Pig Hive HBase Zookeeper
Presentation transcript:

Adam Lech Joseph Pontani Matthew Bollinger DLRL CLUSTER CS4624 Spring 2014 Virginia Tech Blacksburg, VA Client: Sunshin Lee Adam Lech Joseph Pontani Matthew Bollinger

Outline Deliverables Data Preprocessing Hive HBase Impala Mahout Future Work

Deliverables Tutorials Video demos Report generation HBase Hive Impala Mahout

Data preprocessing Needed to preprocess data for meaningful analysis source strip excess URL information <a href="http://tapbots.com/tweetbot" rel="nofollow">Tweetbot for iOS</a> tapbots.com tweet date separate into fields for year, month, and date tweet text remove stop words (‘the’, ‘and’, etc.) Input CSV into Python script Dumped out to CSV file for use

Preprocessing Challenges Tweets are from two different sources twitter-stream twitter-search Different formats <a href="http://tapbots.com/tweetbot" rel="nofollow">Tweetbot for iOS</a> <a href="http://twitter.com/download/android">Twitter for Android</a> Full of weird characters that threw script off Large datasets take FOREVER to process

Hive Importing pothole dataset to Hive Statements similar to loading data in MySql Data still stored in file Queries transformed into MapReduce jobs Hive table MySql Table dumped into imported into CSV FILe imported into Hadoop Distributed File System

HBase Importing pothole dataset to HBASE HBase requires KEY_ROW for exactly one column HBase organizes fields into column family MapReduce jobs performed - ImportTsv tool used HBase table MySql Table imported into dumped into CSV File imported into Hadoop Distributed File System

Impala Setup tables in Impala with new datasets Create benchmark queries to test Impala vs Hive select count(*) as num_tweets, from_user from twitter group by from_user order by num_tweets desc limit 15; select count(*) as num_tweets, source from twitter group by source order by num_tweets desc limit 15; select count(*) as num_tweets, created_at from twitter group by created_at order by num_tweets desc limit 15; select count(*) as num_tweets, month from twitter group by month order by num_tweets desc limit 15; select count(*) as num_tweets, day from twitter group by day order by num_tweets desc limit 15; select count(*) as num_tweets, year from twitter group by year order by num_tweets desc limit 15;

Impala - Example Report select count(*) as num_tweets, from_user from twitter group by from_user order by num_tweets desc limit 15; select count(*) as num_tweets, month from twitter group by month order by num_tweets desc limit 12; select count(*) as num_tweets, month from twitter group by month order by month desc limit 12;

Mahout How to use Mahout preprocess dataset remove ‘stop words’ and other unnecessary text import dataset to HDFS pothole and shooting dataset are 1 tweet per line datamine using FPGrowth algorithm to get frequent patterns specify word separator, in this case a space view/dump results Deliverable: Tutorial (PDF) and Demo video (Youtube) tweets about potholes, 20MB CSV file how to run Mahout with FPGrowth on a dataset Finally, run FPG on actual cluster with a much larger dataset tweets about shootings, 8GB CSV file

Mahout Issues with ‘Shooting’ dataset FPG only needs the tweet text, how to preprocess dataset to remove all other columns, 8GB CSV file took forever to preprocess via Python script solution was to just export only the tweet from MySQL Java heap space is exhausted when running Mahout using mapreduce on a large dataset lower the requested heap size (top k values are kept) when running FPG via the k switch (from -k 50 to -k 10) and increase minimum groupings via s switch (from -s 3 to -s 10)

Future Work Preprocess tweets on their way in, not after the fact Leverage different technologies for specific tasks in DLRL Cluster