Adam Lech Joseph Pontani Matthew Bollinger DLRL CLUSTER CS4624 Spring 2014 Virginia Tech Blacksburg, VA Client: Sunshin Lee Adam Lech Joseph Pontani Matthew Bollinger
Outline Deliverables Data Preprocessing Hive HBase Impala Mahout Future Work
Deliverables Tutorials Video demos Report generation HBase Hive Impala Mahout
Data preprocessing Needed to preprocess data for meaningful analysis source strip excess URL information <a href="http://tapbots.com/tweetbot" rel="nofollow">Tweetbot for iOS</a> tapbots.com tweet date separate into fields for year, month, and date tweet text remove stop words (‘the’, ‘and’, etc.) Input CSV into Python script Dumped out to CSV file for use
Preprocessing Challenges Tweets are from two different sources twitter-stream twitter-search Different formats <a href="http://tapbots.com/tweetbot" rel="nofollow">Tweetbot for iOS</a> <a href="http://twitter.com/download/android">Twitter for Android</a> Full of weird characters that threw script off Large datasets take FOREVER to process
Hive Importing pothole dataset to Hive Statements similar to loading data in MySql Data still stored in file Queries transformed into MapReduce jobs Hive table MySql Table dumped into imported into CSV FILe imported into Hadoop Distributed File System
HBase Importing pothole dataset to HBASE HBase requires KEY_ROW for exactly one column HBase organizes fields into column family MapReduce jobs performed - ImportTsv tool used HBase table MySql Table imported into dumped into CSV File imported into Hadoop Distributed File System
Impala Setup tables in Impala with new datasets Create benchmark queries to test Impala vs Hive select count(*) as num_tweets, from_user from twitter group by from_user order by num_tweets desc limit 15; select count(*) as num_tweets, source from twitter group by source order by num_tweets desc limit 15; select count(*) as num_tweets, created_at from twitter group by created_at order by num_tweets desc limit 15; select count(*) as num_tweets, month from twitter group by month order by num_tweets desc limit 15; select count(*) as num_tweets, day from twitter group by day order by num_tweets desc limit 15; select count(*) as num_tweets, year from twitter group by year order by num_tweets desc limit 15;
Impala - Example Report select count(*) as num_tweets, from_user from twitter group by from_user order by num_tweets desc limit 15; select count(*) as num_tweets, month from twitter group by month order by num_tweets desc limit 12; select count(*) as num_tweets, month from twitter group by month order by month desc limit 12;
Mahout How to use Mahout preprocess dataset remove ‘stop words’ and other unnecessary text import dataset to HDFS pothole and shooting dataset are 1 tweet per line datamine using FPGrowth algorithm to get frequent patterns specify word separator, in this case a space view/dump results Deliverable: Tutorial (PDF) and Demo video (Youtube) tweets about potholes, 20MB CSV file how to run Mahout with FPGrowth on a dataset Finally, run FPG on actual cluster with a much larger dataset tweets about shootings, 8GB CSV file
Mahout Issues with ‘Shooting’ dataset FPG only needs the tweet text, how to preprocess dataset to remove all other columns, 8GB CSV file took forever to preprocess via Python script solution was to just export only the tweet from MySQL Java heap space is exhausted when running Mahout using mapreduce on a large dataset lower the requested heap size (top k values are kept) when running FPG via the k switch (from -k 50 to -k 10) and increase minimum groupings via s switch (from -s 3 to -s 10)
Future Work Preprocess tweets on their way in, not after the fact Leverage different technologies for specific tasks in DLRL Cluster