MalStone:Towards A Benchmark for Analytics on Large Data Clouds Collin Bennett Open Data Group 400 Lathrop Ave Suite 90 River Forest IL Robert L. Grossman Open Data Group 400 Lathrop Ave Suite 90 River Forest IL David Locke Open Data Group 400 Lathrop Ave Suite 90 River Forest IL Jonathan Seidman Open Data Group 400 Lathrop Ave Suite 90 River Forest IL Steve Vejcik Open Data Group 400 Lathrop Ave Suite 90 River Forest IL KDD ’ 10, July 25 – 28, 2010, Washington, DC, USA
OUTLINE 0. ABSTRACT 1. INTRODUCTION 2. Common Elements 3. MalStone A & B 4. MalGen 5. THREE IMPLEMENTATIONS 6. EXPERIMENTAL STUDIES 7. DISCUSSION 8. RELATED WORK 9. SUMMARY
0. ABSTRACT Terasort MalStone MalGen
1. INTRODUCTION Data Mining for Clouds : Hbase, Apache Pig, Hive and ZooKeeper, There are no similar benchmarks for comparing two large data clouds that support building analytic models on large datasets. Use MalStone, also describe the implementation of a data generator for MalStone called MalGen
2.Common Elements Time stamps Sites e.g. Web sites, computers, network devices Entities e.g. visitors, users, flows Log files fill disks, many, many disks Behavior occurs at all scales Want to identify phenomena at all scales Need to group “ similar behavior ” Need to do statistics (not just sorting)
2.Common Elements Abstract the Problem Using Site-Entity Logs ExampleSitesEntities Measuring online advertising Web sitesConsumers Drive-by exploitsWeb sitesComputers (identified by cookies or IP) Compromised systems Compromised computers User accounts
3. MalStone A & B MalStone Benchmark Benchmark developed by Open Cloud Consortium for clouds supporting data intensive computing. Code to generate synthetic data required is available from code.google.com/p/malgen Stylized analytic computation that is easy to implement in MapReduce and its generalizations.
3. MalStone A & B MalStone A computes j for all sites j in the log files. MalStone B computes j;t for sites j in the log files
3. MalStone A & B be the set of all entities ei Aj that become marked at any time in the monitor window
3. MalStone A & B is the set of entities that become marked at any time during the monitor window.
3. MalStone A & B The statistic is ( )/( ) = 1/2
4. MalGen Tens of millions of sites Hundreds of millions of entities Billions of events Most sites have a few number of events Some sites have many events Most entities visit a few sites Some visitors visit many sites
4. MalGen For generating site-entity log files
5. THREE IMPLEMENTATIONS HDFS, Hadoop Streams and Python Hadoop HDFS and MapReduce Sector and Sphere UDFs(User Defined Functions )
6. EXPERIMENTAL STUDIES
MalStone B Sector/Sphere v min # Nodes20 nodes # Records10 Billion Size of Dataset1 TB Tests done on Open Cloud Testbed.
7. DISCUSSION Hadoop streams does not require the MapReduce framework. Python programs can be invoked by Hadoop streams.
8. RELATED WORK In 2008,Haddop by Terasort : 297sec. In 2009,Hadoop by Terasort : 209sec. In nowadays,Terasort was replacement by Minute Sort : in about 1 Min. [MapReduce for machine learning on multicore] Using MapReduce,but does not describe a computation similar to the MalStone statistic.
9. SUMMARY MalGen to create large amount of data. Performance depend upon which cloud middleware is used to compute.