Applications on Spark Prof. Harold Liu Beijing Institute of Technology December 2015.

Slides:



Advertisements
Similar presentations
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Advertisements

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
IT INFRASTRUCTURE AND EMERGING TECHNOLOGIES
Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.
Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
FAST FORWARD WITH MICROSOFT BIG DATA Vinoo Srinivas M Solutions Specialist Windows Azure (Hadoop, HPC, Media)
© 2013 MediaCrossing, Inc. All rights reserved. Going Live: Preparing your first Spark production deployment Gary Malouf Architect,
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Hadoop Ecosystem Overview
SM STRATA PRESENTATION Tim Garnto - SVP Engineering, edo Interactive Rob Rosen – Big Data Field Lead, Pentaho.
The big Data security Analytics Era Is Here Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
Apache Spark and the future of big data applications Eric Baldeschwieler.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Cloud Computing Introduction to China-cloud Project and Related Works in JSI Yi Liu Sino-German Joint Software Institute, Beihang Univ. May 2011.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Tyson Condie.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Business Strategy Evaluation & Recommendations EVALUATE BUSINESS STRATEGY Internal Assessments Evaluation : Bridge of Business-To-Customer Information.
Descriptive Data Analysis of File Transfer Data Sudarshan Srinivasan Victor Hazlewood Gregory D. Peterson.
Final Presentation CSD200424/05/2004. Integrating services such as TV, Telephony & Internet over the same IP network. One Connection. One Package. One.
Spark use case at Telefonica CBS Telefónica Digital Digital Services CiberSecurity.
` tuplejump The data engineering platform. A startup with a vision to simplify data engineering and empower the next generation of data powered miracles!
Our Experience Running YARN at Scale Bobby Evans.
Introduction to Hadoop and HDFS
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
PwC New Technologies New Risks. PricewaterhouseCoopers Technology and Security Evolution Mainframe Technology –Single host –Limited Trusted users Security.
Department of Industrial Engineering Sharif University of Technology Session# 10.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
Matthew Winter and Ned Shawa
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Background 10th May, 2003, Invested RMB 1 billion Alibaba Group, CEO Jack Ma 2010 – Taobao gain revenue RMB 5 billion 4 billion – adveristment 1 billion.
LIMPOPO DEPARTMENT OF ECONOMIC DEVELOPMENT, ENVIRONMENT AND TOURISM The heartland of southern Africa – development is about people! 2015 ICT YOUTH CONFERENCE.
Next Generation of Apache Hadoop MapReduce Owen
Big Data Yuan Xue CS 292 Special topics on.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
Chapter 1: Internet Marketing Foundations. Chapter Objectives Describe how computers and servers communicate to enable people to interact with webpages.
This is a free Course Available on Hadoop-Skills.com.
An Introduction To Big Data For The SQL Server DBA.
COMPUTER NETWORKS Quizzes 5% First practical exam 5% Final practical exam 10% LANGUAGE.
BIG DATA/ Hadoop Interview Questions.
Microsoft Partner since 2011
1 Cloud-Native Data Warehousing Bob Muglia. 2 Scenarios with affinity for cloud Gartner 2016 Predictions: By 2018, six billion connected things will be.
Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Hadoop Big Data Usability Tools and Methods. On the subject of massive data analytics, usability is simply as crucial as performance. Right here are three.
Big Data is a Big Deal!.
SAS users meeting in Halifax
Beijing Institute of Technology December 2015
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
TECHNOLOGY GUIDE THREE
ANOMALY DETECTION FRAMEWORK FOR BIG DATA
Spark Presentation.
Data Platform and Analytics Foundational Training
Hadoop Clusters Tess Fulkerson.
TECHNOLOGY GUIDE THREE
Introduction to Spark.
Tools for Processing Big Data Jinan Al Aridhee and Christian Bach
Overview of big data tools
Zoie Barrett and Brian Lam
Big-Data Analytics with Azure HDInsight
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
TECHNOLOGY GUIDE THREE
Presentation transcript:

Applications on Spark Prof. Harold Liu Beijing Institute of Technology December 2015

Who Are Using Spark These Days? 2

 From the figure above, over 1,000 companies have taken Spark platform into productions, including famous traditional manufacturers like TOYOTA and O2O company like Uber and airbnb.  It indicates that the Spark user field has been expanded, not only in the Internet based industry, but also to traditional industries.  Lots of big data framework distributors, including the former Hadoop distributors like Hortonworks and Cloudera, are beginning to take Spark into deployment, which will have a bigger impact in its spread. 3

Open Source Spark Community The figure shows that the number of contributors has increased rapidly from 2010 to Among these contributors, lots of Chinese organizations and developers show their enthusiasm on Spark. Now, the biggest Spark cluster of over 8,000 nodes is in Tencent and the highest amount of processed data per job is 1PB, recorded by Alibaba and Databricks. 4

Architecture of Spark 5

Entertainment: Tecent  Company Background: The biggest social service provider in China.  Data Background: By the end of 2015, the active QQ users per month have exceeded 8,000 million. The active Wechat user per month have exceeded 6,000 million. They will bring over 200TB data every day.  Business Requirement: Over 90% data need to be processed online. 6

Tencent Distributed Data Warehouse 7 TDW collects all product level data and provides data storage and analysis services. TDW supports PB-level data storage and computing. It has two parts: one is off-line M/R and the other is online computing by Storm.

Hadoop V.S Spark on M/R Running ModeCompute Resource Running Time ( min ) Cost ( Slot*s ) MapReduce200 Map+100 Reduce Spark200 Executor Spark400 Executor Spark works much faster than Hadoop. The running time is only a quarter of that of Hadoop. Compute efficiency can be faster when adding more executors. Overall, when facing data mining problems, traditional Hadoop M/R framework has serious performance problem, while the Spark can deal with the problem based on its iterative and in-memory computing.

E-commerce : Taobao  Company Background The biggest C2C e-commerce company in China and the Spark pioneer user (since 2012)  Data Background Up to 2014, Taobao has over 5,000 million registered members and 1,200 million active members. Taobao has over 90 billion turnovers on November 11, Its various businesses bring TB-lever data every day.  Business Requirement In the past few years, Taobao has been using Yun Ti based on Hadoop. But Hadoop will encounter lots of problems in iterative computing. So Spark comes to its view. 9

Spark in Taobao The figure shows the history of using Spark in Taobao. Taobao has been using Spark when Spark is very young (2012). 10 Yarn version: nodes cluster 200 nodes Yarn cluster

Spark Development Process in Taobao Before putting the job into production servers, the job will be tested on test servers. And the code will be merged to local repository or push to the open source community. 11

Recommender System in Taobao The recommender system combines Spark, Spark MLlib and Spark Streaming frameworks. It can perform both offline and online analysis that covers most parts of business requests in Taobao. 12

Test of K-Means Algorithm From the memory aspect, increasing worker’s memory will cut the running time. And increase worker numbers will have better performance. 13

Telecom: Telefonica  Company Background Telefonica is a Spanish telecommunication company who provides comprehensive services including mobile phone, internet, data and wired television services.  Data Background Telefonica is the biggest multi-national enterprise in Spain who provides customer services for over 40 countries. Its various businesses bring huge data.  Business Requirement As the volume of data is increasing rapidly, network security problem comes to its sight, such as DDoS attack, SQL injection attack, account theft etc. Using big data analysis technology to prevent Cyber crime has become urgent to the company. 14

Why Spark? Spark provides full stack applications (i.e., SQL, Streaming, MLlib, GraphX) Easy to use spark to analyze historical data and streaming data. Support various applications and data sources in order to deal with complex application scenarios Leverage the SQL language to use the power of Spark The number of components in Spark is much fewer than that of Hadoop 15

Components of Spark and Hadoop From the figure above, the number of components in Spark is about half of that in Hadoop. Then, using Spark can potentially have much less errors because of less components. 16

Spark Production Architecture in Telefonica It use distributed message queue system called “Kafka” to collect data from various sources. Then, data will be consumed by Storm for pre-processing. Finally, data will be processed by Spark or saved in Cassandra. 17 Data collection: Kafka Data pre-processing: Storm Batch processing: Cassandra+Spark

Retail: Euclid  Company Background Euclid Analysis is a geo-data analysis company who provides solutions to customers based on offline positional information.  Data Background Euclid mainly relies on WiFi devices to collect data from the physical world.  Business Requirement Euclid’s main job is to support location based analysis services for customers. Through collecting customer behavior data, it tries to know customer’s behavior and shopping feature, and suggestion future behaviors. 18

Retail Customer Features Through the data collected from WiFi devices, customers can be divided into three parts: frequent customers, pass-by customers and quick-leave customers. Some of them like to buy products, some spend a lot of time in store and some like to travel around in a zone.

Analysis Procedure with Spark First, mobile data are collected by WiFi devices through the pinged signals, which include device MAC address, magnitude of signal and other information. Then, these data will be sent to cloud and processed on Spark cluster. Finally, customers will know the analysis result on web.

Other Area: PubMatic  Company Background PubMatic is an advertisement company It developed the first real-time advertisement analysis system in the world marketing field.  Data Background PubMatic has 6 geo-data data centers with 6 PB data to manage. Every day it will post 12 billion ads and deal with 1,000 billion bids. Now 22TB data are produced by its system.  Business Requirement Because of its owned complex and various ad data, PubMatic needs to process the data in real-time.

System Architecture in PubMatic As we can see from the figure above, various streaming data (flows) are fed into memory which will be process by Spark. Finally, the data will be saved in HDFS and Amazon S3.

Spark v.s. Hive on Query Performance When the data volume is 192GB, it will cost 550 seconds on Spark while Hive needs 850s to deal with the same problem. As the data volume is increasing, the running time of Spark is 40% less then Hive on average.

Effect of Using Spark in PubMatic Spark supports both offline and online data processing. It has active community support and be compatible with Hadoop ecosystem. Through the use of Spark Streaming, Spark SQL and Spark Mllib technologies together, PubMatic can provide real-time ads service and business analysis report to customers in a faster speed than ever before.