Beijing Institute of Technology December 2015

Slides:



Advertisements
Similar presentations
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Advertisements

Big Data Workflows N AME : A SHOK P ADMARAJU C OURSE : T OPICS ON S OFTWARE E NGINEERING I NSTRUCTOR : D R. S ERGIU D ASCALU.
SM STRATA PRESENTATION Tim Garnto - SVP Engineering, edo Interactive Rob Rosen – Big Data Field Lead, Pentaho.
Apache Spark and the future of big data applications Eric Baldeschwieler.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
Cloud Computing Introduction to China-cloud Project and Related Works in JSI Yi Liu Sino-German Joint Software Institute, Beihang Univ. May 2011.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Tyson Condie.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Descriptive Data Analysis of File Transfer Data Sudarshan Srinivasan Victor Hazlewood Gregory D. Peterson.
` tuplejump The data engineering platform. A startup with a vision to simplify data engineering and empower the next generation of data powered miracles!
Introduction to Hadoop and HDFS
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
Matthew Winter and Ned Shawa
Applications on Spark Prof. Harold Liu Beijing Institute of Technology December 2015.
Chapter 1: Internet Marketing Foundations. Chapter Objectives Describe how computers and servers communicate to enable people to interact with webpages.
This is a free Course Available on Hadoop-Skills.com.
An Introduction To Big Data For The SQL Server DBA.
COMPUTER NETWORKS Quizzes 5% First practical exam 5% Final practical exam 10% LANGUAGE.
Microsoft Partner since 2011
1 Cloud-Native Data Warehousing Bob Muglia. 2 Scenarios with affinity for cloud Gartner 2016 Predictions: By 2018, six billion connected things will be.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Hadoop Big Data Usability Tools and Methods. On the subject of massive data analytics, usability is simply as crucial as performance. Right here are three.
Building a Better Connected World
What Is Enterprise Computing?
OMOP CDM on Hadoop Reference Architecture
Pipe Engineering.
Protecting a Tsunami of Data in Hadoop
E-Commerce in China (2015) May 29, 2016.
Big Data is a Big Deal!.
SAS users meeting in Halifax
Big Data Enterprise Patterns
PROTECT | OPTIMIZE | TRANSFORM
Beijing Institute of Technology October 2015
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Discovering Computers 2010: Living in a Digital World Chapter 14
Machine Learning Library for Apache Ignite
Introduction to Distributed Platforms
TECHNOLOGY GUIDE THREE
ANOMALY DETECTION FRAMEWORK FOR BIG DATA
Institute for Cyber Security
Spark Presentation.
Couchbase Server is a NoSQL Database with a SQL-Based Query Language
Data Platform and Analytics Foundational Training
Hadoop Clusters Tess Fulkerson.
Extraction, aggregation and classification at Web Scale
TECHNOLOGY GUIDE THREE
Introduction to Spark.
Big Data - in Performance Engineering
Tools for Processing Big Data Jinan Al Aridhee and Christian Bach
Introduction to Apache
Distributed File Systems
Overview of big data tools
Zoie Barrett and Brian Lam
Charles Tappert Seidenberg School of CSIS, Pace University
Big-Data Analytics with Azure HDInsight
CS 239 – Big Data Systems Fall 2018
E-COMMERCE AND VIRTUAL MARKETING
EAST MDSplus Log Data Management System
Introduction to Azure Data Lake
EAST MDSplus Log Data Management System
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
UNIT 6 RECENT TRENDS.
Convergence of Big Data and Extreme Computing
TECHNOLOGY GUIDE THREE
Presentation transcript:

Beijing Institute of Technology December 2015 Applications on Spark Prof. Harold Liu Beijing Institute of Technology December 2015

Who Are Using Spark These Days? 2 https://spark-summit.org/

From the figure above, over 1,000 companies have taken Spark platform into productions, including famous traditional manufacturers like TOYOTA and O2O company like Uber and airbnb. It indicates that the Spark user field has been expanded, not only in the Internet based industry, but also to traditional industries. Lots of big data framework distributors, including the former Hadoop distributors like Hortonworks and Cloudera, are beginning to take Spark into deployment, which will have a bigger impact in its spread. 3

Open Source Spark Community 由图看出 Spark 社区从 2010 年到 2014 年开源贡献者的数量不断增加,在这些代码贡献者中出现很多中国公司和开发者的身影。例如目前世界上最大的 Spark集群在腾讯,拥有高达 8000 个节点;最大的单任务处理数据量达到 1PB,这项记录是由阿里巴巴公司和 databricks 公司共同持有。 The figure shows that the number of contributors has increased rapidly from 2010 to 2014. Among these contributors, lots of Chinese organizations and developers show their enthusiasm on Spark. Now, the biggest Spark cluster of over 8,000 nodes is in Tencent and the highest amount of processed data per job is 1PB, recorded by Alibaba and Databricks. 4

Architecture of Spark Spark 的体系结构不同于Hadoop的MapReduce和HDFS, Spark主要包括Spark Core和在Spark Core基础之上建立的应用框架Spark SQL、 Spark Streaming、 MLlib 和GraphX。他们分别应对交互式查询,流计算,机器学习和图计算,下面讲述的Spark企业应用,将主要就这几个方向的实际应用展开。 5

Entertainment: Tecent Company Background: The biggest social service provider in China. Data Background: By the end of 2015, the active QQ users per month have exceeded 8,000 million. The active Wechat user per month have exceeded 6,000 million. They will bring over 200TB data every day. Business Requirement: Over 90% data need to be processed online. 6

Tencent Distributed Data Warehouse TDW collects all product level data and provides data storage and analysis services. TDW supports PB-level data storage and computing. It has two parts: one is off-line M/R and the other is online computing by Storm. 7

Hadoop V.S Spark on M/R MapReduce 200 Map+100 Reduce 120 693872 Spark Running Mode Compute Resource Running Time(min) Cost(Slot*s) MapReduce 200 Map+100 Reduce 120 693872 Spark 200 Executor 33 396000 400 Executor 21 504000 Spark works much faster than Hadoop. The running time is only a quarter of that of Hadoop. Compute efficiency can be faster when adding more executors. Overall, when facing data mining problems, traditional Hadoop M/R framework has serious performance problem, while the Spark can deal with the problem based on its iterative and in-memory computing. 这张图是Hadoop执行mapreduce算法与Spark执行统一MapReduce算法的运行性能比较图。 可以看出,基于内存计算的Spark的运行时间明显小于MapReduce,时间仅仅是hadoop的四分之一左右,当增加Spark的Executor(执行器),运算能更快。 总之,数据挖掘业务大多具有复杂的处理逻辑,传统的MapReduce类计算框架在应对此类数据处理任务时存在着严重的性能问题。针对这些任务需求,利用Spark的迭代计算和内存计算优势,将会大幅降低运行时间和计算成本。 8

E-commerce:Taobao Company Background The biggest C2C e-commerce company in China and the Spark pioneer user (since 2012) Data Background Up to 2014, Taobao has over 5,000 million registered members and 1,200 million active members. Taobao has over 90 billion turnovers on November 11, 2014. Its various businesses bring TB-lever data every day. Business Requirement In the past few years, Taobao has been using Yun Ti based on Hadoop. But Hadoop will encounter lots of problems in iterative computing. So Spark comes to its view. 9 9

Spark in Taobao The figure shows the history of using Spark in Taobao. 10 nodes cluster Yarn version:0.23.7 200 nodes Yarn cluster The figure shows the history of using Spark in Taobao. Taobao has been using Spark when Spark is very young (2012). 10

Spark Development Process in Taobao Before putting the job into production servers, the job will be tested on test servers. And the code will be merged to local repository or push to the open source community. 11

Recommender System in Taobao The recommender system combines Spark, Spark MLlib and Spark Streaming frameworks. It can perform both offline and online analysis that covers most parts of business requests in Taobao. 12

Test of K-Means Algorithm From the memory aspect, increasing worker’s memory will cut the running time. And increase worker numbers will have better performance. 13

Telecom: Telefonica Company Background Telefonica is a Spanish telecommunication company who provides comprehensive services including mobile phone, internet, data and wired television services. Data Background Telefonica is the biggest multi-national enterprise in Spain who provides customer services for over 40 countries. Its various businesses bring huge data. Business Requirement As the volume of data is increasing rapidly, network security problem comes to its sight, such as DDoS attack, SQL injection attack, account theft etc. Using big data analysis technology to prevent Cyber crime has become urgent to the company. 14 14

Why Spark? Spark provides full stack applications (i.e., SQL, Streaming, MLlib, GraphX) Easy to use spark to analyze historical data and streaming data. Support various applications and data sources in order to deal with complex application scenarios Leverage the SQL language to use the power of Spark The number of components in Spark is much fewer than that of Hadoop 15

Components of Spark and Hadoop From the figure above, the number of components in Spark is about half of that in Hadoop. Then, using Spark can potentially have much less errors because of less components. 16

Spark Production Architecture in Telefonica Data collection: Kafka Data pre-processing: Storm Batch processing: Cassandra+Spark It use distributed message queue system called “Kafka” to collect data from various sources. Then, data will be consumed by Storm for pre-processing. Finally, data will be processed by Spark or saved in Cassandra. 17

Retail: Euclid Company Background Data Background Business Requirement Euclid Analysis is a geo-data analysis company who provides solutions to customers based on offline positional information. Data Background Euclid mainly relies on WiFi devices to collect data from the physical world. Business Requirement Euclid’s main job is to support location based analysis services for customers. Through collecting customer behavior data, it tries to know customer’s behavior and shopping feature, and suggestion future behaviors. 18 18

Retail Customer Features Through the data collected from WiFi devices, customers can be divided into three parts: frequent customers, pass-by customers and quick-leave customers. Some of them like to buy products, some spend a lot of time in store and some like to travel around in a zone.

Analysis Procedure with Spark First, mobile data are collected by WiFi devices through the pinged signals, which include device MAC address, magnitude of signal and other information. Then, these data will be sent to cloud and processed on Spark cluster. Finally, customers will know the analysis result on web.

Other Area: PubMatic Company Background Data Background PubMatic is an advertisement company It developed the first real-time advertisement analysis system in the world marketing field. Data Background PubMatic has 6 geo-data data centers with 6 PB data to manage. Every day it will post 12 billion ads and deal with 1,000 billion bids. Now 22TB data are produced by its system. Business Requirement Because of its owned complex and various ad data, PubMatic needs to process the data in real-time. 21

System Architecture in PubMatic As we can see from the figure above, various streaming data (flows) are fed into memory which will be process by Spark. Finally, the data will be saved in HDFS and Amazon S3.

Spark v.s. Hive on Query Performance When the data volume is 192GB, it will cost 550 seconds on Spark while Hive needs 850s to deal with the same problem. As the data volume is increasing, the running time of Spark is 40% less then Hive on average.

Effect of Using Spark in PubMatic Spark supports both offline and online data processing. It has active community support and be compatible with Hadoop ecosystem. Through the use of Spark Streaming, Spark SQL and Spark Mllib technologies together, PubMatic can provide real-time ads service and business analysis report to customers in a faster speed than ever before.