1 Large-Scale Machine Learning at Twitter Jimmy Lin and Alek Kolcz Twitter, Inc. Presented by: Yishuang Geng and Kexin Liu.

Slides:



Advertisements
Similar presentations
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Advertisements

Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
HOL9396: Oracle Event Processing 12c
Overview of Web Data Mining and Applications Part I
Hadoop Ecosystem Overview
SM STRATA PRESENTATION Tim Garnto - SVP Engineering, edo Interactive Rob Rosen – Big Data Field Lead, Pentaho.
Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 김지연.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Real-Time Stream Processing CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Clearstorydata.com Using Spark and Shark for Fast Cycle Analysis on Diverse Data Vaibhav Nivargi.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Tyson Condie.
Hosted on the Powerful Microsoft Azure Platform, Advent Countdown Lets Companies Run Reliable and Scalable Holiday Marketing Campaigns MICROSOFT AZURE.
As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.
` tuplejump The data engineering platform. A startup with a vision to simplify data engineering and empower the next generation of data powered miracles!
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Why I LIKE the Facebook Database… Sharon Viente May 2010.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Sentiment Analysis with Incremental Human-in-the-Loop Learning and Lexical Resource Customization Shubhanshu Mishra 1, Jana Diesner 1, Jason Byrne 2, Elizabeth.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Matthew Winter and Ned Shawa
TACTIC | Workflow: Project Management OSS on Microsoft Azure Helps Enterprises to Create Streamline, Manage, and Track Digital Content MICROSOFT AZURE.
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Guided By Ms. Shikha Pachouly Assistant Professor Computer Engineering Department 2/29/2016.
From Use Cases to Implementation 1. Structural and Behavioral Aspects of Collaborations  Two aspects of Collaborations Structural – specifies the static.
Next Generation of Apache Hadoop MapReduce Owen
Part III BigData Analysis Tools (Storm) Yuan Xue
Dato Confidential 1 Danny Bickson Co-Founder. Dato Confidential 2 Successful apps in 2015 must be intelligent Machine learning key to next-gen apps Recommenders.
From Use Cases to Implementation 1. Mapping Requirements Directly to Design and Code  For many, if not most, of our requirements it is relatively easy.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Microsoft Ignite /28/2017 6:07 PM
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
4/19/ :02 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
PROTECT | OPTIMIZE | TRANSFORM
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
International Conference on Data Engineering (ICDE 2016)
Original Slides by Nathan Twitter Shyam Nutanix
Spark Presentation.
Speculative Region-based Memory Management for Big Data Systems
9/18/2018 Big Data Analytics with HDInsight Module 6 – Storm Essentials Asad Khan Nishant Thacker Principal PM Manager Technical Product Manager.
Introduction to Spark.
Boyang Peng, Le Xu, Indranil Gupta
CS110: Discussion about Spark
Overview of big data tools
Charles Tappert Seidenberg School of CSIS, Pace University
Computational Advertising and
Igor Stančin, Alan Jović to: {igor.stancin,
Presentation transcript:

1 Large-Scale Machine Learning at Twitter Jimmy Lin and Alek Kolcz Twitter, Inc. Presented by: Yishuang Geng and Kexin Liu

2 Outline Is twitter big data? How can machine learning help twitter? Existing challenges? Existing literature of large-scale learning Overview of machine learning Twitter analytic stack Extending pig Scalable machine learning Sentiment analysis application

2

2 The Scale of Twitter Twitter has more than 140 million active users 340 million Tweets are sent per day 50 million people log into Twitter every day Over 400 million monthly unique visitors to twitter.com Large scale infrastructure of information delivery Users interact via web-ui, sms, and various apps Over 55% of our active users are mobile users Real-time redistribution of content Plus search and other services living on top of it –E.g., 2.3B search queries/day (26K/sec)

2 Support for user interaction Search – Relevance ranking User recommendation – WTF or Who To Follow Content recommendation – Relevant news, media, trends (other) problems we are trying to solve Trending topics Language detection Anti-spam Revenue optimization User interest modeling Growth optimization

2

2 Literature Traditionally, the machine learning community has assumed sequential algorithms on data fit in memory (which is no longer realistic) Few publication on machine learning work-flow and tool integration with data management platform Google – adversarial advertisement detection Predictive analytic into traditional RDBMSes Facebook – business intelligence tasks LinkedIn – Hadoop based offline data processing But they are not for machine learning specificly. Spark ScalOps But they result in end-to-end pipeline.

2 Contribution Provided an overview of Twitter’s analytic stack Describe pig extension that allow seamless integration of machine learning capability into production platform Identify stochastic gradient descent and ensemble methods as being particularly amenable to large-scale machine learning Note that, No fundamental contributions to machine learning

2

2 Hadoop cluster HDFS Real-time processes Batch processes Database Application log Other sources Serialization Protocol buffer /Thrift Oink: Aggregation query Standard business intelligence tasks Ad hoc query One-off business request Prototypes of new function Experiment by analytic group

2 Maximizing the use of Hadoop We cannot afford too many diverse computing environments Most of analytics job are run using Hadoop cluster –Hence, that’s where the data live –It is natural to structure ML computation so that it takes advantage of the cluster and is performed close to the data Seamless scaling to large datasets Integration into production workflows

2 Core libraries: Core Java library Basic abstractions similar to existing packages (weka, mallet, mahout) Lightweight wrapper Expose functionalities in Pig

2 Training models: Storage function

2 Shuffling data:

2 Using models:

2 Scalable Machine learning Techniques for large-scale machine learning Stochastic gradient descent Ensemble method

2 Stochastic gradient descent Gradient Descent Compute the gradient in the loss function by optimizing value in dataset. This method will do the iteration for all the data in order to one a gradient value. Inefficient and everything in the dataset must be considered.

2 Stochastic gradient descent Approximating gradient depends on the value of gradient for one instance. Solve the iteration problem and it does not need to go over the whole dataset again and again. Stream the dataset through a single reduce even with limited memory resource. But when a huge dataset stream goes through a single node in cluster, it will cause network congestion problem. Solution ?

Ensemble Methods Classifier ensembles: high performance learner Performance: very well Some rely mostly on randomization -Each learner is trained over a subset of features and/or instances of the data Ensembles of linear classifiers Ensembles of decision trees (random forest) 2

2 Ensemble Methods

Example: Sentiment Analysis Emotion Trick  Test dataset: 1 million English tweets, minimum 20 letters-long Training data: 1 million, 10 million and 100 million English training examples Preparation: training and test sets contains equal number of positive and negative examples, removed all emoticons. 2

2

1 Twitter: STORM Nathan Marz Twitter, Inc. Presented by: James Forkey Distributed and Fault-Tolerant Real-Time Computation

2 Key Concepts of Storm: Tuples Ordered list of elements

2 Key Concepts of Storm: Streams Unbound sequence of tuples

2 Key Concepts of Storm: Spouts Source of streams

2 Key Concepts of Storm: Spouts Source of streams

2 Key Concepts of Storm: Spouts Source of streams Spouts can talk with: Queues Web logs API calls Event data

2 Key Concepts of Storm: Bolts Process tuples and create new streams Bolts can: Apply functions / transforms Filter Aggregation Stream joining or splitting Access DBs, APIs, etc

2 Key Concepts of Storm: Topologies A directed graph of spouts and bolts

2 Key Concepts of Storm: Topologies A directed graph of spouts and bolts

2 Key Concepts of Storm: Topologies A directed graph of spouts and bolts

2 Key Concepts of Storm: Tasks Processes which execute streams or bolts Spout [“sentence”] SplitSentence Bolt [“word”] WordCount Bolt [“word”, “count”]

2 Grouping Fields Grouping Groups tuples by specific named fields and routes (Very similar to Hadoops partitioning behavior) Shuffle Grouping tuples are randomly distributed across all of the tasks running the bolt

2 Grouping Fields Grouping Groups tuples by specific named fields and routes (Very similar to Hadoops partitioning behavior) Shuffle Grouping tuples are randomly distributed across all of the tasks running the bolt

2 Key Concept of Storm: Distributed RPC for communication

26 Many thanks! Questions?