Scaling Big Data Mining Infrastructure: The Twitter Experience

Slides:



Advertisements
Similar presentations
Building LinkedIn’s Real-time Data Pipeline
Advertisements

Hive: A data warehouse on Hadoop
Web-Enabling the Warehouse Chapter 16. Benefits of Web-Enabling a Data Warehouse Better-informed decision making Lower costs of deployment and management.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
DEV-42: Achieving Real-time BAM with OpenEdge ®, Sonic ™, and Apama ® Eric DebeijBart Schouw Business Development Manager Senior Product Consultant.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.
INTRODUCTION TO DBS Database: a collection of data describing the activities of one or more related organizations DBMS: software designed to assist in.
Fall CSE330/CIS550: Introduction to Database Management Systems Prof. Susan Davidson Office: 278 Moore Office hours: TTh
Extracting value from grey literature Processes and technologies for aggregating and analysing the hidden Big Data treasure of the organisations.
-Active Directory is the brain of the Microsoft windows Server Network. -It’s a database that keeps track of huge amount of stuffs and gives us a centralized.
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.
Databases and Database User ch1 Define Database? A database is a collection of related data.1 By data, we mean known facts that can be recorded and that.
Petr Škoda, Jakub Koza Astronomical Institute Academy of Sciences
A presentation on ElasticSearch
James A. Senn’s Information Technology, 3rd Edition
Information Retrieval in Practice
ADVANCED HOSTING Adrian Newby, CTO.
Connected Infrastructure
Integrating ArcSight with Enterprise Ticketing Systems
Chapter 19: Network Management
Organizations Are Embracing New Opportunities
Integrating ArcSight with Enterprise Ticketing Systems
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
Web Engineering CS-4513 Prepared By: Junaid Hassan Lecturer at UOS M.B.Din Campus
Chapter 1: Introduction to Systems Analysis and Design
Hadoop.
An Open Source Project Commonly Used for Processing Big Data Sets
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Status and Challenges: January 2017
THE OSI MODEL By: Omari Dasent.
Mobile Application Development
Overview – SOE PatchTT November 2015.
WEB SERVICES.
A Warehousing Solution Over a Map-Reduce Framework
Software Design and Architecture
Connected Infrastructure
Hive Mr. Sriram
SQOOP.
MID-SEM REVIEW.
Central Florida Business Intelligence User Group
Hadoop EcoSystem B.Ramamurthy.
Lesson 1: Introduction to Trifacta Wrangler
Chapter 10: Process Implementation with Executable Models
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Big Data The huge amount of data being collected and stored about individuals, items, and activities and to the process of drawing useful information from.
Big Data - in Performance Engineering
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Data Warehousing and Data Mining
Overview of big data tools
Interpret the execution mode of SQL query in F1 Query paper
Chapter 7 –Implementation Issues
Distributed Systems through Web Services
Data Warehousing in the age of Big Data (1)
Middleware, Services, etc.
Flexible Distributed Reporting for Millions of Publishers and Thousands of Advertisers Berlin |
Chapter 1: Introduction to Systems Analysis and Design
Charles Tappert Seidenberg School of CSIS, Pace University
Enterprise Integration
Big DATA.
TN19-TCI: Integration and API management using TIBCO Cloud™ Integration
Chapter 1: Introduction to Systems Analysis and Design
MapReduce: Simplified Data Processing on Large Clusters
EAST MDSplus Log Data Management System
Games Development 2 Entity / Architecture Review
Best Practices in Higher Education Student Data Warehousing Forum
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
UNIT 6 RECENT TRENDS.
Big Data.
Presentation transcript:

Scaling Big Data Mining Infrastructure: The Twitter Experience Sun Weize

Index Schema is important, but not sufficient to represent the whole data set. Build platform from heterogeneity of various component.

Current trend of big data Tremendous explosion in the amount of data. Increasingly sophisticated data types. Open-source software is playing an increasingly important role. TRUMP?

The big data mining cycle

Service Architecture and logging Twitter is powered by many loose-coordinated services. Most services are not aware of the details of other services, and communicate through API. Data scientists and software engineers are two groups of people.

Exploratory data analysis Problem: data collected from users may be stale due to system crashes or unreasonable action of users. Problem: Outliers exists, and hurt the result. Solution: sanity checking.

Data mining Twitter uses Pig script.

Production and Other Considerations It is not sufficient to solve the problem once–we must set up recursive workflows. At Twitter, refinements are assessed via A/B testing.

Summary: big data mining cycle. Generated data Sanity check & preprocess Data mining Adapt into production

What should academics care? We suggest the most impactful research is not incrementally better data mining algorithms, but techniques to help data scientists “gork” the data.

Schemas and Beyond

Don’t use MySQL as a log Too much written traffic in the user end Scaling challenge Database is designed for mutable data. Evolving schemas and model is painful in database

The log transport problem Twitter uses Scribe Ascribe daemon runs on each production host and is responsible for sending local log data to the aggregator. The aggregator merge per-category stream and send to HDFS.

The log transport problem Within Twitter, data in the main Hadoop data warehouse are deposited in per- category, per-hour directories. (/logs/category/YYYY/MM/DD/HH)

From Plain Text to JSON Problem: parsing plain-text log message is slow. Regular expression is also slow. JSON is suitable for small enterprise. Problem with JSON: userid? User_id? Uid? Problem with JSON: int? double? String?

Structured representation Twitter uses Apache Thrift to address the format of logged messages. One benefits is that you can define the physical representation of the log by Thrift!

Looking beyond Schemas A complete data catalog is needed! HCataLog provides a basic table abstraction for files stored in HDFS. The challenge occurs when we attempt to reconstruct user sessions. (need session tokens, which is usually not required by APIs). “Client events” is an attempt to address the session issue. Basic idea of “client events”: establish a Thrift schema for all data sharing the same fields.

Ad Hoc Data Mining How to mine the huge data with limited resource? The traditional way is to feed the data batch by batch. Problem: a batch of data may be too few for machine learning. Traffic from Pig to machine learning software. Approach used by Twitter: Integrating machine learning components into Pig.