Scaling Big Data Mining Infrastructure: The Twitter Experience

Slides:

Advertisements

Similar presentations

Building LinkedIn’s Real-time Data Pipeline

Advertisements

Hive: A data warehouse on Hadoop

Web-Enabling the Warehouse Chapter 16. Benefits of Web-Enabling a Data Warehouse Better-informed decision making Lower costs of deployment and management.

Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.

DEV-42: Achieving Real-time BAM with OpenEdge ®, Sonic ™, and Apama ® Eric DebeijBart Schouw Business Development Manager Senior Product Consultant.

Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.

Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.

INTRODUCTION TO DBS Database: a collection of data describing the activities of one or more related organizations DBMS: software designed to assist in.

Fall CSE330/CIS550: Introduction to Database Management Systems Prof. Susan Davidson Office: 278 Moore Office hours: TTh

Extracting value from grey literature Processes and technologies for aggregating and analysing the hidden Big Data treasure of the organisations.

-Active Directory is the brain of the Microsoft windows Server Network. -It’s a database that keeps track of huge amount of stuffs and gives us a centralized.

Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.

Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.

Databases and Database User ch1 Define Database? A database is a collection of related data.1 By data, we mean known facts that can be recorded and that.

Petr Škoda, Jakub Koza Astronomical Institute Academy of Sciences

A presentation on ElasticSearch

James A. Senn’s Information Technology, 3rd Edition

Information Retrieval in Practice

ADVANCED HOSTING Adrian Newby, CTO.

Connected Infrastructure

Integrating ArcSight with Enterprise Ticketing Systems

Chapter 19: Network Management

Organizations Are Embracing New Opportunities

Integrating ArcSight with Enterprise Ticketing Systems

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING

Web Engineering CS-4513 Prepared By: Junaid Hassan Lecturer at UOS M.B.Din Campus

Chapter 1: Introduction to Systems Analysis and Design

An Open Source Project Commonly Used for Processing Big Data Sets

INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER

Status and Challenges: January 2017

THE OSI MODEL By: Omari Dasent.

Mobile Application Development

Overview – SOE PatchTT November 2015.

A Warehousing Solution Over a Map-Reduce Framework

Software Design and Architecture

Connected Infrastructure

Hive Mr. Sriram

MID-SEM REVIEW.

Central Florida Business Intelligence User Group

Hadoop EcoSystem B.Ramamurthy.

Lesson 1: Introduction to Trifacta Wrangler

Chapter 10: Process Implementation with Executable Models

NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.

Big Data The huge amount of data being collected and stored about individuals, items, and activities and to the process of drawing useful information from.

Big Data - in Performance Engineering

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

Data Warehousing and Data Mining

Overview of big data tools

Interpret the execution mode of SQL query in F1 Query paper

Chapter 7 –Implementation Issues

Distributed Systems through Web Services

Data Warehousing in the age of Big Data (1)

Middleware, Services, etc.

Flexible Distributed Reporting for Millions of Publishers and Thousands of Advertisers Berlin |

Chapter 1: Introduction to Systems Analysis and Design

Charles Tappert Seidenberg School of CSIS, Pace University

Enterprise Integration

TN19-TCI: Integration and API management using TIBCO Cloud™ Integration

Chapter 1: Introduction to Systems Analysis and Design

MapReduce: Simplified Data Processing on Large Clusters

EAST MDSplus Log Data Management System

Games Development 2 Entity / Architecture Review

Best Practices in Higher Education Student Data Warehousing Forum

Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.

UNIT 6 RECENT TRENDS.

Presentation transcript:

Scaling Big Data Mining Infrastructure: The Twitter Experience Sun Weize

Index Schema is important, but not sufficient to represent the whole data set. Build platform from heterogeneity of various component.

Current trend of big data Tremendous explosion in the amount of data. Increasingly sophisticated data types. Open-source software is playing an increasingly important role. TRUMP?

The big data mining cycle

Service Architecture and logging Twitter is powered by many loose-coordinated services. Most services are not aware of the details of other services, and communicate through API. Data scientists and software engineers are two groups of people.

Exploratory data analysis Problem: data collected from users may be stale due to system crashes or unreasonable action of users. Problem: Outliers exists, and hurt the result. Solution: sanity checking.

Data mining Twitter uses Pig script.

Production and Other Considerations It is not sufficient to solve the problem once–we must set up recursive workflows. At Twitter, refinements are assessed via A/B testing.

Summary: big data mining cycle. Generated data Sanity check & preprocess Data mining Adapt into production

What should academics care? We suggest the most impactful research is not incrementally better data mining algorithms, but techniques to help data scientists “gork” the data.

Schemas and Beyond

Don’t use MySQL as a log Too much written traffic in the user end Scaling challenge Database is designed for mutable data. Evolving schemas and model is painful in database

The log transport problem Twitter uses Scribe Ascribe daemon runs on each production host and is responsible for sending local log data to the aggregator. The aggregator merge per-category stream and send to HDFS.

The log transport problem Within Twitter, data in the main Hadoop data warehouse are deposited in per- category, per-hour directories. (/logs/category/YYYY/MM/DD/HH)

From Plain Text to JSON Problem: parsing plain-text log message is slow. Regular expression is also slow. JSON is suitable for small enterprise. Problem with JSON: userid? User_id? Uid? Problem with JSON: int? double? String?

Structured representation Twitter uses Apache Thrift to address the format of logged messages. One benefits is that you can define the physical representation of the log by Thrift!

Looking beyond Schemas A complete data catalog is needed! HCataLog provides a basic table abstraction for files stored in HDFS. The challenge occurs when we attempt to reconstruct user sessions. (need session tokens, which is usually not required by APIs). “Client events” is an attempt to address the session issue. Basic idea of “client events”: establish a Thrift schema for all data sharing the same fields.

Ad Hoc Data Mining How to mine the huge data with limited resource? The traditional way is to feed the data batch by batch. Problem: a batch of data may be too few for machine learning. Traffic from Pig to machine learning software. Approach used by Twitter: Integrating machine learning components into Pig.