B IG D ATA A NALYTICS A Presentation by Meg Monsen, Michael Leonard, and Eric Zeng
A GENDA Big Data Analytics and its Objectives Financial Impact Structured vs Unstructured Data Users of Big Data Relevant Technologies ( Hadoop, MongoDB) Coding Examples Future of Analytics
W HAT IS B IG D ATA AND WHY DOES IT MATTER ? Defining Big Data Analytics Examining large sets of data Discovering patterns and trends Data warehouses are insufficient Purposes Uncovering hidden needs of customers Improve operational efficiency
B IG D ATA & O PERATIONAL E FFICIENCY “By using big data for operations analysis, organizations can gain real-time visibility into operations, customer experience, transactions and behavior.” – IBM Core Objectives Gain Analyze Apply Optimize
F INANCIAL I MPACT OF B IG D ATA High cost of poor data quality 3.1 trillion to US government annually 10-25% of US business revenues Opportunities for qualified analysts Business Analyst: $66,000 Data Analyst: $60,000 Data Scientist: $113,000
D IMENSIONS OF B IG D ATA Essential Characteriestics: Volume - Data quantity Velocity - Data Speed Variety - Data Types
S TRUCTURED VS. U NSTRUCTURED D ATA Structured Data Represented as text Transactional data, formal reports, accounting records of sales and costs Relational databases / data warehouse SQL Unstructured Data May be textual or non-textual Mobile usage, click stream activity, social media responses, genomic data No structured database / data lake NoSQL (Not only SQL), SQL Batch Queries
I LLUSTRATIVE E XAMPLE Inventory AnalystInsurance Actuary
I NTERPRETATIONS Big Data Analytics Structured Data
U SERS OF B IG D ATA Device manufacturers, ERP providers, consulting firms comprise 7 of top 10 users Big Data Based on a survey conducted by Dell of large corporations in 2014… 55% now follow Big Data strategy 60% of Big Data projects involve a cloud 32% involve real-time or near real-time processing 22% use data lake 20% of projects by outside consultants
H ADOOP Free, Java-Based programming framework Distributes storage and processes large data sets Started from a Google File System paper published in October 2003 Development was furthered by Apache Named after Doug Cutting’s son’s toy elephant (logo!)
W HEN TO U SE ( AND N OT U SE ) H ADOOP YES! Analytics Search Data Retention Log File processing Analysis of Text, Image, Audio, and Video Content Recommendation systems like in E- Commerce Websites NO! Low-latency or near real-time data access Large number of small files to process Multiple write scenarios requiring arbitrary writes between files
W HO U SES H ADOOP ?
H ADOOP F RAMEWORK Hadoop Common: Contains all the libraries and utilities Hadoop Distributed File System (HDFS): Storage with high bandwith Hadoop YARN: Resource-management platform Hadoop MapReduce: Programming Model for data processing
HDFS
M AP R EDUCE
M AP R EDUCE E XAMPLE
MONGODB
M ONGO DB = “T HE DATABASE FOR GIANT IDEAS ” Cross-platform document- oriented database Open-source “The database for giant ideas” Founded in 2007 written to handle specific problems with DoubleClick Classified as NoSQL database
M ONGO DB E XAMPLE Also, we can practice! exercises/#PracticeOnline
T HE F UTURE OF B IG D ATA A NALYTICS
A NY Q UESTIONS ?