Big Data and Analytics Systems: Course Introduction

Big Data and Analytics Systems: Course Introduction
Fayé A. Briggs, PhD Adjunct Professor and Intel Fellow(Retired) Rice University Material mostly derived from “Mining Massive Datasets” from: Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

What is Big Data. What is (Analytics)Data Mining
What is Big Data? What is (Analytics)Data Mining? What system architectures? Knowledge discovery from data

The Big Deal With Big Data
~50x the content in the Library of Congress ~13y to view an HD video continuously One Petabyte = ~11s to generate in 2012 A transatlantic flight in a Boeing 777 produces so much telemetry, about 30 terabytes of data. A new generation of technologies and architectures designed to economically extract value from very large VOLUMES of a wide VARIETY of data by enabling high-VELOCITY capture, discovery, and/or analysis and ensure VERACITY Source: IDC How big is big… how to put human terms around it. Start with 1PB 50x volume of archived data in the LOC Would take13 years for an individual to watch a petabyte of HD video Yet, the world generates this information in 11 seconds based on the amount of data created and replicated in 2012 Start to think about machine to machine / sensed data with intelligence devices – dwarfs web traffic Non linear growth pattern Source: IDC's Digital Universe Study, sponsored by EMC, December 2012 3

J. Leskovec, A. Rajaraman, J
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

IoT Landscape Overview
By year 2020, more than 200 billion devices will be connected to the cloud and each other; commonly called as IoT IDC predicts that 1/3rd of billions of devices will be intelligent devices Large amount of legacy equipment is not connected, managed or secured Need to address the interoperability of legacy systems to avoid incredibility avoid large cost of replacing I/F that can be securely connected to cloud (public/private)

The Opportunity With Extracting Knowledge From “Big Data”
“Economically” - Productivity, Efficiency & better decisions “Extract Value & Analysis” - Through Parallel Computations & algorithms for better decisions “Very Large Volumes Capture” – Memory & Storage “Wide Variety”- Sensors, Software & Security “High-Velocity”- Networking & I/O “Veracity” – Correlation to reality(quality and correctness) 6

Big Data Usages Examples (Telecom/Financial/Search)
Calling patterns, signal processing, forecasting Analyze switches/routers data for quality of call, frequency of calls, region loads, etc. Act before problems happen. Act before customer calls arrive. Financial Trading behavior Analyze real-time data to understand market behavior, role of individual institution/investor Detect fraud, detect impact of an event, detect the players Search Engines Process the data collected by Web bot in multiple dimensions Enhance relevance of search Big Data impacts e-connected businesses through capture, processing and storage of huge amount of data efficiently

Big Data Usages Examples (E-Biz, Media)
Click Stream Analysis Analysis of online users behavior Develop comprehensive insight (Business Intelligence) to run effective strategies in real time Graph analysis Term for discovering the online influencers in various walks of life Enables a business to understand key players and devise effective strategies Lifecycle Marketing Strategies to move away from spam/mass mail Enables a business to spend money on high probable customers only Revenue Attribution Term for analyzing the data to accurately attribute revenue back to various marketing investments Business can identify effectiveness of campaign to control expenses Big Data phenomenon allows businesses to know, predict and influence customer behaviors!!!

We are at an Inflection Point in Healthcare - TRENDS
Source: United Nations “Population Aging 2002” 25-29% 30+ % 20-24% 10-19% 0-9% % of population over age 60 2050 WW Average Age 60+: 21% Healthcare costs are RISING Significant % of GDP Global AGING Average age 60+: growing from 10% to 21% by 2050 U.S. Healthcare BIG DATA Value $300 Billion in value/year ~ 0.7% annual productivity growth Healthcare effectiveness analysis: medical histories, clinical information, imaging results, laboratory test results, physician interactions, preferred prescriptions and patient accountability for taking those medications. providers using graph analytics for assessing many similar medical histories managed within a graph model that not only links patients to physicians, medications and presumed diagnoses, and providers. Providers can rapidly scan through the graph to discover therapies used with other patients with similar characteristics (such as age, diagnostics, clinical history, associated risk factors, etc.) that have the most positive outcomes Data storage growth in HC – 2.9 XB to 13.5XB in 4 years 35.3% CAGR Largest growth areas Imaging 824PB to 3.9XB EMR 219PB to 2.3 XB Unstructured Data / File Services will add 2.3XB worth of disk over the next 4 years Source: McKinsey Global Institute Analysis ESG Research Report 2011 – North American Health Care Provider Market Size and Forecast

Big Data in Healthcare Where is the data coming from?
1. Pharma/Life Sciences 2. Clinical Decision Support & Trends (includes Diagnostic Imaging) 3. Claims, Utilization and Fraud 4. Patient Behavior/Social Networking How do we create value? (examples) 1. Personalized Medicine 2. Clinical Decision Support 3. Enhanced Fraud Detection 4. Analytics for Lifestyle and Behavior-induced Diseases Nurses or Doctor notes from Case Management, disease management, or EMR records Streaming biometric data from medical home devices Comments from call center: Individuals, clients and Health care professionals Social Media: Blogs, FaceBook, Twitter Web Logs: On-line sales, member/patient self-service applications Comments on insurance applications, wellness on-line tools, HRAs Business Impact Pattern Matching for Predictive Health Business Goals Millions of patients and 40,000+ providers, 10 years of history Analytics Applications Increase effectiveness of treatment and contain costs Find repeatable patterns in patient data and long term illness diagnosis (hypertension, diabetes, cancer etc.) Correlate visits, diagnostics, and hospital/provider interactions across years of multiple visits Technical Advantages Predict re-treatment risk and proactively address (care recommendations prior to discharge), to avoid re-admission within Medicare’s 30-day window Host of analysis Massively parallel data loading (reduce time from months/weeks to days) Path, Statistical, Relational, Cluster, Data transformation, text analysis etc. Optimizing Consumer Engagement Analyzing social media, web details to understand and influence consumer behavior Next Generation Genetic Sequencing Business Goal Assembling and aligning genomic sequences for rapid pattern detection and investigation Reduced time to decision Rapidly assemble alignment results and interrogate variations to determine base pairs at each position [REDUCE] Rapidly align millions of SNPs from a single sample to a known standard [MAP] Leverage MAQ library functions on nCluster (merge, map, prepare) Rapid alignment and assembly sequencing results High-speed data ingestion Ability to parallelize work via multiple parameters (sample-id, chromosome, etc.) More Sources of Data – Members Phone calls, s, SMS, photos, videos, blogs, tweets, Internet activities Treatment compliance, difficulties, and progress Support structure, activities of daily living, environment Alternative therapy, lifestyle, and preventive health behaviors Facility, provider, and kiosk encounters Devices Physiological, biological, and chemical sensors – VS/BP, weight, spirometry, glucose, pulse oximetry, ECG, EEG, activity level, body position, microarrays, environment, location (venues outside the medical facility) Imaging – photo, video, voice, microscope, x-ray, ultrasound, CT, MRI, PET EHRs (KP HealthConnect) and other IT systems Set-top boxes, smart TVs, gaming consoles, smart phones, smart meters, smart appliances, homes, cars, and worksites A lifetime of BP readings every 10 days plus metadata = 6 TB of storage Some ‘quantified self’ advocates take a BP reading every hour = 1,460 TB of storage (~6X the digitized content of the Library of Congress) Organizations BPs are but a small part of the clinical record, which also includes other biometrics, imaging files, lab and pharmacy data, insurance information, etc. Employers, retailers, marketers, credit agencies, government agencies Research Regulators – quality data, scores, recommendations Professional specialty organizations – best practice guidelines World Wide Web Global biomedical literature Penn State Planned : Broke ground for a $54M new data center dedicated to making use of big data to enhance medical research and patient care potential. Allow the university to better gather, and analyze large volumes of rich heath data for effective prediction, modeling of diseases and disease behaviors. . McKinsey Global Institute Analysis Digital rendering of the 46,000 sq ft new data center to open April 2016

Big Data Solution for Healthcare
Health Info Services Primary Care Personal Health Management Aging Society New Healthcare Applications Personalized Medicine Clinical Decision Support Cancer Genomics Analytics and Visualization SQL-like Query Medical Imaging Analytics Machine Learning Data Processing/ Management Medical Images Medical Records Genome Data We mentioned earlier that Big Data can help healthcare industry to save 300B/year. To validate our two-step process, we did some exercise with some healthcare experts at Intel. Our conclusion is we have the opportunities and capability to deliver a healthcare solution stack. At the platform level, we can provide healthcare optimized appliance with storage optimization such as compression and medical image deduplication, security and privacy support, and medical imaging acceleration. For better software, SSG is able to provide optimized and specialized Hadoop data stores for medical records, Genome data, and medical images. The other critical component of better software is high perf analytics. The cross-BU analytics taskforce will address vertical opportunities such as healthcare. Meanwhile, Intel Labs and Intel Science and Technology Centers have a strong portfolio on machine learning and medical imaging analytics. Better software enables applications such as …. And these applications are not new to Intel. Our vertical and field teams have been working with health information service providers on these areas for a long time. Now they are expecting more from Intel. To them, a vertically integrated solution stack is more preferable than the hardware platform plus some enabling efforts. Distributed Platform Storage Optimization Security and Privacy Imaging Acceleration

Big Data & Analytics Goals & Challenges
Advance state-of-the-art core technologies required to collect, store, preserve, manage, analyze massive amount of data, and visualize results for business intelligence. Term for discovering the online influencers in various walks of life Enables a business to understand key players and devise effective strategies Challenge The large data sets from, for example, the proliferation of sensors, patient records, experiential medical data are overwhelming data analysts, as they lack the tools to efficiently store, process, analyze, and visualize the resulting meta-data from the vast amounts of data. Lack of IoT interface and data delivery standards. Fractured providers causing data format wars (jerry-mandering) (data loss from consolidations) Systems architecture to store these big data, extracting meta-data on the fly and provide the computing capability to analyze the data real-time pose major challenges. Big Data phenomenon allows businesses to know, predict and influence customer behaviors!!!

Data contains value and knowledge

Data Mining But to extract the knowledge data needs to be Stored
Managed And ANALYZED to Predict actionable insight from the data Create data products that have business impacts Communicate relevant visuals to influence business Build confidence in data value to drive business decisions J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

Carlos Somohano, Data Science, London: What does a Data Scientist Do?

J. Leskovec, A. Rajaraman, J

By Requiring A Variety of Well Optimized Technologies Working Together
Extracting Knowledge From This Data Will Need Dynamically Adaptable Balanced Systems Analytics Leading to Insight Protect Store Transport Compute Generate SW FRAMEWORK Context & Location Key points: Expect the unexpected. Plan for dynamically adaptable technology. Although some Big Data workloads will be well defined and highly predictable, many others will require rapidly scalable solutions that depend on high levels of automation. HW-SW co-design is a must By Requiring A Variety of Well Optimized Technologies Working Together 19

Good news: Demand for Data Mining

What is Data Mining? Given lots of data
Discover patterns and models that are: Valid: hold on new data with some certainty Useful: should be possible to act on the item Unexpected: non-obvious to the system Understandable: humans should be able to interpret the pattern J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

Data Mining Tasks Descriptive methods Predictive methods
Find human-interpretable patterns that describe the data Example: Clustering Predictive methods Use some variables to predict unknown or future values of other variables Example: Recommender systems J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

Meaningfulness of Analytic Answers
A risk with “Data mining” is that an analyst can “discover” patterns that are meaningless Statisticians call it Bonferroni’s principle: Roughly, if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

Meaningfulness of Analytic Answers
Example: We want to find (unrelated) people who at least twice have stayed at the same hotel on the same day 109 people being tracked 1,000 days Each person stays in a hotel 1% of time (1 day out of 100) Hotels hold 100 people (so 105 hotels) If everyone behaves randomly (i.e., no terrorists) will the data mining detect anything suspicious? Expected number of “suspicious” pairs of people: 250,000 … too many combinations to check – we need to have some additional evidence to find “suspicious” pairs of people in some more efficient way J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

What matters when dealing with data?
Challenges Usage Quality Context Streaming Scalability Collect Prepare Ontologies Text Networks Signals Data Modalities Represent Structured Multimedia Model Reason Visualize Data Operators J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

Data Mining: Cultures Data mining overlaps with: Different cultures:
Databases: Large-scale data, simple queries Machine learning: Small data, Complex models CS Theory: (Randomized) Algorithms Different cultures: To a DB person, data mining is an extreme form of analytic processing – queries that examine large amounts of data Result is the query answer To a ML person, data-mining is the inference of models Result is the parameters of the model In this class we will review both! CS Theory Machine Learning Data Mining Database systems J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

Comp620 Coverage This class overlaps with machine learning, statistics, artificial intelligence, databases, systems architecture but stresses more on: Scalability (big data) Algorithms Computing architectures Review of handling large data Visualization Statistics Machine Learning Data Mining Database systems Computer Systems J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

What will we learn? Learn to mine different types of data:
Data is high dimensional Data is a graph Data is infinite/never-ending Data is labeled Learn to use different models of computation: MapReduce Streams and online algorithms Single machine in-memory J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

What will we learn? Review real-world problems:
Recommender systems Market Basket Analysis Spam detection Duplicate document detection Review various “tools”: Linear algebra (SVD, Rec. Sys., Communities) Optimization (stochastic gradient descent) Dynamic programming (frequent itemsets) Hashing (LSH, Bloom filters) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

How It All Fits Together
High dim. data Locality sensitive hashing Clustering Dimensionality reduction Graph data PageRank, SimRank Community Detection Spam Detection Infinite data Filtering data streams Web advertising Queries on streams Machine learning SVM Decision Trees Perceptron, kNN Apps Recommender systems Association Rules Duplicate document detection J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

How do you want that data?
I ♥ data How do you want that data? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

About the Course

2015 Comp620 Course Staff Instructor: Office hours: Wish I had TAs!
TBD

Course Logistics Course website: TBD
Lecture slides(To be posted to a Rice U website- TBD) Readings Readings: Book Mining of Massive Datasets with A. Rajaraman and J. Ullman Free online: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

Work for the Course (1+)4 longer homeworks: 40% How to submit?
Theoretical and programming questions HW0 (Hadoop tutorial) has just been posted Assignments take lots of time. Start early!! How to submit? Homework write-up: Stanford students: In class or in Gates submission box SCPD students: Submit write-ups via SCPD Attach the HW cover sheet (and SCPD routing form) Upload code: Put the code for 1 question into 1 file and submit at: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

Big Data and Analytics Systems: Course Introduction

Similar presentations

Presentation on theme: "Big Data and Analytics Systems: Course Introduction"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Big Data and Analytics Systems: Course Introduction

Similar presentations

Presentation on theme: "Big Data and Analytics Systems: Course Introduction"— Presentation transcript:

Similar presentations

About project

Feedback