Big Data and Analytics Systems: Course Introduction

Slides:



Advertisements
Similar presentations
Mining of Massive Datasets: Course Introduction
Advertisements

Big Data and Predictive Analytics in Health Care Presented by: Mehadi Sayed President and CEO, Clinisys EMR Inc.
SAS solutions SAS ottawa platform user society nov 20th 2014.
Open Government Vlora Ademi, Business Development Manager-Edu, Microsoft Macedonia &Kosovo
Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001 George Kollios Boston University.
© 2012 TeraMedica, Inc. Big Data: Challenges and Opportunities for Healthcare Joe Paxton Healthcare and Life Sciences Sales Leader.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
© 2011 IBM Corporation Smarter Software for a Smarter Planet The Capabilities of IBM Software Borislav Borissov SWG Manager, IBM.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Charles Tappert Seidenberg School of CSIS, Pace University
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
© 2007 IBM Corporation IBM Information Management Accelerate information on demand with dynamic warehousing April 2007.
+ Big Data IST210 Class Lecture. + Big Data Summary by EMC Corporation ( More videos that.
Big Data – Big Opportunity Mohammad Khansari ITRC President Jan 2015 ITRC, Tehran, Iran.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
IoT Meets Big Data Standardization Considerations
Axis AI Solves Challenges of Complex Data Extraction and Document Classification through Advanced Natural Language Processing and Machine Learning MICROSOFT.
BUSINESS INTELLIGENCE & ADVANCED ANALYTICS DISCOVER | PLAN | EXECUTE JANUARY 14, 2016.
Mining of Massive Datasets Edited based on Leskovec’s from
LIMPOPO DEPARTMENT OF ECONOMIC DEVELOPMENT, ENVIRONMENT AND TOURISM The heartland of southern Africa – development is about people! 2015 ICT YOUTH CONFERENCE.
Built on the Powerful Microsoft Azure Platform, Forensic Advantage Helps Public Safety and National Security Agencies Collect, Analyze, Report, and Distribute.
Big Data Yuan Xue CS 292 Special topics on.
Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.
Course : Study of Digital Convergence. Name : Srijana Acharya. Student ID : Date : 11/28/2014. Big Data Analytics and the Telco : How Telcos.
BIG DATA. The information and the ability to store, analyze, and predict based on that information that is delivering a competitive advantage.
@nmoneypenny Innovating New Products & Services with Enterprise Social Graphing: Naomi Moneypenny.
Big Data ---a statistician’s perspective Ming Ji, PhD College of Nursing USF.
The Future of Whole Human Genome Data Management and Analysis, Available on the Microsoft Azure Platform Today MICROSOFT AZURE APP BUILDER PROFILE: SPIRAL.
MIS 3500 Instructor: Bob Travica Trendy Database Topics 2016.
Book web site:
© 2016 Global Market Insights, Inc. USA. All Rights Reserved Fuel Cell Market size worth $25.5bn by 2024Low Power Wide Area Network.
© 2016 Global Market Insights, Inc. USA. All Rights Reserved IoT in Retail Market to exceed $30bn by 2024: Global Market Insights Inc.
CNIT131 Internet Basics & Beginning HTML
Meemim's Microsoft Azure-Hosted Knowledge Management Platform Simplifies the Sharing of Information with Colleagues, Clients or the Public MICROSOFT AZURE.
MIS2502: Data Analytics Advanced Analytics - Introduction
April 25, 2012 The Three R’s Are Old School – Now It Is All About Volume, Velocity & Variety Peter Guest Alberta Public Sector Client Technical Advisor.
The Internet of Things (IoT) and Analytics
Science Behind Cross-device Conversion Tracking
Industrial IoT Derive business value from the Internet of Things, People and Services Ronald Binkofski General Manager Microsoft MC CIS.
Big-Data Fundamentals
24-7 Population Health Management Finally… Aligning Patients & Payers
© 2016 Global Market Insights, Inc. USA. All Rights Reserved Fuel Cell Market size worth $25.5bn by 2024Low Power Wide Area Network.
Hadoop Market
Objectives Overview Explain why computer literacy is vital to success in today's world Describe the five components of a computer Discuss the advantages.
MyHealthDirect’s Enterprise Scheduling Platform, Based on Microsoft Azure, Improves the Patient Experience and Reduces Patient Readmissions MICROSOFT AZURE.
Operationalize your data lake Accelerate business insight
به نام خدا Big Data and a New Look at Communication Networks Babak Khalaj Sharif University of Technology Department of Electrical Engineering.
April 25, 2012 The Three R’s Are Old School – Now It Is All About Volume, Velocity & Variety Peter Guest Alberta Public Sector Client Technical Advisor.
Empowering Population Health
Data Mining Modified from
Stop Data Wrangling, Start Transforming Data to Intelligence
DeFacto Planning on the Powerful Microsoft Azure Platform Puts the Power of Intelligent and Timely Planning at Any Business Manager’s Fingertips Partner.
How data analytics can drive greater results
Data Warehousing and Data Mining
Carl Data Solutions Collects Utility Sensor and Meter Data to Provide Advanced Reporting, Alarming, and Analytics with Microsoft Azure MICROSOFT AZURE.
Chapter GS Getting Started.
TruRating: Mass Point-of-Payment Customer Rating System Uses the Power of Microsoft Azure to Store and Analyze Millions of Ratings for Business Owners.
Improve Patient Experience with Saama and Microsoft Azure
Web Mining Department of Computer Science and Engg.
© 2016 Global Market Insights, Inc. USA. All Rights Reserved Fuel Cell Market size worth $25.5bn by 2024 Low Power Wide Area Network.
Big Data Analysis in Digital Marketing
Big DATA.
KEY INITIATIVE Financial Data and Analytics
Built on the Powerful Azure Platform, Angoss Helps Businesses Turn Data into Actionable Insights That Reduce Risk, Increase Organizational Performance.
Empowering Members to Know Your Health & Own Your Health.
Moving from Health Care to Life Care
How Can Healthcare Analytics Drive Profits In Your Business?
OU BATTLECARD: Oracle Utilities Learning Subscription
© 2016 Global Market Insights, Inc. USA. All Rights Reserved Data Warehousing Market to exceed $30bn by 2025 growing at CAGR of 12%
Presentation transcript:

Big Data and Analytics Systems: Course Introduction Fayé A. Briggs, PhD Adjunct Professor and Intel Fellow(Retired) Rice University Material mostly derived from “Mining Massive Datasets” from: Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org

What is Big Data. What is (Analytics)Data Mining What is Big Data? What is (Analytics)Data Mining? What system architectures? Knowledge discovery from data

The Big Deal With Big Data ~50x the content in the Library of Congress ~13y to view an HD video continuously One Petabyte = ~11s to generate in 2012 A transatlantic flight in a Boeing 777 produces so much telemetry, about 30 terabytes of data. A new generation of technologies and architectures designed to economically extract value from very large VOLUMES of a wide VARIETY of data by enabling high-VELOCITY capture, discovery, and/or analysis and ensure VERACITY Source: IDC How big is big… how to put human terms around it. Start with 1PB 50x volume of archived data in the LOC Would take13 years for an individual to watch a petabyte of HD video Yet, the world generates this information in 11 seconds based on the amount of data created and replicated in 2012 Start to think about machine to machine / sensed data with intelligence devices – dwarfs web traffic Non linear growth pattern Source: IDC's Digital Universe Study, sponsored by EMC, December 2012 http://blogs.loc.gov/digitalpreservation/2011/07/transferring-libraries-of-congress-of-data/ 3

J. Leskovec, A. Rajaraman, J J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

IoT Landscape Overview By year 2020, more than 200 billion devices will be connected to the cloud and each other; commonly called as IoT IDC predicts that 1/3rd of billions of devices will be intelligent devices Large amount of legacy equipment is not connected, managed or secured Need to address the interoperability of legacy systems to avoid incredibility avoid large cost of replacing I/F that can be securely connected to cloud (public/private)

The Opportunity With Extracting Knowledge From “Big Data” “Economically” - Productivity, Efficiency & better decisions “Extract Value & Analysis” - Through Parallel Computations & algorithms for better decisions “Very Large Volumes Capture” – Memory & Storage “Wide Variety”- Sensors, Software & Security “High-Velocity”- Networking & I/O “Veracity” – Correlation to reality(quality and correctness) 6

Big Data Usages Examples (Telecom/Financial/Search) Calling patterns, signal processing, forecasting Analyze switches/routers data for quality of call, frequency of calls, region loads, etc. Act before problems happen. Act before customer calls arrive. Financial Trading behavior Analyze real-time data to understand market behavior, role of individual institution/investor Detect fraud, detect impact of an event, detect the players Search Engines Process the data collected by Web bot in multiple dimensions Enhance relevance of search Big Data impacts e-connected businesses through capture, processing and storage of huge amount of data efficiently

Big Data Usages Examples (E-Biz, Media) Click Stream Analysis Analysis of online users behavior Develop comprehensive insight (Business Intelligence) to run effective strategies in real time Graph analysis Term for discovering the online influencers in various walks of life Enables a business to understand key players and devise effective strategies Lifecycle Marketing Strategies to move away from spam/mass mail Enables a business to spend money on high probable customers only Revenue Attribution Term for analyzing the data to accurately attribute revenue back to various marketing investments Business can identify effectiveness of campaign to control expenses Big Data phenomenon allows businesses to know, predict and influence customer behaviors!!!

We are at an Inflection Point in Healthcare - TRENDS Source: United Nations “Population Aging 2002” 25-29% 30+ % 20-24% 10-19% 0-9% % of population over age 60 2050 WW Average Age 60+: 21% Healthcare costs are RISING Significant % of GDP Global AGING Average age 60+: growing from 10% to 21% by 2050 U.S. Healthcare BIG DATA Value $300 Billion in value/year ~ 0.7% annual productivity growth Healthcare effectiveness analysis:  medical histories, clinical information, imaging results, laboratory test results, physician interactions, preferred prescriptions and patient accountability for taking those medications. providers using graph analytics for assessing many similar medical histories managed within a graph model that not only links patients to physicians, medications and presumed diagnoses, and providers.  Providers can rapidly scan through the graph to discover therapies used with other patients with similar characteristics (such as age, diagnostics, clinical history, associated risk factors, etc.) that have the most positive outcomes Data storage growth in HC – 2.9 XB to 13.5XB in 4 years 35.3% CAGR Largest growth areas Imaging 824PB to 3.9XB EMR 219PB to 2.3 XB Unstructured Data / File Services will add 2.3XB worth of disk over the next 4 years Source: McKinsey Global Institute Analysis ESG Research Report 2011 – North American Health Care Provider Market Size and Forecast

Big Data in Healthcare Where is the data coming from? 1. Pharma/Life Sciences 2. Clinical Decision Support & Trends (includes Diagnostic Imaging) 3. Claims, Utilization and Fraud 4. Patient Behavior/Social Networking How do we create value? (examples) 1. Personalized Medicine 2. Clinical Decision Support 3. Enhanced Fraud Detection 4. Analytics for Lifestyle and Behavior-induced Diseases Nurses or Doctor notes from Case Management, disease management, or EMR records Streaming biometric data from medical home devices Comments from call center: Individuals, clients and Health care professionals Social Media: Blogs, FaceBook, Twitter Web Logs: On-line sales, member/patient self-service applications Comments on insurance applications, wellness on-line tools, HRAs Business Impact Pattern Matching for Predictive Health Business Goals Millions of patients and 40,000+ providers, 10 years of history Analytics Applications Increase effectiveness of treatment and contain costs Find repeatable patterns in patient data and long term illness diagnosis (hypertension, diabetes, cancer etc.) Correlate visits, diagnostics, and hospital/provider interactions across years of multiple visits Technical Advantages Predict re-treatment risk and proactively address (care recommendations prior to discharge), to avoid re-admission within Medicare’s 30-day window Host of analysis Massively parallel data loading (reduce time from months/weeks to days) Path, Statistical, Relational, Cluster, Data transformation, text analysis etc. Optimizing Consumer Engagement Analyzing social media, web details to understand and influence consumer behavior Next Generation Genetic Sequencing Business Goal Assembling and aligning genomic sequences for rapid pattern detection and investigation Reduced time to decision Rapidly assemble alignment results and interrogate variations to determine base pairs at each position [REDUCE] Rapidly align millions of SNPs from a single sample to a known standard [MAP] Leverage MAQ library functions on nCluster (merge, map, prepare) Rapid alignment and assembly sequencing results High-speed data ingestion Ability to parallelize work via multiple parameters (sample-id, chromosome, etc.) More Sources of Data – Members Phone calls, emails, SMS, photos, videos, blogs, tweets, Internet activities Treatment compliance, difficulties, and progress Support structure, activities of daily living, environment Alternative therapy, lifestyle, and preventive health behaviors Facility, provider, and kiosk encounters Devices Physiological, biological, and chemical sensors – VS/BP, weight, spirometry, glucose, pulse oximetry, ECG, EEG, activity level, body position, microarrays, environment, location (venues outside the medical facility) Imaging – photo, video, voice, microscope, x-ray, ultrasound, CT, MRI, PET EHRs (KP HealthConnect) and other IT systems Set-top boxes, smart TVs, gaming consoles, smart phones, smart meters, smart appliances, homes, cars, and worksites A lifetime of BP readings every 10 days plus metadata = 6 TB of storage Some ‘quantified self’ advocates take a BP reading every hour = 1,460 TB of storage (~6X the digitized content of the Library of Congress) Organizations BPs are but a small part of the clinical record, which also includes other biometrics, imaging files, lab and pharmacy data, insurance information, etc. Employers, retailers, marketers, credit agencies, government agencies Research Regulators – quality data, scores, recommendations Professional specialty organizations – best practice guidelines World Wide Web Global biomedical literature Penn State Planned :  Broke ground for a $54M new data center dedicated to making use of big data to enhance medical research and patient care potential. Allow the university to better gather, and analyze large volumes of rich heath data for effective prediction, modeling of diseases and disease behaviors. . McKinsey Global Institute Analysis Digital rendering of the 46,000 sq ft new data center to open April 2016

Big Data Solution for Healthcare Health Info Services Primary Care Personal Health Management Aging Society New Healthcare Applications Personalized Medicine Clinical Decision Support Cancer Genomics Analytics and Visualization SQL-like Query Medical Imaging Analytics Machine Learning Data Processing/ Management Medical Images Medical Records Genome Data We mentioned earlier that Big Data can help healthcare industry to save 300B/year. To validate our two-step process, we did some exercise with some healthcare experts at Intel. Our conclusion is we have the opportunities and capability to deliver a healthcare solution stack. At the platform level, we can provide healthcare optimized appliance with storage optimization such as compression and medical image deduplication, security and privacy support, and medical imaging acceleration. For better software, SSG is able to provide optimized and specialized Hadoop data stores for medical records, Genome data, and medical images. The other critical component of better software is high perf analytics. The cross-BU analytics taskforce will address vertical opportunities such as healthcare. Meanwhile, Intel Labs and Intel Science and Technology Centers have a strong portfolio on machine learning and medical imaging analytics. Better software enables applications such as …. And these applications are not new to Intel. Our vertical and field teams have been working with health information service providers on these areas for a long time. Now they are expecting more from Intel. To them, a vertically integrated solution stack is more preferable than the hardware platform plus some enabling efforts. Distributed Platform Storage Optimization Security and Privacy Imaging Acceleration

Big Data & Analytics Goals & Challenges Advance state-of-the-art core technologies required to collect, store, preserve, manage, analyze massive amount of data, and visualize results for business intelligence. Term for discovering the online influencers in various walks of life Enables a business to understand key players and devise effective strategies Challenge The large data sets from, for example, the proliferation of sensors, patient records, experiential medical data are overwhelming data analysts, as they lack the tools to efficiently store, process, analyze, and visualize the resulting meta-data from the vast amounts of data. Lack of IoT interface and data delivery standards. Fractured providers causing data format wars (jerry-mandering) (data loss from consolidations) Systems architecture to store these big data, extracting meta-data on the fly and provide the computing capability to analyze the data real-time pose major challenges. Big Data phenomenon allows businesses to know, predict and influence customer behaviors!!!

Data contains value and knowledge J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Data Mining But to extract the knowledge data needs to be Stored Managed And ANALYZED to Predict actionable insight from the data Create data products that have business impacts Communicate relevant visuals to influence business Build confidence in data value to drive business decisions J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Carlos Somohano, Data Science, London: What does a Data Scientist Do?

Carlos Somohano, Data Science, London: What does a Data Scientist Do?

Carlos Somohano, Data Science, London: What does a Data Scientist Do?

J. Leskovec, A. Rajaraman, J J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

By Requiring A Variety of Well Optimized Technologies Working Together Extracting Knowledge From This Data Will Need Dynamically Adaptable Balanced Systems Analytics Leading to Insight Protect Store Transport Compute Generate SW FRAMEWORK Context & Location Key points: Expect the unexpected. Plan for dynamically adaptable technology. Although some Big Data workloads will be well defined and highly predictable, many others will require rapidly scalable solutions that depend on high levels of automation. HW-SW co-design is a must By Requiring A Variety of Well Optimized Technologies Working Together 19

Good news: Demand for Data Mining J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

What is Data Mining? Given lots of data Discover patterns and models that are: Valid: hold on new data with some certainty Useful: should be possible to act on the item Unexpected: non-obvious to the system Understandable: humans should be able to interpret the pattern J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Data Mining Tasks Descriptive methods Predictive methods Find human-interpretable patterns that describe the data Example: Clustering Predictive methods Use some variables to predict unknown or future values of other variables Example: Recommender systems J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Meaningfulness of Analytic Answers A risk with “Data mining” is that an analyst can “discover” patterns that are meaningless Statisticians call it Bonferroni’s principle: Roughly, if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Meaningfulness of Analytic Answers Example: We want to find (unrelated) people who at least twice have stayed at the same hotel on the same day 109 people being tracked 1,000 days Each person stays in a hotel 1% of time (1 day out of 100) Hotels hold 100 people (so 105 hotels) If everyone behaves randomly (i.e., no terrorists) will the data mining detect anything suspicious? Expected number of “suspicious” pairs of people: 250,000 … too many combinations to check – we need to have some additional evidence to find “suspicious” pairs of people in some more efficient way J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

What matters when dealing with data? Challenges Usage Quality Context Streaming Scalability Collect Prepare Ontologies Text Networks Signals Data Modalities Represent Structured Multimedia Model Reason Visualize Data Operators J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Data Mining: Cultures Data mining overlaps with: Different cultures: Databases: Large-scale data, simple queries Machine learning: Small data, Complex models CS Theory: (Randomized) Algorithms Different cultures: To a DB person, data mining is an extreme form of analytic processing – queries that examine large amounts of data Result is the query answer To a ML person, data-mining is the inference of models Result is the parameters of the model In this class we will review both! CS Theory Machine Learning Data Mining Database systems J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Comp620 Coverage This class overlaps with machine learning, statistics, artificial intelligence, databases, systems architecture but stresses more on: Scalability (big data) Algorithms Computing architectures Review of handling large data Visualization Statistics Machine Learning Data Mining Database systems Computer Systems J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

What will we learn? Learn to mine different types of data: Data is high dimensional Data is a graph Data is infinite/never-ending Data is labeled Learn to use different models of computation: MapReduce Streams and online algorithms Single machine in-memory J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

What will we learn? Review real-world problems: Recommender systems Market Basket Analysis Spam detection Duplicate document detection Review various “tools”: Linear algebra (SVD, Rec. Sys., Communities) Optimization (stochastic gradient descent) Dynamic programming (frequent itemsets) Hashing (LSH, Bloom filters) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

How It All Fits Together High dim. data Locality sensitive hashing Clustering Dimensionality reduction Graph data PageRank, SimRank Community Detection Spam Detection Infinite data Filtering data streams Web advertising Queries on streams Machine learning SVM Decision Trees Perceptron, kNN Apps Recommender systems Association Rules Duplicate document detection J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

How do you want that data? I ♥ data How do you want that data? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

About the Course

2015 Comp620 Course Staff Instructor: Office hours: Wish I had TAs! TBD

Course Logistics Course website: TBD Lecture slides(To be posted to a Rice U website- TBD) Readings Readings: Book Mining of Massive Datasets with A. Rajaraman and J. Ullman Free online: http://www.mmds.org J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Work for the Course (1+)4 longer homeworks: 40% How to submit? Theoretical and programming questions HW0 (Hadoop tutorial) has just been posted Assignments take lots of time. Start early!! How to submit? Homework write-up: Stanford students: In class or in Gates submission box SCPD students: Submit write-ups via SCPD Attach the HW cover sheet (and SCPD routing form) Upload code: Put the code for 1 question into 1 file and submit at: http://snap.stanford.edu/submit/ J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org