Lecture 1: Introduction

Slides:



Advertisements
Similar presentations
R and HDInsight in Microsoft Azure
Advertisements

Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Week 9 Data Mining System (Knowledge Data Discovery)
University of Minnesota
Chapter 14 The Second Component: The Database.
25 Need-to-Know Facts. Fact 1 Every 2 days we create as much information as we did from the beginning of time until 2003 [Source]Source © 2014 Bernard.
Data Mining – Intro.
CS525: Special Topics in DBs Large-Scale Data Management
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
MAKING THE BUSINESS BETTER Presented By Mohammed Dwikat DATA MINING Presented to Faculty of IT MIS Department An Najah National University.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Tang: Introduction to Data Mining (with modification by Ch. Eick) I: Introduction to Data Mining A.Short Preview 1.Initial Definition of Data Mining 2.Motivation.
Tennessee Technological University1 The Scientific Importance of Big Data Xia Li Tennessee Technological University.
Big Data. What is Big Data? Big Data Analytics: 11 Case Histories and Success Stories
Highline Class, BI 348 Basic Business Analytics using Excel, Chapter 01 Intro to Business Analytics BI 348, Chapter 01.
Chapter 1 Introduction to Data Mining
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
+ Big Data IST210 Class Lecture. + Big Data Summary by EMC Corporation ( More videos that.
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Big Data – Big Opportunity Mohammad Khansari ITRC President Jan 2015 ITRC, Tehran, Iran.
SUPPLY CHAIN OF BIG DATA. WHAT IS BIG DATA?  A lot of data  Too much data for traditional methods  The 3Vs  Volume  Velocity  Variety.
IoT Meets Big Data Standardization Considerations
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Waqas Haider Bangyal. 2 Source Materials “ Data Mining: Concepts and Techniques” by Jiawei Han & Micheline Kamber, Second Edition, Morgan Kaufmann, 2006.
Big Data Analytics with Excel Peter Myers Bitwise Solutions.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Big Data Javad Azimi May First of All… Sorry about the language  Feel free to ask any question Please share similar experiences.
TRACE ANALYSIS AND MINING FOR SMART CITIES By G. Pan Zhejiang Univ., Hangzhou, China G. Qi ; W. Zhang ; S. Li ; Z. Wu ; L. T. Yang.
Big Data Quality Challenges for the Internet of Things (IoT) Vassilis Christophides INRIA Paris (MUSE team)
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
Unlock your Big Data with Analytics and BI on Office365 Brian Culver ● SharePoint Fest Seattle● BI102 ● August 18-20, 2015.
Data Mining - Introduction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Data Analytics (CS40003) Introduction to Data Lecture #1
Oracle Advanced Analytics
CNIT131 Internet Basics & Beginning HTML
Bhakthi Liyanage SQL Saturday Atlanta 15 July 2017
Data Mining – Intro.
Big Data is a Big Deal!.
Course Introduction CSC 600: Data Mining Class 1.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
ANOMALY DETECTION FRAMEWORK FOR BIG DATA
MIS2502: Data Analytics Advanced Analytics - Introduction
Tutorial: Big Data Algorithms and Applications Under Hadoop
DATA MINING © Prentice Hall.
BIG Data 25 Need-to-Know Facts.
Chapter 14 Big Data Analytics and NoSQL
Big-Data Fundamentals
Techniques for Finding Patterns in Large Amounts of Data: Applications in Biology Vipin Kumar William Norris Professor and Head, Department of Computer.
Introduction C.Eng 714 Spring 2010.
Data Warehousing and Data Mining
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Lecture 14: Anomaly Detection
Course Introduction CSC 576: Data Mining.
Data Mining: Introduction
Big Data: Four Vs Salhuldin Alqarghuli.
Data Warehousing Data Mining Privacy
Data Mining: Concepts and Techniques
Big DATA.
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

Lecture 1: Introduction Big Data Analysis Lecture 1: Introduction

Big Data in the News https://www.google.com/trends/explore#q=big%20data&date=1%2F2009%2084m&cmpt=q&tz=Etc%2FGMT%2B5

Growth of Big Data Source: http://editorial.designtaxi.com/editorial -images/news-data14082015/big.jpg

How Much Data is Out There? Source: http://www.emc.com/leadership/digital-universe/index.htm

How much is a Zettabyte? 1 ZettaByte = 1000 ExaBytes = 106 PetaBytes = 109 TeraBytes = 1012 GigaBytes 1 ZettaByte ~ 1021 / 5×109 = 200 billion DVDs to store them Each DVD stores about 5 GB data and its case is about 1cm thick Distance from Earth to moon = 384,000 km = 3.84 × 1010 cm ** If you stack together all the DVDs that contain 1 ZB of data, it is about 3 times the distance to the moon and back

Why Analyze Big Data? Data is an asset/lifeblood for many organizations Lots of data are being collected and warehoused The data often contain useful information that can be harnessed to improve the organization But their sheer size makes it difficult to effectively analyze them In the meantime, computers have become cheaper and more powerful This presents a unique opportunity to apply computational techniques to analyze the big data in order to help businesses plan and optimize their operations

Applications of Big Data Customer relationship management Improve our ability to nurture and retain the most valuable customers Customer acquisition and product promotion Identify new customers and cross- or up-selling opportunities Brand management Monitor brand health and track customers’ sentiments Optimize business model and operations Identify best practices, reduce fraud, waste and abuse

Big Data for Scientific Discovery Big data is not just a problem for businesses Lots of big data problems in scientific research Examples: biomedical data, astronomy, high-energy physics, climatology/hydrology Data-intensive computing as 4th paradigm for scientific discovery Theory, experiments, simulations are the other 3 paradigms Source: The Fourth Paradigm: Data-Intensive Scientific Discovery. http://research.microsoft.com/en-us/collaboration/fourthparadigm/

Characteristics (5 V’s) of Big Data Volume: large amount of data that is continuously growing Velocity: rapid streams of data that must be processed in real-time Variety: structured and unstructured data obtained from (potentially) multiple data sources Veracity: messiness or trustworthiness of the data Value: usefulness of the data; needs a careful cost/benefit analysis before embarking on big data project

Challenges of Big Data Analysis Storage limitation Traditional approaches assume entire data can fit into memory Infeasible when applied to big data problems Computation time There are few sublinear time algorithms How long does it take to sort 1 million floating point numbers? 10 million? 100 million?

Other Challenges: Privacy http://techland.time.com/2012/02/17/how-target-knew-a-high-school-girl-was-pregnant-before-her-parents/

Other Challenges: Security http://arstechnica.com/security/2012/12/how-an-internet-connected-samsung-tv-can-spill-your-deepest-secrets/

Collaborative Filtering Types of Data Analysis Predictive modeling Cluster analysis Queries Anomaly detection Descriptive statistics Collaborative Filtering Simple Complexity of analysis Complex

(Simple) Descriptive Statistics Mean (average) Standard deviation Median Mode Quartiles Correlation etc…

Example: Descriptive Statistics # characters in last name of students: Mean = 2+2+3+3…+11+14 50 =6.06 Standard deviation = 2−6.06 2 +…+ 14−6.06 2 50 −1 =2.46 Median (50th percentile) = 6 Mode = 5 1st quartile (25th percentile) = 4 3rd quartile (75th percentile) = 7 2 2 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 8 8 8 8 8 9 9 9 10 11 11 14

Querying Find the top-10 most frequently purchased items at a given store in 2015 SQL: SELECT item, count(*) as freq FROM transactions WHERE Year(Tdate) = 2015 GROUP BY item ORDER BY freq DESC LIMIT 10 TID Tdate Item CustID … Price

Predictive Modeling To predict the unknown value of a target attribute Examples of predictive modeling tasks Predict the future price of a stock Predict whether a customer will purchase an item at a store Predict which product a customer is interested in buying when visiting an online store Detect whether there is congestion or traffic accident on a highway Though the prediction tasks are different, the same class of algorithms can be applied to solve these tasks

Framework for Predictive Modeling Labeled examples Test Set Unlabeled examples congestion No congestion Model Train Training Set

Cluster Analysis To identify groups (clusters) of observations such that observations in the same group are more similar to each other than to those in other groups Crime hotspot detection

Association Analysis Extract patterns of frequently co-occurring events Time Sensor ID State 3/1/2015 07:48:05 BR1 OFF 3/1/2015 07:48:07 LR1 ON 3/1/2015 07:48:10 LR6 3/1/2015 07:48:20 BT1 3/1/2015 07:48:40 3/1/2015 07:49:30 BT3 Weekday, 7 - 8am, BR2 = OFF, BR1 = OFF, LR6 = ON  LR1=ON Weekday, 10-11pm, BR1 = ON, BR2 = ON, LR6 = OFF  LR1 = OFF

Anomaly Detection Detect significant deviations from normal observations Examples: Smart Transportation Congestion detection Smart Home/Building Pipe burst detection Network intrusion detection

Ranking (Collaborative Filtering) Given a query q, rank items in specific order based on their relevance to q Examples: Location-aware services Recommender systems

Creating Value from Big Data Target Domain Data collection and storage Data preprocessing Postprocessing Modeling and analysis

What Will You Learn in this Class? How to collect data from online sources? How to clean and preprocess data? How to query and visualize data? How to choose the right methods to analyze data? How to evaluate the results of your analysis? Programming languages and software: Python, Java SQL, Hive, Pig Hadoop Weka, Mahout, Spark

Summary Big data analysis plays a significant role in various sectors, from businesses to scientific research This lecture presents an overview of Big data analysis Challenges in analyzing big data Types of data analysis Next lecture Data and how it is represented