Introduction to Big Data & Basic Data Analysis. Big Data EveryWhere! Lots of data is being collected and warehoused – Web data, e-commerce – purchases.

Slides:



Advertisements
Similar presentations
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Advertisements

OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
Chapter 18: Data Analysis and Mining Kat Powell. Chapter 18: Data Analysis and Mining ➔ Decision Support Systems ➔ Data Analysis and OLAP ➔ Data Warehousing.
Data Warehousing CPS216 Notes 13 Shivnath Babu. 2 Warehousing l Growing industry: $8 billion way back in 1998 l Range from desktop to huge: u Walmart:
Xyleme A Dynamic Warehouse for XML Data of the Web.
OLAP. Overview Traditional database systems are tuned to many, small, simple queries. Some new applications use fewer, more time-consuming, analytic queries.
Week 9 Data Mining System (Knowledge Data Discovery)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the.
Evaluation of MineSet 3.0 By Rajesh Rathinasabapathi S Peer Mohamed Raja Guided By Dr. Li Yang.
Data Mining By Archana Ketkar.
1 Lecture 10: More OLAP - Dimensional modeling
Recommender systems Ram Akella November 26 th 2008.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Decision Support: Data Mining Introduction.
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Data mining By Aung Oo.
Introduction to Data Science Kamal Al Nasr, Matthew Hayes and Jean-Claude Pedjeu Computer Science and Mathematical Sciences College of Engineering Tennessee.
On-Line Application Processing Warehousing Data Cubes Data Mining 1.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Knowledge Discovery & Data Mining process of extracting previously unknown, valid, and actionable (understandable) information from large databases Data.
MAKING THE BUSINESS BETTER Presented By Mohammed Dwikat DATA MINING Presented to Faculty of IT MIS Department An Najah National University.
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
1 1 Slide Introduction to Data Mining and Business Intelligence.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Database Design Part of the design process is deciding how data will be stored in the system –Conventional files (sequential, indexed,..) –Databases (database.
BUSINESS ANALYTICS AND DATA VISUALIZATION
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
1 What is Data Mining? l Data mining is the process of automatically discovering useful information in large data repositories. l There are many other.
Winter 2006Winter 2002 Keller, Ullman, CushingJudy Cushing 19–1 Warehousing The most common form of information integration: copy sources into a single.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Business Intelligence - 2 BUS 782. Topics Data warehousing Data Mining.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Data Resource Management Agenda What types of data are stored by organizations? How are different types of data stored? What are the potential problems.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.
Databases 2 On-Line Application Processing: Warehousing, Data Cubes, Data Mining.
Data Warehousing and OLAP Outline u Models & operations u Implementing a warehouse u Future directions.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
An Introduction to Data Mining
Department of Computer Science Sir Syed University of Engineering & Technology, Karachi-Pakistan. Presentation Title: DATA MINING Submitted By.
Book web site:
On-Line Application Processing
From BI to Big DatA.
Data Mining: Introduction
MIS2502: Data Analytics Advanced Analytics - Introduction
Data Warehousing CIS 4301 Lecture Notes 4/20/2006.
Data Mining: Introduction
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Mining: Introduction
MIS5101: Data Analytics Advanced Analytics - Introduction
Sangeeta Devadiga CS 157B, Spring 2007
Data Analysis.
Data Mining: Introduction
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
Introduction of Week 9 Return assignment 5-2
Data Mining: Introduction
On-Line Application Processing
Presentation transcript:

Introduction to Big Data & Basic Data Analysis

Big Data EveryWhere! Lots of data is being collected and warehoused – Web data, e-commerce – purchases at department/ grocery stores – Bank/Credit Card transactions – Social Network

How much data? Google processes 20 PB a day (2008) Wayback Machine has 3 PB TB/month (3/2009) Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009) CERN’s Large Hydron Collider (LHC) generates 15 PB a year 640K ought to be enough for anybody.

Maximilien Brice, © CERN

The Earthscope The Earthscope is the world's largest science project. Designed to track North America's geological evolution, this observatory records data over 3.8 million square miles, amassing 67 terabytes of data. It analyzes seismic slips in the San Andreas fault, sure, but also the plume of magma underneath Yellowstone and much, much more. ( /ns/technology_and_science -future_of_technology/#.TmetOdQ- -uI) 1.

Type of Data Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) Graph Data – Social Network, Semantic Web (RDF), … Streaming Data – You can only scan the data once

What to do with these data? Aggregation and Statistics – Data warehouse and OLAP Indexing, Searching, and Querying – Keyword based search – Pattern matching (XML/RDF) Knowledge discovery – Data Mining – Statistical Modeling

Statistics 101

Random Sample and Statistics Population: is used to refer to the set or universe of all entities under study. However, looking at the entire population may not be feasible, or may be too expensive. Instead, we draw a random sample from the population, and compute appropriate statistics from the sample, that give estimates of the corresponding population parameters of interest.

Statistic Let Si denote the random variable corresponding to data point xi, then a statistic ˆθ is a function ˆθ : (S1, S2, · · ·, Sn) → R. If we use the value of a statistic to estimate a population parameter, this value is called a point estimate of the parameter, and the statistic is called as an estimator of the parameter.

Empirical Cumulative Distribution Function Where Inverse Cumulative Distribution Function

Example

Measures of Central Tendency (Mean) Population Mean: Sample Mean (Unbiased, not robust):

Measures of Central Tendency (Median) Population Median: or Sample Median:

Example

Measures of Dispersion (Range) Range:  Not robust, sensitive to extreme values Sample Range:

Measures of Dispersion (Inter-Quartile Range) Inter-Quartile Range (IQR):  More robust Sample IQR:

Measures of Dispersion (Variance and Standard Deviation) Standard Deviation: Variance:

Measures of Dispersion (Variance and Standard Deviation) Standard Deviation: Variance: Sample Variance & Standard Deviation:

Univariate Normal Distribution

Multivariate Normal Distribution

OLAP and Data Mining

Warehouse Architecture 23 Client Warehouse Source Query & Analysis Integration Metadata

24 Star Schemas A star schema is a common organization for data at a warehouse. It consists of: 1.Fact table : a very large accumulation of facts such as sales. wOften “insert-only.” 2.Dimension tables : smaller, generally static information about the entities involved in the facts.

Terms Fact table Dimension tables Measures 25

Star 26

Cube 27 Fact table view: Multi-dimensional cube: dimensions = 2

3-D Cube 28 day 2 day 1 dimensions = 3 Multi-dimensional cube:Fact table view:

ROLAP vs. MOLAP ROLAP: Relational On-Line Analytical Processing MOLAP: Multi-Dimensional On-Line Analytical Processing 29

Aggregates 30 Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 81

Aggregates 31 Add up amounts by day In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date

Another Example 32 Add up amounts by day, product In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId drill-down rollup

Aggregates Operators: sum, count, max, min, median, ave “Having” clause Using dimension hierarchy – average by region (within store) – maximum by month (within date) 33

What is Data Mining? Discovery of useful, possibly unexpected, patterns in data Non-trivial extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

Data Mining Tasks Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive] Regression [Predictive] Deviation Detection [Predictive] Collaborative Filter [Predictive]

Classification: Definition Given a collection of records (training set ) – Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Decision Trees 37 Example: Conducted survey to see what customers were interested in new model car Want to select customers for advertising campaign training set

Clustering 38 age income education

K-Means Clustering 39

Association Rule Mining 40 transaction id customer id products bought sales records: Trend: Products p5, p8 often bough together Trend: Customer 12 likes product p9 market-basket data

Association Rule Discovery Marketing and Sales Promotion: – Let the rule discovered be {Bagels, … } --> {Potato Chips} – Potato Chips as consequent => Can be used to determine what should be done to boost its sales. – Bagels in the antecedent => can be used to see which products would be affected if the store discontinues selling bagels. – Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips! Supermarket shelf management. Inventory Managemnt

Collaborative Filtering Goal: predict what movies/books/… a person may be interested in, on the basis of – Past preferences of the person – Other people with similar past preferences – The preferences of such people for a new movie/book/… One approach based on repeated clustering – Cluster people on the basis of preferences for movies – Then cluster movies on the basis of being liked by the same clusters of people – Again cluster people based on their preferences for (the newly created clusters of) movies – Repeat above till equilibrium Above problem is an instance of collaborative filtering, where users collaborate in the task of filtering information to find information of interest 42

Other Types of Mining Text mining: application of data mining to textual documents – cluster Web pages to find related pages – cluster pages a user has visited to organize their visit history – classify Web pages automatically into a Web directory Graph Mining: – Deal with graph data 43

Data Streams What are Data Streams? – Continuous streams – Huge, Fast, and Changing Why Data Streams? – The arriving speed of streams and the huge amount of data are beyond our capability to store them. – “Real-time” processing Window Models – Landscape window (Entire Data Stream) – Sliding Window – Damped Window Mining Data Stream 44

A Simple Problem Finding frequent items – Given a sequence (x 1, … x N ) where x i ∈ [1,m], and a real number θ between zero and one. – Looking for x i whose frequency > θ – Na ï ve Algorithm (m counters) The number of frequent items ≤ 1/θ Problem: N>>m>>1/θ 45 P×(Nθ) ≤ N

KRP algorithm ─ Karp, et. al (TODS ’ 03) 46 Θ=0.35 ⌈ 1/θ ⌉ = 3 N=30m=12 N/ ( ⌈ 1/θ ⌉ ) ≤ Nθ

Streaming Sample Problem Scan the dataset once Sample K records – Each one has equally probability to be sampled – Total N record: K/N