Data Mining: An Introduction Billy Mutell. “The Library of Babel” Analogy Network of bookshelves with every book ever written All the books one could.

Slides:



Advertisements
Similar presentations
Decision Tree Approach in Data Mining
Advertisements

Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
An Introduction of Support Vector Machine
Support Vector Machines
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.
Week 9 Data Mining System (Knowledge Data Discovery)
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Classical Techniques: Statistics, Neighborhoods, and Clustering.
Recommender systems Ram Akella November 26 th 2008.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CIS 674 Introduction to Data Mining
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Enterprise systems infrastructure and architecture DT211 4
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Dr. Awad Khalil Computer Science Department AUC
Data Mining Techniques
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Machine Learning Queens College Lecture 13: SVM Again.
Spatial Statistics and Spatial Knowledge Discovery First law of geography [Tobler]: Everything is related to everything, but nearby things are more related.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
3 Objects (Views Synonyms Sequences) 4 PL/SQL blocks 5 Procedures Triggers 6 Enhanced SQL programming 7 SQL &.NET applications 8 OEM DB structure 9 DB.
COMP3503 Intro to Inductive Modeling
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Basic Data Mining Technique
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Data Management and Database Technologies 1 DATA MINING Extracting Knowledge From Data Petr Olmer CERN
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
3-1 Data Mining Kelby Lee. 3-2 Overview ¨ Transaction Database ¨ What is Data Mining ¨ Data Mining Primitives ¨ Data Mining Objectives ¨ Predictive Modeling.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Recommender Systems Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Credits to Bing Liu (UIC) and Angshul Majumdar.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College Bio Informatics January
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Data Mining and Decision Support
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Machine Learning with Spark MLlib
DATA MINING © Prentice Hall.
Intro to Machine Learning
School of Computer Science & Engineering
Machine Learning Week 1.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
CAMCOS Report Day December 9th, 2015 San Jose State University
Presentation transcript:

Data Mining: An Introduction Billy Mutell

“The Library of Babel” Analogy Network of bookshelves with every book ever written All the books one could possibly imagine must exist somewhere in this library Books have titles like ‘Axaxxas mlo’, ‘The Bible’ & ‘Tomorrow's Winning Lottery Numbers’ Roughly 25 1,312,000 or x 10 1,834,097 volumes in library May be viewed as a metaphor for information in today’s society, where there’s growing amounts of data and, but not enough information

Content General Information Approaches to searching for information Project and plans

The nontrivial extraction of implicit, previously unknown, and potentially useful information from data The science of extracting useful information from large data sets or databases What is Data Mining?

With increased data, techniques needed to be created How Did it Evolve to What We Have Today? Information Retrieval Statistics Machine Learning Algorithms Database Management Data Mining

Practical Applications Government Intelligence Insurance Bank Finance Branch Evaluation Pharmaceutical Reactions in Patients

Content General Information Approaches to searching for information Project and plans

There are two models for mining data Predictive: Makes projected conclusions about values based on known results from different data Includes: Regression, Classification, Time Series Analysis Classification: Maps data into predefined groups Example: Identifying potential credit risks Time Series Analysis: Examining the value of an attribute as it varies over time Example: Choosing stocks

There are two models for mining data Descriptive: Identifies patterns or relationships in data Includes: Clustering, Association Rules, Sequence Discovery Clustering: Very similar to Classification, but groups are defined by data and not predefined Association Rules: Identifies specific types of data pairings Example: If someone buys jelly, they’re probably buying peanut butter Sequence Discovery: Highlights patterns on temporal sequences Example: If someone buys a CD player, they’ll probably buy CDs within a week

Statistical Based Algorithms Decision Tree Based Algorithms Rule Based Algorithms Distance Based Algorithms Information Analysis

Linear Regression Examples Regression- Estimation of output value based on input values; takes input data and fits it into a formula according to output

Statistical Based Algorithms By determining the regression coefficients {c 0, c 1, …, c n }, we can estimate the relationship the output parameter, y, and the input parameters, {x 1,…, x n }

Dead or Alive? Alive?Dead? Woman?Man? Non- Mathematician? Mathematician? Modern? Ancient? Pythagoras! Decision Tree Example: 20 Questions

Rule Based Algorithms Works well to perform classification through if-then analysis Trees have an implied order in which there is splitting; rules have no order

Parametric vs Nonparametric Models Parametric Model- Describes the relationship between input and output through algebraic equations where some parameters aren’t specified Nonparametric Model- Data driven and more appropriate for mining applications Creates models based on input while Parametric Methods assume models ahead of time More flexible than Parametric Models and generally easier to work with

Content General Information Approaches to searching for information Project and plans

Quest to improve customer/movie predictability through data mining and linear regression Teams win $1,000,000 prize Must beat Cinematch, Netflix’s current program to predict movie preferences NetFlix: A Case Study

What others have done so far: “If I have seen further, it is by standing on the shoulders of giants.” -Isaac Newton 1676 There are currently 31,443 contestants on 25,713 teams from 167 different countries. Important to remember that everyone is given the same amount of incomplete data, and we have to use that to predict rest of the data (unknown to us, known to Netflix) Current Leaders are from Budapest, Hungry and they’ve accurately predicted the data 8.7% better than Cinematch

K-Nearest Neighbor Algorithm (k-NN) A set of pairs is given, where the x i ’s take values in a metric space X upon which is defined a metric d and the θ i ’s take values in the set {1,2,…M} of possible classes. Each θ i is considered to be an index of the category to which the i th individual belongs, and each x i is the outcome of the set of measurements made upon that individual. A new pair ( x,θ ) is given, where only the measurement of x is observable, and it is desired to estimate θ by using information in the set of correctly classified points. Thus, we will call the nearest neighbor of x if The Nearest-Neighbor classification decision method gives to x the category θ’ n of its nearest neighbor x’ n

K-Nearest Neighbor Algorithm (k-NN) If k=3, we classify the dot as a triangle If k=5, we classify the dot as a rectangle x

NameGenderHeight)Output KristinaF1.6Short JimM2Tall MaggieF1.9Medium MarthaF1.88Medium StephanieF1.7Short BobM1.85Medium KathyF1.6Short DaveM1.7Short WorthM2.2Tall StevenM2.1Tall DebbieF1.8Medium ToddM1.95Medium KimF1.9Medium AmyF1.8Medium WynetteF1.75Medium Suppose we want to know what the entry would be classified as… Set K=5 and find the K nearest neighbors: => SHORT => MEDIUM Thus KNN would classify as SHORT

Take data from Netflix and sift through it Develop a function that maps non-linear data to a linear format so that it may be clustered and regressed Map data to matrices in R n Use Support Vector Machines to map input vectors to a higher dimensional space where a maximal separating hyper-plane is constructed Create a way to interpret this data in the form of movie recommendations Also… Use k-NN Approach along with Latent Semantic Indexing techniques to analyze scripts and key thematic plots and look for correlations/clusters What I plan to do from here:

Questions?