Course Introduction CSC 600: Data Mining Class 1.

Course Introduction CSC 600: Data Mining Class 1

Today What is Data Mining? Syllabus / Course Webpage Types of Data

What is Data Mining? How would you define data mining? Data Mining and Business Analytics deal with collecting and analyzing data for better decision making. Goal: solving business problems Data collection (more and more data is being collected) Warehousing of data (readily available for analysis; data from numerous sources already integrated) Computer storage and computer power cheaper every day Good software for performing analysis (prompt)

Data Mining … blends traditional data analysis (mathematical + statistical) with sophisticated machine learning algorithms Programming ability to process big data Businesses interested in decision making “Art” of data mining Math Business CS

Predictive Data Mining
Moving from data to insights to decisions.

Data Mining Applications
Businesses collect lots of data: Purchase information Web site browsing habits Social network data Business Goals: customer profiling, targeted marketing, fraud detection Questions that analyst will try to answer by data mining: “Who are the most profitable customers?” “What products can be cross-sold?” “What is the revenue outlook for the company next year?” Many variables are collected; few turn out to be useful.

More Applications Price Prediction Fraud Detection Risk Assessment
Diagnosis

Target Example 2010 project to predict customer pregnancy (pregnancy scores) Tremendous sales opportunity when family prepares for newborn Send specific marketing material (baby coupon book) Awareness of false positives; camouflaged activities Links: target-figured-out-a-teen-girl-was-pregnant-before-her-father- did/ pregnancy-inside-story.html

Data Mining Applications
Medicine, Science, Engineering collecting lots of data NASA / weather observations (collecting land surface, ocean, atmosphere readings) Molecular Biology data (large amounts of genomic data being gathered to better understand function of genes) Medical data (outcomes of procedures) Questions that a scientist will try to answer using data mining: “How is land surface precipitation and temperature affected by ocean surface temperature?” “How well can we predicts the beginning and end of the growing season for a region?”

What we will do in this Course
Learn Basic-to-Intermediate Data Mining Techniques Apply them on Datasets Program using Python Read, Understand, Discuss, Critique Scientific Papers Perform Significant Individual Data Mining Project

Syllabus / Course Webpage

“looking up records in a MySQL database” (database)
What is Data Mining? What is NOT data Mining? “the process of automatically discovering useful information in large data repositories” “to find novel and useful patterns that might otherwise remain unknown” “looking up records in a MySQL database” (database) “finding relevant web pages based on a Google search query” (information retrieval)

Data Mining and Knowledge Discovery
Process of converting raw data into useful information Input Data MySQL .csv Data Preprocessing Feature Selection Dimensionality Reduction Normalization Data Mining Decision Trees Support Vector Machines Linear Regression Postprocessing Visualization Pattern Interpretation Reporting to Boss “closing the loop”

Input Data Available in data in variety of formats:
Flat files (.csv or .txt) Spreadsheets (Excel .xls tougher to deal with) Relational tables (MySQL) Text, data on web page (scraping necessary) Big Data / Data Warehouse Data spread out over multiple locations CS programming ability often necessary Sometimes enormous amount of effort Digitizing hand-written notes

Preprocessing To transform raw input data into an appropriate format for subsequent analysis Fusing data from multiple sources Cleaning data to remove noise Duplicate observations “garbage in – garbage out” also applies to data mining Selecting records and features that are relevant to the data mining task at hand

Data Mining Applying Appropriate Data Mining Task Linear Regression
Support Vector Machines Decision Trees Clustering …

Postprocessing Performing: Visualization
Statistical significant tests, confidence intervals, hypothesis testing to eliminate spurious data mining results (yikes, math!)

Challenges of Data Mining
Scalability Gigabytes, terabytes, petabytes, exabytes of data Storage, processing “are data mining algorithms scalable?” Limits of python statistical framework libraries

High Dimensionality Datasets with hundreds or thousands of attributes Some traditional data analysis techniques were developed for low-dimensional data, and many not work well with high-dimensional data Many variables are collected; few turn out to be useful.

Heterogeneous and Complex Data Traditional data analysis often deals with data sets containing attributes of the same type (e.g. all continuous, all categorical) Non-traditional data: collection of web pages (w/ semi-structured text and hyperlinks)

Data Ownership “Good data” being geographically distributed owned by more than one organization (e.g. medical records) Access to “good data” Facebook and google keep their collected data private

Traditional Data Analysis
Based on a hypothesize-and-test paradigm Hypothesis proposed Experiment designed to gather data Data analyzed w/ respect to hypothesis Hypothesis accepted or rejected

Hypothesis-and-test pattern Data collection Laborious process
Traditional Data Analysis Data Mining Hypothesis-and-test pattern Data collection Laborious process Generation and evaluation of thousands of hypotheses Usually on relatively smaller datasets Datasets analyzed typically not result of a carefully designed experiment Opportunistic samples of data Datasets of size TB Because of data quantity, role of traditional statistical concepts (confidence intervals, statistical significance tests) is reduced With large data sets, almost any small difference becomes significant

Sample Data Vocabulary:
What is interesting in this data? Vocabulary: Column: “attribute”, “feature”, “field”, “dimension”, “variable” Row: “instance”, “record”, “observation”

Data Mining Tasks Predictive Tasks
Objective: predict value of a particular attribute, based on the values of other attributes “Defaulted Barrower?” is the target (or dependent variable) Attributes/features used for making the prediction are known as explanatory (or independent variables)

Supervised Machine Learning
Machine Learning techniques automatically learn a model of the relationship between a set of descriptive features and a target feature from a set of historical examples.

Data Mining Tasks Descriptive Tasks
Objective: derive patterns (correlations, clusters) that summarize underlying relationships in data Often more exploratory and requires an explanation of found results

Available Datasets

References Fundamentals of Machine Learning for Predictive Data Analytics, 1st Edition, Kelleher et al. Data Science from Scratch, 1st Edition, Grus Introduction to Data Mining, 1st edition, Tan et al. Data Mining and Business Analytics in R, 1st edition, Ledolter

Course Introduction CSC 600: Data Mining Class 1.

Similar presentations

Presentation on theme: "Course Introduction CSC 600: Data Mining Class 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Course Introduction CSC 600: Data Mining Class 1.

Similar presentations

Presentation on theme: "Course Introduction CSC 600: Data Mining Class 1."— Presentation transcript:

Similar presentations

About project

Feedback