Course Introduction CSC 576: Data Mining.

Slides:



Advertisements
Similar presentations
QMM 384 – Data Mining Data Mining: Introduction Introduction to Predictive Analytics.
Advertisements

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
Data Mining By Archana Ketkar.
Chapter 14 The Second Component: The Database.
Data Resource Management Data Concepts Database Management Types of Databases Chapter 5 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies,
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Presented To: Madam Nadia Gul Presented By: Bi Bi Mariam.
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
TURKISH STATISTICAL INSTITUTE INFORMATION TECHNOLOGIES DEPARTMENT (Muscat, Oman) DATA MINING.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Chun-Hung Chou
Introduction: The essential background
Data Management Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
CIS 9002 Kannan Mohan Department of CIS Zicklin School of Business, Baruch College.
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
Highline Class, BI 348 Basic Business Analytics using Excel, Chapter 01 Intro to Business Analytics BI 348, Chapter 01.
Chapter 1 Introduction to Data Mining
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Guest Lecture Introduction to Data Mining Dr. Bhavani Thuraisingham September 17, 2010.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
1 What is Data Mining? l Data mining is the process of automatically discovering useful information in large data repositories. l There are many other.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Data Mining and Decision Support
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Chapter 3 Building Business Intelligence Chapter 3 DATABASES AND DATA WAREHOUSES Building Business Intelligence 6/22/2016 1Management Information Systems.
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
July 7, 2016 Data Mining: Concepts and Techniques 1 1.
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining – Intro.
SNS COLLEGE OF TECHNOLOGY
Course Introduction CSC 600: Data Mining Class 1.
Linear Regression CSC 600: Data Mining Class 12.
Data Mining Generally, (Sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it.
MIS2502: Data Analytics Advanced Analytics - Introduction
Statistics 202: Statistical Aspects of Data Mining
Data Mining 101 with Scikit-Learn
Introduction C.Eng 714 Spring 2010.
Data and Applications Security Introduction to Data Mining
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Warehouse.
Data Mining: Concepts and Techniques Course Outline
Sangeeta Devadiga CS 157B, Spring 2007
Data Warehousing and Data Mining
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Data Mining: Introduction
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Nearest Neighbors CSC 576: Data Mining.
Data Warehousing Data Mining Privacy
Data Mining: Concepts and Techniques
Big DATA.
Welcome! Knowledge Discovery and Data Mining
CSE591: Data Mining by H. Liu
Presentation transcript:

Course Introduction CSC 576: Data Mining

Today What is Data Mining? Syllabus / Course Webpage Types of Data

What is Data Mining? How would you define data mining? Data Mining and Business Analytics deal with collecting and analyzing data for better decision making. Goal: solving business problems Data collection (more and more data is being collected) Warehousing of data (readily available for analysis; data from numerous sources already integrated) Computer storage and computer power cheaper every day Good software for performing analysis (prompt)

Data Mining … blends traditional data analysis (mathematical + statistical) with sophisticated machine learning algorithms Programming ability to process big data Businesses interested in decision making “Art” of data mining Math Business CS

Predictive Data Mining Moving from data to insights to decisions.

Data Mining Applications Businesses collect lots of data: Purchase information Web site browsing habits Social network data Business Goals: customer profiling, targeted marketing, fraud detection Questions that analyst will try to answer by data mining: “Who are the most profitable customers?” “What products can be cross-sold?” “What is the revenue outlook for the company next year?” Many variables are collected; few turn out to be useful.

More Applications Price Prediction Fraud Detection Risk Assessment Diagnosis

What we will do in this Course Learn Basic-to-Intermediate Data Mining Techniques Apply them on Datasets Program using Python Read, Understand, Discuss, Critique Scientific Papers Perform Significant Individual Data Mining Project

Syllabus / Course Webpage

“looking up records in a MySQL database” (database) What is Data Mining? What is NOT data Mining? “the process of automatically discovering useful information in large data repositories” “to find novel and useful patterns that might otherwise remain unknown” “looking up records in a MySQL database” (database) “finding relevant web pages based on a Google search query” (information retrieval)

Data Mining and Knowledge Discovery Process of converting raw data into useful information Input Data MySQL .csv JSON Twitter API Data Preprocessing Feature Selection Dimensionality Reduction Normalization Data Mining Decision Trees Support Vector Machines Linear Regression Neural Networks Postprocessing Visualization Pattern Interpretation Reporting to Boss “closing the loop”

Input Data Available in data in variety of formats: Flat files (.csv or .txt) Spreadsheets (Excel .xls tougher to deal with) Relational tables (MySQL) Text, data on web page (scraping necessary) Big Data / Data Warehouse Data spread out over multiple locations CS programming ability often necessary Sometimes enormous amount of effort Digitizing hand-written notes

Preprocessing To transform raw input data into an appropriate format for subsequent analysis Fusing data from multiple sources Cleaning data to remove noise Duplicate observations “garbage in – garbage out” also applies to data mining Selecting records and features that are relevant to the data mining task at hand

Data Mining Applying Appropriate Data Mining Task Linear Regression Support Vector Machines Decision Trees Clustering …

Postprocessing Performing: Visualization Statistical significant tests, confidence intervals, hypothesis testing to eliminate spurious data mining results (yikes, math!)

Challenges of Data Mining Scalability Gigabytes, terabytes, petabytes, exabytes of data Storage, processing “are data mining algorithms scalable?” Limits of python statistical framework libraries

Challenges of Data Mining High Dimensionality Datasets with hundreds or thousands of attributes Some traditional data analysis techniques were developed for low-dimensional data, and many not work well with high-dimensional data Many variables are collected; few turn out to be useful.

Challenges of Data Mining Heterogeneous and Complex Data Traditional data analysis often deals with data sets containing attributes of the same type (e.g. all continuous, all categorical) Non-traditional data: collection of web pages (w/ semi-structured text and hyperlinks)

Challenges of Data Mining Data Ownership “Good data” being geographically distributed owned by more than one organization (e.g. medical records) Access to “good data” Facebook and google keep their collected data private

Sample Data Vocabulary: What is interesting in this data? Vocabulary: Column: “attribute”, “feature”, “field”, “dimension”, “variable” Row: “instance”, “record”, “observation”

Data Mining Tasks Predictive Tasks Objective: predict value of a particular attribute, based on the values of other attributes “Defaulted Barrower?” is the target (or dependent variable) Attributes/features used for making the prediction are known as explanatory (or independent variables)

Supervised Machine Learning Machine Learning techniques automatically learn a model of the relationship between a set of descriptive features and a target feature from a set of historical examples.

Data Mining Tasks Descriptive Tasks Objective: derive patterns (correlations, clusters) that summarize underlying relationships in data Often more exploratory and requires an explanation of found results

“Free Public Datasets” https://www.reddit.com/r/datasets/ https://www.reddit.com/r/opendata/ https://www.kaggle.com/datasets https://github.com/awesomedata/awesome-public- datasets https://www.forbes.com/sites/bernardmarr/2018/02/ 26/big-data-and-ai-30-amazing-and-free-public- data-sources-for-2018/

References Fundamentals of Machine Learning for Predictive Data Analytics, 1st Edition, Kelleher et al. Data Science from Scratch, 1st Edition, Grus Introduction to Data Mining, 1st edition, Tan et al. Data Mining and Business Analytics in R, 1st edition, Ledolter