Data Mining Mohammed J. Zaki.

Slides:



Advertisements
Similar presentations
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Advertisements

An Overview of Machine Learning
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
Spatial and Temporal Data Mining V. Megalooikonomou Introduction to Decision Trees ( based on notes by Jiawei Han and Micheline Kamber and on notes by.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
1 CSE591 (575) Data Mining 1/21/ /6/2003 Computer Science & Engineering ASU.
Data Mining – Intro.
Presented To: Madam Nadia Gul Presented By: Bi Bi Mariam.
Computer Science Universiteit Maastricht Institute for Knowledge and Agent Technology Data mining and the knowledge discovery process Summer Course 2005.
Introduction to machine learning
Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Data Mining Techniques
9/30/2004TCSS588A Isabelle Bichindaritz1 Introduction to Bioinformatics.
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Chun-Hung Chou
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Data Management Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
Last Words COSC Big Data (frameworks and environments to analyze big datasets) has become a hot topic; it is a mixture of data analysis, data mining,
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
Chapter 1 Introduction to Data Mining
Knowledge Discovery and Data Mining Evgueni Smirnov.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU.
1 Machine Learning 1.Where does machine learning fit in computer science? 2.What is machine learning? 3.Where can machine learning be applied? 4.Should.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Copyright © 2012, SAS Institute Inc. All rights reserved. ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY,
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Last Words DM 1. Mining Data Steams / Incremental Data Mining / Mining sensor data (e.g. modify a decision tree assuming that new examples arrive continuously,
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
9/03 Data Mining – Introduction G Dong (WSU)1 CS499/ Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Data Mining and Decision Support
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Instructor: Dr. Jinze Liu CS 485G – Spring 2016 Special Topics in Data mining.
Mining of Massive Datasets Edited based on Leskovec’s from
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
The KDD Process for Extracting Useful Knowledge from Volumes of Data Fayyad, Piatetsky-Shapiro, and Smyth Ian Kim SWHIG Seminar.
Book web site:
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining – Intro.
Knowledge Discovery State of the Art
SNS COLLEGE OF TECHNOLOGY
Machine Learning for Computer Security
DATA MINING © Prentice Hall.
Dr. Chengwei Lei CEECS California State University, Bakersfield
Introduction C.Eng 714 Spring 2010.
Special Topics in Data Mining Applications Focus on: Text Mining
Data Mining: Concepts and Techniques Course Outline
CS 685G – Spring 2017 Special Topics in Data mining
Data Warehousing and Data Mining
I don’t need a title slide for a lecture
Course Introduction CSC 576: Data Mining.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Warehousing Data Mining Privacy
Data Mining: Concepts and Techniques
Presentation transcript:

Data Mining Mohammed J. Zaki

Traditional Hypothesis Driven Research Experiment Data Result Design Data analysis

Data Data Driven Science No Prior Hypothesis New Science of Data Process/Experiment Data No Prior Hypothesis New Science of Data

Bioinformatics Datasets: Integrative Science Genomes Protein structure DNA/Protein arrays Interaction Networks Pathways Metagenomics Integrative Science Systems Biology Network Biology

Astro-Informatics: US National Virtual Observatory (NVO) New Astronomy Local vs. Distant Universe Rare/exotic objects Census of active galactic nuclei Search extra-solar planets Turn anyone into an astronomer

Ecological Informatics Analyze complex ecological data from a highly-distributed set of field stations, laboratories, research sites, and individual researchers

Geo-Informatics

Cheminformatics Structural Descriptors Physiochemical Descriptors Topological Descriptors Geometrical Descriptors AAACCTCATAGGAAGCATACCAGGAATTACATCA…

Materials Informatics

Economics & Finance

World Wide Web

What is Data Mining? The iterative and interactive process of discovering valid, novel, useful, and understandable patterns or models in Massive databases

What is Data Mining? Valid: generalize to the future Novel: what we don't know Useful: be able to take some action Understandable: leading to insight Iterative: takes multiple passes Interactive: human in the loop

Why Data Mining? Massive amounts of data being collected in different disciplines Biology, Chemistry, Materials science, Astronomy, Ecology, Geology, Economics, and many more Search for a systematic way to address the challenges across/at the intersection of the diverse fields Leverage the unique strengths of each area Techniques from bioinformatics can be applied to other areas (like network intrusion detection) Game theory from Economics can be applied to problems in CS Database development in Astronomy can help Ecology applications Enable Data-informatics: bio-, chem-, eco-, geo-, astro-, materials- informatics

Why Data Mining? Dynamic nature of modern data sets: streams Massive and distributed datasets: tera-/peta-scale Various modalities: Tables Images Video Audio Text, hyper-text, “semantic” text Networks Spreadsheets Multi-lingual

Data mining: Main Goals Prediction What? Opaque Description Why? Transparent Model Age Salary CarType High/Low Risk outlier

Data Mining: Main Techniques Association rules: detect sets of attributes that frequently co-occur, and rules among them, e.g. 90% of the people who buy book X, also buy book Y (10% of all shoppers buy both) Sequence mining (categorical): discover sequences of events that commonly occur together, .e.g. In a set of DNA sequences ACGTC is followed by GTCA after a gap of 9, with 30% probability

Data Mining: Main Techniques Classification and regression: assign a new data record to one of several predefined categories or classes. Regression deals with predicting real- valued fields. Also called supervised learning. Clustering: partition the dataset into subsets or groups such that elements of a group share a common set of properties, with high within group similarity and small inter-group similarity. Also called unsupervised learning.

Data Mining: Main Techniques Deviation detection: find the record(s) that is (are) the most different from the other records, i.e., find all outliers. These may be thrown away as noise or may be the “interesting” ones. Similarity search: given a database of objects, and a “query” object, find the object(s) that are within a user-defined distance of the queried object, or find all pairs within some distance of each other.

Data Mining Process Interpretation Data Mining Transformation Preprocessing Knowledge Selection Patterns Transformed Data Preprocessed Data Original Data Target Data

Data Mining Process Understand application domain Prior knowledge, user goals Create target dataset Select data, focus on subsets Data cleaning and transformation Remove noise, outliers, missing values Select features, reduce dimensions

Data Mining Process Apply data mining algorithm Associations, sequences, classification, clustering, etc. Interpret, evaluate and visualize patterns What's new and interesting? Iterate if needed Manage discovered knowledge Close the loop

Components of Data Mining Methods Representation: language for patterns/models, expressive power Evaluation: scoring methods for deciding what is a good fit of model to data Search: method for enumerating patterns/models

New Science of Data New data models: dynamic, streaming, etc. New mining, learning, and statistical algorithms that offer timely and reliable inference and information extraction: online, approximate Self-aware, intelligent continuous data monitoring and management Data and model compression Data provenance Data security and privacy Data sensation: visual, aural, tactile Knowledge validation: domain experts

Data Science Core Areas Data Mining and Machine Learning Mathematical Modeling and Optimization Databases and Datawarehousing High Performance Computing Data Compression/Representation Statistics, Algebra, and Geometry Visualization, Sonification Social/ethical/legal Dimensions Application Domains Biology, medicine, chemistry, astronomy, finance, economics, geology, environment, materials, large-scale simulations, national security, WWW

Course Topics Exploratory Data Analysis (EDA): Multivariate statistics Numeric, Categorical Kernel Approach Graph Data Analysis High dimensional data Dimensionality reduction Frequent Pattern Mining (FPM): Itemsets Sequences Graphs Classification (CLASS): Decision trees Naïve Bayes Instance-based Rule-based Discriminant analysis Support vector machines (SVMs) Clustering (CLUS): Partitional Probabilistic Hierarchical Density-based Subspace Spectral Graph clustering

Course Syllabus and Schedule Main Course Page: http://www.cs.rpi.edu/~zaki/www-new/pmwiki.php/Dmcourse/Main