Three Challenges in Data Mining Anne Denton Department of Computer Science NDSU.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Data Mining Classification: Alternative Techniques
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Dynamic Bayesian Networks (DBNs)
Traffic Prediction on the Internet Anne Denton. Outline  Paper by Y. Baryshnikov, E. Coffman, D. Rubenstein and B. Yimwadsana  Solutions  Time-Series.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
CES 514 – Data Mining Lecture 8 classification (contd…)
Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Backtracking.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Chapter 5 Data mining : A Closer Look.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Enterprise systems infrastructure and architecture DT211 4
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Sublinear time algorithms Ronitt Rubinfeld Computer Science and Artificial Intelligence Laboratory (CSAIL) Electrical Engineering and Computer Science.
Data Mining Techniques
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
2009 Mathematics Standards of Learning Training Institutes Algebra II Virginia Department of Education.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Chapter 4 Pattern Recognition Concepts continued.
Last Words COSC Big Data (frameworks and environments to analyze big datasets) has become a hot topic; it is a mixture of data analysis, data mining,
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Data Mining and Data Warehousing, many-to-many Relationships, applications William Perrizo Dept of Computer Science North Dakota State Univ.
Tennessee Technological University1 The Scientific Importance of Big Data Xia Li Tennessee Technological University.
Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.
Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,
A Graph-based Friend Recommendation System Using Genetic Algorithm
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
 Fundamentally, data mining is about processing data and identifying patterns and trends in that information so that you can decide or judge.  Data.
Part 1: Biological Networks 1.Protein-protein interaction networks 2.Regulatory networks 3.Expression networks 4.Metabolic networks 5.… more biological.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Last Words DM 1. Mining Data Steams / Incremental Data Mining / Mining sensor data (e.g. modify a decision tree assuming that new examples arrive continuously,
Data Mining and Data Warehousing Many-to-Many Relationships Applications William Perrizo Dept of Computer Science North Dakota State Univ.
AN INTELLIGENT AGENT is a software entity that senses its environment and then carries out some operations on behalf of a user, with a certain degree of.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Slides for “Data Mining” by I. H. Witten and E. Frank.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Data Mining and Data Warehousing of Many-to-Many Relationships and some Applications William Perrizo Dept of Computer Science North Dakota State Univ.
Machine Learning in Practice Lecture 5 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Top-K Generation of Integrated Schemas Based on Directed and Weighted Correspondences by Ahmed Radwan, Lucian Popa, Ioana R. Stanoi, Akmal Younis Presented.
Data Mining and Decision Support
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
Using Bayesian Networks to Predict Plankton Production from Satellite Data By: Rob Curtis, Richard Fenn, Damon Oberholster Supervisors: Anet Potgieter,
CS Machine Learning Instance Based Learning (Adapted from various sources)
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
A research and policy informed discussion of cross-curricular approaches to the teaching of mathematics and science with a focus on how scientific enquiry.
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
Why Intelligent Data Analysis? Joost N. Kok Leiden Institute of Advanced Computer Science Universiteit Leiden.
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Mean Shift Segmentation
Classification in Complex Systems
Visualization of Content Information in Networks using GlyphNet
Presentation transcript:

Three Challenges in Data Mining Anne Denton Department of Computer Science NDSU

Why Data Mining?  Parkinson’s Law of Data Data expands to fill the space available for storage  Disk-storage version of Moore’s law Capacity  2 t / 18 months  Available data grows exponentially!

Outline  Motivation of 3 challenges More records (rows) More attributes (columns) More subject domains  Some answers to the challenges Thesis work  Generalized P-Tree structure  Kernel-based semi-naïve Bayes classification KDD-cup 02/03 and with Csci 366 students  Data with graph relationship  Outlook: Data with time dependence

Examples  More records Many stores save each transaction Data warehouses keep historic data Monitoring network traffic Micro sensors / sensor networks  More attributes Items in a shopping cart Keywords in text Properties of a protein (multi-valued categorical)  More subject domains Data mining hype increases audience

Algorithmic Perspective  More records Standard scaling problem  More attributes Different algorithms needed for 1000 vs. 10 attributes  More subject domains New techniques needed Joining of separate fields Algorithms should be domain-independent Need for experts does not scale well  Twice as many data sets Twice as many domain experts?? Ignore domain knowledge?  No! Formulate it systematically

Some Answers to Challenges  Large data quantity (Thesis) Many records  P-Tree concept and its generalization to non-spatial data Many attributes  Algorithm that defies curse of dimensionality  New techniques / Joining separate fields Mining data on a graph Outlook: Mining data with time dependence

Challenge 1: Many Records  Typical question How many records satisfy given conditions on attributes?  Typical answer In record-oriented database systems  Database scan: O(N) Sorting / indexes?  Unsuitable for most problems  P-Trees Compressed bit-column-wise storage Bit-wise AND replaces database scan

P-Trees: Compression Aspect

P-Trees: Ordering Aspect  Compression relies on long sequences of 0 or 1  Images Neighboring pixels are probably similar Peano-ordering  Other data? Peano-ordering can be generalized Peano-order sorting

Peano-Order Sorting

Impact of Peano-Order Sorting  Speed improvement especially for large data sets  Less than O(N) scaling for all algorithms

So Far  Answer to challenge 1: Many records P-Tree concept allows scaling better than O(N) for AND (equivalent to database scan) Introduced effective generalization to non-spatial data (thesis)  Challenge 2: Many attributes Focus: Classification Curse of dimensionality Some algorithms suffer more than others

Curse of Dimensionality  Many standard classification algorithms E.g., decision trees, rule-based classification For each attribute 2 halves: relevant  irrelevant How often can we divide by 2 before small size of “relevant” part makes results insignificant?  Inverse of Double number of rice grains for each square of the chess board  Many domains have hundreds of attributes Occurrence of terms in text mining Properties of genes

Possible Solution  Additive models Each attribute contributes to a sum Techniques exist (statistics)  Computationally intensive  Simplest: Naïve Bayes x (k) is value of k th attribute Considered additive model  Logarithm of probability additive

Semi-Naïve Bayes Classifier  Correlated attributes are joined Has been done for categorical data  Kononenko ’91, Pazzani ’96  Previously: Continuous data discretized  New (thesis) Kernel-based evaluation of correlation

Results  Error decrease in units of standard deviation for different parameter sets  Improvement for wide range of correlation thresholds: 0.05 (white) to 1 (blue)

So Far  Answer to challenge 1: More records Generalized P-tree structure  Answer to challenge 2: More attributes Additive algorithms Example: Kernel-based semi-naïve Bayes  Challenge 3: More subject domains Data on a graph Outlook: Data with time dependence

Standard Approach to Data Mining  Conversion to a relation (table) Domain knowledge goes into table creation Standard table can be mined with standard tools  Does that solve the problem? To some degree, yes But we can do better

“Everything should be made as simple as possible, but not simpler” Albert Einstein

Claim: Representation as single relation is not rich enough  Example: Contribution of a graph structure to standard mining problems Genomics  Protein-protein interactions WWW  Link structure Scientific publications  Citations Scientific American 05/03

Data on a Graph: Old Hat?  Common Topics Analyze edge structure  Google  Biological Networks Sub-graph matching  Chemistry Visualization  Focus on graph structure  Our work Focus on mining node data Graph structure provides connectivity

Protein-Protein Interactions  Protein data From Munich Information Center for Protein Sequences (also KDD-cup 02) Hierarchical attributes  Function  Localization  Pathways Gene-related properties  Interactions From experiments Undirected graph

Questions  Prediction of a property (KDD-cup 02: AHR*) Which properties in neighbors are relevant? How should we integrate neighbor knowledge?  What are interesting patterns? Which properties say more about neighboring nodes than about the node itself? But not: *AHR: Aryl Hydrocarbon Receptor Signaling Pathway

AHR Possible Representations  OR-based At least one neighbor has property Example: Neighbor essential true  AND-based All neighbors have property Example: Neighbor essential false  Path-based (depends on maximum hops) One record for each path Classification: weighting? Association Rule Mining: Record base changes essential AHR essential AHR not essential

Association Rule Mining  OR-based representation  Conditions Association rule involves AHR Support across a link greater than within a node Conditions on minimum confidence and support Top 3 with respect to support: (Results by Christopher Besemann, project CSci 366) AHR  essential AHR  nucleus (localization) AHR  transcription (function)

Classification Results  Problem (especially path-based representation) Varying amount of information per record Many algorithms unsuitable in principle  E.g., algorithms that divide domain space  KDD-cup 02 Very simple additive model Based on visually identifying relationship Number of interacting essential genes adds to probability of predicting protein as AHR

KDD-Cup 02: Honorable Mention NDSU Team

Outlook: Time-Dependent Data  KDD-cup 03 Prediction of citations of scientific papers Old: Time-series prediction New: Combination with similarity-based prediction

Conclusions and Outlook  Many exciting problems in data mining  Various challenges Scaling of existing algorithms (more records) Different properties in algorithms become relevant (more attributes) Identifying and solving new domain- independent challenges (more subject areas)  Examples of general structural components that apply to many domains Graph-structure Time-dependence Relationships between attributes