Three Challenges in Data Mining Anne Denton Department of Computer Science NDSU.

Three Challenges in Data Mining Anne Denton Department of Computer Science NDSU

Why Data Mining?  Parkinson’s Law of Data Data expands to fill the space available for storage  Disk-storage version of Moore’s law Capacity  2 t / 18 months  Available data grows exponentially!

Outline  Motivation of 3 challenges More records (rows) More attributes (columns) More subject domains  Some answers to the challenges Thesis work  Generalized P-Tree structure  Kernel-based semi-naïve Bayes classification KDD-cup 02/03 and with Csci 366 students  Data with graph relationship  Outlook: Data with time dependence

Examples  More records Many stores save each transaction Data warehouses keep historic data Monitoring network traffic Micro sensors / sensor networks  More attributes Items in a shopping cart Keywords in text Properties of a protein (multi-valued categorical)  More subject domains Data mining hype increases audience

Algorithmic Perspective  More records Standard scaling problem  More attributes Different algorithms needed for 1000 vs. 10 attributes  More subject domains New techniques needed Joining of separate fields Algorithms should be domain-independent Need for experts does not scale well  Twice as many data sets Twice as many domain experts?? Ignore domain knowledge?  No! Formulate it systematically

Some Answers to Challenges  Large data quantity (Thesis) Many records  P-Tree concept and its generalization to non-spatial data Many attributes  Algorithm that defies curse of dimensionality  New techniques / Joining separate fields Mining data on a graph Outlook: Mining data with time dependence

Challenge 1: Many Records  Typical question How many records satisfy given conditions on attributes?  Typical answer In record-oriented database systems  Database scan: O(N) Sorting / indexes?  Unsuitable for most problems  P-Trees Compressed bit-column-wise storage Bit-wise AND replaces database scan

P-Trees: Compression Aspect

P-Trees: Ordering Aspect  Compression relies on long sequences of 0 or 1  Images Neighboring pixels are probably similar Peano-ordering  Other data? Peano-ordering can be generalized Peano-order sorting

Peano-Order Sorting

Impact of Peano-Order Sorting  Speed improvement especially for large data sets  Less than O(N) scaling for all algorithms

So Far  Answer to challenge 1: Many records P-Tree concept allows scaling better than O(N) for AND (equivalent to database scan) Introduced effective generalization to non-spatial data (thesis)  Challenge 2: Many attributes Focus: Classification Curse of dimensionality Some algorithms suffer more than others

Curse of Dimensionality  Many standard classification algorithms E.g., decision trees, rule-based classification For each attribute 2 halves: relevant  irrelevant How often can we divide by 2 before small size of “relevant” part makes results insignificant?  Inverse of Double number of rice grains for each square of the chess board  Many domains have hundreds of attributes Occurrence of terms in text mining Properties of genes

Possible Solution  Additive models Each attribute contributes to a sum Techniques exist (statistics)  Computationally intensive  Simplest: Naïve Bayes x (k) is value of k th attribute Considered additive model  Logarithm of probability additive

Semi-Naïve Bayes Classifier  Correlated attributes are joined Has been done for categorical data  Kononenko ’91, Pazzani ’96  Previously: Continuous data discretized  New (thesis) Kernel-based evaluation of correlation

Results  Error decrease in units of standard deviation for different parameter sets  Improvement for wide range of correlation thresholds: 0.05 (white) to 1 (blue)

So Far  Answer to challenge 1: More records Generalized P-tree structure  Answer to challenge 2: More attributes Additive algorithms Example: Kernel-based semi-naïve Bayes  Challenge 3: More subject domains Data on a graph Outlook: Data with time dependence

Standard Approach to Data Mining  Conversion to a relation (table) Domain knowledge goes into table creation Standard table can be mined with standard tools  Does that solve the problem? To some degree, yes But we can do better

“Everything should be made as simple as possible, but not simpler” Albert Einstein

Claim: Representation as single relation is not rich enough  Example: Contribution of a graph structure to standard mining problems Genomics  Protein-protein interactions WWW  Link structure Scientific publications  Citations Scientific American 05/03

Data on a Graph: Old Hat?  Common Topics Analyze edge structure  Google  Biological Networks Sub-graph matching  Chemistry Visualization  Focus on graph structure  Our work Focus on mining node data Graph structure provides connectivity

Protein-Protein Interactions  Protein data From Munich Information Center for Protein Sequences (also KDD-cup 02) Hierarchical attributes  Function  Localization  Pathways Gene-related properties  Interactions From experiments Undirected graph

Questions  Prediction of a property (KDD-cup 02: AHR*) Which properties in neighbors are relevant? How should we integrate neighbor knowledge?  What are interesting patterns? Which properties say more about neighboring nodes than about the node itself? But not: *AHR: Aryl Hydrocarbon Receptor Signaling Pathway

AHR Possible Representations  OR-based At least one neighbor has property Example: Neighbor essential true  AND-based All neighbors have property Example: Neighbor essential false  Path-based (depends on maximum hops) One record for each path Classification: weighting? Association Rule Mining: Record base changes essential AHR essential AHR not essential

Association Rule Mining  OR-based representation  Conditions Association rule involves AHR Support across a link greater than within a node Conditions on minimum confidence and support Top 3 with respect to support: (Results by Christopher Besemann, project CSci 366) AHR  essential AHR  nucleus (localization) AHR  transcription (function)

Classification Results  Problem (especially path-based representation) Varying amount of information per record Many algorithms unsuitable in principle  E.g., algorithms that divide domain space  KDD-cup 02 Very simple additive model Based on visually identifying relationship Number of interacting essential genes adds to probability of predicting protein as AHR

KDD-Cup 02: Honorable Mention NDSU Team

Outlook: Time-Dependent Data  KDD-cup 03 Prediction of citations of scientific papers Old: Time-series prediction New: Combination with similarity-based prediction

Conclusions and Outlook  Many exciting problems in data mining  Various challenges Scaling of existing algorithms (more records) Different properties in algorithms become relevant (more attributes) Identifying and solving new domain- independent challenges (more subject areas)  Examples of general structural components that apply to many domains Graph-structure Time-dependence Relationships between attributes

Three Challenges in Data Mining Anne Denton Department of Computer Science NDSU.

Similar presentations

Presentation on theme: "Three Challenges in Data Mining Anne Denton Department of Computer Science NDSU."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Three Challenges in Data Mining Anne Denton Department of Computer Science NDSU.

Similar presentations

Presentation on theme: "Three Challenges in Data Mining Anne Denton Department of Computer Science NDSU."— Presentation transcript:

Similar presentations

About project

Feedback