Data Mining and Data Warehousing of Many-to-Many Relationships and some Applications William Perrizo Dept of Computer Science North Dakota State Univ.

Why Mining Data?  Parkinson’s Law of Data Data expands to fill available storage (and then some)  Disk-storage version of Moore’s law Capacity  2 t / 18 months Available storage doubles every 18 months!

More’s Law = More’s Less The more volume one has, the less information one has. (AKA: Shannon’s Canon) A simple illustration: Which phone book is more helpful? BOOK-1BOOK-2 NameNumberNameNumber Smith234-9816Smith234-9816 Jones231-7237Smith231-7237 Jones234-9816 Jones231-7237

We all have volumes of data! Soon we will volumes more! EROS Data Center (EDC) in the USA archives Earth Observing System satellite data for the US Government. They expect 10 petabytes by 2005. Sloan Digital Sky Survey (aggregated astronomical data) will exceed that by many orders of magnitude. Sensor networks will collect unheard-of data volumes. The WWW!! Micro-arrays, gene-chips and genome sequencing successes are creating potentially life-saving data at a torrid pace. Homeland security data is voluminous and vital? (no matter where your homeland happens to be) but the information MUST be teased out of it). That’s where data mining comes in!

Data Mining and Querying Most people have Data,but want information. Sometimes they get that information using simple query engines, provided they know exactly what they want and how to ask for it, otherwise, data mining is required. But, in fact, data mining is useful anyway, because there’s almost always a wealth of useful information in your data that you cannot query because you don’t know it’s there. There is a whole spectrum of techniques to get information from data: Even on the query end, much work is yet to be done to solve the problem of delivering standard workload answers quickly (D. DeWitt, ACM SIGMOD’02). On the Data Mining end, we have barely scratched the surface. But those scratches, in some cases, have already made a HUGE difference - between becoming the biggest corporation in the world and filing for bankruptcy Walmart vs. KMart SQL SELECT FROM WHERE Complex queries (nested, EXISTS..) FUZZY query, Search engines, BLAST searches OLAP (rollup, drilldown, slice/dice.. Machine LearningData Mining Standard querying Simple Searching and Aggregating Supervised Learning – Classificatior Regression Unsupervised Learning - Clustering Association Rule Mining Data Prospecting?

What is Data Mining? Querying: asks specific questions - expect specific answers. Data Mining: “Goes into the MOUNTAIN of DATA, returns information gems” (But also, likely, much fool’s gold. Relevance and interestingness analysis assays help pick out the valuable information gems).

Outline  Motivation of 3 challenges More records (rows) More attributes (columns) New subject domains  Some answers to the challenges Thesis work  Generalized P-Tree structure  Kernel-based semi-naïve Bayes classification KDD-cup 02/03 and with Csci 366 students  Data with graph relationship  Outlook: Data with time dependence

Examples  More records Many stores save each transaction Data warehouses keep historic data Monitoring network traffic Micro sensors / sensor networks  More attributes Items in a shopping cart Keywords in text Properties of a protein (multi-valued categorical)  New subject domains Data mining hype increases audience

Algorithmic Perspective  More records Standard scaling problem  More attributes Different algorithms needed for 1000 vs. 10 attributes  New subject domains New techniques needed Joining of separate fields Algorithms should be domain-independent Need for experts does not scale well  Twice as many data sets Twice as many domain experts?? Ignore domain knowledge?  No! Formulate it systematically

Some Answers to Challenges  Large data quantity (Thesis) Many records  P-Tree concept and its generalization to non-spatial data Many attributes  Algorithm that defies curse of dimensionality  New techniques / Joining separate fields Mining data on a graph Outlook: Mining data with time dependence

Challenge 1: Many Records  Typical question How many records satisfy given conditions on attributes?  Typical answer In record-oriented database systems  Database scan: O(N) Sorting / indexes?  Unsuitable for most problems  P-Trees Compressed bit-column-wise storage Bit-wise AND replaces database scan

P-Trees: Compression Aspect

P-Trees: Ordering Aspect  Compression relies on long sequences of 0 or 1  Images Neighboring pixels are probably similar Peano-ordering  Other data? Peano-ordering can be generalized Peano-order sorting

Peano-Order Sorting

Impact of Peano-Order Sorting  Speed improvement especially for large data sets  Less than O(N) scaling for all algorithms

So Far  Answer to challenge 1: Many records P-Tree concept allows scaling better than O(N) for AND (equivalent to database scan) Introduced effective generalization to non-spatial data (thesis)  Challenge 2: Many attributes Focus: Classification Curse of dimensionality Some algorithms suffer more than others

Curse of Dimensionality  Many standard classification algorithms E.g., decision trees, rule-based classification For each attribute 2 halves: relevant  irrelevant How often can we divide by 2 before small size of “relevant” part makes results insignificant?  Inverse of Double number of rice grains for each square of the chess board  Many domains have hundreds of attributes Occurrence of terms in text mining Properties of genes

Possible Solution  Additive models Each attribute contributes to a sum Techniques exist (statistics)  Computationally intensive  Simplest: Naïve Bayes x (k) is value of k th attribute Considered additive model  Logarithm of probability additive

Semi-Naïve Bayes Classifier  Correlated attributes are joined Has been done for categorical data  Kononenko ’91, Pazzani ’96  Previously: Continuous data discretized  New (thesis) Kernel-based evaluation of correlation

Results  Error decrease in units of standard deviation for different parameter sets  Improvement for wide range of correlation thresholds: 0.05 (white) to 1 (blue)

So Far  Answer to challenge 1: More records Generalized P-tree structure  Answer to challenge 2: More attributes Additive algorithms Example: Kernel-based semi-naïve Bayes  Challenge 3: New subject domains Data on a graph Outlook: Data with time dependence

Standard Approach to Data Mining  Conversion to a relation (table) Domain knowledge goes into table creation Standard table can be mined with standard tools  Does that solve the problem? To some degree, yes But we can do better

“Everything should be made as simple as possible, but not simpler” Albert Einstein

Claim: Representation as single relation is not rich enough  Example: Contribution of a graph structure to standard mining problems Genomics  Protein-protein interactions WWW  Link structure Scientific publications  Citations Scientific American 05/03

Data on a Graph: Old Hat?  Common Topics Analyze edge structure  Google  Biological Networks Sub-graph matching  Chemistry Visualization  Focus on graph structure  Our work Focus on mining node data Graph structure provides connectivity

Protein-Protein Interactions  Protein data From Munich Information Center for Protein Sequences (also KDD-cup 02) Hierarchical attributes  Function  Localization  Pathways Gene-related properties  Interactions From experiments Undirected graph

Questions  Prediction of a property (KDD-cup 02: AHR*) Which properties in neighbors are relevant? How should we integrate neighbor knowledge?  What are interesting patterns? Which properties say more about neighboring nodes than about the node itself? But not: *AHR: Aryl Hydrocarbon Receptor Signaling Pathway

AHR Possible Representations  OR-based At least one neighbor has property Example: Neighbor essential true  AND-based All neighbors have property Example: Neighbor essential false  Path-based (depends on maximum hops) One record for each path Classification: weighting? Association Rule Mining: Record base changes essential AHR essential AHR not essential

Association Rule Mining  OR-based representation  Conditions Association rule involves AHR Support across a link greater than within a node Conditions on minimum confidence and support Top 3 with respect to support: (Results by Christopher Besemann, project CSci 366) AHR  essential AHR  nucleus (localization) AHR  transcription (function)

Classification Results  Problem (especially path-based representation) Varying amount of information per record Many algorithms unsuitable in principle  E.g., algorithms that divide domain space  KDD-cup 02 Very simple additive model Based on visually identifying relationship Number of interacting essential genes adds to probability of predicting protein as AHR

KDD-Cup 02: Honorable Mention NDSU Team

Outlook: Time-Dependent Data  KDD-cup 03 Prediction of citations of scientific papers Old: Time-series prediction New: Combination with similarity-based prediction

Conclusions and Outlook  Many exciting problems in data mining  Various challenges Scaling of existing algorithms (more records) Different types of algorithms gain importance (more attributes) Identifying and solving new challenges in a domain-independent way (new subject areas)  Examples of general structural components that apply to many domains Graph-structure Time-dependence Relationships between attributes  Software engineering aspects Software design of scientific applications Rows vs. columns

Data Mining and Data Warehousing of Many-to-Many Relationships and some Applications William Perrizo Dept of Computer Science North Dakota State Univ.

Similar presentations

Presentation on theme: "Data Mining and Data Warehousing of Many-to-Many Relationships and some Applications William Perrizo Dept of Computer Science North Dakota State Univ."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining and Data Warehousing of Many-to-Many Relationships and some Applications William Perrizo Dept of Computer Science North Dakota State Univ.

Similar presentations

Presentation on theme: "Data Mining and Data Warehousing of Many-to-Many Relationships and some Applications William Perrizo Dept of Computer Science North Dakota State Univ."— Presentation transcript:

Similar presentations

About project

Feedback