Vertical Data In Data Processing, you run up against two curses immediately. Curse of cardinality: solutions don’t scale well with respect to record volume.

Slides:

Advertisements

Similar presentations

OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.

Advertisements

PARTITIONAL CLUSTERING

Data Input How do I transfer the paper map data and attribute data to a format that is usable by the GIS software? Data input involves both locational.

Introduction to Cartography

Designing a Data Warehouse

Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.

Raster and Vector 2 Major GIS Data Models. Raster and Vector 2 Major GIS Data Models.

Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.

Data Mining Techniques

Artificial Neural Network Applications on Remotely Sensed Imagery Kaushik Das, Qin Ding, William Perrizo North Dakota State University

Data Mining on Streams  We should use runlists for stream data mining (unless there is some spatial structure to the data, of course, then we need to.

Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.

Data Mining and Data Warehousing, many-to-many Relationships, applications William Perrizo Dept of Computer Science North Dakota State Univ.

Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William.

3. Vertical Data First, a brief description of Data Warehouses (DWs) versus Database Management Systems (DBMSs)  C.J. Date recommended, circa 1980, 

MULTI-LAYERED SOFTWARE SYSTEM FRAMEWORK FOR DISTRIBUTED DATA MINING

Bit Sequential (bSQ) Data Model and Peano Count Trees (P-trees) Department of Computer Science North Dakota State University, USA (the bSQ and P-tree technology.

Section 1 # 1 CS The Age of Infinite Storage.

Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo.

NOSQL DATABASES Please remember to read the NOSQL Distilled book and the Seven Databases book.

Section 1 # 1 CS The Age of Infinite Storage.

Data Warehouses and DBMSs  C.J. Date, circa 1980  Do transactions on a DBMSs rather than  file processing on file systems.  “Using a DBMS instead of.

Association Rule Mining on Remotely Sensed Imagery Using Peano-trees (P-trees) Qin Ding, Qiang Ding, and William Perrizo Computer Science Department North.

Data Mining and Data Warehousing Many-to-Many Relationships Applications William Perrizo Dept of Computer Science North Dakota State Univ.

Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.

1 Technology in Action Chapter 11 Behind the Scenes: Databases and Information Systems Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.

TEMPLATE DESIGN © Predicate-Tree based Pretty Good Protection of Data William Perrizo, Arjun G. Roy Department of Computer.

Yanlei Diao, University of Massachusetts Amherst Future Directions in Sensor Data Management Yanlei Diao University of Massachusetts, Amherst.

The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.

Foundations of Business Intelligence: Databases and Information Management.

Autonomous Robots Vision © Manfred Huber 2014.

The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.

Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees.

Data Mining and Data Warehousing, many-to-many Relationships, applications DataSURG (Database Systems Users and Research Group) North Dakota State University.

Data Mining and Data Warehousing of Many-to-Many Relationships and some Applications William Perrizo Dept of Computer Science North Dakota State Univ.

Data Mining and Data Warehousing, many-to-many Relationships, applications William Perrizo DataSURG (Database Systems Users and Research Group) North Dakota.

Data Mining and Decision Support

Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,

What is GIS? “A powerful set of tools for collecting, storing, retrieving, transforming and displaying spatial data”

Vertical Data 2 In this example database (which is used throughout these notes), there are two entities, Students (a student has a number, S#, a name,

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.

DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.

Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.

Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.

IMAGE PROCESSING is the use of computer algorithms to perform image process on digital images   It is used for filtering the image and editing the digital.

P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  1.

Item-Based P-Tree Collaborative Filtering applied to the Netflix Data

Data Mining Motivation: “Necessity is the Mother of Invention”

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

DataSURG (Database Systems Users and Research Group)

Decision Tree Induction for High-Dimensional Data Using P-Trees

Efficient Ranking of Keyword Queries Using P-trees

The Age of Infinite Storage or the age of data mining

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

William Perrizo Dept of Computer Science North Dakota State Univ.

North Dakota State University Fargo, ND USA

Yue (Jenny) Cui and William Perrizo North Dakota State University

PTrees (predicate Trees) fast, accurate , DM-ready horizontal processing of compressed, vertical data structures Project onto each attribute (4 files)

3. Vertical Data LECTURE 2 Section 3.

Physical Database Design

Vertical K Median Clustering

Data Warehousing and Data Mining

3. Vertical Data LECTURE 2 Section 3.

A Spatial Data and Sensor Network Application:

North Dakota State University Fargo, ND USA

The Multi-hop closure theorem for the Rolodex Model using pTrees

Vertical K Median Clustering

William Perrizo Dept of Computer Science North Dakota State Univ.

North Dakota State University Fargo, ND USA

The P-tree Structure and its Algebra Qin Ding Maleq Khan Amalendu Roy

Presentation transcript:

Vertical Data In Data Processing, you run up against two curses immediately. Curse of cardinality: solutions don’t scale well with respect to record volume. "files are too deep!" Curse of dimensionality: solutions don’t scale with respect to attribute dimension. "files are too wide!" The curse of cardinality is a problem in both the horizontal and vertical data worlds! In the horizontal data world it was disguised as “curse of slow joins”. In the horizontal world we decompose relations to get good design (e.g., 3 rd normal form), but then we pay for that by requiring many slow joins to get the answers we need.

Horizontal Processing of Vertical Data or HPVD, instead of the ubiquitous Vertical Processing of Horizontal (record orientated) Data or VPHD. Parallelizing the processing engine.  Parallelize the software engine on clusters of computers.  Parallelize the greyware engine on clusters of people (i.e., enable visualization and use the web...). Again, we need better techniques for data analysis, querying and mining because of: Parkinson’s Law: Data volume expands to fill available data storage. Moore’s law: Available storage doubles every 9 months! Techniques to address these curses.

Yield prediction: Using Remotely Sensed Imagery (RSI) consists of an aerial photograph (RGB TIFF image taken ~July) and a synchronized crop yield map taken at harvest; thus, 4 feature attributes (B,G,R,Y) and ~100,000 pixels. A stronger association, “hi_NIR & low_red  hi_yield”, found through HPVD data mining), allows producers to take and query mid-season aerial photographs for low_NIR & high_red grid cells, and where low yeild is anticipated, apply (top dress) additional nitrogen. Can producers use Landsat images of China of predict wheat prices before planting? A few HPVD successes: 1. Precision Agriculture TIFF image Yield Map 2. Infestation Detection (e.g., Grasshopper Infestation Prediction - again involving RSI) Grasshopper caused significant economic loss each year. Early infestation prediction is key to damage control. Pixel classification on remotely sensed imagery holds much promise to achieve early detection. Pixel classification (signaturing) has many, many applications: pest detection, Flood monitoring, fire detection, wetlands monitoring … Producer are able to analyze the color intensity patterns from aerial and satellite photos taken in mid season to predict yield (find associations between electromagnetic reflection and yeild). E.g., ”hi_green & low_red  hi_yield”. That is very intuitive.

3. Sensor Network Data HPVD  Micro and Nano scale sensor blocks are being developed for sensing  Biological agents  Chemical agents  Motion detection  coatings deterioration  RF-tagging of inventory (RFID tags for Supply Chain Mgmt)  Structural materials fatigue There will be trillions ++ of individual sensors creating mountains of data which can be data mined using HPVD (maybe it shouldn't be called a success yet?).

4. A Sensor Network Application: Each energized nano-sensor transmits a ping (location is triangulated from the ping). These locations are then translated to 3-dimensional coordinates at the display. The corresponding voxel on the display lights up. This is the expendable, one-time, cheap sensor version. A more sophisticated CEASR device could sense and transmit the intensity levels, lighting up the display voxel with the same intensity. Wherever threshold level is sensed (chem, bio, thermal...) a ping is registered in a compressed structure (P-tree – detailed definition coming up) for that location. Situation space Nano-sensors dropped into the Situation space Soldier sees replica of sensed situation prior to entering space.:.:.:.:..:: ….:. : … : … ::..:.. :: :.: … : :..:..::..::..:.::...:.:.:.:..:: ….:. : … : … ::..:.. :: :.: … : :..:..::..::..:.::...:.:.:.:..:: ….:. : … : … ::..:.. :: :.: … : :..:..::..::..:.::.. Using Alien Technology’s Fluidic Self-assembly (FSA) technology, clear plexiglass laminates are joined into a cube, with a embedded nano-LED at each voxel. ================================== \ CARRIER / CubE for Active Situation Replication (CEASR) The single compressed structure (P-tree) containing all the information is transmitted to the cube, where the pattern is reconstructed (uncompress, display).

3. Anthropology Application Digital Archive Network for Anthropology (DANA) (analyze, query and mine arthropological artifacts (shape, color, discovery location,…)

What has spawned these successes? (i.e., What is Data Mining?) Querying is asking specific questions for specific answers Data Mining is finding the patterns that exist in data ( going into MOUNTAINS of raw data for the information gems hidden in that mountain of data.) Raw data must be cleaned of: missing items, outliers, noise, errors Data Warehouse: cleaned, integrated, read-only, periodic, historical database Data Mining Pattern Evaluation and Assay Classification Clustering Rule Mining Task-relevant Data Selection Feature extraction, tuple selection visualization Loop backs Smart files

Data Mining versus Querying Even on the Query end, much work is yet to be done (D. DeWitt, ACM SIGMOD Record’02). On the Data Mining end, the surface has barely been scratched. But even those scratches have had a great impact. For example, one of the early scatchers became the biggest corporation in the world. A Non-scratcher had to file for bankruptcy protection. SQL SELECT FROM WHERE Complex queries (nested, EXISTS..) Standard querying FUZZY query, Search engines, BLAST searches OLAP (rollup, drilldown, slice/dice.. Searching and Aggregating Machine LearningData Mining Supervised Learning – classification regression Unsupervised Learning - clustering Walmart vs. KMart There is a whole spectrum of techniques to get information from data : Association Rule Mining Data Prospecting Fractals, … HPVD Approach: Vertical, horizontally horizontal data vertically) HPVD Approach: Vertical, compressed data structures, Predicate-trees or Peano- trees (Ptrees in either case) 1 processed horizontally (Most DBMSs process horizontal data vertically) Ptrees are data-mining-ready, compressed data structures, which attempt to address the curses of cardinality and curse of dimensionality. 1 Ptree Technology is patented by North Dakota State University

P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  R To find the number of occurences of , AND these basic Ptrees (next slide) Predicate trees (Ptrees): vertically project each attribute, Given a table structured into horizontal records. (which are traditionally processed vertically - VPHD ) Top-down construction of the 1-dimensional Ptree of R 11, denoted, P 11 : Record the truth of the universal predicate pure 1 in a tree recursively on halves (1/2 1 subsets), until purity is achieved. 3. Right half pure1? false  R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] But it is pure (pure0) so this branch ends then vertically project each bit position of each attribute, then compress each bit slice into a basic 1D Ptree. e.g., compression of R 11 into P 11 goes as follows: P 11 pure1? false=0 pure1? true=1 pure1? false=0 R (A 1 A 2 A 3 A 4 ) for Horizontally structured records Scan vertically = Base 10Base 2 P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P ^^ ^ ^ ^ ^^ 1-Dimensional Ptrees VPHD to find the number of occurences of =2 HPVD to find the number of occurences of ?

R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] To count occurrences of 7,0,1,4 use : 0 P 11 ^P 12 ^P 13 ^P’ 21 ^P’ 22 ^P’ 23 ^P’ 31 ^P’ 32 ^P 33 ^P 41 ^P’ 42 ^P’ 43 = ^ P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P ^ ^^^ ^ ^ ^ ^^ R (A 1 A 2 A 3 A 4 ) = This 0 makes entire left branch 0 These 0 s make this node 0 These 1 s and these 0 s (which when complemented are 1's) make this node 1 The 2 1 -level has the only 1-bit so 1-count = 1*2 1 = 2 # change

R Top-down construction of basic P-trees is best for understanding, bottom-up is much faster (once across). Bottom-up construction of 1-Dim, P 11, is done using in-order tree traversal, collapsing of pure siblings as we go: R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 P Siblings are pure0 so callapse!

Thank you.