Incrementally Maintaining Classification using an RDBMS

Slides:

Advertisements

Similar presentations

Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson.

Advertisements

A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad.

Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.

Maintaining Sliding Widow Skylines on Data Streams.

Parallel Databases By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany) DIRECTOR ARUNAI ENGINEERING COLLEGE TIRUVANNAMALAI.

Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.

CMSC 414 Computer (and Network) Security Lecture 13 Jonathan Katz.

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.

Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]

Static Optimization of Conjunctive Queries with Sliding Windows over Infinite Streams Presented by: Andy Mason and Sheng Zhong Ahmed M.Ayad and Jeffrey.

Locality-Aware Request Distribution in Cluster-based Network Servers 1. Introduction and Motivation --- Why have this idea? 2. Strategies --- How to implement?

Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.

Online Magazine Bryan Ng. Goal of the Project Product Dynamic Content Easy Administration Development Layered Architecture Object Oriented Adaptive to.

Spatial Indexing I Point Access Methods.

Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.

CS Instance Based Learning1 Instance Based Learning.

Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.

Chapter 7 Indexing Objectives: To get familiar with: Indexing

1 Project 7: Huffman Code. 2 Extend the most recent version of the Huffman Code program to include decode information in the binary output file and use.

1 Physical Data Organization and Indexing Lecture 14.

Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.

Department of Computer Science, University of Waikato, New Zealand Geoffrey Holmes, Bernhard Pfahringer and Richard Kirkby Traditional machine learning.

CSCE Database Systems Chapter 15: Query Execution 1.

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.

An Optimal Cache-Oblivious Priority Queue and Its Applications in Graph Algorithms By Arge, Bender, Demaine, Holland-Minkley, Munro Presented by Adam Sheffer.

1 Online algorithms Typically when we solve problems and design algorithms we assume that we know all the data a priori. However in many practical situations.

MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.

DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.

Spring 2003 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Approximate NN queries on Streams with Guaranteed Error/performance Bounds Nick AT&T labs-research Beng Chin Ooi, Kian-Lee Tan, Rui National.

Spring 2004 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2004 Yanyong Zhang

Bootstrapped Optimistic Algorithm for Tree Construction

DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.

for all Hyperion video tutorial/Training/Certification/Material Essbase Optimization Techniques by Amit.

Offering a Precision- Performance Tradeoff for Aggregation Queries over Replicated Data Paper by Chris Olston, Jennifer Widom Presented by Faizaan Kersi.

Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.

CS222: Principles of Data Management Lecture #4 Catalogs, Buffer Manager, File Organizations Instructor: Chen Li.

COURSE DETAILS SPARK ONLINE TRAINING COURSE CONTENT

S. Sudarshan CS632 Course, Mar 2004 IIT Bombay

Module 11: File Structure

CS 540 Database Management Systems

Updating SF-Tree Speaker: Ho Wai Shing.

Parallel Databases.

Lecture 16: Data Storage Wednesday, November 6, 2006.

Chapter 11: Storage and File Structure

CSE-291 Cloud Computing, Fall 2016 Kesden

Database System Implementation CSE 507

Spatial Indexing I Point Access Methods.

Entity Framework By: Casey Griffin.

FAST Administration Training

Database Management Systems (CS 564)

Basic Concepts in Data Management

James B. Orlin Presented by Tal Kaminker

CS 440 Database Management Systems

Introduction to Database Systems

A Framework for Clustering Evolving Data Streams

View and Index Selection Problem in Data Warehousing Environments

The BIRCH Algorithm Davitkov Miroslav, 2011/3116

Chapter 13: Data Storage Structures

Chapter 12 Query Processing (1)

Indexing 4/11/2019.

Evaluation of Relational Operations: Other Techniques

Chapter 13: Data Storage Structures

Chapter 13: Data Storage Structures

Lecture-Hashing.

Morteza Kheirkhah University College London

Presentation transcript:

Incrementally Maintaining Classification using an RDBMS Presented by: Noah Golmant October 3, 2016

HAZY “An end-to-end system for imprecision management”

Goals: Integrate classification models into run-time operation with RDBMS Incorporate new training examples in a real-time, dynamic environment How? Model-based Views Incremental Maintenance

Goals: Integrate classification models into run-time operation with RDBMS Incorporate new training examples in a real-time, dynamic environment How? Model-based Views Incremental Maintenance

Model-based Views Expose statistical computations through relational views Standard SQL semantics: Queries for updates, inserts, and deletes Triggers to propagate updates

Model-based Views CREATE CLASSIFICATION VIEW Labeled_Papers KEY id -- (id, class) ENTITIES FROM Papers KEY id -- (id, title, ...) LABELS FROM Paper_Area LABEL l -- (label) EXAMPLES FROM Example_Papers KEY id LABEL l -- (id, label) FEATURE FUNCTION tf_bag_of_words T(id, l)

Model-based Views Eager Approach: maintain V as a materialized view where updates to class labels occur immediately after a model update. Lazy approach: in response to read of an input id, read feature vector and label using current model. Incremental model maintenance should improve both of these approaches for: Single Entity read All Members read Update

Goals: Integrate classification models into run-time operation with RDBMS Incorporate new training examples in a real-time, dynamic environment How? Model-based Views Incremental Maintenance

Incremental Maintenance At round i, we receive new examples to update a materialized view . HAZY divides this into two problems: How to perform an incremental step to update the old model to with (preferably low) cost . Decide when to reorganize to obtain a model by training on the whole dataset with a fixed cost .

Incremental Maintenance At round i, maintain a materialized view: Where: Let s be the last round at which HAZY reorganized the model.

Incremental Maintenance Let be a round after the “anchor” s. Only need to reclassify tuples satisfying:

Reorganization - The Skiing Strategy

Reorganization - The Skiing Strategy You are going skiing for an unknown number of days d. Every day you choose between renting a pair of skis for $1 and buying the pair for $10. When should you purchase the skis? Minimize the ratio between what you would pay using some decision strategy and what you would pay if you knew d.

Reorganization - The Skiing Strategy Choose some . At each round i, accumulate a total cost: Reorganize when: Then reset the accumulated cost to 0. This is a 2-approximation of the optimal strategy, and is optimal among all online, deterministic strategies**. Reorganization: update the model, re-cluster on t.eps, and rebuild indices. **Assuming reorganizing more recently does not raise the cost.

Architectural Optimizations In-memory architecture: Maintain classification view in memory, discard when memory needs to be revoked. Cluster the data (on t.eps) in memory. We only need to persist the entities and training examples since everything else can be re-computed. Hybrid architecture: Maintain buffer for entities. Cache t.eps if we can’t store all the entities in memory.

Questions, Issues How does this fit with the semantics of streaming systems? Not “black box” enough: Featurization still exposed (vs. LASER source nodes). Lazy and eager approaches have different semantics. Scalability - balancing throughput with feature length, dataset size. Drift, dataset size and the monotonicity assumption.

Questions, Issues Pushing incremental maintenance through the model training process: SGD is orders of magnitude slower on larger datasets when performed in HAZY system vs. a hand-coded C file. Can this be improved without bulk-loading? Can we take advantage of epsilon clustering during training?