Presented by: Ashkan Malekloo Fall 2015.  Type: Demonstration paper  Authors:  VLDB 15 Daniel Haas, Sanjay Krishnan, Jiannan Wang, Michael J. Franklin,

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

PARTITIONAL CLUSTERING

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Sharing Enterprise Data Data administration Data administration Data downloading Data downloading Data warehousing Data warehousing.

Search Engines and Information Retrieval

Xyleme A Dynamic Warehouse for XML Data of the Web.

MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.

Introduction to Data Warehousing. From DBMS to Decision Support DBMSs widely used to maintain transactional data Attempts to use of these data for analysis,

Chapter 14 The Second Component: The Database.

Abstract Shortest distance query is a fundamental operation in large-scale networks. Many existing methods in the literature take a landmark embedding.

Data Mining – Intro.

A Search-based Method for Forecasting Ad Impression in Contextual Advertising Defense.

DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.

 A data processing system is a combination of machines and people that for a set of inputs produces a defined set of outputs. The inputs and outputs.

Chapter 1 Overview of Databases and Transaction Processing.

LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.

1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.

Data Mining Techniques

Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,

A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data Authors: Eleazar Eskin, Andrew Arnold, Michael Prerau,

SharePoint 2010 Business Intelligence Module 6: Analysis Services.

Repository Method to suit different investment strategies Alma Lilia Garcia & Edward Tsang.

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.

Ihr Logo Data Explorer - A data profiling tool. Your Logo Agenda  Introduction  Existing System  Limitations of Existing System  Proposed Solution.

Copyright © 2003 by Prentice Hall Computers: Tools for an Information Age Chapter 13 Database Management Systems: Getting Data Together.

Search Engines and Information Retrieval Chapter 1.

Multimedia Databases (MMDB)

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.

OLAP & DSS SUPPORT IN DATA WAREHOUSE By - Pooja Sinha Kaushalya Bakde.

Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU.

Presenter: Shanshan Lu 03/04/2010

5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.

Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.

Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.

Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.

Software Architecture Evaluation Methodologies Presented By: Anthony Register.

MIS2502: Data Analytics Advanced Analytics - Introduction.

Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉教授 : 許毅然作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Presented By Amarjit Datta

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

Data Mining and Decision Support

CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.

Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.

Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.

Learning Objectives Understand the concepts of Information systems.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

CS Machine Learning Instance Based Learning (Adapted from various sources)

Presented by: Dardan Xhymshiti Fall  Authors: Eli Cortez, Philip A.Bernstein, Yeye He, Lev Novik (Microsoft Corporation)  Conference: VLDB  Type:

Fundamentals of Information Systems, Sixth Edition Chapter 3 Database Systems, Data Centers, and Business Intelligence.

1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.

Fraud Detection Notes from the Field. Introduction Dejan Sarka –Data.

Unsupervised Classification

Data Mining Techniques Applied in Advanced Manufacturing PRESENT BY WEI SUN.

Chapter 1 Overview of Databases and Transaction Processing.

Collaborative Filtering - Pooja Hegde. The Problem : OVERLOAD Too much stuff!!!! Too many books! Too many journals! Too many movies! Too much content!

Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.

SNS COLLEGE OF TECHNOLOGY

What Is Cluster Analysis?

MIS2502: Data Analytics Advanced Analytics - Introduction

Kyriaki Dimitriadou, Brandeis University

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

Lecture 12: Data Wrangling

Data Warehousing and Data Mining

Presentation transcript:

Presented by: Ashkan Malekloo Fall 2015

 Type: Demonstration paper  Authors:  VLDB 15 Daniel Haas, Sanjay Krishnan, Jiannan Wang, Michael J. Franklin, Eugene Wu

 Dirty data  Data cleaning is often domain specific, dataset, and eventual analysis, analysts report spending upwards of 80% of their time on problems in data cleaning  Possible errors  What to extract  How to clean the data  Whether that cleaning will significantly change results

 While the extraction operation can be represented at a logical level by its input and output schema, there is a huge space of possible physical implementations of the logical operators.  Rule-based  Learning-based  Crowd-based  Or a combination of three  Let’s say we select crowd-based operator as our extraction method  There are still many parameters that might influence the quality of output  the number of crowd workers  the amount each worker is paid

 ETL ( Extract-Transfer-Load)  Constraint Driven tools  Wrangler  OpenRefine  Crowd-based

 a system designed to support the iterative development and optimization of data cleaning plans end to end  allows users to specify declarative data cleaning plans  Wisteria phases:  Sampling  Recommendation  Crowd Latency

Presented by: Ashkan Malekloo Fall 2015

 Type: Demonstration paper  Authors:  VLDB 15 Eli Cortez, Philip A. Bernstein, Yeye He, Lev Novik

 In large enterprises, data discovery is a common problem faced by users who need to find relevant information in relational databases  Finding Tables That are relevant  Find out whether it is truly relevant

 In large enterprises, data discovery is a common problem faced by users who need to find relevant information in relational databases  Finding Tables That are relevant  Find out whether it is truly relevant  In this paper their sampling involves  29 databases  639 tables  4216 data columns

 many frequently-used column names are very generic  Name  Id  Description  Field  Code  Column  These generic column names are useless for helping users find tables that have the data they need.

 a system that automatically generates candidate keywords to annotate columns of database tables  mining spreadsheets  Spreadsheets are more readble

 A method to automatically extract tables from a corpus of enterprise spreadsheets.  A method for identifying and ranking relevant column annotations, and an efficient technique for calculating it.  An implementation of our method, and an experimental evaluation that shows its efficiency and effectiveness.

Type:Demonstration Paper Authors: Authors: Manas Joglekar, Hector Garcia-Molina(Stanford), Aditya Parameswaran(University of Illinois) Presented by: Siddhant Kulkarni Term:Fall 2015

 Drill Down -> Data exploration  Drawbacks with Traditional drill down operation  Too many Distinct Values  One Column at a time  Simultaneously drilling down columns presents too many values

 Interpretable and informative explanations of outcomes  User-adaptive exploration of multidimensional data  User-cognizant multidimensional analysis  Discovery-driven exploration of olap data cubes

Type:Demonstration Paper Authors: Tobias Muller, Torsten Grust (Universit at Tubingen Tubingen, Germany) Presented by: Siddhant Kulkarni Term:Fall 2015

 Given a query  Record it’s control flow decisions (WHY- PROVENANCE)  Data access locations (WHERE- PROVENANCE)  Without actual data values (VALUELESS!) determine the why- origin and where-origin of a query

 DETERMINE THE I/O DEPENDENCIES OF REAL LIFE SQL QUERIES

 Step 1: Convert SQL query into Python code  Step 2: Apply Program Splicing  Step 3: Apply Abstract Interpretation  Demo with PostgreSQL

Presented by: Omar Alqahtani Fall 2015

Nicole Bidoit Université Paris Sud / Inria Melanie Herschel Universität Stuttgart Katerina Tzompanaki Université Paris Sud / Inria

 Explanations to Why-Not questions:  data-based explanations  query-based explanations  Mixed.

Explain and Fix Query platform (EFQ) that enable to execute queries, express a Why-Not question and ask for:  Explanations to Why-Not questions.  Query-based  Why-Not Answer polynomials  Query refinements that produce the desired results.  cost model for ranking

Presented by: Ranjan Fall 2015

Database ? Spreadsheets? Problem ?

 A spreadsheet containing course assignment scores and eventual grades for students from rows 1–1000, columns 1–10 in one sheet, and demographic information for the students from rows 1–1000, columns 1–20 in another sheet.  user wants to understand the impact of assignment grades on the course grade, for which having std_points> 90 in at least one assignment.  user wants to plot the average grade by demographic group (undergrad, MS, PhD).  the course management software outputs actions performed by students into a relational database or a CSV file ; there is no easy way for the user to study this data within the spreadsheet, as the data is continuously added.

 Schema  Addressing  Modiﬁcations  Computation: spreadsheets support value-at-a-time formulae to allow derived computation, while databases support arbitrary SQL queries operating on groups of tuples at once.

a) analytic queries that reference data on the spreadsheet, as well as data in other database relations. b) importing or exporting data from the relational database. c) keeps data in the front-end and back-end in-sync during modiﬁcations at either end.

 a)Use of spreadsheets to mimic the relational database functionalities : achieves expressivity of SQL, it is unable to leverage the scalability of databases.  b) Use of databases to mimic spreadsheet functionalities : achieves scalability of databases, it is does not support ad-hoc tabular management provided by spreadsheets.  c) Use of spreadsheet interface for querying data : Provide an intuitive interface to query data, but looses the expressivity of SQL as well as ad-hoc data management capabilities.

Overall, the aforementioned demonstration scenarios will convince attendees that DATASPREAD system offers a valuable hybrid between spreadsheets and databases, retaining the ease-of-use of spreadsheets, and the power of databases

Presented by: Zohreh Raghebi Fall 2015

 Bilegsaikhan Naidan  Norwegian University of Science and Technology Trondheim, Norway  Leonid Boytsov Car negie Mellon University Pittsburgh, PA, USA  Er ic Nyberg Car negie Mellon University Pittsburgh, PA, USA

 Nearest-neighbor searching is a fundamental operation employed in many applied areas such as: pattern recognition, computer vision, multimedia retrieval  Given a query data point q, the goal is to identify the nearest (neighbor) data point x  A natural generalization is a k-NN search, where we aim to find k closest points  The most studied instance of the problem is an exact nearest-neighbor search in vector spaces  where a distance function is an actual metric distance

 Exact methods work well only in low dimensional metric spaces  Experiments showed that exact methods can rarely outperform the sequential scan when dimensionality exceeds ten  This a well-known phenomenon known as “the curse of dimensionality  Approximate search methods can be much more efficient than exact ones  but this comes at the expense of a reduced search accuracy  The quality of approximate searching is often measured using recall  the average fraction of true neighbors returned by a search method

 It is based on the idea that if we rank a set of reference points–called pivots–with respect to distances from a given point  the pivot rankings produced by two near points should be similar  In these methods, every data point is represented by a ranked list of pivots sorted by the distance to this point.  Such ranked lists are called permutations  the distance between permutations is a good proxy for the distance between original points  However, a comprehensive evaluation that involves a diverse set of large metric and nonmetric data sets is lacking  We survey permutation-based methods for approximate k nearest neighbor search

 by examining only a tiny subset of data points whose permutations are similar to the permutation of a query  Converting the vector of distances to pivots into a permutation entails information loss  but this loss is not necessarily detrimental  our preliminary experiments showed that using permutations instead of vectors of original distances:  results in slightly better retrieval performance  (1) The distance function is expensive (or the data resides on disk)  (2) The indexing costs of k-NN graphs are unacceptably high  (3) There is a need for a simple, but reasonably efficient, implementation that operates on top of a relational database

 Tamrapar ni Dasu AT&T Labs–Research  Vladislav Shkapenyuk AT&T Labs–Research  Divesh Sr ivastava AT&T Labs–Research Presented by: Zohreh Raghebi

 Data are being collected and analyzed today at an unprecedented scale  Data errors (or glitches) in many domains, such as medicine, finance can have severe consequences  need to develop data quality management systems to effectively detect and correct glitches in the data  Data errors can arise throughout the data lifecycle  from data entry, through storage, data integration, analysis

 Much of the data quality effort in the database research has focused on detecting and errors in data once the data has been collected  This is surprising since data entry time offers the first opportunity to detect and correct errors  We address this problem in our paper, describe principled techniques for online data quality monitoring in a dynamic feed environment  While there has been significant focus on collecting and managing data feeds  it is only now that attention is turning to their quality

 Our goal is to alert quickly when feed behavior deviates from expectations  Data feed management systems(DFMSs) have recently emerged to provide reliable, continuous data delivery to :  databases and data intensive applications that need to:  perform real-time correlation and analysis  In prior work we have presented the Bistro DFMS, which is deployed at AT&T Labs  responsible for the real-time delivery of over 100 different raw feeds,  distributing data to several large-scale stream warehouses.

 Bistro uses a publish-subscribe architecture to efficiently process incoming data from a large number of data publishers,  identify logical data feeds  reliably distribute these feeds to remote subscribers  FIT naturally fits into this DFMS architecture:  both as a subscriber of data and metadata feeds  as a publisher of learned statistical models and identified outliers  we propose novel enhancements to permit a publish subscribe approach  to incorporate data quality modules into the DFMS architecture

 Early detection of errors by FIT enables data administrators to quickly remedy any problems with the incoming feeds  FIT’s online feed monitoring can naturally detect errors from two distinct perspectives:  (i) errors in the data feed processes  e.g., missing or delayed delivery of files in a feed  by continuously analyzing the DFMS metadata feed  (ii) significant changes in distributions in the data records present in the feeds  e.g., erroneously switching from packets/second to bytes/second in a measurement feed  by continuously analyzing the contents of the data feeds.

Presented by: Shahab Helmi Fall 2015

Authors: Publication:  VLDB 2015 Type:  Industrial Paper

What have been done in this paper?  The first attempt to implement three basic DP architectures in the deployed telecommunication (telco) big data platform for data mining applications (churn prediction). What is DP?  Differential Privacy (DP) is an Anonymization technique. What is Anonymization? A privacy protection technique, which removes or replaces the explicitly sensitive identifiers (ID) of customers, such as the identification number or mobile phone number, by random mapping or encryption mechanisms in DB, and provides the sanitized dataset without any ID information to DM services.

What have been done in this paper?  The first attempt to implement three basic DP architectures in the deployed telecommunication (telco) big data platform for data mining applications (churn prediction). Who is a Churner?  A person who quits the service! Customer churn is on of the biggest challenge in telco industry. Telecommunication (telco) big data platform  Telecommunication (telco) big data record billions of customers’ communication behaviors for years in the world. Mining big data to increase customers’ experience for higher profits becomes one of important tasks for telco operators.

 Implementation DP in telco big data platform: Data Publication Architecture, Separated Architecture and Hybridized Architecture.  Extensive experimental results on big data:  influence of privacy budget parameter on different DP implementations with industrial big data.  The accuracy and privacy budgets trade-off.  The performance of three basic DP architectures in churn prediction;.  How volume and variety of big data affect the performance.  Comparing the DP implementation performance between the simple decision tree and the relatively complicated random forest classifiers in churn prediction.

 Findings:  All DP architectures have a relative accuracy loss less than 5% with week privacy guarantee and more than 15% (up to 30) with storing privacy guarantee.  Among all three basic DP architectures, the Hybridized architecture performs the best.  Prediction error:  increases with the number of features.  decreases with the growth of the training data volume.

Anonymization techniques: such as K-Anonymity  DP is currently the strongest privacy protection technique, which does not need any background information assumption of attackers. The attacker can be assumed to know the maximum knowledge. Studying DP in different scenarios:  Histogram query  Statistical geospatial  Data query  Frequent item set mining  Crowdsourcing …

 Dataset: collected from one of biggest telco operators in China, having 9 consecutive months of more than 2 million prepaid customer’s behavior records from 2013 to 2014 (around 2M users).  Experiments: checking the effect of following properties on the churn prediction accuracy:  Privacy budget parameter.  Number of features.  Training data volume

 AUC: Area under ROC Curve  ROC is a graphical plot that illustrates the performance of a binary classifier system. [Wikipedia] The effect of number of features on prediction accuracy (1M training records)

 AUC: Area under ROC Curve  ROC is a graphical plot that illustrates the performance of a binary classifier system. [Wikipedia] The effect of training data volume on prediction accuracy

 AUC: Area under ROC Curve  ROC is a graphical plot that illustrates the performance of a binary classifier system. [Wikipedia] Decision Trees VS. Random Forests

Presented by: Shahab Helmi Fall 2015

Authors: Publication:  VLDB 2015 Type:  Demonstration Paper

 Data analysts often engage in data exploration tasks to discover interesting data patterns, without knowing exactly what they are looking for (exploratory analysis).  Users try to make sense of the underlying data space by navigating through it. The process includes a great deal of experimentation with queries, backtracking on the basis of query results, and revision of results at various points in the process.  When data size is huge, finding the relevant sub-space and relevant results takes so long.

AIDE is an automated data exploration system that:  Steers the user towards interesting data areas based on her relevance feedback on database samples.  Aims to achieve the goal of identifying all database objects that match the user interest with high efficiency.  It relies on a combination of machine learning techniques and sample selection algorithms to provide effective data exploration results as well as high interactive performance over databases of large sizes.

Datasets:  AuctionMark: information on action items and their bids. 1.77GB.  Sloan Digital Sky Survey: This is a scientific data set generated by digital surveys of stars and galaxies. Large data size and complex schema. 1GB-100GB.  US housing and used cars: available through the DAIDEM Lab System Implementation:  Java: ML, clustering and classification algorithms, such as SVM, k-means, decision trees  PostgreSQL