A Research Sampler dex.html.

Slides:



Advertisements
Similar presentations
High Performance Discovery from Time Series Streams
Advertisements

1 Fast Calculations of Simple Primitives in Time Series Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences New York.
Indexing DNA Sequences Using q-Grams
State Estimation and Kalman Filtering CS B659 Spring 2013 Kris Hauser.
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Discovering Lag Interval For Temporal Dependencies Larisa Shwartz Liang Tang, Tao Li, Larisa Shwartz1 Liang Tang, Tao Li
Di Yang, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute VLDB 2009, Lyon, France 1 A Shared Execution Strategy for Multiple Pattern.
Fast Algorithms For Hierarchical Range Histogram Constructions
Motion Planning for Point Robots CS 659 Kris Hauser.
1 StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time Pankaj Kumar Madhukar Rakesh Kumar Singh Puspendra Kumar Project Instructor:
Variance reduction techniques. 2 Introduction Simulation models should be coded such that they are efficient. Efficiency in terms of programming ensures.
DFT/FFT and Wavelets ● Additive Synthesis demonstration (wave addition) ● Standard Definitions ● Computing the DFT and FFT ● Sine and cosine wave multiplication.
Exploiting Sparse Markov and Covariance Structure in Multiresolution Models Presenter: Zhe Chen ECE / CMR Tennessee Technological University October 22,
Unsupervised learning
Efficient Anomaly Monitoring over Moving Object Trajectory Streams joint work with Lei Chen (HKUST) Ada Wai-Chee Fu (CUHK) Dawei Liu (CUHK) Yingyi Bu (Microsoft)
A hierarchical unsupervised growing neural network for clustering gene expression patterns Javier Herrero, Alfonso Valencia & Joaquin Dopazo Seminar “Neural.
Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.
Indexing Time Series. Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time.
Themis Palpanas1 VLDB - Aug 2004 Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use.
Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,
Abdullah Mueen UC Riverside Suman Nath Microsoft Research Jie Liu Microsoft Research.
Monitoring Methods for Topic Drift in Message Streams By Christopher Ross & S. Muthu Muthukrishnan.
Online Pattern Discovery Applications in Data Streams Sensor-less: Pairs-trading in stock trading (find highly correlated pairs in n log n time) Sensor-full:
Elastic Burst Detection: Applications Discovering intervals with an unusually large numbers of events. –In astrophysics, the sky is constantly observed.
Multi-Scale Analysis for Network Traffic Prediction and Anomaly Detection Ling Huang Joint work with Anthony Joseph and Nina Taft January, 2005.
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
Dunja Mladenić Marko Grobelnik Jožef Stefan Institute, Slovenia.
Based on Slides by D. Gunopulos (UCR)
Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,
Computer vision: models, learning and inference Chapter 10 Graphical Models.
A Multiresolution Symbolic Representation of Time Series
1 Robert Engle UCSD and NYU July WHAT IS LIQUIDITY? n A market with low “transaction costs” including execution price, uncertainty and speed n.
1 Dot Plots For Time Series Analysis Dragomir Yankov, Eamonn Keogh, Stefano Lonardi Dept. of Computer Science & Eng. University of California Riverside.
Quantitative Trading Strategy based on Time Series Technical Analysis Group Member: Zhao Xia Jun Lorraine Wang Lu Xiao Zhang Le Yu.
Radial Basis Function Networks
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Achieving fast (approximate) event matching in large-scale content- based publish/subscribe networks Yaxiong Zhao and Jie Wu The speaker will be graduating.
On Anomalous Hot Spot Discovery in Graph Streams
Online Detection of Change in Data Streams Shai Ben-David School of Computer Science U. Waterloo.
EVENT MANAGEMENT IN MULTIVARIATE STREAMING SENSOR DATA National and Kapodistrian University of Athens.
CS910: Foundations of Data Analytics Graham Cormode Time Series Analysis.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences.
Evolving Virtual Creatures & Evolving 3D Morphology and Behavior by Competition Papers by Karl Sims Presented by Sarah Waziruddin.
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
Abdullah Mueen Eamonn Keogh University of California, Riverside.
Identifying Patterns in Time Series Data Daniel Lewis 04/06/06.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Stream Monitoring under the Time Warping Distance Yasushi Sakurai (NTT Cyber Space Labs) Christos Faloutsos (Carnegie Mellon Univ.) Masashi Yamamuro (NTT.
DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011.
Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.
NSF Career Award IIS University of California Riverside Eamonn Keogh Efficient Discovery of Previously Unknown Patterns and Relationships.
VizTree Huyen Dao and Chris Ackermann. Introducing example
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
Computational Challenges in BIG DATA 28/Apr/2012 China-Korea-Japan Workshop Takeaki Uno National Institute of Informatics & Graduated School for Advanced.
Genetic Algorithm. Outline Motivation Genetic algorithms An illustrative example Hypothesis space search.
Enabling Real Time Alerting through streaming pattern discovery Chengyang Zhang Computer Science Department University of North Texas 11/21/2016 CRI Group.
Fast Subsequence Matching in Time-Series Databases.
Supervised Time Series Pattern Discovery through Local Importance
Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,
Probabilistic Data Management
A Time Series Representation Framework Based on Learned Patterns
A Latent Space Approach to Dynamic Embedding of Co-occurrence Data
Objective of This Course
Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,
Presentation transcript:

A Research Sampler dex.html

Philosophy Research should be fun -- good puzzles, interesting algorithms. Research should be useful -- work with real users whenever possible. Implementation should be fast (I use a very powerful programming environment that I expect my students to learn)

Thesis Philosophy Ideal thesis should have an interesting algorithm with analysis, an implementation, and users. Of the 15 theses I have supervised, 13 follow this model. The other two were pure systems theses.

Current Research Topics Time series analysis: finding correlation/bursts. Query by humming. AQuery: Database for ordered data (like time series) Computational biology: data analysis, visualization, proteomics

Online Pattern Discovery Sensor-less: Pairs-trading in stock trading (find highly correlated pairs in n log n time) Sensor-full: Gamma Ray Detection in astrophysics (burst detection over a large number of window sizes in almost linear time) Dennis Shasha (joint work with Yunyue Zhu, Xiaojian Zhao, Zhihua Wang, Tyler Neylon, Xin Zhang and Prof Richard Cole)

Goal of this work Time series are important in so many applications – biology, medicine, finance, music, physics, … A few fundamental operations occur all the time: burst detection, correlation, pattern matching. Extend functionality for music and science.

StatStream (VLDB,2002): Example Stock prices streams –The New York Stock Exchange (NYSE) –50,000 securities (streams); 100,000 ticks (trade and quote) Pairs Trading, a.k.a. Correlation Trading Query:“which pairs of stocks were correlated with a value of over 0.9 for the last three hours?”

StatStream (VLDB,2002): Example XYZ and ABC have been correlated with a correlation of 0.95 for the last three hours. Now XYZ and ABC become less correlated as XYZ goes up and ABC goes down. They should converge back later. I will sell XYZ and buy ABC …

Online Detection of High Correlation Given tens of thousands of high speed time series data streams, to detect high-value correlation, including synchronized and time-lagged, over sliding windows in real time. Real time –high update frequency of the data stream –fixed response time, online

Online Detection of High Correlation Correlated!

StatStream: Algorithm Naive algorithm –N : number of streams –w : size of sliding window –space O(N) and time O(N 2 w) VS space O(N 2 ) and time O(N 2 ). Suppose that the streams are updated every second. –With a Pentium 4 PC, the exact computing method can only monitor 700 streams with a delay of 2 minutes. Our Approach –Use Discrete Fourier Transform to approximate correlation –Use grid structure to filter out unlikely pairs –Our approach can monitor 10,000 streams with a delay of 2 minutes.

Empirical Study : Speed Our algorithm is parallelizable.

Sketches : Random Projection Correlation between time series of the returns of stock –Since most stock price time series are close to random walks, their return time series are close to white noise –DFT/DWT can’t capture approximate white noise series because there is no clear trend (too many frequency components). Solution : Sketches (a form of random landmark) –Sketches pool: matrix of random variables drawn from stable distribution –Sketches : The random projection of all time series to lower dimensions by multiplication with the same matrix –The Euclidean distance (correlation) between time series is approximated by the distance between their sketches with a probabilistic guarantee.

Burst Detection

Burst Detection: Applications Discovering intervals with unusually large numbers of events. –In astrophysics, the sky is constantly observed for high- energy particles. When a particular astrophysical event happens, a shower of high-energy particles arrives in addition to the background noise. Might last milliseconds or days… –In telecommunications, if the number of packages lost within a certain time period exceeds some threshold, it might indicate some network anomaly. Exact duration is unknown. –In finance, stocks with unusual high trading volumes should attract the notice of traders (or perhaps regulators).

Bursts across different window sizes in Gamma Rays Challenge : to discover not only the time of the burst, but also the duration of the burst.

Elastic Burst Detection: Problem Statement Problem: Given a time series of positive numbers x 1, x 2,..., x n, and a threshold function f(w), w=1,2,...,n, find the subsequences of any size such that their sums are above the thresholds: –all 0<w<n, 0<m<n-w, such that x m + x m+1 +…+ x m+w-1 ≥ f(w) Brute force search : O(n^2) time Our shifted wavelet tree (SWT): O(n+k) time. –k is the size of the output, i.e. the number of windows with bursts

Burst Detection: Data Structure and Algorithm –Define threshold for node for size 2 k to be threshold for window of size 1+ 2 k-1

Empirical Study : Stock Price Spread Burst

Elastic Burst in two dimensions Population Distribution in the US

Summary Able to detect bursts of many different durations in essentially linear time. Can be used both for time series and for spatial searching. Can specify thresholds either with absolute numbers or with probability of hit. Algorithm is simple to implement and has low constants (code is available). Ok, it’s embarrassingly simple.

With a Little Help From My Warped Correlation Karen’s hummingMatch: Dennis’s humming Match: “What would you do if I sang out of tune?" Yunyue’s humming Match:

Related Work in Query by Humming Traditional method: String Matching [Ghias et. al. 95, McNab et.al. 97,Uitdenbgerd and Zobel 99] –Music represented by string of pitch directions: U, D, S (degenerated interval) –Hum query is segmented to discrete notes, then string of pitch directions –Edit Distance between hum query and music score Problem –Very hard to segment the hum query –Partial solution: users are asked to hum articulately New Method : matching directly from audio [Mazzoni and Dannenberg 00] Problem –slowed down by DTW

Time Series Representation of Query An example hum query Note segmentation is hard! Segment this!

How to deal with poor hum queries? No absolute pitch –Solution: the average pitch is subtracted Incorrect tempo –Solution: Uniform Time Warping Inaccurate pitch intervals –Solution: return the k-nearest neighbors Local timing variations –Solution: Dynamic Time Warping

Dynamic Time Warping Euclidean distance: sum of point-by-point distance DTW distance: allowing stretching or squeezing the time axis locally

Dynamic Time Warping

AQuery A Database System for Order Dennis Shasha joint work with Alberto Lerner

Idea Whatever can be done on a table can be done on an ordered table (arrable). Not vice-versa. AQuery – query language on arrables Expresses many queries easily Elegant new optimizations.

And Streams? AQuery has no special facilities for streaming data, but it is expressive enough. Idea for streaming data is to split the tables into tables that are indexed with old data and a buffer table with recent data. Optimizer works over both transparently.

Computational Biology Collaborations with several groups at NYU (plant and worm), Duke, Yale. Growth area: biologists need us, but we have a lot to learn. Big issues: control experimental space, evaluate data, infer an active (rather than just paper) model – combinatorial design. Visualization.

Sungear Design Generalizes Venn diagrams to more than three Visual outline is an ellipse having anchors on borders and vessels in the interior. Each vessel points to associated anchors. Linked views to hierarchies, lists, and graphs, so can simultaneously update data depending on user queries (selection events).

Venn Diagram: great for three factors

Sungear Principle “Sungear is stupid” Doesn’t care which kind of data it is representing, though there is built-in support for genes (because of links to GO and to cytoscape). Basic Sungear representation could be used to describe anything from yachting gear to demographics.

Summary Hard problems with practical motivation. Fun algorithms – not afraid of heuristics. Fast, maintainable, portable applications.