Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Slides:

Advertisements

Similar presentations

High Performance Discovery from Time Series Streams

Advertisements

1 Fast Calculations of Simple Primitives in Time Series Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences New York.

Indexing DNA Sequences Using q-Grams

Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

Wavelets Fast Multiresolution Image Querying Jacobs et.al. SIGGRAPH95.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Fast Algorithms For Hierarchical Range Histogram Constructions

1 StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time Pankaj Kumar Madhukar Rakesh Kumar Singh Puspendra Kumar Project Instructor:

FTP Biostatistics II Model parameter estimations: Confronting models with measurements.

Content-based retrieval of audio Francois Thibault MUMT 614B McGill University.

Efficient Anomaly Monitoring over Moving Object Trajectory Streams joint work with Lei Chen (HKUST) Ada Wai-Chee Fu (CUHK) Dawei Liu (CUHK) Yingyi Bu (Microsoft)

Mining Time Series.

Face detection Many slides adapted from P. Viola.

Indexing Time Series. Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time.

Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)

Simple Linear Regression

Themis Palpanas1 VLDB - Aug 2004 Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use.

Efficient Similarity Search in Sequence Databases Rakesh Agrawal, Christos Faloutsos and Arun Swami Leila Kaghazian.

Online Pattern Discovery Applications in Data Streams Sensor-less: Pairs-trading in stock trading (find highly correlated pairs in n log n time) Sensor-full:

Elastic Burst Detection: Applications Discovering intervals with an unusually large numbers of events. –In astrophysics, the sky is constantly observed.

High Performance Correlation Techniques For Time Series

Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.

1. 2 General problem Retrieval of time-series similar to a given pattern.

Based on Slides by D. Gunopulos (UCR)

Probabilistic Skyline Operator over sliding Windows Wan Qian HKUST DB Group.

Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

05/06/2005CSIS © M. Gibbons On Evaluating Open Biometric Identification Systems Spring 2005 Michael Gibbons School of Computer Science & Information Systems.

Fast multiresolution image querying CS474/674 – Prof. Bebis.

Energy-efficient Self-adapting Online Linear Forecasting for Wireless Sensor Network Applications Jai-Jin Lim and Kang G. Shin Real-Time Computing Laboratory,

Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.

Radial Basis Function (RBF) Networks

Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.

Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.

CS910: Foundations of Data Analytics Graham Cormode Time Series Analysis.

CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.

Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.

BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai (NTT Cyber Space Labs) Spiros Papadimitriou (Carnegie Mellon Univ.) Christos Faloutsos.

Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.

A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos.

Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

The Landmark Model: An Instance Selection Method for Time Series Data C.-S. Perng, S. R. Zhang, and D. S. Parker Instance Selection and Construction for.

Shape-based Similarity Query for Trajectory of Mobile Object NTT Communication Science Laboratories, NTT Corporation, JAPAN. Yutaka Yanagisawa Jun-ichi.

Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.

Fast Subsequence Matching in Time-Series Databases Author: Christos Faloutsos etc. Speaker: Weijun He.

Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.

Identifying Patterns in Time Series Data Daniel Lewis 04/06/06.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Efficient Local Statistical Analysis via Integral Histograms with Discrete Wavelet Transform Teng-Yok Lee & Han-Wei Shen IEEE SciVis ’13Uncertainty & Multivariate.

A Research Sampler dex.html.

Exact indexing of Dynamic Time Warping

Stream Monitoring under the Time Warping Distance Yasushi Sakurai (NTT Cyber Space Labs) Christos Faloutsos (Carnegie Mellon Univ.) Masashi Yamamuro (NTT.

D-skyline and T-skyline Methods for Similarity Search Query in Streaming Environment Ling Wang 1, Tie Hua Zhou 1, Kyung Ah Kim 2, Eun Jong Cha 2, and Keun.

NSF Career Award IIS University of California Riverside Eamonn Keogh Efficient Discovery of Previously Unknown Patterns and Relationships.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

ITree: Exploring Time-Varying Data using Indexable Tree Yi Gu and Chaoli Wang Michigan Technological University Presented at IEEE Pacific Visualization.

Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.

Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),

Feature learning for multivariate time series classification Mustafa Gokce Baydogan * George Runger * Eugene Tuv † * Arizona State University † Intel Corporation.

Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)

Fast Subsequence Matching in Time-Series Databases.

Data Transformation: Normalization

A review of audio fingerprinting (Cano et al. 2005)

Supervised Time Series Pattern Discovery through Local Importance

Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,

Image Segmentation Techniques

Objective of This Course

Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Presentation transcript:

Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao, Zhihua Wang, and Alberto Lerner {shasha,yunyue, xiaojian, zhihua, Courant Institute, New York University

Goal of this work Time series are important in so many applications – biology, medicine, finance, music, physics, … A few fundamental operations occur all the time: burst detection, correlation, pattern matching. Do them fast to make data exploration faster, real time, and more fun.

Sample Needs Pairs Trading in Finance: find two stocks that track one another closely. When they go out of correlation, buy one and sell the other. Match a person’s humming against a database of songs to help him/her buy a song. Find bursts of activity even when you don’t know the window size over which to measure. Query and manipulate ordered data.

Why Speed Is Important As processors speed up, algorithmic efficiency no longer matters … one might think. True if problem sizes stay same but they don’t. As processors speed up, sensors improve -- satellites spewing out a terabyte a day, magnetic resonance imagers give higher resolution images, etc. Desire for real time response to queries.

Surprise, surprise More data, real-time response, increasing importance of correlation IMPLIES Efficient algorithms and data management more important than ever!

Corollary Important area, lots of new problems. Small advertisement: High Performance Discovery in Time Series (Springer 2004). At this conference.

Outline Correlation across thousands time series Query by humming: correlation + shifting Burst detection: when you don’t know window size Aquery: a query language for all these stuff

Real-time Correlation Across Thousands of Time Series

Scalable Methods for Correlation Compress streaming data into moving synopses. Update the synopses in constant time. Compare synopses in real time. Use transforms + simple data structures. (Avoid curse of dimensionality.)

GEMINI framework* * Faloutsos, C., Ranganathan, M. & Manolopoulos, Y. (1994). Fast subsequence matching in time-series databases. In proceedings of the ACM SIGMOD Int'l Conference on Management of Data. Minneapolis, MN, May pp

StatStream (VLDB,2002): Example Stock prices streams –The New York Stock Exchange (NYSE) –50,000 securities (streams); 100,000 ticks (trade and quote) Pairs Trading, a.k.a. Correlation Trading Query:“which pairs of stocks were correlated with a value of over 0.9 for the last three hours?” XYZ and ABC have been correlated with a correlation of 0.95 for the last three hours. Now XYZ and ABC become less correlated as XYZ goes up and ABC goes down. They should converge back later. I will sell XYZ and buy ABC …

Online Detection of High Correlation Given tens of thousands of high speed time series data streams, to detect high-value correlation, including synchronized and time-lagged, over sliding windows in real time. Real time –high update frequency of the data stream –fixed response time, online Correlated!

Online Detection of High Correlation Given tens of thousands of high speed time series data streams, to detect high-value correlation, including synchronized and time-lagged, over sliding windows in real time. Real time –high update frequency of the data stream –fixed response time, online

Online Detection of High Correlation Given tens of thousands of high speed time series data streams, to detect high-value correlation, including synchronized and time-lagged, over sliding windows in real time. Real time –high update frequency of the data stream –fixed response time, online Correlated!

StatStream: Naïve Approach Goal: find most highly correlated stream pairs over sliding windows –N : number of streams –w : size of sliding window –space O(N) and time O(N 2 w). Suppose that the streams are updated every second. –With a Pentium 4 PC, the exact computing method can monitor only 700 streams, where each calculation takes place with a separation of two minutes. –“Punctuated result model” – not continuous, but online.

StatStream: Our Approach –Use Discrete Fourier Transform to approximate correlation as in Gemini approach. –Every two minutes (“basic window size”), update the DFT for each time series over the last hour (“window size”) –Use grid structure to filter out unlikely pairs –Our approach can report highly correlated pairs among 10,000 streams for the last hour with a delay of 2 minutes. So, at 2:02, find highly correlated pairs between 1 PM and 2 PM. At 2:04, find highly correlated pairs between 1:02 and 2:02 PM etc.

StatStream: Stream synoptic data structure Three level time interval hierarchy –Time point, Basic window, Sliding window Basic window (the key to our technique) –The computation for basic window i must finish by the end of the basic window i+1 –The basic window time is the system response time. Digests Basic window digests: sum DFT coefs Sliding window Basic window Time point Basic window digests: sum DFT coefs

StatStream: Stream synoptic data structure Basic window digests: sum DFT coefs Sliding window Basic window Time point Basic window digests: sum DFT coefs Basic window digests: sum DFT coefs Three level time interval hierarchy –Time point, Basic window, Sliding window Basic window (the key to our technique) –The computation for basic window i must finish by the end of the basic window i+1 –The basic window time is the system response time. Digests

StatStream: Stream synoptic data structure Sliding window digests: sum DFT coefs Basic window digests: sum DFT coefs Sliding window Basic window Time point Basic window digests: sum DFT coefs Basic window digests: sum DFT coefs Three level time interval hierarchy –Time point, Basic window, Sliding window Basic window (the key to our technique) –The computation for basic window i must finish by the end of the basic window i+1 –The basic window time is the system response time. Digests

StatStream: Stream synoptic data structure Sliding window digests: sum DFT coefs Basic window digests: sum DFT coefs Sliding window Basic window Time point Basic window digests: sum DFT coefs Basic window digests: sum DFT coefs Three level time interval hierarchy –Time point, Basic window, Sliding window Basic window (the key to our technique) –The computation for basic window i must finish by the end of the basic window i+1 –The basic window time is the system response time. Digests

StatStream: Stream synoptic data structure Basic window digests: sum DFT coefs Sliding window Basic window Time point Basic window digests: sum DFT coefs Basic window digests: sum DFT coefs Three level time interval hierarchy –Time point, Basic window, Sliding window Basic window (the key to our technique) –The computation for basic window i must finish by the end of the basic window i+1 –The basic window time is the system response time. Digests

How general technique is applied Compress streaming data into moving synopses: Discrete Fourier Transform. Update the synopses in time proportional to number of coefficients: basic window idea. Compare synopses in real time: compare DFTs. Use transforms + simple data structures: grid structure.

Synchronized Correlation Uses Basic Windows Inner-product of aligned basic windows Stream x Stream y Sliding window Basic window Inner-product within a sliding window is the sum of the inner- products in all the basic windows in the sliding window.

Approximate with an orthogonal function family (e.g. DFT) Approximate Synchronized Correlation x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 f 1 (1) f 1 (2) f 1 (3) f 1 (4) f 1 (5) f 1 (6) f 1 (7) f 1 (8) f 2 (1) f 2 (2) f 2 (3) f 2 (4) f 2 (5) f 2 (6) f 2 (7) f 2 (8) f 3 (1) f 3 (2) f 3 (3) f 3 (4) f 3 (5) f 3 (6) f 3 (7) f 3 (8)

Approximate with an orthogonal function family (e.g. DFT) Approximate Synchronized Correlation x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8

Approximate with an orthogonal function family (e.g. DFT) Approximate Synchronized Correlation x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8

Approximate with an orthogonal function family (e.g. DFT) Inner product of the time series Inner product of the digests The time and space complexity is reduced from O(b) to O(n). –b : size of basic window –n : size of the digests (n<<b) e.g. 120 time points reduce to 4 digests Approximate Synchronized Correlation x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8

Approximate lagged Correlation Inner-product with unaligned windows The time complexity is reduced from O(b) to O(n 2 ), as opposed to O(n) for synchronized correlation. Reason: terms for different frequencies are non-zero in the lagged case. sliding window

Grid Structure(to avoid checking all pairs) The DFT coefficients yields a vector. High correlation => closeness in the vector space –We can use a grid structure and look in the neighborhood, this will return a super set of highly correlated pairs. x

Empirical Study : Speed Our algorithm is parallelizable.

Empirical Study: Accuracy Approximation errors –Larger size of digests, larger size of sliding window and smaller size of basic window give better approximation –The approximation errors (mistake in correlation coef) are small.

Sketches : Random Projection* Correlation between time series of the returns of stock –Since most stock price time series are close to random walks, their return time series are close to white noise –DFT/DWT can’t capture approximate white noise series because there is no clear trend (too many frequency components). Solution : Sketches (a form of random landmark) –Sketch pool: list of random vectors drawn from stable distribution –Sketch : The list of distances from a data vector to the sketch pool. –The Euclidean distance (correlation) between time series is approximated by the distance between their sketches with a probabilistic guarantee. W.B.Johnson and J.Lindenstrauss. “Extensions of Lipshitz mapping into hilbert space”. Contemp. Math.,26: ,1984 D. Achlioptas. “Database-friendly random projections”. In Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, ACM Press,2001

Sketches : Intuition You are walking in a sparse forest and you are lost. You have an old-time cell phone without GPS. You want to know whether you are close to your friend. You identify yourself as 100 meters from the pointy rock, 200 meters from the giant oak etc. If your friend is at similar distances from several of these landmarks, you might be close to one another. The sketch is just the set of distances.

inner product random vector sketches raw time series Sketches : Random Projection

The ratio of sketch distance/real distance (Sliding window size=256 and sketch size=80)

Empirical Study: Sketch on Price and Return Data DFT and DWT work well for prices (today’s price is a good predictor of tomorrow’s) But badly for returns (todayprice – yesterdayprice)/todayprice. Data length=256 and the first 14 DFT coefficients are used in the distance computation, db2 wavelet is used here with coefficient size=16 and sketch size is 64

Empirical Comparison: DFT, DWT and Sketch

Sketch Guarantees Note: Sketches do not provide approximations of individual time series window but help make comparisons. Johnson-Lindenstrauss Lemma: For any and any integer n, let k be a positive integer such that Then for any set V of n points in, there is a map such that for all Further this map can be found in randomized polynomial time

Overcoming curse of dimensionality* May need many random projections. Can partition sketches into disjoint pairs or triplets and perform comparisons on those. Each such small group is placed into an index. Algorithm must adapt to give the best results. *Idea from P.Indyk,N.Koudas, and S.Muthukrishnan. “Identifying representative trends in massive time series data sets using sketches”. VLDB 2000.

Inner product with random vectors r1,r2,r3,r4,r5,r6 XY Z

Grid structure

Further Performance Improvements -- Suppose we have R random projections of window size WS. -- Might seem that we have to do R*WS work for each timepoint for each time series. -- In ongoing work with colleague Richard Cole, we show that we can cut this down by use of convolution and an oxymoronic notion of “structured random vectors”. *Idea from Dimitris Achlioptas, “Database-friendly Random Projections”, Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

How to compute the sketch efficiently We will not compute the inner product at each data point which is expensive. Given that random vectors are 1/-1*, we explain our algorithm through an example:... *Idea from Dimitris Achlioptas, “Database-friendly Random Projections”, Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Our algorithm Given Time series, compute its sketch for a window of size sw=12. Partition to smaller basic windows of size bw = 4. The random vector within a basic window is R, where At the cost of reducing randomization, we have another control vector,where or -1 with probability ½ each. is used to determine which basic window will be multiplied with –1 or 1 A final complete random vector may look like: ( ; ; ) Here bw=( ) b=(1 -1 1)

Our algorithm continued Then convolve with corresponding after padding with |bw| zeros. Here to show the example, we take all r=1. Animation shows convolution in action: conv1:( ) (x1,x2,x3,x4) conv2:( ) (x5,x6,x7,x8) conv3:( ) (x9,x10,x11,x12) x1 x2 x3 x4 x4 x4+x3 x4+x3+x2 x4+x3+x2+x1 x3+x2+x1 x2+x1 x1 x1 x2 x3 x4

Our algorithm continued bw1 bw2 x4 x8 x3+x4 x7+x8 x2+x3+x4 x6+x7+x8 x1+x2+x3+x4 x5+x6+x7+x8 x1+x2+x3 x5+x6+x7 x1+x2 x5+x6 x1 x5 Sk4 Sk3 Sk2 Sk1 ++

Our algorithm continued Summing up the corresponding items gives us the sought sketches as follows: Sk1=(x1+x2+x3+x4) Sk2=(x2+x3+x4) + (x5) Sk3=(x3+x4) + (x5+x6) Sk4=(x4) + (x5+x6+x7) After 3 such convolutions (note: because we have 3 basic windows) and then after inner product with b, the sketch for the first sliding window comes into formation. That is: (Sk1 Sk5 Sk9)*(b1 b2 b3) Here * is inner product

Performance Naïve algorithm For each datum and random vector O(|sw|) integer additions New algorithm Asymptotically for each datum and random vector (1) O(|sw|/|bw|) integer additions (2) O(log |bw|) floating point operations (use FFT in computing comvolutions)

Query by humming: Correlation + Shifting

Query By Humming You have a song in your head. You want to get it but don’t know its title. If you’re not too shy, you hum it to your friends or to a salesperson and you find it. They may grimace, but you get your CD

With a Little Help From My Warped Correlation Karen’s hummingMatch: Dennis’s hummingMatch: “What would you do if I sang out of tune?" Yunyue’s hummingMatch:

Related Work in Query by Humming Traditional method: String Matching [Ghias et. al. 95, McNab et.al. 97,Uitdenbgerd and Zobel 99] –Music represented by string of pitch directions: U, D, S (degenerated interval) –Hum query is segmented to discrete notes, then string of pitch directions –Edit Distance between hum query and music score Problem –Very hard to segment the hum query –Partial solution: users are asked to hum articulately New Method : matching directly from audio [Mazzoni and Dannenberg 00] We use both.

Time Series Representation of Query An example hum query Note segmentation is hard! Segment this!

How to deal with poor hummers? No absolute pitch –Solution: the average pitch is subtracted Inaccurate pitch intervals –Solution: return the k-nearest neighbors Incorrect overall tempo –Solution: Uniform Time Warping Local timing variations –Solution: Dynamic Time Warping Bottom line: timing variations take us beyond Euclidean distance.

Dynamic Time Warping Euclidean distance: sum of point-by-point distance DTW distance: allowing stretching or squeezing the time axis locally

Envelope Transform using Piecewise Aggregate Approximation(PAA) [Keogh VLDB 02]

Envelope Transform using Piecewise Aggregate Approximation(PAA) Advantage of tighter envelopes –Still no false negatives, and fewer false positives

Container Invariant Envelope Transform Container-invariant A transformation T for envelope such that Theorem: if a transformation is Container-invariant and Lower-bounding, then the distance between transformed times series x and transformed envelope of y lower bound their DTW distance. Feature Space

Framework to Scale Up for Large Database note/duration sequence segment notes Query criteria Database Humming with ‘ta’ keywords Top N match Nearest-N search on DTW distance with transformed envelope filter melody (note) Top N’ match Rhythm alignment verifier rhythm (duration) Database statistics based features Boosted feature filter boosting Database Keyword filter

Improvement by Introducing Humming with ‘ta’ * Solve the problem of note segmentation Compare humming with ‘la’ and ‘ta’ * Idea from N. Kosugi et al “A pratical query-by-humming system for a large music database” ACM Multimedia 2000

Improvement by Introducing Humming with ‘ta’(2) Still use DTW distance to tolerate poor humming Decrease the size of time series by orders of magnitude. Thus reduce the computation of DTW distance

Statistics-Based Filters * Low dimensional statistic feature comparison –Low computation cost comparing to DTW distance –Quickly filter out true negatives Example –Filter out candidates whose note length is much larger/smaller than the query’s note length More –Standard Derivation of note value –Zero crossing rate of note value –Number of local minimum of note value –Number of local maximum of note value * Intuition from Erling Wold et al “Content-based classification, search and retrieval of audio” IEEE Multimedia

Boosting Statistics-Based Filters Characteristics of statistics-based filters –Quick but weak classifier –Does not guarantee no false negative –Ideal candidates for boosting Boosting * –“An algorithm for constructing a ‘strong’ classifier using only a training set and a set of ‘weak’ classification algorithm” –“A particular linear combination of these weak classifiers is used as the final classifier which has a much smaller probability of misclassification” * Cynthia Rudin et al “On the Dynamics of Boosting” In Advances in Neural Information Processing Systems 2004

Verify Rhythm Alignment in the Query Result Nearest-N search only used melody information Will A. Arentz et al* suggests combining rhythm and melody –Results are generally better than using only melody information –Not appropriate when the sum of several notes’ duration in the query may be related to duration of one note in the candidate Our method: –First use melody information for DTW distance computing –Merge durations appropriately based on the note alignment –Reject candidates which have bad rhythm alignment * Will Archer Arentz “Methods for retrieving musical information based on rhythm and pitch correlation” CSGSC 2003

Query by Humming Demo 1039 songs (73051 note/duration sequences)

Burst detection: when window size is unknown

Burst Detection: Applications Discovering intervals with unusually large numbers of events. –In astrophysics, the sky is constantly observed for high-energy particles. When a particular astrophysical event happens, a shower of high-energy particles arrives in addition to the background noise. Might last milliseconds or days… –In telecommunications, if the number of packages lost within a certain time period exceeds some threshold, it might indicate some network anomaly. Exact duration is unknown. –In finance, stocks with unusual high trading volumes should attract the notice of traders (or perhaps regulators).

Bursts across different window sizes in Gamma Rays Challenge : to discover not only the time of the burst, but also the duration of the burst.

Burst Detection: Challenge Single stream problem. What makes it hard is we are looking at multiple window sizes at the same time. Naïve approach is to do this one window size at a time.

Elastic Burst Detection: Problem Statement Problem: Given a time series of positive numbers x 1, x 2,..., x n, and a threshold function f(w), w=1,2,...,n, find the subsequences of any size such that their sums are above the thresholds: –all 0<w<n, 0<m<n-w, such that x m + x m+1 +…+ x m+w-1 ≥ f(w) Brute force search : O(n^2) time Our shifted binary tree (SBT): O(n+k) time. –k is the size of the output, i.e. the number of windows with bursts

Burst Detection: Data Structure and Algorithm –Define threshold for node for size 2 k to be threshold for window of size 1+ 2 k-1

Burst Detection: Example

True Alarm False Alarm

Burst Detection: Algorithm In linear time, determine whether any node in SBT indicates an alarm. If so, do a detailed search to confirm (true alarm) or deny (false alarm) a real burst. In on-line version of the algorithm, need keep only most recent node at each level.

False Alarms (requires work, but no errors)

Empirical Study : Gamma Ray Burst

Case Study: Burst Detection(1) Background: Motivation: In astrophysics, the sky is constantly observed for high-energy particles. When a particular astrophysical event happens, a shower of high-energy particles arrives in addition to the background noise. An unusual event burst may signal an event interesting to physicists. Technical Overview: 1.The sky is partitioned into 1800*900 buckets Sliding window lengths are monitored from 0.1m to 39.81m 3.The original code implements the naive algorithm

Case Study: Burst Detection(2) The challenges: 1.Vast amount of data  1800*900 time series, so any trivial overhead may be accumulated to become nontrivial expenses. 2. Unavoidable overheads of data transformations  Data pre-processing such as fetching and storage requires much work.  SBT trees have to be built no matter how many sliding windows to be investigated.  Thresholds are maintained over time due to the different background noises.  Hit on one bucket will affect its neighbours as shown in the previous figure

Case Study: Burst Detection(3) Our solutions: 1. Combine near buckets into one to save space and processing time. If any alarms reported for this large bucket, go down to see each small components (two level detailed search). 2. Special implementation of SBT tree  Build the SBT tree only including those levels covering the sliding windows  Maintain a threshold tree for the sliding windows and update it over time. Fringe benefits: 1. Adding window sizes is easy. 2. More sliding windows monitored also benefit physicists.

Case Study: Burst Detection(4) Experimental results: 1. Benefits improve with more sliding windows. 2. Results consistent across different data files. 3. SBT algorithm runs 7 times faster than current algorithm. 4. More improvement possible if memory limitations are removed.

Extension to other aggregates SBT can be used for any aggregate that is monotonic –SUM, COUNT and MAX are monotonically increasing the alarm threshold is aggregate<threshold –MIN is monotonically decreasing the alarm threshold is aggregate<threshold –Spread =MAX-MIN Application in Finance –Stock with burst of trading or quote(bid/ask) volume (Hammer!) –Stock prices with high spread

Empirical Study : Stock Price Spread Burst

Extension to high dimensions

Elastic Burst in two dimensions Population Distribution in the US

Can discover numeric thresholds from probability threshold. Suppose that the moving sum of a time series is a random variable from a normal distribution. Let the number of bursts in the time series within sliding window size w be S o (w) and its expectation be S e (w). –S e (w) can be computed from the historical data. Given a threshold probability p, we set the threshold of burst f(w) for window size w such that Pr[S o (w) ≥ f(w)] ≤p.

Find threshold for Elastic Bursts Φ(x) is the normal cdf, so symmetric around 0: Therefore Φ(x) x p Φ -1 (p)

Summary of Burst Detection Able to detect bursts on many different window sizes in essentially linear time. Can be used both for time series and for spatial searching. Can specify thresholds either with absolute numbers or with probability of hit. Algorithm is simple to implement and has low constants (code is available). Ok, it’s embarrassingly simple.

AQuery AQuery A Database System for Order