Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presented by Arun Qamra

Similar presentations


Presentation on theme: "Presented by Arun Qamra"— Presentation transcript:

1 Presented by Arun Qamra
Mining the Stock Market: Which Measure is the Best ? Martin Gavrilov, Dragomir Anguelov, Piotr Indyk, Rajeev Motwani Presented by Arun Qamra

2 Main Idea Lot of interest in mining Time Series data
But little work on identifying measures suitable for specific class of data sets This work attempts to Study similarity measures suitable for stocks Evaluate results

3 More specifically.. 500 stocks, data for one year (S & P index, 1998)
Opening price for 252 days Time Series Clustering to find similar stocks Variety of similarity measures

4 Evaluation Technique How do you evaluate clustering results ?
Each stock pre-assigned to a cluster/category 102 clusters (based on industry) Abstracted into 62 super-clusters Used as ground truth Attempt to recreate this clustering

5 Feature Selection Data Representation Normalization
Dimensionality Reduction

6 Data Representation Raw First Derivative
Point in 252-dimensional space represents sequence i.e. stock First Derivative i-th coordinate is equal to difference between i-th and (i+1)-th value of sequence

7 Normalization Standard Normalization Piecewise Normalization
Mean subtracted from all coordinates, then dividing vector by L2 norm Piecewise Normalization Split sequence into ‘windows’ Each window normalized separately Local similarities taken into account No Normalization

8 Dimensionality Reduction
Principal Component Analysis Maps vectors into lower-dimensional space Aggregation Local fluctuation insignificant Groups of consecutive B data points replaced by average Hence dimensionality reduced by factor of B Fourier Transform Time series represented by few of its lowest frequencies

9 Similarity Measure Use Euclidean distance

10 Clustering Method Hierarchical Agglomerative Clustering
Hierarchical classification of objects Done by series of binary mergers Smallest max distance between two inter-cluster elements Any level in hierarchy can be chosen based on required number of clusters

11 Comparing Clusterings
Similarity measure for comparing clusterings: Note: Not symmetric

12 Precision-Recall curves
Precision recall curves also used for evaluation To make observations independent of clustering algorithm For each S, Rank all S’ based on distance Plot percentage of relevant stocks among ‘i’ closest stocks against ‘i’ Average over all stocks

13 Results: Dimensionality Reduction
Preprocessing causes dimensionality dispersal Raw data, reduced to 10 or 5 dimensions Raw data, Normalized, reduced to 50 FD, Normalized, reduced to 100

14 Results: Clustering Normalization improves results
Derivative improves results Best results for combination of Piecewise Normalization (window 15) First Derivative

15 Conclusion This paper Identifies mining techniques specifically useful for Stock Market data Evaluates against real data Further research needed to understand behavior of this class of data. effect of taking first derivative not understood well


Download ppt "Presented by Arun Qamra"

Similar presentations


Ads by Google