Download presentation
Presentation is loading. Please wait.
1
Presented by Arun Qamra
Mining the Stock Market: Which Measure is the Best ? Martin Gavrilov, Dragomir Anguelov, Piotr Indyk, Rajeev Motwani Presented by Arun Qamra
2
Main Idea Lot of interest in mining Time Series data
But little work on identifying measures suitable for specific class of data sets This work attempts to Study similarity measures suitable for stocks Evaluate results
3
More specifically.. 500 stocks, data for one year (S & P index, 1998)
Opening price for 252 days Time Series Clustering to find similar stocks Variety of similarity measures
4
Evaluation Technique How do you evaluate clustering results ?
Each stock pre-assigned to a cluster/category 102 clusters (based on industry) Abstracted into 62 super-clusters Used as ground truth Attempt to recreate this clustering
5
Feature Selection Data Representation Normalization
Dimensionality Reduction
6
Data Representation Raw First Derivative
Point in 252-dimensional space represents sequence i.e. stock First Derivative i-th coordinate is equal to difference between i-th and (i+1)-th value of sequence
7
Normalization Standard Normalization Piecewise Normalization
Mean subtracted from all coordinates, then dividing vector by L2 norm Piecewise Normalization Split sequence into ‘windows’ Each window normalized separately Local similarities taken into account No Normalization
8
Dimensionality Reduction
Principal Component Analysis Maps vectors into lower-dimensional space Aggregation Local fluctuation insignificant Groups of consecutive B data points replaced by average Hence dimensionality reduced by factor of B Fourier Transform Time series represented by few of its lowest frequencies
9
Similarity Measure Use Euclidean distance
10
Clustering Method Hierarchical Agglomerative Clustering
Hierarchical classification of objects Done by series of binary mergers Smallest max distance between two inter-cluster elements Any level in hierarchy can be chosen based on required number of clusters
11
Comparing Clusterings
Similarity measure for comparing clusterings: Note: Not symmetric
12
Precision-Recall curves
Precision recall curves also used for evaluation To make observations independent of clustering algorithm For each S, Rank all S’ based on distance Plot percentage of relevant stocks among ‘i’ closest stocks against ‘i’ Average over all stocks
13
Results: Dimensionality Reduction
Preprocessing causes dimensionality dispersal Raw data, reduced to 10 or 5 dimensions Raw data, Normalized, reduced to 50 FD, Normalized, reduced to 100
14
Results: Clustering Normalization improves results
Derivative improves results Best results for combination of Piecewise Normalization (window 15) First Derivative
15
Conclusion This paper Identifies mining techniques specifically useful for Stock Market data Evaluates against real data Further research needed to understand behavior of this class of data. effect of taking first derivative not understood well
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.