Presented by Arun Qamra

Presented by Arun Qamra
Mining the Stock Market: Which Measure is the Best ? Martin Gavrilov, Dragomir Anguelov, Piotr Indyk, Rajeev Motwani Presented by Arun Qamra

Main Idea Lot of interest in mining Time Series data
But little work on identifying measures suitable for specific class of data sets This work attempts to Study similarity measures suitable for stocks Evaluate results

More specifically.. 500 stocks, data for one year (S & P index, 1998)
Opening price for 252 days Time Series Clustering to find similar stocks Variety of similarity measures

Evaluation Technique How do you evaluate clustering results ?
Each stock pre-assigned to a cluster/category 102 clusters (based on industry) Abstracted into 62 super-clusters Used as ground truth Attempt to recreate this clustering

Feature Selection Data Representation Normalization
Dimensionality Reduction

Data Representation Raw First Derivative
Point in 252-dimensional space represents sequence i.e. stock First Derivative i-th coordinate is equal to difference between i-th and (i+1)-th value of sequence

Normalization Standard Normalization Piecewise Normalization
Mean subtracted from all coordinates, then dividing vector by L2 norm Piecewise Normalization Split sequence into ‘windows’ Each window normalized separately Local similarities taken into account No Normalization

Dimensionality Reduction
Principal Component Analysis Maps vectors into lower-dimensional space Aggregation Local fluctuation insignificant Groups of consecutive B data points replaced by average Hence dimensionality reduced by factor of B Fourier Transform Time series represented by few of its lowest frequencies

Similarity Measure Use Euclidean distance

Clustering Method Hierarchical Agglomerative Clustering
Hierarchical classification of objects Done by series of binary mergers Smallest max distance between two inter-cluster elements Any level in hierarchy can be chosen based on required number of clusters

Comparing Clusterings
Similarity measure for comparing clusterings: Note: Not symmetric

Precision-Recall curves
Precision recall curves also used for evaluation To make observations independent of clustering algorithm For each S, Rank all S’ based on distance Plot percentage of relevant stocks among ‘i’ closest stocks against ‘i’ Average over all stocks

Results: Dimensionality Reduction
Preprocessing causes dimensionality dispersal Raw data, reduced to 10 or 5 dimensions Raw data, Normalized, reduced to 50 FD, Normalized, reduced to 100

Results: Clustering Normalization improves results
Derivative improves results Best results for combination of Piecewise Normalization (window 15) First Derivative

Conclusion This paper Identifies mining techniques specifically useful for Stock Market data Evaluates against real data Further research needed to understand behavior of this class of data. effect of taking first derivative not understood well

Presented by Arun Qamra

Similar presentations

Presentation on theme: "Presented by Arun Qamra"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Presented by Arun Qamra

Similar presentations

Presentation on theme: "Presented by Arun Qamra"— Presentation transcript:

Similar presentations

About project

Feedback