Market analysis for the S&P500 Giulio Genovese Tuesday, December 11 2007.

Market analysis for the S&P500 Giulio Genovese Tuesday, December 11 2007

Data collected and used Historical prices have been downloaded from finance.yahoo.com for the most influencial 500 stocks in the United States from the S&P500 index. For every company is associated a label corresponding to the sector in which the company operates. Yahoo identifies eight sector plus a ninth sector, Conglomerates, used for companies which own divisions in different and separate businesses. Historical prices for each stock can be downloaded through the url: http://ichart.finance.yahoo.com/table.csv?s=SYMBOL where SYMBOL is the ticker associated to the company in the stock market. Data contains daily closing prices starting from January 1 st 1962 or later for companies who went public after that date. The following sectors were listed from Yahoo finance: Basic Materials (53 stocks)‏ Conglomerates (7 stocks)‏ Consumer Goods (62 stocks)‏ Financial (93 stocks)‏ Healthcare (40 stocks)‏ Industrial Goods (40 stocks)‏ Services (93 stocks)‏ Technology (82 stocks)‏ Utilities (30 stocks)‏

Goal of the project We want to measure how the prices of the stock market reflect the sector division given by the Yahoo website. To this aim we want to apply the k-means clustering algorithms to the 500 stocks in the S&P500 index to investigate how price variations follow the market sectorization.

Data preprocessing As a first step we compute the log of the closing prices for every day. Then we compute for every day the return, intended as the difference between the current day log price and the previous day log price. For every day we subtract the average return among all the stocks, intended as the general market return for the day. With the serieses of the modified returns we compute the 500x500 matrix of correlations among stocks, using the daily returns from January 1 st 2000 up to December 7 th 2007. To visualize the result we use the principal component analysis over the correlation matrix and we visualize how the stocks appear when projected over the first two eigenvectors of the matrix.

Geometric interpretation Think of the correlation matric C as the matrix of the scalar products (kernel) of the stocks thought as vectors. Using the eigenvector decomposition we get that C=VDV' where D is a diagonal matrix and V is the matrix whose columns correspond to the eigenvectors. Consider now W=V*sqrt(D). Then C=WW' and therefore C can be seen as the matrix of scalar products among the rows of W. Since the elements on the diagonal of C are all 1's, all the rows of W are unitary vectors. Each one of these unitary vector will represent from now on one of our stocks.

PCA of the correlation matrix As you can see already with the first two eigenvectors it is possible to tell apart some of the sectors. To get a better plot more eigenvectors are needed and therefore it is not possible to visualize it on screen.

Eigenvalues of the matrix The eigenvalues after the second eigenvalue are not significantly smaller than the first two. This is a sign that we lost many of the metric properties carried by the correlation matrix.

Using k-means to cluster If we exclude the 7 stocks from the Conglomerate cluster, that is, those stocks that don't belong to any specific sector, then we can try to cluster the rest of the stocks using k-means setting the number of groups equal to 8. To make the process supervised we set the ”seed” centroids for k-means as the centroids of the 8 groups of stocks indicated by Yahoo.

k-means results Setting the ”seed” centroid has been an essential step since there are a plethora of possible outputs for k-means. The cluster classification looks very similar to the sector classification. Can we quantify that?

Accuracy of classification The picture shows how the 500 stocks were classified by the k-means method compared to how they are classified per sector. Overall, 334 out of 493 stocks have been classified correctly, that is, the accuracy of the classification has been of 67.75%.

Conclusions At first sight a classification accuracy of 67.75% might seem low. Although, we have to remember that there were 8 clusters and therefore a random classification would have yielded only an accuracy of 12.5%. Mahalanobis distance cannot be applied since the space where the stock vectors were embedded has dimension greater than the size of the clusters. Figures were generated using matlab. Source code and data is freely available at the following address: http://mlab01.dartmouth.edu/finance/

Market analysis for the S&P500 Giulio Genovese Tuesday, December 11 2007.

Similar presentations

Presentation on theme: "Market analysis for the S&P500 Giulio Genovese Tuesday, December 11 2007."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Market analysis for the S&P500 Giulio Genovese Tuesday, December 11 2007.

Similar presentations

Presentation on theme: "Market analysis for the S&P500 Giulio Genovese Tuesday, December 11 2007."— Presentation transcript:

Similar presentations

About project

Feedback