Analysis of Complex Systems John Sherwood Period 2
Abstract My project is involved with using data mining techniques on the internet in order to gather enough information for the use for a genetic algorithm in trend analysis of a complex system; e.g. the stock market
Scope The most fundamental element of my program is creating a correlation between news about a company and its stock and the price of the stock itself. In order to do this, a huge amount of data on both stock prices and news regarding companies must be processed into a quantitative format, and then extensively analyzed.
Expected Results In this project, I expect to at the very least have a very useful genetic algorithm, that given a list of independant and dependant data, can generate equations to create a tentative correlation. While the extremely chaotic nature of the specific application may prevent quantitative success in this instance, I do expect to have success on general terms.
Other's Work Due to the very lucrative nature of a program that could predict the stock market: Many have tried All have failed
Procedures Differs for each part of program Data Mining Analysis Determination of data parsing sequences to extract information from HTML (Hypertext Markup Language) Quantitative tests of success XML parser Data classification Trial and error tests Evaluation algorithms Discriminant generation
Design Several program segments: Data mining algorithms Price data logger News parser Data analysis algorithms Heuristic Generator Equation Regression Genetic Command Shell unifies elements of program Graph Generator Written in PHP to build PNG graphs of data
Program Tests XML parser tests prove successful in parsing properly formatted XML/XHTML, sufficient success in parsing malformed XHTML Stock price logging working perfectly Generalized equations work but are semi-timeframe specific
Algorithms Different program segments use different algorithms Data mining algorithm Discriminant (Heuristic) Generation Algorithm XML parsing algorithm Equation Refinement algorithm
Data Mining Based on XML parser to convert XHTML code to programming objects New algorithm allows for parsing of Google Finance pages – following links to other sites for more data Different algorithms required to parse different websites for information
XML Parser Two potential paradigms Iterative Uses a set of flags to determine what action to take with each character Recursive Splits XML document into sets of tags and processes each tag's child elements
Problems Malformed XHTML forces extensive testing of data mining for each new site Huge variety in formatting of different websites makes mining problematic to make sure that only news data is mined
Results and Conclusions The equations generated by my equation refiner are accurate in the timeframe they are generated for, then become less accurate as time passes (time in the past or future relative to the generation timeframe), implying that the effect of news information on stock prices is non-constant and relative based on your current timeframe