Alex Meng Chunshi Jin Elliott Conant Jonathan Fung
What is Over9k? Architecture Crawler Postprocessor Extractor Web Service Summary
Original Goal: A system to predict stock’s future volatility based on the news and information gathered from Internet. Current Goal: create a system that crawled different news sites for articles, identified which companies are affected, and extracted events from articles. We store all information in a database that is accessed through our web service.
Web crawler: Nutch Domains we crawl: ◦ ◦ ◦ ◦ … (6 total) Nutch’s Successes Nutch’s Failures
Components: ◦ NBClassifier Classifies articles using Naives-Bayes ◦ DateParser Parses date using regular expressions ◦ PageGetter Retrieves training data from RSS feeds
Tried several systems for IE ◦ Gate ◦ OpenCalais ◦ CRF++
OpenCalais: ◦ Web service. Easy to use. ◦ Not extensible. No machine learning process. ◦ Has usage quotas Gate: ◦ ANNIE( a Nearly New IE system ): Tokenizer, Sentence Splitter, POSTagger, Gazetteer, NE ◦ JAPE: Gate’s rule engine. ◦ Extensible with JAPE. Easy to use for its regex like syntax. Behavior is almost deterministic. ◦ High precision for defined patterns, low recall if there are sentences of undefined patterns.
CRF++ ◦ Need tools to preprocess content: HTML to text POS Tag/NE (Stanford NLP library) Extract other features when necessary Convert file to the required train/test format of CRF++ ◦ Template file to define dependencies of feature and label. ◦ Need big set of training set. ◦ Labeling training set is laborious ◦ Fairly good precision/recall. “Intelligence” may emerge.
Technologies used: ◦ YUI Toolkit ◦ PHP ◦ Apache ◦ CSS ◦ Javascript Layout description
A realistic goal is critical. Right tools are important. Communication is key. Future Improvement ◦ Controlled crawling ◦ Improve feature extraction qualities: POSTagger/NE etc. ◦ Developing a model to predict volatility
Q&A Thanks!