Download presentation
Presentation is loading. Please wait.
1
Alex Meng Chunshi Jin Elliott Conant Jonathan Fung
2
What is Over9k? Architecture Crawler Postprocessor Extractor Web Service Summary
3
Original Goal: A system to predict stock’s future volatility based on the news and information gathered from Internet. Current Goal: create a system that crawled different news sites for articles, identified which companies are affected, and extracted events from articles. We store all information in a database that is accessed through our web service.
5
Web crawler: Nutch Domains we crawl: ◦ www.cnbc.com www.cnbc.com ◦ www.reuters.com www.reuters.com ◦ www.marketwatch.com www.marketwatch.com ◦ … (6 total) Nutch’s Successes Nutch’s Failures
6
Components: ◦ NBClassifier Classifies articles using Naives-Bayes ◦ DateParser Parses date using regular expressions ◦ PageGetter Retrieves training data from RSS feeds
7
Tried several systems for IE ◦ Gate ◦ OpenCalais ◦ CRF++
8
OpenCalais: ◦ Web service. Easy to use. ◦ Not extensible. No machine learning process. ◦ Has usage quotas Gate: ◦ ANNIE( a Nearly New IE system ): Tokenizer, Sentence Splitter, POSTagger, Gazetteer, NE ◦ JAPE: Gate’s rule engine. ◦ Extensible with JAPE. Easy to use for its regex like syntax. Behavior is almost deterministic. ◦ High precision for defined patterns, low recall if there are sentences of undefined patterns.
9
CRF++ ◦ Need tools to preprocess content: HTML to text POS Tag/NE (Stanford NLP library) Extract other features when necessary Convert file to the required train/test format of CRF++ ◦ Template file to define dependencies of feature and label. ◦ Need big set of training set. ◦ Labeling training set is laborious ◦ Fairly good precision/recall. “Intelligence” may emerge.
10
Technologies used: ◦ YUI Toolkit ◦ PHP ◦ Apache ◦ CSS ◦ Javascript Layout description
11
A realistic goal is critical. Right tools are important. Communication is key. Future Improvement ◦ Controlled crawling ◦ Improve feature extraction qualities: POSTagger/NE etc. ◦ Developing a model to predict volatility
12
Q&A Thanks!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.