Fintan The Amazing Fish of Knowledge… …filtering out the blogosphere so you don’t have to!
Overview Description Demo Pipeline Problems Future work Questions
What is Fintan? Provides a news aggregating service similar to Digg and Reddit based on blog entries. Presents topic-based clusters of entries. Algorithmically ranks clusters based on ranks of the entries and votes.
1: Retrieving data Spinn3r crawls >10M blogs on the web Offers their data free for academic use Use their API to collect blog entries Marshall data into Hadoop formats Contributed code back to Spinn3r
2: Syntax Tree Clustering O(n) nodes to suffixes O(n2) operations to corpus data Pipeline Several Tactics used: Get rid of useless nodes Eliminate stop words from prefixes Break trees apart by prefix and distribute
3: To ranked SQL Bridges the clustering and user interface Determines algorithmic ranking Original idea: PageRank with voting Clusters scored based on entries Entries ranked by reputation and date MapReduce job to convert to SQL statements
4: User Interface Aim to keep it simple & intuitive Written in RoR Tracking user actions User votes User comments Clickthroughs Cluster views Future: Personalization
Problems Quality of clusters Runtime of clusters Classification Ranking
Future Work Real time updates Personalization Faster clustering Blog reputation system
Questions?