Download presentation
Presentation is loading. Please wait.
1
Fintan The Amazing Fish of Knowledge…
…filtering out the blogosphere so you don’t have to!
2
Overview Description Demo Pipeline Problems Future work Questions
3
What is Fintan? Provides a news aggregating service similar to Digg and Reddit based on blog entries. Presents topic-based clusters of entries. Algorithmically ranks clusters based on ranks of the entries and votes.
4
1: Retrieving data Spinn3r crawls >10M blogs on the web
Offers their data free for academic use Use their API to collect blog entries Marshall data into Hadoop formats Contributed code back to Spinn3r
5
2: Syntax Tree Clustering
O(n) nodes to suffixes O(n2) operations to corpus data Pipeline Several Tactics used: Get rid of useless nodes Eliminate stop words from prefixes Break trees apart by prefix and distribute
6
3: To ranked SQL Bridges the clustering and user interface
Determines algorithmic ranking Original idea: PageRank with voting Clusters scored based on entries Entries ranked by reputation and date MapReduce job to convert to SQL statements
7
4: User Interface Aim to keep it simple & intuitive Written in RoR
Tracking user actions User votes User comments Clickthroughs Cluster views Future: Personalization
8
Problems Quality of clusters Runtime of clusters Classification
Ranking
9
Future Work Real time updates Personalization Faster clustering
Blog reputation system
10
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.