Fintan The Amazing Fish of Knowledge…

Slides:



Advertisements
Similar presentations
MapReduce.
Advertisements

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Effective use of blogs in business: How companies can talk to themselves 2 nd December 2004 Olympia The Big Blog Company.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Information Retrieval in Practice
Collaborative Filtering in iCAMP Max Welling Professor of Computer Science & Statistics.
Architecture of a Search Engine
WMES3103 : INFORMATION RETRIEVAL
Searching with Lucene Chapter 2. For discussion Information retrieval What is Lucene? Code for indexer using Lucene Pagerank algorithm.
Web queries classification Nguyen Viet Bang WING group meeting June 9 th 2006.
Introduction to WEKA Aaron 2/13/2009. Contents Introduction to weka Download and install weka Basic use of weka Weka API Survey.
Learning the Common Structure of Data Kristina Lerman and Steven Minton Presentation by Jeff Roth.
Overview of Search Engines
Cloud Computing Other Mapreduce issues Keke Chen.
Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.
More Algorithms for Trees and Graphs Eric Roberts CS 106B March 11, 2013.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Web Information Retrieval Projects Ida Mele. Rules Students can work in teams (max 3 people) The project must be delivered by the deadline that will be.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Introduction to social software in the enterprise “There’s something happening here, what it is ain’t exactly clear.” - Quoted from John Hagel on Web2.0.
Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.
Project Proposal: Academic Job Market and Application Tracker Website Project designed by: Cengiz Gunay Client: Cengiz Gunay Audience: PhD candidates and.
Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.
Using Wikis. What is a wiki? Hawaiian Word – meaning ‘quick’ A website or a document Real strength lies in its collaborative nature  Multiple people.
Data File Access API : Under the Hood Simon Horwith CTO Etrilogy Ltd.
How to think in Map-Reduce Paradigm Ayon Sinha
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Social Networking Algorithms related sections to read in Networked Life: 2.1,
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Proposal for Term Project J. H. Wang Mar. 2, 2015.
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
LOGO Identifying Opinion Leaders in the Blogosphere Xiaodan Song, Yun Chi, Koji Hino, Belle L. Tseng CIKM 2007 Advisor : Dr. Koh Jia-Ling Speaker : Tu.
D. Heynderickx DH Consultancy, Leuven, Belgium 22 April 2010EuroPlanet, London, UK.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
 Used MapReduce algorithms to process a corpus of web pages and develop required index files  Inverted Index evaluated using TREC measures  Used Hadoop.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
RDFPath: Path Query Processing on Large RDF Graph with MapReduce Martin Przyjaciel-Zablocki et al. University of Freiburg ESWC May 2013 SNU IDB.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
"Data sources index" a web application to list projects in Hadoop Luca Menichetti.
Kendra Hunter & Charde Johnson EDUC Dr. M. Kariuki.
Rich Internet Applications: Better Practices for Financial Services Stephen Turbek, Avenue A | Razorfish.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07.
Data mining in web applications
Information Retrieval in Practice
Image taken from: slideshare
Big Data is a Big Deal!.
Hadoop and Analytics at CERN IT
Search Engine Architecture
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Microsoft /2/2018 3:42 PM BRK3129 Query Big Data using the Expanded T-SQL footprint with PolyBase in SQL Server 2016 Casey Karst Program Manager.
Proposal for Term Project
Spark Presentation.
What is a Blog? short for Weblog journal on a website
Waikato Environment for Knowledge Analysis
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Azure Machine Learning Studio
This meme comes from South Park (S2E )
Introduction to Apache
Overview of big data tools
TIM TAYLOR AND JOSH NEEDHAM
Group 15 Swathi Gurram Prajakta Purohit
Magnet & /facet Zheng Liang
Existing SQL Integration
Introduction to The Writing Process
The Student’s Guide to Apache Spark
Map Reduce, Types, Formats and Features
Presentation transcript:

Fintan The Amazing Fish of Knowledge… …filtering out the blogosphere so you don’t have to!

Overview Description Demo Pipeline Problems Future work Questions

What is Fintan? Provides a news aggregating service similar to Digg and Reddit based on blog entries. Presents topic-based clusters of entries. Algorithmically ranks clusters based on ranks of the entries and votes.

1: Retrieving data Spinn3r crawls >10M blogs on the web Offers their data free for academic use Use their API to collect blog entries Marshall data into Hadoop formats Contributed code back to Spinn3r

2: Syntax Tree Clustering O(n) nodes to suffixes O(n2) operations to corpus data Pipeline Several Tactics used: Get rid of useless nodes Eliminate stop words from prefixes Break trees apart by prefix and distribute

3: To ranked SQL Bridges the clustering and user interface Determines algorithmic ranking Original idea: PageRank with voting Clusters scored based on entries Entries ranked by reputation and date MapReduce job to convert to SQL statements

4: User Interface Aim to keep it simple & intuitive Written in RoR Tracking user actions User votes User comments Clickthroughs Cluster views Future: Personalization

Problems Quality of clusters Runtime of clusters Classification Ranking

Future Work Real time updates Personalization Faster clustering Blog reputation system

Questions?