Publish-Subscribe Approach to Social Annotation of News Top-k Publish-Subscribe for Social Annotation of News Joint work with: Maxim Gurevich (RelateIQ)

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

The Optimal-Location Query
AP STUDY SESSION 2.
Sugar 2.0 Formal Specification Language D ana F isman 1,2 Cindy Eisner 1 1 IBM Haifa Research Laboratory 1 IBM Haifa Research Laboratory 2 Weizmann Institute.
1
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Slide 1 Insert your own content. Slide 2 Insert your own content.
© 2008 Pearson Addison Wesley. All rights reserved Chapter Seven Costs.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
SNFS: The design and implementation of a Social Network File System Ch. Kaidos, A. Pasiopoulos N. Ntarmos, P. Triantafillou University of Patras.
Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
1 Rate and distance fairness in OBS networks Tananun Orawiwattanakul, Yusheng Ji.
1 Faster algorithms for string matching with k mismatches Adviser : R. C. T. Lee Speaker: C. C. Yen Journal of Algorithms, Volume 50, Issue 2, February.
UNITED NATIONS Shipment Details Report – January 2006.
International Technology Alliance In Network & Information Sciences International Technology Alliance In Network & Information Sciences 1 Interference.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.
Multiplying binomials You will have 20 seconds to answer each of the following multiplication problems. If you get hung up, go to the next problem when.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
Year 6 mental test 5 second questions
Online Event-driven Subsequence Matching over Financial Data Streams Huanmei Wu,Betty Salzberg, Donghui Zhang Northeastern University, College of Computer.
Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
GeoFeed: A Location-Aware News Feed System
Swiss Federal Institute of Technology Computer Engineering and Networks Laboratory Modular Performance Analysis with Real-Time Calculus Lothar Thiele,
Pinwheel Scheduling for Power-Aware Real-Time Systems Gaurav Chitroda Komal Kasat Nalini Kumar.
QUANTITATIVE METHODS FOR BUSINESS 8e
Outline Introduction Assumptions and notations
PP Test Review Sections 6-1 to 6-6
Hash Tables.
Outline Minimum Spanning Tree Maximal Flow Algorithm LP formulation 1.
2 |SharePoint Saturday New York City
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.
Text Categorization.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
© 2012 National Heart Foundation of Australia. Slide 2.
Introduction to Feedback Systems / Önder YÜKSEL Bode plots 1 Frequency response:
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
25 seconds left…...
Week 1.
Analyzing Genes and Genomes
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Converting a Fraction to %
PSSA Preparation.
Essential Cell Biology
Immunobiology: The Immune System in Health & Disease Sixth Edition
Compiler Construction LR(1) Rina Zviel-Girshin and Ohad Shacham School of Computer Science Tel-Aviv University.
1 A Systematic Review of Cross- vs. Within-Company Cost Estimation Studies Barbara Kitchenham Emilia Mendes Guilherme Travassos.
Profile. 1.Open an Internet web browser and type into the web browser address bar. 2.You will see a web page similar to the one on.
Amit Goyal Laks V. S. Lakshmanan RecMax: Exploiting Recommender Systems for Fun and Profit University of British Columbia
Delta-Oriented Testing for Finite State Machines
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning, Pandu Nayak and Prabhakar.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.
8. Efficient Scoring Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
Presentation transcript:

Publish-Subscribe Approach to Social Annotation of News Top-k Publish-Subscribe for Social Annotation of News Joint work with: Maxim Gurevich (RelateIQ) Marcus Fontoura, Vanja Josifovski (Google) Alex Shraer Work done while authors were at Yahoo! Research

News & Social Updates

News Annotation Goal: Annotate each story with k most related tweets Challenges: – Automatic matching, based on content of story & tweet – Real time - continuously update annotations – Serving Latency - avoid delay in serving the news page – High scale – billions of page views per day, hundreds of millions of tweets per day, tens of thousands of stories per day

Real-time Index Approach Maintain a tweet index in real-time For every page view in the media site, query this index with the content of the story as the query Problems: – Long queries, serving time affected – The index is queried and updated very frequently – Caching techniques almost unusable Not scalable! Tweet Index top-k tweets story update New tweet Page view Billions per day Hundreds of millions per day

Our solution: Top-K Publish-Subscribe Treat stories as subscriptions, tweets as published items New item triggers a subscription only if it is among the top- k matching items published so far top-k tweetsstory update New tweet Page view Story to top-k tweets map Story Index New story query update

Real Time Indexing VS Top-k Pub-Sub Real-time indexing Publish-Subscribe Computation 1B 50ms = 50Bms 100M 10ms+1B 1ms = 2Bms Serving time 50ms 1ms #cores = 24 1B pageviews/day => ~600 pageviews/50ms 10K 100M 1B pageviews 50ms10ms 1ms Story Index 100M tweets/day =>~12 tweets/10ms 1B pageviews/day => ~12 pageviews/1ms Top-k map X 25 X 50 X 25 Story to top-k tweets map Story Index 1B pageviews

Standard IR Index and Algorithms Posting list for term t: a list of partial scores, one for each document containing the term t Query q = Go over posting lists for t 1, t 3, t 4 Collect partial scores, when done we have fully scored documents w.r.t. the query q Return k documents with maximal score terms Documents s1s1 s3s3 t1t1 s4s4 s7s7 s9s9 s 10 s 11 s 18 s 31 s 37 s2s2 s7s7 s8s8 s 18 s 11 s 18 s3s3 t2t2 s4s4 s3s3 s8s8 t3t3 s9s9 s 32 s4s4 s5s5 t4t4 s7s7 s 12 s 13 s 15 s 21 s 22 s 34 s 35 s6s6 s8s8 t5t5 s 13 s 14 s 19 s 22 s 25

Story Index and Top-k Pub-Sub Algorithms Posting list for term t: a list of partial scores, one for each story containing the term t tweet = Go over posting lists for t 1, t 3, t 4 Collect partial scores, when done we have fully scored stories w.r.t. the query q For every story s with score(s, tweet) > 0, attempt to insert tweet into annotation set of s Compare score(s, tweet) to score of the k tweets currently annotating s terms Stories s1s1 s3s3 t1t1 s4s4 s7s7 s9s9 s 10 s 11 s 18 s 31 s 37 s2s2 s7s7 s8s8 s 18 s 11 s 18 s3s3 t2t2 s4s4 s3s3 s8s8 t3t3 s9s9 s 32 s4s4 s5s5 t4t4 s7s7 s 12 s 13 s 15 s 21 s 22 s 34 s 35 s6s6 s8s8 t5t5 s 13 s 14 s 19 s 22 s 25

Our contribution Method to convert efficient IR algorithms into efficient top-k pub-sub algorithms – Demonstrate on 4 standard IR algorithms TAAT, Buckley & Lewit, DAAT, WAND

Key for Efficiency: Skipping Score of worst Tweet annotating story s 1 IR algorithms skip most of the posting lists Compute upper bound on score gain in all remaining posting lists If upper bound is not enough to change result set, can skip remaining lists Cant use this for pub-sub – instead of 1 result-set we have to update many μ s - score of worst tweet annotating a story s Skipping condition when processing a tweet: Can skip s only if upper bound on score(tweet, s) μ s Use a segment tree per posting list to skip segments of the list that satisfy skipping condition Overhead ~1.6% of index size s1s1 s2s2 t4t4 s3s3 s4s4 s5s5

Score(story, tweet) Content based matching (cosine similarity, BM25) Time-based decay factor – every time the score is divided by 2

Test Collection 100K articles from a single day – Each article has title, abstract and main body 35M from same day containing only ASCII chars – 24K/minute

Fraction of related tweets that actually matter We measured: 38 new tweets related to average story per minute For 100K stories: 3.8M tweets / minute This would be #invalidations in real-time indexing w/caching Many (expensive) queries of Tweet Index or, alternatively, stale annotations Fraction of related tweets that actually become annotations: 5 orders of magnitude less! Important to efficiently identify stories the tweet will actually annotate

Skipping: 10x reduction in processing time Our alg. with skipping Our alg. w/o skipping

Summary Annotating news stories with social updates in real time – Top-k pub-sub: stories indexed as subscriptions, tweets are events – Scalable, fast annotation serving – Low latency tweet processing, off the critical serving path! Method to convert top-k retrieval alg. to top-k pub-sub – Demonstrate using 4 popular algorithms – Skipping works - up to 10x latency reduction Can use top-k pub-sub for top stories, caching for others Many potential applications – Examples: alerts, personalized news feed, etc.

Thank you! Alex Shraer