Client Behavior and Feed Characteristics of RSS Presented by Sukumar Manduva Nageswari Vallabhaneni
Why This Presentation Previously we dealt with system architecture, event-notification and content filtering algorithms used by RSS. What About fundamental aspects like Work-Load? Usage of system by Clients?
Topics Introduction Measurement Methodology Survey Results Publish-Subscribe Systems Experiment at Cornell University Measurement Methodology Passive Logging Active Polling Survey Results Feed Characteristics Update Characteristics Client Behavior
INTRODUCTION Pub-Sub Systems: Topic based Content based Pub-Sub system (Subscribers, Publishers and infrastructure of event delivery). Infrastructure maps down published events with Subscribers Interest. Pub-Sub systems can be divided into two ways based on how Subscribers specify their interest: Topic based Content based
Pub-Sub System S1 CNN Notification Service BBC S2 NGC S3 Publishers Subscribers Events Event Notification
Topic Based Pub-Sub Systems Generally also known as subject based, group based or channel based event filtering. A subscriber subscribes to a particular channel and will receive all events published to the subscribed channel. e.g. Sports, Stock Market Topic can be hierarchy topic, e.g. Sports/basketball, Stock Market/BOA
Content Based Pub-Sub System More flexibility and power to subscribers This allows Subscribers to query over the contents of the event. e.g. Notify me of news about cricket from cricinfo if the score is greater than 350
Experiments at Cornell University: INTRODUCTION Experiments at Cornell University: 45 days study of about 10,000 feeds. Analyzed Feed Characteristics, Update Characteristics and Client behavior CNN RSS REQ Tracer BBC RSS RESP NGC Cornell University CS Dept
Measurement Methodology Passive Logging: Tracer S/W captures TCP packets, Reassembles the flow Tracer logs the RSS requests/responses from the reassembled flow. Trace length 45 days Number of clients 158 Number of feeds 667 Number of requests 61935
Measurement Methodology Active Polling: Actively polled 99,714 RSS feeds for 84 hours. A snapshot of the feed is gathered when a poll is done. Polling Period 84 Hours Number of feeds 99714 Number of snapshots 3682043 Bytes received 57GB
Analyzing Study Results Feed Characteristics Popularity distribution Content size Format and version. Update Characteristics Intervals Changes involved in an update Correlation between feed size and update. Client behavior Polling Subscription patterns.
Feed Characteristics Feed Popularity: We measure popularity in two ways: 1.The number of requests received for each RSS feed. 2.The number of clients who subscribed to each RSS feed.
Feed Characteristics Feeds Ranked by Number of Requests:
Feeds Ranked by Number of Subscribers: Feed Characteristics Feeds Ranked by Number of Subscribers:
Feed Format and Version: Feed Characteristics Feed Format and Version: Format: 98% are RSS feeds and 2% are Atom feeds. Version:
Feed Characteristics Feed Size The feed size is calculated as the average of all the snapshots of the feed 80% of feeds <10 KB Median = 5.8 KB 99% of feeds < 100KB
Update Characteristics The nature of RSS update can be found using hourly snapshots gathered through polling. An update is valid if there is a valid snapshot preceding the update. Initial snapshot
Update Characteristics No change Invalid snapshot Feed Change 1 hr Duration Valid snapshot 1 hr Duration No change Invalid snapshot
Update Characteristics Update Rate:
Update Characteristics Update Size:
Issues with Polling The constant polling by clients poses a significant bandwidth challenge on RSS servers. RSS 2.0 supports the TTL, SkipDays and SkipHours. Send clients only data that actually changes which saves 93.2% bandwidth consumption because of 6.8% average content change
Correlations between Feed Size & Update Rate: Update Characteristics Correlations between Feed Size & Update Rate:
Correlations between Feed Size & Update Size: Update Characteristics Correlations between Feed Size & Update Size:
Polling Frequency: Client Behavior Auto-Client: Fixed Rate (Default 60 Min) Manual-Client: As they need
Client Behavior Subscriptions:
Conclusion We discussed what are the factors to be considered for constructing an Pub-Sub system in the future How our architecture can influence performance by saving bandwidth and reducing work load.