Download presentation
Presentation is loading. Please wait.
Published byVirgil Lindsey Modified over 9 years ago
1
1 Insights into Access Patterns of Internet Media Systems Lei Guo
2
2 Video applications are mainstream Video traffic is doubling every 3 to 4 months YouTube: more than 100 M daily streams since July 2006 Media content on the Internet No. 3 until now 1.Google 2.Yahoo! 3.YouTube
3
3 Existing media delivery systems ♫ ♫ P2P network CDN wireless High CPU and bandwidth consumption, or unsatisfactory QoS Caching for performance improvement –A lot of research studies –Commercial products Access patterns play an important role –Exploit temporal locality –High performance with cost effective hardware Seldom used in reality
4
4 Zipf model of Internet traffic patterns Zipf distribution (power law) –Characterizes the property of scale invariance –Heavy tailed, scale free 80-20 rule –Income distribution: 80% of social wealth owned by 20% people (Pareto law) –Web traffic: 80% Web requests access 20% pages (Breslau, INFOCOM’99) System implications –Objectively caching the working set in proxy –Significantly reduce network traffic log i log y slope: - : 0.6~0.8 i y heavy tail Reference rank distribution i : rank of objects y i : number of references
5
5 Does Internet media traffic follow Zipf’s law? Chesire, USITS’01: Zipf-like Cherkasova, NOSSDAV’02: non-Zipf Acharya, MMCN’00: non-Zipf Yu, EUROSYS’06: Zipf-like Web media systemsVoD media systems Live streaming and IPTV systems Veloso, IMW’02: Zipf-like Sripanidkulchai, IMC’04: non - Zipf P2P media systems Gummadi, SOSP’03: non-Zipf Iamnitchi, INFOCOM’04: Zipf-like
6
6 Inconsistent media access pattern models Still based on the Zipf model –Zipf with exponential cutoff –Zipf-Mandelbrot distribution –Generalized Zipf-like distribution –Two-mode Zipf distribution –Fetch-at-most-once effect –Parabolic fractal distribution –…–… All case studies –Based on one or two workloads –Different from or even conflict with each other An insightful understanding is essential to guide –Content delivery system design –Internet resource provisioning –Performance optimization heuristic assumptions
7
7 Outline Motivation and objectives Stretched exponential of Internet media traffic Dynamics of access patterns in media systems Caching implications Conclusion and future work
8
8 Workload summary 16 workloads in different media systems –Web, VoD, P2P, and live streaming –Both client side and server side Different delivery techniques –Downloading, streaming, pseudo streaming –Overlay multicast, P2P exchange, P2P swarming Data set characteristics –Workload duration: 5 days - two years –Number of users: 10 3 - 10 5 –Number of requests: 10 4 - 10 8 –Number of objects: 10 2 - 10 6 nearly all workloads available on the Internet all major delivery techniques data sets of different scales
9
9 Stretched exponential distribution Media reference rank follows stretched exponential distribution (verified by Chi-square test) log i ycyc b slope: -a i : rank of media objects ( N objects) y : number of references Reference rank distribution: fat head and thin tail in log-log scale straight line in log x - y c scale (SE scale) log i log y fat head thin tail c : stretch factor
10
10 Set 1: Web media systems (server logs) ST-SVR-01 (15 MB) *HPC-98 (14 MB) *HPLabs-99 (120 MB) HPC-98: enterprise streaming media server logs of HP corporation (29 months) HPLabs: logs of video streaming server for employees in HP Labs (21 months) ST-SVR-01: an enterprise streaming media server log workload like HPC-98 (4 months) log scale in x axis powered scale y c log scale fat headthin tail c = 0.22 R 2 = 0.977 Y left: y^c scale Y right: log scale Parameters: maximum likelihood method R 2 : coefficient of determination (1 means a perfect fit) x : rank of media object y : number of references to the object
11
11 Set 2: Web media systems (req packets) ST-CLT-05 (4.5 MB) PS-CLT-04 (1.5 MB) ST-CLT-04 (2 MB) PS-CLT-04: HTTP downloading/pseudo streaming requests, 9 days ST-CLT-04: RTSP/MMS on-demand streaming requests, 9 days ST-CLT-05: RTSP/MMS on-demand streaming requests, 11 days All collected from a big cable network hosted by a large ISP powered scale y c log scale fat headthin tail log scale in x axis x : rank of media object y : number of references to the object
12
12 Set 3: VoD media systems mMoD-98: logs of a multicast Media-on-Demand video server, 194 days CTVoD-04: streaming serer logs of a large VoD system by China telecom, 219 days, reported as Zipf in EUROSYS’06 IFILM-06: number of web page clicks to video clips in IFILM site, 16 weeks (one week for the figure) YouTube-06: cumulative number of requests to YouTube video clips, by crawling on web pages publishing the data *mMoD-98 (125 MB) *CTVoD-04 (300 MB) IFILM-06 (2.25 MB) YouTube-06 (3.4 MB) powered scale y c log scale fat headthin tail log scale in x axis
13
13 Set 4: P2P media systems BT-03 (636 MB) *KaZaa-02 (300 MB) *KaZaa-03 (5 MB) KaZaa-02: large video file (> 100 MB. Files smaller than 100 MB are intensively removed) transferring in KaZaa network, collected in a campus network, 203 days. KaZaa-03: music files, movie clips, and movie files downloading in KaZaa network, 5 days, reported as Zipf in INFOCOM’04. BT-03: 48 days BitTorrent file downloading (large video and DVD images) recorded by two tracker sites fat headthin tail
14
14 Set 5: Live streaming and movie pictures IMDB-06 Akamai-03 Movie-02 Akamai-03: server logs of live streaming media collected from akamai CDN, 3 months, reported as two-mode Zipf in IMC’04 Movie-02: US movie box office ticket sales of year 2002. IMDB-06: cumulative number of votes for top 250 movies in Internet Movie Database web site fat head thin tail
15
15 Why Zipf observed before? Media traffic is driven by user requests Intermediate systems may affect traffic pattern –Effect of extraneous traffic (ads video insertion) –Filtering effect due to caching (through a proxy) Biased measurements may cause Zipf observation cache ad server media server proxy
16
16 Extraneous media traffic meta file link web server streaming media server ads server ads clip logo clip video program ad and logo video are pushed to clients mandatorily ads clip logo clip video prog ads clip logo clip video prog
17
17 Do not represent user access patterns –High request rate (high popularity) –High total number of requests Not necessary Zipf with extraneous traffic –Extraneous traffic change –Always SE without extraneous traffic Small object sizes, small traffic volume Effects of extraneous traffic on reference rank distributions Reference rates prog ads logo 2004 2005 2004: 2 objects2005: merged into 1 object Non-Zipf Zipf with extraneous traffic SE 2004 SE 2005 without extraneous traffic
18
18 Caching effect Web workload: caching can cause a “flattened head” in log-log scale Stretched exponential is not caused by caching effect Local replay events can be traced by commercial streaming media protocols log i log y Zipf Filtered by Web cache log i log y Stretched exponential Play summary Cache validation Server logs packet sniffer Client side workload
19
19 --- Web --- Video 0 100 200 10 0 10 1 10 2 10 3 CCDF of req (log) ------ raw data ------ linear fit Time after object birth (day) BitTorrent media file Why media access pattern is not Zipf “Rich-get-richer” and Zipf / power law –Pareto law: income distribution Web access is Zipf –Popular pages can attract more users –Yahoo! ranks No.1 more than six years –Pages update to keep popular –Zipf-like for long duration Media access is not –Popularity decreases with time exponentially –Media objects are immutable –Rich-get-richer not present –Non-Zipf in long duration Number of distinct weekly top N popular objects in 16 weeks Top 1 Web object never changes Top 1 video object changes every week 16 1
20
20 Outline Motivation and objectives Stretched exponential of Internet media traffic Dynamics of access patterns in media systems Caching implications Conclusion and future work
21
21 Dynamics of access patterns in media systems Media reference rank distribution in log-log scale –Different systems have different access patterns –The distribution evolves over time in a system (NOSSDAV’02) All follow stretched exponential distribution –Stretch factor: c –Minus of slope: a Dynamics and physical meanings –Media file sizes –Aging effects of media objects –Deviation from the Zipf model log i ycyc # of references slope: -a c : stretch factor rank
22
22 Stretch factors of different systems
23
23 Stretch factor and media file size Other issues –Different encoding rates (file length in time is a better metric) –Different content type: entertainment, educational, business Stretch factor c reflects the view time of a video object EDU BIZ file size vs. stretch factor c 0 – 5 MB: c <= 0.2 5 – 100 MB: 0.2 ~ 0.3 > 100 MB: c >= 0.3 c increases with file size, despite the underlying system
24
24 log i ycyc # of references slope: -a c : stretch factor rank slope = obj # requested obj. over time Stretched exponential parameters over time In a media system over time t –Constant request rate req –Constant object birth rate obj –Constant median file size Stretch factor c is a time invariant constant Parameter a increases with time t Objects created in [0, t) Objects created in (- , 0]: ~ O(log t) Popularity decreases exponentially 0
25
25 Evolution of media reference rank distribution VoD (IFILM)P2P (BT) Reference rank distribution (slope = -a) Parameter a Web media t0: system upgrade Slow down 1 2 Model confirmed: a increases with time, and a converges to a constant Caused by objects created before workload collection Speed up
26
26 E F Deviation from the Zipf model a increases with c ( c < 2) a increases with req / obj a increases with t Deviation increases with time Big file workloads have larger deviation |EF| |OE| OE EF Big files O rank (log) reference # (log) SE Zipf parameter a deviation Deviation: In which case SE is close to Zipf?
27
27 Deviation from Zipf for different sized movie videos Movie trailer (IFILM) VoD (CTVoD)US Movie (income) 2.25 M deviation = 0.25 300 M deviation = 0.45 movie theater deviation = 0.62 reference # (log) rank (log) Big/long media files have large deviation Large stretch factor c High req / obj (movie has low production rate)
28
28 Example: YouTube Video Measurements in IMC’07 80 days in a campus network All downloads for all time Large a log N, large deviation Small a log N, small deviation Short time, old objects dominant: N’(t) >> obj t No “old” objects N’(t) = 0 In campus: small request rate req Entire YouTube: huge request rate req Zipf non-Zipf N: small # of objects N: all objects Two aspects of the same model: stretched exponential
29
29 Outline Motivation and objectives Stretched exponential of Internet media traffic Dynamics of access patterns in media systems Caching implications Conclusion and future work
30
30 Caching analysis methodology Analyze media caching for short term workload –Distribution is stationary (evolution takes time) –Requests are independent –Object has unit size log i log y Zipf-like log i log y Stretched exponential Intuition from the distribution shape Highly concentrated requestsLess concentrated requests
31
31 Modeling caching performance Media caching is far less efficient than Web caching Parameter selection Zipf: typical Web workload ( = 0.8) SE: typical streaming media workload (c = 0.2, a = 0.25, same as ST-CLT-05) Asymptotic analysis for small cache size k (k << N) Zipf SE Web media normalized cache size cache hit ratio
32
32 Potential of long term media caching Short term: Requests dominated by old objects, dilute concentration Long term: Requests dominated by new objects –Request concentration: significantly increased –Request correlation: old objects can be replaced when unpopular To achieve maximal concentration –Very long time (months to years) and huge amount of storage –Client-server based proxy systems cannot scale –Peer-to-peer systems are scalable for this purpose CDF of object ages CDF of requests in 11 days old objects new objects object popularity decreases exponentially
33
33 Applications of media access pattern model Resource allocation for large scale media distribution –CDN systems –Peer-to-peer systems Benchmarking the performance of media servers –Synthetic workload generation –SPECweb benchmark: pending Traffic engineering and management of the Internet –Analyze flow patterns of video traffic –Throttle illegal digital movie downloading Analyzing user request patterns for video search –Better relevance ranking –Search intent analysis
34
34 Outline Motivation and objectives Stretched exponential of Internet media traffic Dynamics of access patterns in media systems Caching implications Conclusion and future work
35
35 Concluding Remarks Media access patterns do not fit Zipf model Media access patterns are stretched exponential Our findings implies that –Client-server based proxy systems are not effective to deliver media contents –P2P systems are most suitable for this purpose We provide an analytical basis for the effectiveness of P2P media content delivery infrastructure
36
36 Future work: User behavior models for social networks Internet systems are user-driven –User networks and cloud services on the Internet –User behavior models: Zipf-like, stretched exponential, or others? Social networks are powered by user contributions –User generated content (UGC) –Video sharing, blog, bookmark sharing, question/answer, Wikipedia, … Understanding UGC creation patterns in social networks –Common users vs. core users –Validate the correctness of user provided knowledge Identifying spam content in social networks –Distinguish normal users and spam users –Evaluate the “health” of a social network Efficient resource management in the underlying supporting system
37
37 Future work: User behavior models for search engine systems Typical Internet cloud services User behavior models for –Keyword selection –Query sequence, page click through –Query frequency, page dwelling time Analyze user search intent –Better rank boosting algorithms and query re-writing functions –Serve more accurate information Indentify search engine abuse and attacks Improve search engine performance –Mining co-occurrence patterns of search keywords –Exploiting inter-query locality on search index
38
38 Other related projects Internet measurement and modeling –The stretched exponential distribution of Internet media access patterns, PODC 2008 –Delving into Internet streaming media delivery: A quality and resource utilization perspective, IMC 2006 –Measurements, analysis, and modeling of BitTorrent-like systems, IMC 2005 –Analysis of multimedia workloads with implications for Internet streaming, WWW 2005 Internet media systems –ASAP: an AS-aware peer-relay protocol for high quality VoIP with low overhead, ICDCS 2006 –DISC: Dynamic Interleaved Segment Caching for interactive streaming, ICDCS 2005 –PROP: a scalable and reliable P2P assisted proxy streaming system, ICDCS 2004
39
39 Other related projects (cont.) Wireless Internet –PSM-throttling: Minimizing energy consumption for bulk data communications in WLANs, ICNP 2007 –Cooperative Relay Service in a Wireless LAN, JSAC 2007 –SCAP: Smart caching in wireless access points to improve P2P streaming, ICDCS 2007 –Exploiting idle communication power to improve wireless network performance and energy ffficiency, INFOCOM 2006 Social networks and Web 2.0 –Tag-based social interest discovery, WWW 2008 –Understanding instant messaging traffic characteristics, ICDCS 2007 Peer-to-peer and overlay networks –LightFlood: Minimizing redundant messages and maximizing scope of peer-to-peer search, TPDS 2008 –Fast and low cost search schemes by exploiting localities in P2P networks, JPDC 2005
40
40 Thank you!
41
41 Segment caching performance Segment hit ratio ≈ byte hit ratio –Much lower than object hit ratio LRU is less efficient for media caching –Less request concentration –Larger working set Segment LFU is even worse –Sequential order in an object not captured ST-CLT-04ST-CLT-05LRU efficiency Conventional replacement policies are not efficient (i): minimum cache size to hold object i
42
42 Ongoing and future work Media traffic is driven by user requests –How users request media objects –How media access patterns affect system performance Social networks are powered by user contributions –user generated content (UGC) –video sharing, blog, bookmark sharing, question/answer, … How users contribute UGC content in social networks? How UGC creation patterns affect the growth and popularity of a social network?
43
43 UGC creation patterns in online social networks UGC creation patterns in social networks are stretched exponential x: user rank y: number of posts Yahoo! Blog (article posting)Yahoo! Blog (photo posting)
44
44 UGC creation patterns in online social networks (continue) Yahoo! Delicious (bookmark)Yahoo! Answer (answer) Media request is content consumption –Stretch factor vs. file size (length): ↑ UGC creation is content production –Stretch factor vs. effort to create a UGC object: ↓
45
45 Segment-based streaming media caching Streaming media are often partially accessed –Segment caching is efficient “Ideal” segment reference rank distribution –M segments per object, no partial access –Two mode SE distribution: same c, same a in the 2 nd mode M slope = -a objectsegment
46
46 Segment-based streaming media caching (cont.) Segmentation: 5 seconds of media data Segments rank distribution: two-mode SE –Same stretch factor c –Smaller a than object rank distribution Less temporal locality ST-SVR-01ST-CLT-04ST-CLT-05
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.