Expediting Peer-to-Peer Simulation using GPU Di Niu, Zhengjun Feng Apr. 14 th, 2009.

Slides:

Advertisements

Similar presentations

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.

Advertisements

PRAM Algorithms Sathish Vadhiyar. PRAM Model - Introduction Parallel Random Access Machine Allows parallel-algorithm designers to treat processing power.

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Parallel and Distributed Simulation Global Virtual Time - Part 2.

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Cameron Dale and Jiangchuan LiuA Measurement Study of Piece Population in BitTorrent Introduction BitTorrent Experiment Results Simulation Discussion A.

Playback delay in p2p streaming systems with random packet forwarding Viktoria Fodor and Ilias Chatzidrossos Laboratory for Communication Networks School.

SELECT: Self-Learning Collision Avoidance for Wireless Networks Chun-Cheng Chen, Eunsoo, Seo, Hwangnam Kim, and Haiyun Luo Department of Computer Science,

Network Coding in Peer-to-Peer Networks Presented by Chu Chun Ngai

Seed Scheduling for Peer-to-Peer Networks Flavio Esposito Ibrahim Matta Pietro Michiardi Nobuyuki Mitsutake Damiano Carra.

Parallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on GPUs Lifan Xu, Michela Taufer, Stuart Collins, Dionisios G. Vlachos Global Computing.

Mohamed Hefeeda 1 School of Computing Science Simon Fraser University, Canada ISP-Friendly Peer Matching without ISP Collaboration Mohamed Hefeeda (Joint.

OS2-1 Chapter 2 Computer System Structures. OS2-2 Outlines Computer System Operation I/O Structure Storage Structure Storage Hierarchy Hardware Protection.

Network Coding for Large Scale Content Distribution Christos Gkantsidis Georgia Institute of Technology Pablo Rodriguez Microsoft Research IEEE INFOCOM.

Informed Content Delivery Across Adaptive Overlay Networks J. Byers, J. Considine, M. Mitzenmacher and S. Rost Presented by Ananth Rajagopala-Rao.

Improving ISP Locality in BitTorrent Traffic via Biased Neighbor Selection Ruchir Bindal, Pei Cao, William Chan Stanford University Jan Medved, George.

Peer-to-Peer Based Multimedia Distribution Service Zhe Xiang, Qian Zhang, Wenwu Zhu, Zhensheng Zhang IEEE Transactions on Multimedia, Vol. 6, No. 2, April.

1 CAPS: A Peer Data Sharing System for Load Mitigation in Cellular Data Networks Young-Bae Ko, Kang-Won Lee, Thyaga Nandagopal Presentation by Tony Sung,

Virtual Memory Management B.Ramamurthy. Paging (2) The relation between virtual addresses and physical memory addresses given by page table.

On-Demand Media Streaming Over the Internet Mohamed M. Hefeeda, Bharat K. Bhargava Presented by Sam Distributed Computing Systems, FTDCS Proceedings.

Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.

INTRODUCTION TO PEER TO PEER NETWORKS Z.M. Joseph CSE 6392 – DB Exploration Spring 2006 CSE, UT Arlington.

Rahul Sharma (Stanford) Michael Bauer (NVIDIA Research) Alex Aiken (Stanford) Verification of Producer-Consumer Synchronization in GPU Programs June 15,

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Exploring VoD in P2P Swarming Systems By Siddhartha Annapureddy, Saikat Guha, Christos Gkantsidis, Dinan Gunawardena, Pablo Rodriguez Presented by Svetlana.

COCONET: Co-Operative Cache driven Overlay NETwork for p2p VoD streaming Abhishek Bhattacharya, Zhenyu Yang & Deng Pan.

Open Shortest Path First (OSPF) -Sheela Anand -Kalyani Ravi -Saroja Gadde.

GPS: A General Peer-to-Peer Simulator and its Use for Modeling BitTorrent Weishuai Yang Nael Abu-Ghazaleh

CacheLab 10/10/2011 By Gennady Pekhimenko. Outline Memory organization Caching – Different types of locality – Cache organization Cachelab – Warnings.

Your Friends Have More Friends Than You Do: Identifying Influential Mobile Users Through Random Walks Bo Han, Aravind Srinivasan University of Maryland.

1 BitHoc: BitTorrent for wireless ad hoc networks Jointly with: Chadi Barakat Jayeoung Choi Anwar Al Hamra Thierry Turletti EPI PLANETE 28/02/2008 MAESTRO/PLANETE.

Alexander Afanasyev Tutors: Seung-Hoon Lee, Uichin Lee Content Distribution in VANETs using Network Coding: Evaluation of the Generation Selection Algorithms.

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 3, 2011outline.1 ITCS 6010/8010 Topics in Computer Science: GPU Programming for High Performance.

Prophet Address Allocation for Large Scale MANETs Matt W. Mutka Dept. of Computer Science & Engineering Michigan State University East Lansing, USA IEEE.

Topic: P2P Trading in Social Networks: The Value of Staying Connected The purpose of this paper is to propose a P2P incentive paradigm named Networked.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

A Simple Model for Analyzing P2P Streaming Protocols Zhou Yipeng Chiu DahMing John, C.S. Lui The Chinese University of Hong Kong.

A P2P-Based Architecture for Secure Software Delivery Using Volunteer Assistance Purvi Shah, Jehan-François Pâris, Jeffrey Morgan and John Schettino IEEE.

QCAdesigner – CUDA HPPS project

Lab 2 Parallel processing using NIOS II processors

 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 5, 2005 Session 22.

Distributed Computing Systems CSCI 4780/6780. Scalability ConceptExample Centralized servicesA single server for all users Centralized dataA single on-line.

Algorithms and Techniques in Structured Scalable Peer-to-Peer Networks

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Peer-to-Peer Systems: An Overview Hongyu Li. Outline  Introduction  Characteristics of P2P  Algorithms  P2P Applications  Conclusion.

Local Search. Systematic versus local search u Systematic search  Breadth-first, depth-first, IDDFS, A*, IDA*, etc  Keep one or more paths in memory.

Bit Torrent Nirav A. Vasa. Topics What is BitTorrent? Related Terms How BitTorrent works Steps involved in the working Advantages and Disadvantages.

A Bandwidth Scheduling Algorithm Based on Minimum Interference Traffic in Mesh Mode Xu-Yajing, Li-ZhiTao, Zhong-XiuFang and Xu-HuiMin International Conference.

Embedded System Lab. 정영진 The Design and Implementation of a Log-Structured File System Mendel Rosenblum and John K. Ousterhout ACM Transactions.

Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edit9on Lecture 3 Chapter 1: Introduction Provided & Updated by Sameer Akram.

Naga Shailaja Dasari Ranjan Desh Zubair M Old Dominion University Norfolk, Virginia, USA.

Distributed Caching and Adaptive Search in Multilayer P2P Networks Chen Wang, Li Xiao, Yunhao Liu, Pei Zheng The 24th International Conference on Distributed.

13 Nov 2002ARENA Tutorial - FRAG1 Peer 1 Peer 3 Peer 2 Peer-to-Peer network No dedicated server!

Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.

Traffic Localization with Information Guidance of Pseudo Peer Agent on BT-P2P Network 學生 : 楊宏昌指導教授 : 曾黎明教授在 BT 同儕通訊上利用參與者訊息引導對外流量之區域化節約.

Introduction to CUDA Programming Optimizing for CUDA Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.

Composing Web Services and P2P Infrastructure. PRESENTATION FLOW Related Works Paper Idea Our Project Infrastructure.

Chapter 4: Multithreaded Programming

Implementation of Efficient Check-pointing and Restart on CPU - GPU

Distributed P2P File System

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

© 2012 Elsevier, Inc. All rights reserved.

Presentation transcript:

Expediting Peer-to-Peer Simulation using GPU Di Niu, Zhengjun Feng Apr. 14 th, 2009

Outline An Introduction to P2P Content Distribution Overview of P2P Algorithms and Simulation GPU Algorithm Design Performance Evaluation Summary & Future Work Discussion

Introduction Current-generation P2P Content Distribution (BitTorrent)  A large file is broken into k blocks  N participating peers form a random topology  Neighboring Peers “gossip” the data blocks  Each peer has a buffer map to indicate which block it has A binary array of size k. B[x] = 1 means the peer has block x  Seed: Peers with all data blocks  Downloader: Peers with a subset of the data blocks.

P2P Algorithms: RUB & GRF  Rand Useful Block (RUB): in each round, each peer selects a random neighbor and transmits a random useful block.  Global Rarest First (GRF): in each round, each peer selects a random neighbor and transmits a random useful block that is the rarest within the neighborhood Maintains a global counter for each block Count[x] = 7 means there are 7 copies of block x in the network.

Simulating P2P Algorithms  Simulation Methods: Synchronized, Unsynchronized  Starts with one seed, ends with N seeds

RUB Design on GPU Without Shared Memory Each thread handles a peer:  Step 1: generate a random number  Step 2: get the random target peer  Step 3: compare data block matrix  count of difference  Step 4: generate another random number  Step 5: compare data block matrix again  data block index  Step 6: transmit the data block to target peer  Step 7: count the seeds

RUB Design on GPU With Shared Memory (store the buffer difference) Each thread handles a peer:  Step 1: generate a random number  Step 2: get the random target peer  Step 3: compare data block matrix, save different block index to SMEM  Step 4: generate another random number  Step 5: get the data block index from SMEM  Step 6: transmit the data block to target peer  Step 7: count the seeds

RUB Design on GPU Shared Memory – Bank Conflict Avoidance  Block size 64  Data: USHORT  16KB/64=256B =128 data/thread (= 64 integers/thread) To avoid bank conflict: 126 data / thread (=63 integers/thread)

GRF on GPU: Algorithm 1 Each thread handles one sender:  1. Choose a random neighbor to upload to  2. Compare the buffer difference between the sender and receiver  3. Find the rarest block by checking Count (in global memory)  4. Updating Count  5. Updating the receiver buffer Use a separate kernel to update Count  This resolves the conflict of two peers transmitting the same block to the same peer.

GRF on GPU: Algorithm 2 Viewing each thread block as a sender Each thread handles one block in one sender  Step 1: Thread 0 in each sender selects a neighbor and stores receiver in the shared memory  Step 2: Buffer comparison. Each block in each sender is compared to the corresponding block in the receiver using one thread.  Step 3: Find the block that has the least Count  Step 4: Thread 0 in each sender transmits the rarest block to the receiver.

Evaluation – RUB Speedup 11 With SMEM: up to 8x speedup W/o SMEM: up to 4.3x speedup

Evaluation – RUB Speedup 12 With SMEM: 2x ~ 4.5x speedup W/O SMEM: 1.2x ~ 2.7 speedup

Evaluation - GRF 13 Each thread handles a peer: 7x speedup Each thread handles a data block: 21x speedup

Evaluation - GRF 14 Each thread handles a peer: 8x speedup Each thread handles a data block: 21x speedup

Evaluation - GRF Speedup 15

Evaluation - GRF Speedup 16 For data block/thread: grid size = 16384, block size = # of data blocks. For peer/thread: grid size = # of peers/512, block size = 512.

Summary RUB:  without SMEM: up to 4x speedup  Shared-memory-based: up to 8x speedup GRF:  Each thread handles a peer: up to 8x speedup  Each thread handles a block: up to 21x speedup 17

Future Work RUB  Design RUB with “one thread per data block”  Difficulty: randomly select a thread among a bunch of parallel threads GRF  Handles more data blocks (>512)  Let each thread handle multiple data blocks of a sender. 18

Discussion Thanks Any questions? 19