The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Data Management.

Slides:



Advertisements
Similar presentations
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL CULLIDE: Interactive Collision Detection Between Complex Models in Large Environments using Graphics Hardware.
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
+ Accelerating Fully Homomorphic Encryption on GPUs Wei Wang, Yin Hu, Lianmu Chen, Xinming Huang, Berk Sunar ECE Dept., Worcester Polytechnic Institute.
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
1 GPUTeraSort: High Performance Graphics Coprocessor Sorting for Large Database Management Naga K. Govindaraju Jim Gray Ritesh Kumar Dinesh Manocha Presented.
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC.
Adapted from: CULLIDE: Interactive Collision Detection Between Complex Models in Large Environments using Graphics Hardware Naga K. Govindaraju, Stephane.
02/22/ Manocha Interactive Modeling and Simulation using Graphics Processors Dinesh Manocha University of North Carolina at Chapel Hill
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Sorting and Searching Timothy J. PurcellStanford / NVIDIA Updated Gary J. Katz based on GPUTeraSort (MSR TR )U. of Pennsylvania.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
University of Michigan Electrical Engineering and Computer Science Low-Power Scientific Computing Ganesh Dasika, Ankit Sethia, Trevor Mudge, Scott Mahlke.
Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU Mauricio Breternitz Jr, Herbert Hum, Sanjeev.
11/28/ Manocha Interactive CGF Computations using COTS Graphics Processors Dinesh Manocha University of North Carolina at Chapel Hill
1 Chapter 4 The Central Processing Unit and Memory.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
Raghu Machiraju Slides: Courtesy - Prof. Huamin Wang, CSE, OSU
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.
GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Slide 1 / 16 On Using Graphics Hardware for Scientific Computing ________________________________________________ Stan Tomov June 23, 2006.
Database and Stream Mining using GPUs Naga K. Govindaraju UNC Chapel Hill.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
 Design model for a computer  Named after John von Neuman  Instructions that tell the computer what to do are stored in memory  Stored program Memory.
Computer Graphics Graphics Hardware
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Multi-core architectures. Single-core computer Single-core CPU chip.
3. April 2006Bernd Panzer-Steindel, CERN/IT1 HEPIX 2006 CPU technology session some ‘random walk’
Fast Computation of Database Operations using Graphics Processors Naga K. Govindaraju Univ. of North Carolina Modified By, Mahendra Chavan forCS632.
Hardware Trends. Contents Memory Hard Disks Processors Network Accessories Future.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
1 PennySort Award Ceremony Beijing China 23 October 2006.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters
Sunpyo Hong, Hyesoon Kim
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Computer Hardware & Processing Inside the Box CSC September 16, 2010.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Trends in Multicore Architecture.
Computer Graphics Graphics Hardware
CS203 – Advanced Computer Architecture
Graphics Processor Graphics Processing Unit
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Bus Systems ISA PCI AGP.
Graphics Processing Unit
GP2: General Purpose Computation using Graphics Processors
Graphics Processing Unit
Computer Graphics Graphics Hardware
Presentation transcript:

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Data Management Naga K. GovindarajuJim Gray Ritesh Kumar Dinesh Manocha

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 2 Sorting “I believe that virtually every important aspect of programming arises somewhere in the context of sorting or searching!” -Don Knuth

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 3 Sorting Well studied High performance computing Databases Computer graphics Programming languages... Google map reduce algorithm Spec benchmark routine!

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 4 Massive Databases Terabyte-data sets are common Google sorts more than 100 billion terms in its index > 1 Trillion records in web indexed! Database sizes are rapidly increasing! Max DB sizes increases 3x per year ( Processor improvements not matching information explosion

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 5 CPU vs. GPU CPU (3 GHz) System Memory (2 GB) AGP Memory (512 MB) PCI-E Bus (4 GB/s) Video Memory (512 MB) GPU (690 MHz) Video Memory (512 MB) GPU (690 MHz) 2 x 1 MB Cache

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 6 External Memory Sorting Performed on Terabyte-scale databases Two phases algorithm [Vitter01, Salzberg90, Nyberg94, Nyberg95] Limited main memory First phase – partitions input file into large data chunks and writes sorted chunks known as “Runs” Second phase – Merge the “Runs” to generate the sorted file

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 7 External Memory Sorting Performance mainly governed by I/O Salzberg Analysis: Given the main memory size M and the file size N, if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN)

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 8 External Memory Sorting Given the main memory size M and the file size N, if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN) N

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 9 External Memory Sorting Given the main memory size M and the file size N, if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN) R

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 10 External Memory Sorting T Given the main memory size M and the file size N, if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN)

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 11 Salzberg Analysis If N=100GB, T=2MB, then R ≈ 230MB Large data sorting on CPUs can achieve high I/O performance by sorting large runs

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 12 Massive Data Handling on CPUs Require random memory accesses Small CPU caches (< 2MB) Slower than even sequential disk accesses – bottleneck shift from I/O to memory Widening memory to compute gap! External memory sorting on CPUs can have low performance due to High memory latency on account of cache misses Or low I/O performance Sorting is hard!

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 13 Graphics Processing Units (GPUs) Commodity processor for graphics applications Massively parallel vector processors High memory bandwidth Low memory latency pipeline Programmable High growth rate

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 14 GPU: Commodity Processor Cell phones LaptopsConsoles PSP Desktops

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 15 Graphics Processing Units (GPUs) Commodity processor for graphics applications Massively parallel vector processors 10x more operations per sec than CPUs High memory bandwidth Low memory latency pipeline Programmable High growth rate

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 16 Parallelism on GPUs Graphics FLOPS G PU – 1.3 T FLOPS CPU – 25.6 GFLOPS

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 17 Graphics Processing Units (GPUs) Commodity processor for graphics applications Massively parallel vector processors High memory bandwidth Better hides memory latency Programmable 10x more memory bandwidth than CPUs High growth rate

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 18 vertex setup rasterizer pixel texture image per-pixel texture, fp16 blending Graphics Pipeline programmable vertex processing (fp32) programmable per- pixel math (fp32) polygon polygon setup, culling, rasterization Z-buf, fp16 blending, anti-alias (MRT) memory Hides memory latency!! Low pipeline depth 56 GB/s

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 19 data setup rasterizer data data fetch, fp16 blending NON-Graphics Pipeline Abstraction programmable MIMD processing (fp32) programmable SIMD processing (fp32) lists SIMD “rasterization” predicated write, fp16 blend, multiple output memory Courtesy: David Kirk, Chief Scientist, NVIDIA

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 20 Graphics Processing Units (GPUs) Commodity processor for graphics applications Massively parallel vector processors High memory bandwidth Low memory latency pipeline Programmable High growth rate

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 21 Technology Trends: CPU and GPU 2.2 GHz 4.4 GHz 31 GHz 0.8 GHz 1.6 GHz Log of Relative Processing Power Corporate DT SW Requirements Moore’s Law Trajectory CPU Value Leading Edge Mobile Mainstream Desktop DT ‘Replacement’ Enthusiast / Specialty Cooling (Cost) Limitations GPU Moore’s Law 3 for 18 mo Then Moore’s Law trajectory Graphics Req’mts (enhanced experience) Leading Edge Value / UMA ? CPU

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 22 Architecture of Phase 1: GPUTeraSort

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 23 GPUs for Sorting: Issues No support for arbitrary writes Optimized CPU algorithms do not map! Requires new algorithms – sorting networks Lack of support for general data types Out-of-core algorithms Limited GPU memory Difficult to program

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 24 General Sorting on GPUs Sorting networks: No data dependencies Utilize high parallelism on GPUs To handle large keys, use bitonic radix sort Perform bitonic sort on the 4 most significant bytes (MSB) using GPUs, compute sorted records with equal 4 MSBs, proceed to the next 4 bytes on those and so on Can handle any length keys

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 25 GPU-Based Sorting Networks Represent data as 2D arrays Multi-stage algorithm Each stage involves multiple steps In each step 1.Compare one array element against exactly one other element at fixed distance 2.Perform a conditional assignment (MIN or MAX) at each element location

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 26 Flash animation removed to save (46MB !)

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 27 2D Memory Addressing GPUs optimized for 2D representations Map 1D arrays to 2D arrays Minimum and maximum regions mapped to row- aligned or column-aligned quads

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 28 1D – 2D Mapping MINMAX

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 29 1D – 2D Mapping MIN Effectively reduce instructions per element

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 30 Sorting on GPU: Pipelining and Parallelism Input Vertices Texturing, Caching and 2D Quad Comparisons Sequential Writes

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 31 Comparison with GPU-Based Algorithms 3-6x faster than prior GPU-based algorithms!

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 32 GPU vs. High-End Multi-Core CPUs 2-2.5x faster than Intel high-end processors Single GPU performance comparable to high-end dual core Athlon Hand-optimized CPU code from Intel Corporation!

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 33 Super-Moore’s Law Growth 50 GB/s on a single GPU Peak Performance: Effectively hide memory latency with 15 GOP/s Download URL: Slash Dot News and Toms Hardware News Headlines

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 34 Implementation & Results Pentium IV PC ($170) NVIDIA 7800 GT ($270) 2 GB RAM ($152) 9 80GB SATA disks ($477) SuperMicro Motherboard & SATA Controller ($325) Windows XP PC costs $1469

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 35 Implementation & Results Indy SortBenchmark 10 byte random string keys 100 byte long records Sort maximum amount in 644 seconds

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 36 Overall Performance Faster and more scalable than Dual Xeon processors (3.6 GHz)!

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 37 Performance/$ 1.8x faster than current Terabyte sorter World’s best price-to- performance system

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 38 Analysis: I/O Performance Salzberg Analysis: 100 MB Run Size Peak sequential throughput in MB/s

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 39 Analysis: I/O Performance Pentium IV: 25MB Run Size (to reduce memory latency) Less work and only 75% IO efficient! Salzberg Analysis: 100 MB Run Size

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 40 Analysis: I/O Performance Dual 3.6 GHz Xeons: 25MB Run size (to reduce memory latency) More cores, less work but only 85% IO efficient! Salzberg Analysis: 100 MB Run Size

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 41 Analysis: I/O Performance 7800 GT: 100MB run size Ideal work, and 92% IO efficient with single CPU! Salzberg Analysis: 100 MB Run Size

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 42 Task Parallelism Performance limited by IO and memory Sorting 100MB on GPU Reorder or Sequential IO Sorting 100MB on GPU: 3x > reorder or sequential IO

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 43 Why GPU-like Architectures for Large Data Management? Plateau: Data Management Performance Crisis GPU

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 44 Advantages Exploit high memory bandwidth on GPUs Higher memory performance than CPU-based algorithms High I/O performance due to large run sizes

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 45 Advantages Offload work from CPUs CPU cycles well-utilized for resource management Scalable solution for large databases Best performance/price solution for terabyte sorting

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 46 Limitations May not work well on variable-sized keys and almost sorted databases Requires programmable GPUs (GPUs manufactured after 2003)

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 47 Conclusions Designed new sorting algorithms on GPUs Handles wide keys and long records Achieves 10x higher memory performance Memory efficient sorting algorithm with peak memory performance of (50 GB/s) on GPUs 15 GOP/sec on a single GPU

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 48 Conclusions Novel external memory sorting algorithm as a scalable solution Achieves peak I/O performance on CPUs Best performance/price solution – world’s fastest sorting system High performance growth rate characteristics Improve 2-3 times/yr

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 49 Future Work Designed high performance/price solutions High wattage and cooling requirements of CPUs and GPUs To exploit GPUs, we need easy-to-use programming APIs Promising directions: BrookGPU, Microsoft Accelerator, Sh, etc. Scientific libraries utilizing high parallelism and memory bandwidth Scientific routines on LU, QR, SVD, FFT, etc. BLAS library on GPUs Eventually, build GPU-LAPACK and Matlab routines

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 50 GPUFFTW N. Govindaraju, S. Larsen, J. Gray and D. Manocha, Proc. of ACM SuperComputing, 2006 (to appear) Download URL: 4x faster than IMKL on high-end Quad cores SlashDot Headlines, May 2006

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 51 GPU Roadmap GPUs are becoming more general purpose Fewer limitations in Microsoft DirectX10 API Better and consistent floating point support, Integer instruction support, More programmable stages, etc. Significant advance in performance GPUs are being widely adopted in commercial applications Eg. Microsoft Vista

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 52 Call to Action Don’t put all your eggs in the Multi-core basket If you want TeraOps – go where they are If you want memory bandwidth – go where the memory bandwidth is. CPU-GPU gap is widening Microsoft Xbox is ½ TeraOP today. 40 gops 40 gBps

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 53 Acknowledgements Research Sponsors: Army Research Office Defense and Advanced Research Projects Agency National Science Foundation Naval Research Laboratory Intel Corporation Microsoft Corporation Craig Peeper, Peter-Pike Sloan, David Blythe, Jingren Zhou NVIDIA Corporation RDECOM

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 54 Acknowledgements David Tuft (UNC) UNC Systems, GAMMA and Walkthrough groups

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH 55 Thank You Questions or Comments?