Overview of MSR External Research Earth, Energy, and MSR Environmental Ecosystem Conceptual Model Projects Trident GrayWulf Dyrad and DryadLinq
Research locations : Redmond, Washington (Sept, 1991) San Francisco, California (Jun, 1995) Cambridge, United Kingdom (July, 1997) Beijing, China(Nov, 1998) Silicon Valley, California (July, 2001) Bangalore, India (Jan, 2005) Cambridge, Massachusetts(July, 2008) MSR New England MSR Asia MSR India
Division within Microsoft Research focused on partnerships between academia, industry and government to advance computer science, education, and research in fields that rely heavily upon advanced computing Supporting groundbreaking research to help advance human potential and the wellbeing of our planet Developing advanced technologies and services to support every stage of the research process Microsoft External Research is committed to interoperability and to providing open access, open tools, and open technology
Core Computer Science Earth, Energy & Environment Education & Scholarly Communication Health & Wellbeing Advanced Research Tools and Services Community and Geographic Outreach
Visualizing and Experiencing E 3 Data + Information: Provide a unique experience to reduce time to insight and knowledge through visualizing data and information Accessible Data: Ensure E 3 data (remote and local sensing) is easily accessible and consumable in the scientists domain Enabling Scientific Collaboration: Look at new ways to enable collaboration in scientific virtual organizations Earth, Energy & Environment
7 Action Knowledge Inform
8 AnalysisInsightPublishData Action Knowledge Communicate Decide Implement Inform
Each of these potentially impacts the technology, user interface, and API design ● I want to visualize ocean processes and share my analysis. I want to do this more than once and get exactly the same answer. I want to do this more than once, but don’t care if I get exactly the same answer. I’m only going to do this once and don’t care about keeping the data or the results long term (but I need to remember the inputs); I want to store the data in I want full provenance to validate a result, OPM compliant; I want to use my own provenance management system; Each group may wish a different UI (no WF), or authoring tool I only want NCAR, MBARI, etc. data because I trust it. I know that Jon really wants my results to drive his model and I want to share my workflow and executables.
Visually program workflows. Libraries of activities and workflows, to save and reuse workflows. Abstract parallelism for HPC, to test on desktop and then run on cluster. Automatic provenance capture, for all workflows and data products. Integrated data storage and access, allows researcher to store data on a SQL database, local files or in the cloud (Microsoft SDS, Amazon S3). Reproducible research Composition Space Activity Library Workflow Library Data Options & Sharing
PanSTARRs (Astronomy) One of the largest visible light telescopes Four unit telescopes acting as one One Gigapixel per telescope Survey entire visible universe in 1 week Catalog solar system, moving objects/asteroids ps1sc.org: Univ. Hawaii, Johns Hopkins, …
1 PB of raw image data/year 2.5 TB image data | 1000 images | 150 M detections / night 30 TB of processed data per year 5.5 Billion celestial objects 350 Billion detections The largest astronomy DB in the world! And the platform to build it upon! Telescope Telescope diameter (m) Effective collecting area (m 2 ) [A] Solid angle subtended by field of view (deg 2 ) [D] Nominal image quality (arcsec) [Q] The survey power [AD/Q 2 ] Status UH 2.2-m/PFCam Palomar/QUEST CFHT/Megacam Active Subaru/Suprimam Active Pan-STARRS DMT/LSST
Software & Hardware design principles for data intensive science Enhances BeoWulf model with storage co-located with commodity HPC nodes Databases for fast queries on index High sequential I/O bandwith for varying query patterns Scale out instead of Scale up The GrayWulf name pays tribute to Jim Gray who was actively involved in the defining these design principles.
GrayWulf Shared Compute Resources Shared Queryable Data Store Configuration Management, Health and Performance Monitoring Operator User Interface User Interface Data Valet User Interface VALETWORKFLOWVALETWORKFLOW USER WORKFLOWUSER WORKFLOW User Storage Data Flow Control Flow Data Valet Queryable Data Store User Queryable Data Store
Cluster - Scheduling & Monitoring Windows HPC 2008 Cluster Database - Shared Domain DBs & User MyDBs SQL Server 2008 Trident Workflow Workbench Windows Workflow Foundations, Composer, Registry, Provenance/Logging Common data management library Domain specific user interfaces Scientists, Data Valets, System Operations
3000 node cluster 12,000 cores (36 x cycles/sec) 48 terabytes of RAM 9 petabytes of persistent storage
Continuously deployed since 2006 Running on >> 10 4 machines Sifting through > 10Pb data daily Runs on clusters > 3000 machines Handles jobs with > 10 5 processes each Used by >> 100 developers Rich platform for data analysis Microsoft Research, Silicon Valley Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly
Automatic plan generated by DryadLINQ Automatic distributed execution by Dryad Programmer writes sequential C#, VB,… code – System figures out the data-parallelism – Manages execution, traditional parallel-DB tricks
A radical approach to programming at scale Nodes talk to each other as little as possible (shared nothing) Programmer is not allowed to communicate between nodes Data is spread throughout machines in advance, computation happens where it’s stored. Master program divvies up tasks based on location of data, schedules tasks on same machine as the data resides, or at least same rack, detects worker failures and restarts, load balances, redundant execution, etc…
The goal of the analysis is to execute a set of analysis functions on a collection of data files produced by high-energy physics experiments Histogramming of events from large data set (TBs) DryadLINQ program provides easy way to distribute the computation on the cluster
Broad academic/research Dryad and DryadLINQ ( binary for now, source release in planning) With tutorials, programming guides, sample codes, libraries, and a community site.
© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.