The Future of Scientific Computing at Harvard Alyssa A. Goodman Professor of Astronomy Director, Initiative in Innovative Computing Alyssa A. Goodman Professor of Astronomy Director, Initiative in Innovative Computing
“The Heavy Red Bag” How can computers advance (my) science?
A new collaborative scientific initiative at Harvard.
Computational challenges are common across scientific disciplines How to: Acquire, transmit, organize, and query new kinds of data? Apply distributed computing resources to solve complex problems? Derive meaningful insight from large datasets? Share, integrate and analyze knowledge across geographically dispersed researchers? Visually represent scientific results so as to maximize understanding? Opportunity to collaborate and apply insights from one field to another
Filling the “Gap” between Science and Computer Science Increasingly, core problems in science require computational solution Typically hire/“home grow” computationalists, but often lack the expertise or funding to go beyond the immediate pressing need Focused on finding elegant solutions to basic computer science challenges Often see specific, “applied” problems as outside their interests Scientific disciplines Computer Science departments
“Workflow” & “Continuum”
Workflow ExamplesAstronomyPublic Health“Collect”TelescopeMicroscope, Stethoscope, Survey COLLECT “National Virtual Observatory”/ COMPLETE CDC Wonder “Analyze” Study the density structure of a star- forming glob of gas Find a link between one factory’s chlorine runoff & disease ANALYZE Study the density structure of all star- forming gas in… Study the toxic effects of chlorine runoff in the U.S. “Collaborate” Work with your student COLLABORATE Work with 20 people in 5 countries, in real-time “Respond” Write a paper for a Journal. RESPOND Write a paper, the quantitative results of which are shared globally, digitally.
IIC contact: AG, FAS Workflow
Workflow a.k.a. The Scientific Method (in the Age of the Age of High-Speed Networks, Fast Processors, Mass Storage, and Miniature Devices) IIC contact: Matt Welsh, FAS
Workflow: The Harvard Virtual Brain Faculty of Arts and Sciences Harvard College Division of Engineering Harvard School of Public Health Faculty of Medicine Harvard Medical School Affiliated Teaching Hospitals Data Acquisition MRI PET Microscopy etc. Distributed Data Storage Data Processing Analysis Visualization Integration etc. Information Access Query Statistical Analysis Knowledge Management etc. Establishing a Harvard-wide Neuroscience Infrastructure Harvard IIC IIC contact: David Kennedy, HMS/MGH
New technologies for measurement and simulation are transforming the “workflow.” Manual/low throughput Solitary Limited by two hands Analog High throughput Automated/networked Highly scalable Digital Biomedicine: pre-genomics Biomedicine: genomics era
Continuum “Pure” Discipline Science (e.g. Galileo) “Pure” Computer Science (e.g. Turing) “Computational Science” Missing at Most Universities
Workflow & Continuum For any particular scientific investigation: Where does, and could, “computational science” make improvements in this cycle?
Harvard Public Health “NOW” (Oct. 2004) "In the past, experiments did not involve such large data sets," observed Dyann Wirth, professor of infectious diseases in the Department of Immunology and Infectious Diseases and member of the advisory group for the core. "There has been a dramatic change in the past five to 10 years in the amount and availability of genomic data [or the DNA sequences themselves] and functional genomic data, [or the sequences’ purpose]." In the past five years alone, the genomes of humans, rats, and the malaria parasite Plasmodium Falciparum have been published, for example. Dyann Wirth "One of the purposes of bioinformatics is to reduce the number of experiments that need to be done to achieve reliable information," said L.J. Wei, professor of biostatistics in the Department of Biostatistics and member of the advisory group for the core. "However, an issue right now is that there are huge data sets that can be run through different kinds of software programs, ending up with many data points. Unless we understand and use bioinformatics well, we may not even know which of those data points are important." L.J. Wei
Filling the “computational science” gap: IIC Problem-driven approach …focusing effort on solving problems that will have greatest impact & educational value Collaborative projects …combining disciplinary knowledge with computer science expertise Interdisciplinary effort …to ensure that best practices are shared across fields and that new tools and methodologies will be broadly applicable Links with industry …to draw on and learn from experience in applied computation Institutional funding …to ensure effort is directed towards key needs and not driven solely by narrow priorities of funding agencies
IIC at Harvard
Numerical Simulation of Star Formation Bate, Bonnell & Bromm 2002 (UKAFF) MHD turbulence gives “t=0” conditions; Jeans mass=1 M sun 50 M sun, 0.38 pc, n avg =3 x 10 5 ptcls/cc forms ~50 objects T=10 K SPH, no B or movie=1.4 free-fall times
Simulations & Public Health
Goal: Statistical Comparison of “Real” and “Synthesized” Star Formation Figure based on work of Padoan, Nordlund, Juvela, et al. Excerpt from realization used in Padoan & Goodman 2002.
Measuring Motions: Molecular Line Maps
Alves, Lada & Lada 1999 Radio Spectral-Line Survey Radio Spectral-line Observations of Interstellar Clouds
Velocity from Spectroscopy Intensity "Velocity" Observed Spectrum All thanks to Doppler Telescope Spectrometer
Intensity "Velocity" Observed Spectrum Telescope Spectrometer All thanks to Doppler Velocity from Spectroscopy
Barnard’s Perseus COMPLETE/FCRAO W( 13 CO)
IRAS N dust H- emission,WHAM/SHASSA Surveys (see Finkbeiner 2003) HH 2MASS/NICER Extinction
“Astronomical Medicine” Excerpts from Junior Thesis of Michelle Borkin (Harvard College); IIC Contacts: AG (FAS) & Michael Halle (HMS/BWH/SPL)
IC 348
“Astronomical Medicine”
After “Medical Treatment” Before “Medical Treatment”
3D Slicer Demo (available after talk) IIC contacts: Michael Halle & Ron Kikinis
VisualizationDistributed Computing Databases/ Provenance Analysis & Simulations Instrumentation Physically meaningful combination of diverse data types. e-Science aspects of large collaborations. Sharing of data and computational resources and tools in real-time. Management, and rapid retrieval, of data. “Research reproducibility” …where did the data come from? How? Development of efficient algorithms. Cross-disciplinary comparative tools (e.g. statistical). Improved data acquisition. Novel hardware approaches (e.g. GPUs, sensors). IIC: Five Research Branches
IIC: Innovative Organizational Model Culture Staffing Promotion/ career path Criteria for promotion will give equal weight to scholarly activities, and to technological invention No “class” distinctions made between teaching and non- teaching faculty, scientists and engineers, artists and designers working in the visualization program Highly accomplished academics and senior experts whose careers have been primarily in industry, working together
How IIC will Function: Overview IIC Objectives Identify and fund projects that are likeliest to have the greatest and broadest impact Pursue projects in way that will yield best outcome, enable shared learning, etc. Enable new research for specific scientific discipline Generate new computational tools for broader application Project execution Dissemination of knowledge Project selection
Role Submit proposal in response to call for ideas Evaluate/rank proposals for scientific merit: should this be a priority for IIC? Evaluate/prioritize proposals according to technical feasibility, assess resource needs Who participates Any Harvard researcher (e.g., in genomics, fluid dynamics, epidemiology,neuroscience, nanoscience, comp bio, chemical biology, optics, geology, astronomy, quantum mechanics, et al.) Harvard researchers representing broad interests of IIC stakeholders plus IIC Director & Dir. of Research Consists of IIC Director Dirs. of Res. & Adm/Ops Heads of IIC branches Project Selection Program Advisory Committee Project proposals IIC Management Team
Project Execution Responsible for project execution and metrics for tracking progress/performance; interfaces with IIC branch heads Scientists who “own” the problem and are committed to working with IIC staff to tackle it IIC staff scientists assigned to work on project by relevant IIC branch heads. The same IIC staff member may serve on multiple IIC project teams Discipline scientistsIIC staff Project Manager IIC Project Team C, etc. Discipline scientistsIIC staff Project Manager IIC Project Team B Discipline scientistsIIC staff Project Manager IIC Project Team A
Dissemination of Knowledge Seminars/colloquiaPublications Knowledge management system Communities of practice Scientific journals IIC white papers Internal... External… New tools IIC process
Education is central to IIC’s mission At Harvard: Undergraduate & graduate courses focused on “data-intensive science” New graduate certificate program, within existing Ph.D. programs Research opportunities at undergraduate, graduate, and postdoctoral levels Beyond Harvard: New museum, highlighting the kind of science done at the IIC
IIC organization: research and education Assoc Dir, Instrumentation Assoc Dir, Visualization Assoc Dir, Analysis & Simulation Provost IIC Director Assoc Provost Dir of Admin & Operations Project 1 (Proj Mgr 1) Project 2 (Proj Mgr 2) Project 3 (Proj Mgr 3) Dir of Education & Outreach Etc. CIO (systems) Knowledge mgmt Education & Outreach staff Dean, Physical Sciences Dir of Research Assoc Dir, Databases/Data Provenance Assoc Dir, Distributed Computing
IIC organization: admin and operations Provost IIC Director Dir of Research Assoc Provost Dir of Admin & Operations Dir of Education & Outreach Dean, Physical Sciences Admin Finance Development Facilities HR Note: admin roles expected to be played by 1-2 staff members at outset; staff will grow with overall IIC growth
VisualizationDistributed Computing Databases/ Provenance Analysis & Simulations Instrumentation Physically meaningful combination of diverse data types. e-Science aspects of large collaborations. Sharing of data and computational resources and tools in real-time. Management, and rapid retrieval, of data. “Research reproducibility” …where did the data come from? How? Development of efficient algorithms. Cross-disciplinary comparative tools (e.g. statistical). Improved data acquisition. Novel hardware approaches (e.g. GPUs, sensors). IIC: Examples
Visualization: 3D Slicer (BWH Surgical Planning Lab) IIC contacts: Michael Halle & Ron Kikinis
IIC contact: Felice Frankel (MIT) Work: Garstecki/Whitesides (FAS) “Image and Meaning” (Visualization)
Distributed Computing: Semantics, Ontologies IIC Contact: Tim Clark (HMS/MGH)
Distributed Computing & Large Databases: Large Synoptic Survey Telescope Optimized for time domain scan mode deep mode 7 square degree field 6.5m effective aperture 24th mag in 20 sec > 5 Tbyte/night Real-time analysis Simultaneous multiple science goals Simultaneous multiple science goals IIC contact: Christopher Stubbs (FAS)
Relative optical survey power based on A = 270 LSST design
AstronomyHigh Energy Physics LSSTSDSS2MASSMACHODLSBaBarAtlasRHIC First year of operation Run-time data rate to storage (MB/sec) 5000 Peak 500 Avg (zero- suppressd) 6* 540* 120* ( ’ 03) 250* ( ’ 04) Daily average data rate (TB/day) ( ’ 03) 10 ( ’ 04) Annual data store (TB) ( ’ 03) 500 ( ’ 04) Total data store capacity (TB) 20,000 (10 yrs) ,000100,000 (10 yrs) 10,000 (10 yrs) Peak computational load (GFLOPS) 140, ,000100,0003,000 Average computational load (GFLOPS) 140, ,000100,0003,000 Data release delay acceptable 1 day moving 3 months static 2 months 6 months1 year6 hrs (trans) 1 yr (static) 1 day (max) <1 hr (typ) Few days100 days Real-time alert of event30 secnone <1 hour1 hrnone Type/number of processors TBD1GHz Xeon MHz Sparc MHz Sparc MH z Pentium 5 Mixed/ GHz/ 10,000 Pentium/ 2500
Analysis & Simulations Figure based on work of Padoan, Nordlund, Juvela, et al. Excerpt from realization used in Padoan & Goodman 2002.
Analysis & Simulations: Neural Net Models of Intelligence Does Speed of Convergence in Neural Nets Predict Scores on Measures of “General Intelligence”? Select from the lower 8 the one that completes the pattern in the top 9 IIC contact: Stephen Kosslyn (Psychology)
(Easier) Analysis of Large Data Sets: Mendelian Disease Genes OMIM on the genome Position (MB) Chromosome 1 2 Hello world 189 Large data files reformat, merge, and filter Can a biologist get from here to there? Location of every known disease gene on the human genome Without programming? IIC contact: Eitan Rubin (FAS/CGR)
Instrumentation IIC contact: Matt Welsh, FAS
IIC: Mission The Institute for Innovative Computing (IIC) will make Harvard a world leader in the innovative and creative use of computational resources to address forefront scientific problems. We will focus on developing capabilities that are applicable to multiple disciplines, by undertaking specific, well-defined projects, thereby developing tools and approaches that can be generalized and shared. We will foster the flow of ideas and inventions along the continuum from basic science to scientific computation to computational science to computer science. We will train a next generation of creative and computationally capable scientists, build linkages to industry, and communicate with the public at large.
Why Here? Diverse group of senior faculty and accomplished scientists… …spanning a wide range of relevant disciplines, e.g., Computer science Physics, Chemistry, Astronomy, Statistics, Biology, Medicine, etc. Psychology, Graphic Design …with backgrounds in both academia and industry… …deeply committed to the vision of a collaborative approach to solving the most compelling computing challenges facing scientists today
Who are IIC’s “competitors”? Caltech Center for Advanced Scientific Computing Research Computation Institute at the University of Chicago Cornell Theory Center MIT Media Lab Scientific Computing and Imaging Institute (University of Utah) UK National eScience Center of the Universities of Glasgow and Edinburgh IIC is unique in its collaborative, comprehensive, interdisciplinary approach
IIC will evolve over three phases Phase I Timing IIC staffing level, combo of new faculty senior scientists admin staff Number of projects Educational mission New courses offered Outreach programs Other key milestones Phase II Phase III Total ~25to ~100 ~3to ~15 New courses to museum Evaluation schedule (internal, external committees)
Challenges In “Phase I” (Startup) Result of “Allston” Science & Technology Task Force IIC intended to be a “University” (not a single school) initiative FAS Constraints Faculty Appointments Non-Faculty Appointments Startup Space “Chicken-and-Egg” Problem with Recruiting Good, but not certain, Funding Prospects Role of DEAS Computer Science