Cyberinfrastructure for e-Education and e-Research (e-Science) Cyberinfrastructure Days New Mexico Highlands University Las Vegas NM March 10-11 2008 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 gcf@indiana.edu http://www.infomall.org
e-moreorlessanything ‘e-Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it.’ from inventor of term John Taylor Director General of Research Councils UK, Office of Science and Technology e-Science is about developing tools and technologies that allow scientists to do ‘faster, better or different’ research Similarly e-Business captures an emerging view of corporations as dynamic virtual organizations linking employees, customers and stakeholders across the world. This generalizes to e-moreorlessanything including e-DigitalLibrary, e-NationalSecurity, e-HavingFun and e-Education A deluge of data of unprecedented and inevitable size must be managed and understood. People (virtual organizations), computers, data (including sensors and instruments) must be linked via hardware and software networks 2 2
Applications, Infrastructure, Technologies This field is confused by inconsistent use of terminology; I define Web Services, Grids and (aspects of) Web 2.0 (Enterprise 2.0) are technologies Grids could be everything (Broad Grids implementing some sort of managed web) or reserved for specific architectures like OGSA or Web Services (Narrow Grids) These technologies combine and compete to build electronic infrastructures termed e-infrastructure or Cyberinfrastructure e-moreorlessanything is an emerging application area of broad importance that is hosted on the infrastructures e-infrastructure or Cyberinfrastructure e-Science or perhaps better e-Research is a special case of e-moreorlessanything
What is Cyberinfrastructure Cyberinfrastructure is (from NSF) infrastructure that supports distributed science (e-Science)– data, people, computers Clearly core concept more general than Science Exploits Internet technology (Web2.0) adding (via Grid technology) management, security, supercomputers etc. It has two aspects: parallel – low latency (microseconds) between nodes and distributed – highish latency (milliseconds) between nodes Parallel needed to get high performance on individual large simulations, data analysis etc.; must decompose problem New Mexico Encanto supercomputer excellent parallel resource Distributed aspect integrates already distinct components – especially natural for data 4 4
Underpinnings of Cyberinfrastructure Distributed software systems are being “revolutionized” by developments from e-commerce, e-Science and the consumer Internet. There is rapid progress in technology families termed “Web services”, “Grids” and “Web 2.0” The emerging distributed system picture is of distributed services with advertised interfaces but opaque implementations communicating by streams of messages over a variety of protocols Complete systems are built by combining either services or predefined/pre-existing collections of services together to achieve new capabilities As well as Internet/Communication revolutions (distributed systems), multicore chips will likely be hugely important (parallel systems)
Virtual Observatory Astronomy Grid Integrate Experiments Radio Far-Infrared Visible Dust Map Visible + X-ray Galaxy Density Map
Example: Setting up a Polar CI-Grid The North and South poles are melting with potential huge environmental impact As a result of MSI meetings, I am working with MSI ECSU in North Carolina and Kansas University to design and set up a Polar Grid (Cyberinfrastructure) This is a network of computers, sensors (on robots and satellites), data and people aimed at understanding science of ice-sheets and impact of global warming We have changed the 100,000 year Glacier cycle into a ~50 year cycle; the field has increased dramatically in importance and interest Good area to get involved in as not so much established work
Computing and Cyberinfrastructure: TeraGrid TeraGrid resources include more than 250 teraflops of computing capability and more than 30 petabytes of online and archival data storage, with rapid access and retrieval over high-performance networks. TeraGrid is coordinated at the University of Chicago, working with the Resource Provider sites: Indiana University, Oak Ridge National Laboratory, National Center for Supercomputing Applications, Pittsburgh Supercomputing Center, Purdue University, San Diego Supercomputer Center, Texas Advanced Computing Center, University of Chicago/Argonne National Laboratory, and the National Center for Atmospheric Research. Grid Infrastructure Group (UChicago) UW UC/ANL PSC NCAR PU NCSA IU UNC/RENCI Caltech ORNL USC/ISI SDSC TACC Resource Provider (RP) Software Integration Partner Computing and Cyberinfrastructure: TeraGrid
Large Hadron Collider CERN, Geneva: 2008 Start pp s =14 TeV L=1034 cm-2 s-1 27 km Tunnel in Switzerland & France CMS TOTEM pp, general purpose; HI 5000+ Physicists 250+ Institutes 60+ Countries Atlas ALICE : HI LHCb: B-physics Higgs, SUSY, Extra Dimensions, CP Violation, QG Plasma, … the Unexpected Challenges: Analyze petabytes of complex data cooperatively Harness global computing, data & network resources
Environmental Monitoring Sensor Grid at Clemson
Sensor Grids Can be Fun Note sensors are any time dependent source of information and a fixed source of information is just a broken sensor SAR Satellites Environmental Monitors Nokia N800 pocket computers RFID tags and readers GPS Sensors Lego Robots RSS Feeds Audio/video: web-cams Presentation of teacher in distance education Text chats of students
The Sensors on the Fun Grid Laptop for PowerPoint 2 Robots used Lego Robot GPS Nokia N800 RFID Tag RFID Reader
Data from the Robot RFID Sensors Data from GPS geolocates other sensors Sensor Data from Lego Light sensor plus videocams from N800 carried as payload on Lego RFID Reader sees many tags
BIRN Bioinformatics Research Network
The People in Cyberinfrastructure Web 2.0 can enhance scientific collaboration, i.e. effectively support virtual organizations, in different ways from grids I expect more resources like MyExperiment from UK, SciVee from SDSC and Connotea from Nature that offer Flickr, YouTube, Facebook, Second Life type capabilities optimized for science The usability and participatory nature of Web 2.0 can bring science and its informatics to a broader audience In particular distance collaborative aspects of such Cyberinfrastructure can level playing field; you do not have to be at Harvard etc. to succeed e.g. ECSU in CReSIS NSF Science and Technology Center Navajo Tech can access TeraGrid Science Gateways
SciVee: Share videos etc. Connotea: Share links/comments All have tags
MSI-CIEC Web 2.0 Research Matching Portal Portal supporting tagging and linkage of Cyberinfrastructure Resources NSF (soon other agencies) Solicitations and Awards MSI-CIEC Portal Homepage Feeds such as SciVee and NSF Researchers on NSF Awards User and Friends TeraGrid Allocations Search Results Search for linked people, grants etc. Could also be used to support matching of students and faculty for REUs etc. MSI-CIEC Portal Homepage Search Results
The social process of science 2.0 Virtual Learning Environment Undergraduate Students Digital Libraries scientists Graduate Students Technical Reports Reprints Peer-Reviewed Journal & Conference Papers Preprints & Metadata experimentation Local Web Repositories Certified Experimental Results & Analyses Data, Metadata Provenance Workflows Ontologies
Data and Cyberinfrastructure DIKW: Data Information Knowledge Wisdom transformation Applies to e-Science, Distributed Business Enterprise (including outsourcing), Military Command and Control and general decision support (SOAP or just RSS) messages transport information expressed in a semantically rich fashion between sources and services that enhance and transform information so that complete system provides Semantic Web technologies like RDF and OWL might help us to have rich expressivity but they might be too complicated We are meant to build application specific information management/transformation systems for each domain Each domain has Specific Services/Standards (for API’s and Information such as KML and GML for Geographical Information Systems) and will use Generic Services (like R for datamining) and Generic Standards (such as RDF, WSDL) Standards made before consensus or not observant of technology progress are dubious
Information and Cyberinfrastructure Raw Data Data Information Knowledge Wisdom Decisions Another Grid Another Grid SS SS SS SS SS Filter Service fs Discovery Cloud Portal Filter Cloud Filter Cloud Inter-Service Messages Another Service Filter Service fs Filter Cloud Filter Service fs Discovery Cloud Filter Service fs Filter Cloud Traditional Grid with exposed services Filter Cloud Filter Cloud Another Grid SS SS SS SS Sensor or Data Interchange Service SS SS SS SS SS SS SS Compute Cloud Storage Cloud Database
APEC Cooperation for Earthquake Simulation ACES is a eight year-long collaboration among scientists interested in earthquake and tsunami predication iSERVO is Infrastructure to support work of ACES SERVOGrid is (completed) US Grid that is a prototype of iSERVO http://www.quakes.uq.edu.au/ACES/ Chartered under APEC – the Asia Pacific Economic Cooperation of 21 economies
Repositories Federated Databases Sensors Streaming Data Field Trip Data Database Database Sensor Grid Database Grid Research Education SERVOGrid ? Discovery Services GIS Grid Compute Grid Customization Services From Research to Education Data Filter Services Research Simulations Analysis and Visualization Portal Education Grid Computer Farm Grid of Grids: Research Grid and Education Grid
Grid Workflow Datamining in Earth Science Work with Scripps Institute Grid services controlled by workflow process real time data from ~70 GPS Sensors in Southern California Streaming Data Support Transformations Data Checking Hidden Markov Datamining (JPL) Display (GIS) NASA GPS Real Time Archival Earthquake 25 25
Grid Workflow Data Assimilation in Earth Science Grid services triggered by abnormal events and controlled by workflow process real time data from radar and high resolution simulations for tornado forecasts Typical graphical interface to service composition