1 Challenges and New Trends in Data Intensive Science Panel at Data-aware Distributed Computing (DADC) Workshop HPDC Boston June Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University
HPDC DADC Panel Questions I What do you think is the most challenging research problem in data-intensive computing today? What will it be five years from now? – Same problem as for any distributed computing arena Should we call it clouds What is a sustainable architecture and associated software What are special requirements of scientific data processing How do they relate across sensors, “classic” astronomy/particle physics and “scholarly studies” How do these relate to commercial systems that perhaps look most like “scholarly studies” Is there any chance today’s 100 idiosyncratic data and metadata systems will survive 5 year projection depends a lot on evolution of compute, data and network systems/hardware The demand for Data-intensive computing has also attracted the attention of the federal funding agencies. NSF, DOE and other agencies are introducing new programs focusing on data-intensive computing such as DataNet, INTEROP, CLuE, SCiDAC Data Centers etc. Can we say Data Intensive Computing will be the next big/hot thing? How long do you think this trend will continue? – Note 30 years ago I was solving LHC problem on a few hundred 1600 BPI tapes (around 10 gigabyte of data ). I used identical analysis software (as did those from 15 years before) but volume of data a factor of million lower – HPCC Initiative shifted emphasis to simulation in 1990 – Datanet: Note raw data and software thrown away (without asking me; both “saved” on data repository chipstore) 12 years ago but I do have results of analysis – So problem is not so new; scale is new – I want to stress that agencies should keep a good balance between research and deployment; I am not so sure community has done very well on deployment
HPDC DADC Panel Questions II What How do you think the new Petaflop scale systems and multi- core machines will affect data-intensive science? – Multi-core will be very important as the high performance distributed data analysis system for plethora of distributed data – Not clear what computational complexity of important data mining algorithms will be – Unclear if Petaflops very relevant; many smaller machines better and not allowed to use Petaflops in this fashion Could you comment on the vision of your institution on the "next generation data-intensive science & discovery"? – Very important in usual sciences – Biology, Chemistry, Physics – Indiana University has no engineering school so social science or “scholarly knowledge” especially interesting – Education and Training; clearly identified as a serious problem – Propose (as chair of informatics department) certificates/minors in X- Informatics (a wrinkle on Computational Science) X = Geography, Library, Health, Business, Media, Sports Already have full set of offerings in Bio, Chem, Security, Music, Social, HCI ….