Presentation is loading. Please wait.

Presentation is loading. Please wait.

3DAPAS/ECMLS panel Dynamic Distributed Data Intensive Analysis Environments for Life Sciences: June 8 2011 San Jose Geoffrey Fox, Shantenu Jha, Dan Katz,

Similar presentations


Presentation on theme: "3DAPAS/ECMLS panel Dynamic Distributed Data Intensive Analysis Environments for Life Sciences: June 8 2011 San Jose Geoffrey Fox, Shantenu Jha, Dan Katz,"— Presentation transcript:

1 3DAPAS/ECMLS panel Dynamic Distributed Data Intensive Analysis Environments for Life Sciences: June 8 2011 San Jose Geoffrey Fox, Shantenu Jha, Dan Katz, Judy Qiu, Jon Weissman

2 Discussion Topics? Programming methods: languages vs. frameworks (advantages and disadvantages of each) Moving compute to data: Is the data localization model imposed by Clouds scalable and/or sustainable? Does Life Sciences want clouds or supercomputers? Data model for Life Sciences; is it dynamic?, What is size? What is Access pattern? Is it Shared or Individual? How important is data security and privacy?

3 Programming methods: languages vs. frameworks for data intensive/Life Science areas SaaS offers “Blast etc.” on demand Role of PGAS, Data parallel compilers (like Chapel) i.e. main stream HPC high level approaches Nodes v. Cores v. GPU’s – do hybrid programming models have special features MapReduce v. MPI Distributed environments like SAGA Data parallel analysis languages like Pig Latin, Sawzall Role of databases like SciDB and SQL based analysis – See DryadLINQ Is R (cloud R, parallel R) critical – What about Excel, Matlab …

4 Moving compute to data: Is the data localization model imposed by Clouds scalable and/or sustainable? This related to privacy and programming model questions Is data stored in central resources Does data have co-located compute resources (cloud) If co-located, are data and compute on same cluster – How is data spread out over disks on nodes? Or is data in a storage system supporting wide area file system shared by nodes of cloud? Or is data in a database (SciDB SkyServer)? Or is data in an object store like OpenStack? What kind of middleware exists, or needs to be developed to enable effective compute-data movement? Or it just a run- time scheduling problem? What are performance issues and how do we get data there for dynamic data as that produced by sequencers.

5 Data model for Life Sciences; is it dynamic?, What is size? What is Access pattern? Is it Shared or Individual? Is it a few large centers? Is it a distributed set of repositories containing say all data from a particular lab? – Or both of the above? – How to manage and present stream of new data The world created ~1000 exabytes of data this year – how much will Life Sciences create? Relative importance of large shared data centers versus instrumental or computer generated individually owned data? Is Data replication important? Storage model – files, objects, databases? How often is the different types of data read (presumably written once!) – Which data is most important? Raw or processed to some level? What is metadata challenge?

6 Does Life Sciences want Clouds or Supercomputers? Clouds are cost effective and elastic for varying need Supercomputers support low latency (MPI) parallel applications Clouds main commercial offering; supercomputers main academic large scale computing solution – Also Open Science Grid, EGI …. Cost(time) of transporting data from sequencers and repositories to analysis engines ( clouds) – Will NLR or Internet2 link to clouds; they do to TeraGrid What can LS data-intensive community learn from the HEP community? e.g., Will the HEP approach of community-wide "workload management systems" and VOs work? What is the role of Campus Clusters/resources in genomic data-sharing? No history of large cloud budgets in federal grants

7 How important is data security and privacy? Human Genome processing cannot use most cost effective solutions which will be shared resources such as public clouds – Commercial, military applications What other research applications have such concerns – Analysis of copyrighted material such as digital books Partly technical; partly policy issue of establishing a trusted approach – Companies accept off site paper storage? See recent hacking attacks such as Sony network, gmail How important is fault tolerance/autonomic computing?


Download ppt "3DAPAS/ECMLS panel Dynamic Distributed Data Intensive Analysis Environments for Life Sciences: June 8 2011 San Jose Geoffrey Fox, Shantenu Jha, Dan Katz,"

Similar presentations


Ads by Google