Biology MDS and Clustering Results

Slides:



Advertisements
Similar presentations
1 Challenges and New Trends in Data Intensive Science Panel at Data-aware Distributed Computing (DADC) Workshop HPDC Boston June Geoffrey Fox Community.
Advertisements

SALSA HPC Group School of Informatics and Computing Indiana University.
Introduction to Programming Paradigms Activity at Data Intensive Workshop Shantenu Jha represented by Geoffrey Fox
Clouds from FutureGrid’s Perspective April Geoffrey Fox Director, Digital Science Center, Pervasive.
SALSASALSASALSASALSA Chemistry in the Digital Age Workshop, Penn State University, June 11, 2009 Geoffrey Fox
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Student Visits August Geoffrey Fox
Clouds Cyberinfrastructure and Collaboration CTS2010 Chicago IL May Geoffrey Fox
1 Multicore and Cloud Futures CCGSC September Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University
April 2009 OSG Grid School - RDU 1 Open Science Grid John McGee – Renaissance Computing Institute University of North Carolina, Chapel.
Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.
Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.
SALSASALSASALSASALSA Digital Science Center June 25, 2010, IIT Geoffrey Fox Judy Qiu School.
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Panel Session The Challenges at the Interface of Life Sciences and Cyberinfrastructure and how should we tackle them? Chris Johnson, Geoffrey Fox, Shantenu.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Big Data and Clouds: Challenges and Opportunities NIST January Geoffrey Fox
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
X-Informatics Cloud Technology (Continued) March Geoffrey Fox Associate.
SALSASALSASALSASALSA AOGS, Singapore, August 11-14, 2009 Geoffrey Fox 1,2 and Marlon Pierce 1
Science in Clouds SALSA Team salsaweb/salsa Community Grids Laboratory, Digital Science Center Pervasive Technology Institute Indiana University.
Science Clouds and FutureGrid’s Perspective June Science Clouds Workshop HPDC 2012 Delft Geoffrey Fox
FutureGrid Dynamic Provisioning Experiments including Hadoop Fugang Wang, Archit Kulshrestha, Gregory G. Pike, Gregor von Laszewski, Geoffrey C. Fox.
Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.
SALSA HPC Group School of Informatics and Computing Indiana University.
Multidimensional Scaling by Deterministic Annealing with Iterative Majorization Algorithm Seung-Hee Bae, Judy Qiu, and Geoffrey Fox SALSA group in Pervasive.
Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1.
SALSASALSASALSASALSA Clouds Ball Aerospace March Geoffrey Fox
SALSASALSASALSASALSA Cloud Panel Session CloudCom 2009 Beijing Jiaotong University Beijing December Geoffrey Fox
1 Multicore for Science Multicore Panel at eScience 2008 December Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University.
Looking at Use Case 19, 20 Genomics 1st JTC 1 SGBD Meeting SDSC San Diego March Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox
Security: systems, clouds, models, and privacy challenges iDASH Symposium San Diego CA October Geoffrey.
Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics,
SALSA Group Research Activities April 27, Research Overview  MapReduce Runtime  Twister  Azure MapReduce  Dryad and Parallel Applications 
REU Site: Arctic and Antarctic Project (AaA-REU) with Research Experience for Teachers (RET) Component I would also like to create a poster on the IU/ECSU.
SALSASALSASALSASALSA Digital Science Center February 12, 2010, Bloomington Geoffrey Fox Judy Qiu
Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Big Data to Knowledge Panel SKG 2014 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China August Geoffrey Fox
HPC in the Cloud – Clearing the Mist or Lost in the Fog Panel at SC11 Seattle November Geoffrey Fox
SALSASALSA Large-Scale Data Analysis Applications Computer Vision Complex Networks Bioinformatics Deep Learning Data analysis plays an important role in.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Big Data Workshop Summary Virtual School for Computational Science and Engineering July Geoffrey Fox
Digital Science Center II
Geoffrey Fox, Shantenu Jha, Dan Katz, Judy Qiu, Jon Weissman
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
Our Objectives Explore the applicability of Microsoft technologies to real world scientific domains with a focus on data intensive applications Expect.
I590 Data Science Curriculum August
Applying Twister to Scientific Applications
High Performance Big Data Computing in the Digital Science Center
MapReduce for Data Intensive Scientific Analyses
Data Science Curriculum March
Martin Swany Gregor von Laszewski Thomas Sterling Clint Whaley
Scalable Parallel Interoperable Data Analytics Library
Adaptive Interpolation of Multidimensional Scaling
Clouds from FutureGrid’s Perspective
Digital Science Center III
Big Data Architectures
Group 15 Swathi Gurram Prajakta Purohit
Cyberinfrastructure and PolarGrid
Services, Security, and Privacy in Cloud Computing
Towards High Performance Data Analytics with Java
Twister2: Design of a Big Data Toolkit
PolarGrid and FutureGrid
Panel on Research Challenges in Big Data
Chemical Informatics and Cyberinfrastructure Collaboratory
Cloud versus Cloud: How Will Cloud Computing Shape Our World?
Big Data, Simulations and HPC Convergence
CReSIS Cyberinfrastructure
Convergence of Big Data and Extreme Computing
Presentation transcript:

Biology MDS and Clustering Results MapReduce and Clouds for Science http://salsahpc.indiana.edu/ Indiana University Bloomington Geoffrey Fox, Judy Qiu, SALSA Group SALSA project (salsahpc.indiana.edu) investigates new programming models of parallel multicore computing and Cloud/Grid computing. It aims at developing and applying parallel and distributed Cyberinfrastructure to support large scale data analysis. We illustrate this with a project for life sciences: clustering for biology Alu and Metagenomics sequences; a study of usability and performance of different Cloud approaches; an iterative MapReduce runtime, Twister, to support complex data analysis algorithms for scientific applications; engagement of undergraduate students in new programming models using Dryad and TPL through class, REU, and Minority outreach programs. Processing/Visualizing DNA Sequencing Pipeline Biology MDS and Clustering Results There is a data deluge throughout science and all areas need analysis pipelines or workflows to propel the data from instruments through various stages to scientific discovery often aided by visualization. It is well known that these pipelines typically offer natural data parallelism that can be implemented within many different frameworks. We chose to look at the MapReduce frameworks as these stem from the commercial information retrieval field which is perhaps currently the world’s most demanding data analysis problem. Exploiting commercial approaches offers a good chance that one can achieve high-quality, robust environments and MapReduce has a mixture of commercial and open source implementations. This figure illustrates results from our research of a pipeline mode to provide services on demand (Software as a Service SaaS) for genomics. Alu Families This visualizes results of Alu repeats from Chimpanzee and Human Genomes. Young families (green, yellow) are tight clusters Metagenomics This visualizes results of clustering and dimension reduction to 3D of 30000 gene sequences from an environmental sample. Usability and Performance of Different Cloud/MapReduce Models We have demonstrated that clouds offer attractive computing paradigms for loosely coupled scientific applications. Higher level models include Dryad and Hadoop which we find are easier to use than EC2 and Azure (less setup and fewer lines of code). The cost effectiveness of cloud data centers combined with the comparable performance reported here suggests that loosely coupled science applications will increasingly be implemented on clouds and that using MapReduce will offer convenient user interfaces with little overhead. Earlier studies have shown that MPI is similar in performance to Hadoop and Dryad. Undergraduate Research Experiences Twister(MapReduce++) supports iterative MapReduce Computations and allows MapReduce to achieve higher performance, perform faster data transfers, and reduce the time it takes to process vast sets of data for data mining and machine learning applications. Open source code supports streaming communication and long running processes The IU HBCU STEM Summer Scholar Institute is an eight-week program that provides opportunities for minority students to engage in continuous, substantive research and work with researchers of our group on active projects. Funded by NSF, a team of STEM summer scholars from North Carolina A&T has joined Community Grids Lab and involved in research activities with the SALSA project that is funded by Microsoft research. http://www.iterativemapreduce.org/