NCGAS provides A specific goal is to provide dedicated access to memory rich supercomputers customized for genomics studies, including Mason and other.

Slides:



Advertisements
Similar presentations
Next-Generation Sequencing: Methodology and Application
Advertisements

Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Experiences with a large-memory HP cluster – performance on benchmarks and genome codes Craig A. Stewart Executive Director, Pervasive.
Next Generation Sequencing, Assembly, and Alignment Methods
Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.
Bioinformatics at WSU Matt Settles Bioinformatics Core Washington State University Wednesday, April 23, 2008 WSU Linux User Group (LUG)‏
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
The Human Genome Race. Collins vs. Venter Collins Venter.
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
Sequencing a genome (a) outline the steps involved in sequencing the genome of an organism; (b) outline how gene sequencing allows for genome-wide comparisons.
SALSASALSASALSASALSA Digital Science Center June 25, 2010, IIT Geoffrey Fox Judy Qiu School.
ENCODE The Human Genome project sequenced “the human genome” “the human genome” that we have labeled as such doesn’t actually exist What we call.
Lesson 10 Bioinformatics
 The institute started in 1989 as a UNDP funded project called the National Agricultural Genetic Engineering Laboratory (NAGEL).  The Agricultural.
De-novo Assembly Day 4.
Campus Bridging: What is it and why is it important? Barbara Hallock – Senior Systems Analyst, Campus Bridging and Research Infrastructure.
Statewide IT Conference, Bloomington IN (October 7 th, 2014) The National Center for Genome Analysis Support, IU and You! Carrie Ganote (Bioinformatics.
Next Generation Cyberinfrastructures for Next Generation Sequencing and Genome Science AAMC 2013 Information Technology in Academic Medicine Conference.
CS 394C March 19, 2012 Tandy Warnow.
ARC Biotechnology Platform: Sequencing for Game Genomics Dr Jasper Rees
A Pervasive Technology Institute Center What is The National Center for Genome Analysis Support? NCGAS is a national center dedicated to providing scientists.
Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Bio-IT World Asia, June 7, 2012 High Performance Data Management and Computational Architectures for Genomics Research at National and International Scales.
The National Center for Genome Analysis Support as a Model Virtual Resource for Biologists Internet2 Network Infrastructure for the Life Sciences Focused.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
RNA-Seq 2013, Boston MA, 6/20/2013 Optimizing the National Cyberinfrastructure for Lower Bioinformatic Costs: Making the Most of Resources for Publicly.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
IPlant cyberifrastructure to support ecological modeling Presented at the Species Distribution Modeling Group at the American Museum of Natural History.
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.
An Efficient Method of Generating Whole Genome Sequence for Thousands of Bulls Chuanyu Sun 1 and Paul M. VanRaden 2 1 National Association of Animal Breeders,
DAN LAWSON BRC 2011 – ANNUAL MEETING UT SOUTHWESTERN MEDICAL CENTER DALLAS, TX SEPTEMBER 2011 Challenges and opportunities of new sequencing technologies.
CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University.
CS177 Lecture 10 SNPs and Human Genetic Variation
Enabling Science Through Campus Bridging A case study with mlRho Scott Michael July 24, 2013.
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
The iPlant Collaborative
Chapter 21 Eukaryotic Genome Sequences
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
© 2010 by The Samuel Roberts Noble Foundation, Inc. 1 The Samuel Roberts Noble Foundation, 2510 Sam Noble Parkway, Ardmore, OK, 73401, USA 2 National Center.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
IPlant Genomics in Education Workshop Genome Exploration in Your Classroom.
HUMAN GENOME PROJECT International effort of 13 years (1990 – 2003) Identified all the approximate 20,000 – 25,000 genes in human DNA Determined the sequences.
EB3233 Bioinformatics Introduction to Bioinformatics.
The National Center for Genomic Analysis Support: creating a national cyberinfrastructure environment for genomics researchers. William Barnett, Thomas.
Pti.iu.edu/sc14 The National Center for Genome Analysis Support Supercomputing 2014 November 17-21, 2014.
Wfleabase.org/docs/arthropod-gene-finding/ Unlocated Arthropod genes and ways to find them Many bug genes are hard to find - Daphnia’s many tandems were.
Bio-IT World Conference and Expo ‘12, April 25, 2012 A Nation-Wide Area Networked File System for Very Large Scientific Data William K. Barnett, Ph.D.
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
BLAST Sequences queried against the nr or grass databases. GO ANALYSIS Contigs classified based on homology to known plant or fungal genes Next.
Biotechnology and Bioinformatics: Bioinformatics Essential Idea: Bioinformatics is the use of computers to analyze sequence data in biological research.
MERmaid: Distributed de novo Assembler Richard Xia, Albert Kim, Jarrod Chapman, Dan Rokhsar.
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Computational Sciences at Indiana University an Overview Rob Quick IU Research Technologies HTC Manager.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Transforming Science Through Data-driven Discovery Workshop Overview Ohio State University MCIC Jason Williams – Lead, CyVerse – Education, Outreach, Training.
Boundless Lecture Slides Free to share, print, make copies and changes. Get yours at Available on the Boundless Teaching Platform.
1 Campus Bridging: What is it and why is it important? Barbara Hallock – Senior Systems Analyst, Campus Bridging and Research Infrastructure.
Transcriptome Assembly
Richard LeDuc, Ph.D. (Manager)
Genomes and Their Evolution
Genome organization and Bioinformatics
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
3.1 Genes Essential idea: Every living organism inherits a blueprint for life from its parents. Genes and hence genetic information is inherited from.
Presentation transcript:

NCGAS provides A specific goal is to provide dedicated access to memory rich supercomputers customized for genomics studies, including Mason and other XSEDE systems Distributions of hardened versions of popular codes Initially, nucleated around genome assembly software such as: de Bruijn graph methods: SOAPdeNovo, Velvet, ABySS consensus methods: Celera, Arachne 2 Expanding to other areas as users are recruited: now moving into phylogenetics and metagenomics We’re especially interested in helping smaller institutions Funded only in Nov. 2011, NCGAS is actively seeking users! Metagenomics Sequence Analysis Yuzhen Ye Lab (IU Bloomington School of Informatics) Genome Assembly and Annotation Michael Lynch Lab (IU Bloomington, Department of Biology) Assembles and annotates Genomes in the Paramecium aurelia species complex in order to eventually study the evolutionary fates of duplicate genes after whole-genome duplication. This project also has been performing RNAseq on each genome, which is currently used to aid in genome annotation and subsequently to detect expression differences between paralogs. The assembler used is based on an overlap-layout-consensus method instead of a de Bruijn graph method (like some of the newer assemblers). It is more memory intensive – requires performing pairwise alignments between all pairs of reads. The annotation of the genome assemblies involves programs such as GMAP, GSNAP, PASA, and Augustus. To use these programs, we need to load-in millions of RNAseq and EST reads and map them back to the genome. Genome Informatics for Animals and Plants Genome Informatics Lab (IU Bloomington Department of Biology) This project is to find genes in animals and plants, using the vast amounts of new gene information coming from next generation sequencing technology. These improvements are applied to newly deciphered genomes for an environmental sentinel animal, the waterflea (Daphnia), the agricultural pest insect Pea aphid, the evolutionarily interesting jewel wasp (Nasonia), and the chocolate plant (Th. cacao) which will bring genomics to sustainable agriculture of cacao. Large memory compute systems are needed for biological genome and gene transcript assembly because assembly of genomic DNA or gene RNA sequence reads (in billions of fragments) into full genomic or gene sequences requires a minimum of 128 GB of shared memory, more depending on data set. These programs build graph matrices of sequence alignments in memory. Imputation of Genotypes And Sequence Alignment Tatiana Foroud Lab (IU School of Medicine, Medical and Molecular Genetics) Study complex disorders by using imputation of genotypes typically for genome wide association studies as well as sequence alignment and post-processing of whole genome and whole exome sequencing. Requires analysis of markers in a genetic region (such as a chromosome) in several hundred representative individuals genotyped for the full reference panel of SNPs, with extrapolation of the inferred haplotype structures. More memory allows the imputation algorithms to evaluate haplotypes across much broader genomic regions, reducing or eliminating the need to partition the chromosomes into segments. This increases the accuracy and speed of imputed genotypes, allowing for improved evaluation of detailed within-study results as well as communication and collaboration (including meta-analysis) using the disease study results with other researchers. Daphnia Population Genomics Michael Lynch Lab (IU Bloomington Department of Biology) This project involves the whole genome shotgun sequences of over 20 more diploid genomes with genomes sizes >200 Megabases each. With each genome sequenced to over 30 x coverage, the full project involves both the mapping of reads to a reference genome and the de novo assembly of each individual genome. The genome assembly of millions of small reads often requires excessive memory use for which we once turned to Dash at SDSC. With Mason now online at IU, we have been able to run our assemblies and analysis programs here at IU. Thomas G. Doak Le-Shin Wu, Craig A. Stewart, Robert Henschel, William K. Barnett Environmental sequencing –Sampling DNA sequences directly from the environment –Since the sequences consists of DNA fragments from hundreds or even thousands of species, the analysis is far more difficult than traditional sequence analysis that involves only one species. Assembling metagenomic sequences and deriving genes from the dataset Dynamic programming to optimally map consecutive contigs from the assembly. Since the number of contigs is enormous for most metagenomic dataset, a large memory computing system is required to perform the dynamic programming algorithm so that the task can be completed in polynomial time. NCGAS is a national service center funded by the National Science Foundation’s Advances in Biological Informatics (ABI) to provide scientists access to software and supercomputers for genomics research. a Pervasive Technology Institute (pti.iu.edu) Centerpti.iu.edu Current participating institutions: IU’s Mason – a HP ProLiant DL580 G7: 10GE interconnect; Quad socket nodes (8 core Xeon L7555, 1.87 GHz base frequency 32 cores per node; 512 GByte of memory per node!); rated at TFLOPs (G-HPL benchmark) Texas Advanced Computing Center (TACC) San Diego Supercomputer Center (SDSC); e.g. DASH NCGAS will support software running at IU, TACC and SDSC, as well as other supercomputers available as part of XSEDE, with the goal to create a single allocation system that will transparently access all appropriate clusters NCGAS will further campus bridging integration Early Users: