Bioinformatics for DNA-seq and RNA-seq experiments

Slides:



Advertisements
Similar presentations
The Past, Present, and Future of DNA Sequencing
Advertisements

Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
RNAseq.
DNAseq analysis Bioinformatics Analysis Team
Data integration across omics landscapes Bing Zhang, Ph.D. Department of Biomedical Informatics Vanderbilt University School of Medicine
Next-generation sequencing
Bioinformatics at WSU Matt Settles Bioinformatics Core Washington State University Wednesday, April 23, 2008 WSU Linux User Group (LUG)‏
Next-generation sequencing and PBRC. Next Generation Sequencer Applications DeNovo Sequencing Resequencing, Comparative Genomics Global SNP Analysis Gene.
Gene expression analysis summary Where are we now?
Introduction to Medical Genetics Fadel A. Sharif.
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
Introduction to Genomics, Bioinformatics & Proteomics Brian Rybarczyk, PhD PMABS Department of Biology University of North Carolina Chapel Hill.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Something related to genetics? Dr. Lars Eijssen. Bioinformatics to understand studies in genomics – São Paulo – June Image:
STAT115 STAT215 BIO512 BIST298 Introduction to Computational Biology and Bioinformatics Spring 2015 Xiaole Shirley Liu Please Fill Out Student Sign In.
Polymorphisms – SNP, InDel, Transposon BMI/IBGP 730 Victor Jin, Ph.D. (Slides from Dr. Kun Huang) Department of Biomedical Informatics Ohio State University.
Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January
High Throughput Sequencing
Presented by Karen Xu. Introduction Cancer is commonly referred to as the “disease of the genes” Cancer may be favored by genetic predisposition, but.
From T. MADHAVAN, & K.Chandrasekaran Lecturers in Zoology.. EXIT.
Diabetes and Endocrinology Research Center The BCM Microarray Core Facility: Closing the Next Generation Gap Alina Raza 1, Mylinh Hoang 1, Gayan De Silva.
Before we start: Align sequence reads to the reference genome
NGS Analysis Using Galaxy
Whole Exome Sequencing for Variant Discovery and Prioritisation
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
DRAW+SneakPeek: Analysis Workflow and Quality Metric Management for DNA-Seq Experiments O. Valladares 1,2, C.-F. Lin 1,2, D. M. Childress 1,2, E. Klevak.
Bioinformatics.
Bioinformatics Core Facility Ernesto Lowy February 2012.
Detecting enriched regions (Chip- seq, RIP-seq) Statistical evaluation of enriched regions Data displayed in Genome Browser Detection of enriched motifs.
MES Genome Informatics I - Lecture VIII. Interpreting variants Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute,
Human SNPs from short reads in hours using cloud computing Ben Langmead 1, 2, Michael C. Schatz 2, Jimmy Lin 3, Mihai Pop 2, Steven L. Salzberg 2 1 Department.
Data Type 1: Microarrays
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Massive Parallel Sequencing
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
Next Generation DNA Sequencing
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
SNP Haplotypes as Diagnostic Markers Shrish Tiwari CCMB, Hyderabad.
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
Online Mendelian Inheritance in Man (OMIM): What it is & What it can do for you Knowledge Management & Eskind Biomedical Library January 27, 2012 helen.
Bioinformatics Core Facility Guglielmo Roma January 2011.
Chap. 5 Problem 1 Recessive mutations must be present in two copies (homozygous) in diploid organisms to show a phenotype (Fig. 5.2). These mutations show.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
Introduction to RNAseq
Geuvadis Analysis Meeting 16/02/2012 Micha Sammeth CNAG – Barcelona.
Cancer Center Support Grant Site Review Date Cancer Center Support Grant Site Review Date Genomics High-Throughput Facility (GHTF) and Bioinformatics Core.
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
No reference available
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
The Future of Genetics Research Lesson 7. Human Genome Project 13 year project to sequence human genome and other species (fruit fly, mice yeast, nematodes,
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Accessing and visualizing genomics data
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
STAT115 STAT215 BIO512 BIST298 Introduction to Computational Biology and Bioinformatics Spring 2016 Xiaole Shirley Liu.
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
How to get from a pile of unprocessed data to knowledge: The user’s perspective Guido Jenster, Ph.D. Professor of Experimental Urological Oncology Department.
Centralizing Bioinformatics Services: Analysis Pipelines, Opportunities, and Challenges with Large- scale –Omics, and other BigData High-Performance Computing.
Cancer Genomics Core Lab
Gil McVean Department of Statistics
Tools and Services Workshop
Cloud based NGS data analysis
Genomes and Their Evolution
Presentation transcript:

Bioinformatics for DNA-seq and RNA-seq experiments Li-San Wang Department of Pathology and Laboratory Medicine Penn Institute for Biomedical Informatics Penn Genome Frontiers Institute University of Pennsylvania Perelman School of Medicine Thank you for having me here.

Next Generation Sequencing Technology Generate reads of billions of short DNA sequences in the order of 100nts in a week Costs < $5K for resequencing a human genome Hi-Seq 2000: run 2 flow cells (300Gb each) in ~ 1 week, sequences 6 genomes Illumina Hi-Seq 2000

Applications of NGS DNA-Seq resequences genomes to identify variations associated with diseases and traits Use RNA-Seq to study gene expression activities Use ChIP-Seq and DNase-Seq to measure protein-DNA interactions and modifications … Many other types of protocols

Central Dogma DNA RNA Protein Phenotypes

RNA-Seq Library prep Reverse Transcription & DNA fragmentation RNA Sequencing and Analysis Images: illumina

High read heterogeneity along RNA transcripts Needs to dig deeper! Secondary structures Functional classes Modifications (non-standard nucleotides) Visualization … and many other questions What actually happens is a lot more complicated than we thought. Highly heterogeneous, some regions are more expressed than others.

SAVoR: RNA-seq visualization Fan Li, Paul Ryvkin, Micah Childress, Otto Valladares, Brian Gregory*, Li-San Wang*. SAVoR: a server for sequencing annotation and visualization of RNA structures. Nucleic Acids Research, 2012. HAMR: Detect RNA modification using RNA-seq Paul Ryvkin, Yuk Yee Leung, Micah Childress, Otto Valladares, Isabelle Dragomir, Brian Gregory*, and Li-San Wang*. HAMR: High throughput Annotation of Modified Ribonucleotides. RNA, in press, 2013. CoRAL: Use small RNA-seq to annotate non-coding RNA function classes Yuk Yee Leung, Paul Ryvkin, Lyle Ungar, Brian Gregory*, Li-San Wang*. CoRAL: Predicting non-coding RNAs from small RNA-sequencing data. Nucleic Acids Research, 2013. RNA-Seq-Fold: Use pairing-informative RNA-seq protocols to estimate secondary structures (in progress) CoRAL

SAVoR: web-based visualization of RNA-seq data in a structural context http://tesla.pcbi.upenn.edu/savor/ RNA-seq data + 2nd structure = SAVoR Plots ! Li et al., NAR 2012

Log-ratio of dsRNA-seq to ssRNA-seq read coverage along the At2g04390 Log-ratio of dsRNA-seq to ssRNA-seq read coverage along the At2g04390.1 transcript.

Modified RNA – Motivation: Sites with unusual mismatch patterns in RNA-seq 1 2 3 3a A in actual sequence, C/G/T are due to 1% base calling error rate A/C SNP, G/T are due to 1% error rate G/T ratio too far away from 1:1, heterozygotes cannot explain A and C rates are too high for base calling error

Observed nucleotide pattern at a known m2G site In an Alanine tRNA

tRNA modifications guanosine (G) N-2-methylguanosine (m2G) 6 6 1 5 7 1 5 7 tRNA-modifying protein 8 8 2 4 9 2 4 9 3 3 H2N 5' 5' 3' 2' 3' 2' Watson-Crick pairing edge has been modified

Detecting modified RNAs: change in RT effects when Watson-Crick edge is modified

Statistical model for HAMR H01: homozygous reference, low base calling error H02: heterozygote, low base calling error In both cases, there should be at most two nucleotides with high frequencies ML ratio test Annotation: naïve Bayes model on non-reference allele frequencies

Results Statistical analysis on known modification sites show this idea works with high specificity

Known modifications predicted to affect RT Detected modifications predicted to affect RT

Our data Yeast dataset

Classification accuracy Train on human tRNA data, test on yeast tRNA data Precursor Classes Observations Accuracy A m1A|m1I|ms2i6A, i6A|t6A 187 98% G m1G, m2G|m22G 86 79% U D, Y 17 96%

Modifications in other RNAs Scan the entire smRNA transcriptome for candidate modified sites * Uniquely mapped reads in 4 libraries * Removed sites corresponding to read-ends * Removed sites corresponding to known SNPs

HAMR High-Throughput Annotation of Modified RNAs Ryvkin et al., RNA, 2013 http://tesla.pcbi.upenn.edu/hamr/ Please contact us if you are interested!

RNA-seq is more than an expensive digital gene expression microarray NGS algorithms and experimental protocols should integrate tightly Bioinformatics scientists Bench scientists

DNA-Seq: find genetic variations linked to traits and diseases All individuals have small differences between each other Single nucleotide polymorphism (SNP) is the most common form Other types: indel, copy number variation, rearrangement Genetic polymorphisms may lead to different phenotypes and diseases 21 trisomy: Down syndrome Substitution 1624G>T of the CFTR gene leads to change of amino acid (G542X) which leads to cystic fibrosis

Alzheimer’s Disease Sequencing Project Announced in Feb. 2012 Participants NIA, NHGRI ADGC and CHARGE Large-Scale Genome Sequencing and Analysis Centers (Broad/Baylor/WashU) NACC (phenotype) and NCRAD (sample) NIAGADS (data coordinating center) NCBI dbGaP/SRA Design: 584 WGS / 11,000 WES (>300TB data) WGS data of 584 samples available from our ADSP data portal Visit ADSP website www.niagads.org/adsp to learn about study design, apply for data access, download data Photo from http://nihrecord.od.nih.gov/newsletters/2012/03_02_2012/story5.htm

Computational Challenges to Analyzing DNA-Seq data Mapping between 100~1000 billion reads to the reference genome with good sensitivity Variant calling: call SNPs and structural variants reliably Association: Find susceptibility variants by association tests Interpretation: Interpret the effect of variants Data management: Query, store, and distribute 100TBs of data ~~ And that’s just for one project!

Cloud computing using Amazon EC2 Can run hundreds of cores on Amazon EC2 easily Can share data and programs easily Very good security Steep learning curve Needs to provide pre-configured workflows/environments allows you to run analysis easily on Amazon Storing data is very expensive $0.1/GB-Month, or $1200/TB-year Glacier is 10 times cheaper but also that much slower

DNA Resequencing Analysis Workflow (DRAW) Mapping Realignment, dedup, uniq, base quality recalibration Variant detection Coverage, QC metrics BWA Easy to run – invoke phases by five commands, no need to mouse-click like crazy Memory request based on data size Support SunGridEngine for cluster computing Modular architecture, job monitoring, job dependency, auditing, error checking Runs on Amazon EC2, $582/FC We are migrating all our NGS pipelines to DRAW architecture GATK Picard Samtools I want to go back to the workflow of how we processed sequencing data. I divide the workflow into three phases, there are of course a lot more steps. Different software packages were used, such as BWA for mapping, GTATK for variant detection. Running through those programs is straightforward. The challenge, is, however, the sheer amount of data. For example, a flow cell from illumina hiseq typically gives 300Gb of data. It is nearly impossible to process such amount of data without using high performance computing cluster. You just can’t sit there and wait for a process to finish and start the next. And do this for 30 samples each time. And this is where our pipeline comes in. our pipeline generates the commands for submitting jobs on computing cluster. that streamline and automate the entire process. GATK Samtools GATK

NIA Genetics of Alzheimer’s Disease Data Storage Site (NIAGADS) Portal to AD genetics studies funded by NIA Portal for ADSP data Portal for other large-scale AD sequencing projects (>2,000 whole genomes, >400TB raw data) being developed Software (DRAW+SneakPeek) and other resources Signup for user account and news alert at www.niagads.org

Lab members Chiao-Feng Lin Otto Valladares Tianyan Hu Fanny Leung Amanda Partch Mugdha Khaladkar Dan Laufer Micah Childress John Malamon Yih-Chi Hwang Fan Li Paul Ryvkin Mitchell Tang Alex Amlie-Wolf Pavel Kuksa

Acknowledgements Schllenberg lab Gerard Schellenberg Evan Geller Laura Cantwell Gregory Lab Brian Gregory Qi Zheng Isabelle Dragomir Jamie Yang Sandeep Jain CNDR/ADC John Trojanowski Virginia Lee Vivianna Van Deerlin Steven Arnold Terry Schuck Robert Greene Pathology and Lab Medicine PSOM/CHOP David Roth Nancy Spinner Dimitrios Monos Jennifer Morrisette Robert Daber Laura Conlin Ellen Tsai Avni Santani Zissimos Mourelatos Support: Penn Institute on Aging PGFI Alzheimer’s Foundation CurePSP foundation NIH: NIA/NIGMS/NIMH/NHGRI Mingyao Li John Hogenesch Nancy Zhang Sampath Kannan Lyle Ungar Sarah Tishkoff Maja Bucan Chris Stoeckert Arupa Ganguly Kate Nathanson Alice Chen-Plotkin Travis Unger