Download presentation
Published byMeryl Fletcher Modified over 9 years ago
1
Department of Bioinformatics and Computational Biology
Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis Han Liang, Ph.D. Department of Bioinformatics and Computational Biology Rice University
2
Outline History NGS Platforms Applications Bioinformatics Analysis
Challenges
3
Central Dogma
4
Sanger sequencing DNA is fragmented Cloned to a plasmid vector
Cyclic sequencing reaction Separation by electrophoresis Readout with fluorescent tags
5
Sanger vs NGS ‘Sanger sequencing’ has been the only DNA sequencing method for 30 years but… …hunger for even greater sequencing throughput and more economical sequencing technology… NGS has the ability to process millions of sequence reads in parallel rather than 96 at a time (1/6 of the cost) Objections: fidelity, read length, infrastructure cost, handle large volum of data .
6
Platforms Roche/454 FLX: 2004 Illumina Solexa Genome Analyzer: 2006
Applied Biosystems SOLiDTM System: 2007 Helicos HeliscopeTM : recently available Pacific Biosciencies SMRT: launching 2010
7
Quickly reduced Cost
8
Three Leading Sequencing Platforms
Roche 454 Illumina Solexa Applied Biosystems SOLiD
9
The general experimental procedure
Wang et al. Nature Reviews Genetics 2009
10
454 bead microreactor Maridis Annu. Rev. Genome. Human Genet. 2008
12
Illumina (Solexa) Bridge amplification
Maridis Annu. Rev. Genome. Human Genet. 2008
13
SOLiD color coding Maridis Annu. Rev. Genome. Human Genet. 2008
14
Comparison of existing methods
15
Real Data – nucleotide space
Solexa @SRR :8:1:325:773 length=33 AAAGAACATTAAAGCTATATTATAAGCAAAGAT +SRR :8:1:325:773 length=33 @SRR :8:1:409:432 length=33 AAGTTATGAAATTGTAATTCCAATATCGTAAGC +SRR :8:1:409:432 length=33 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIII07 @SRR :8:1:488:490 length=33 AATTTCTTACCATATTAGACAAGGCACTATCTT +SRR :8:1:488:490 length=33 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIII&I
16
Real Data – color space SOLiD Data >1_24_47_F3
17
Data output difference among the three platforms
Nucleotide space vs. color space Length of short reads 454 (400~500 bp) > SOLiD (70 bp) ~ Solexa (36~120bp)
18
Applications with “Digital output”
De novo genome assembly Genome re-sequencing RNA-Seq (gene expression, exon-intron structure, small RNA profiling, and mutation) CHIP-Seq (protein-DNA interaction) Epigenetic profiling
19
Ancient Genomes Resurrected
Degraded state of the sample mitDNA sequencing Nuclear genomes of ancient remains: cave bear, mommoth, Neanderthal (106 bp ) Problems: contamination modern humans and coisolation bacterial DNA
21
Elucidating DNA-protein interactions through chromoatin immunoprecipitation sequencing
Key part in regulating gene expression Chip: technique to study DNA-protein interaccions Recently genome-wide ChIP-based studies of DNA-protein interactions Readout of ChIP-derived DNA sequences onto NGS platforms Insights into transcription factor/histone binding sites in the human genome Enhance our understanding of the gene expression in the context of specific environmental stimuli
22
Discovering noncoding RNAs
ncRNA presence in genome difficult to predict by computational methods with high certainty because the evolutionary diversity Detecting expression level changes that correlate with changes in environmental factors, with disease onset and progression, complex disease set or severity Enhance the annotation of sequenced genomes (impact of mutations more interpretable)
23
Metagenomics Characterizing the biodiversity found on Earth
The growing number of sequenced genomes enables us to interpret partial sequences obtained by direct sampling of specif environmental niches. Examples: ocean, acid mine site, soil, coral reefs, human microbiome which may vary according to the health status of the individual
24
Defining variability in many human genomes
Common variants have not yet completly explained complex disease genetics rare alleles also contribute Also structural variants, large and small insertions and deletions Accelerating biomedical research
25
Epigenomic variation Enable of genome-wide patterns of methylation and how this patterns change through the course of an organism’s development. Enhanced potential to combine the results of different experiments, correlative analyses of genome-wide methylation, histone binding patterns and gene expression, for example.
26
:Integrating Omics Mutation discovery Protein-DNA interaction Copy number variation mRNA expression microRNA expression Alternative Splicing Kahvejian et al. 2008
27
decoding, filter and mapping
Data Analysis Flow SOLiD machine: Raw data Central Server Basic processing decoding, filter and mapping Local Machine Downstream analysis
28
Short Read Mapping DNA-Resequencing BLAST-like approach RNA-Seq
31
Read length and pairing
ACTTAAGGCTGACTAGC TCGTACCGATATGCTG Short reads are problematic, because short sequences do not map uniquely to the genome. Solution #1: Get longer reads. Solution #2: Get paired reads.
32
Post-alignment Analysis
DNA-SEQ SNP calling RNA-SEQ Quantifying gene expression level
33
Concepts The reference genome: Target Region: exonome
hg19 (GRC37) Main assembly: Chr1-22, X, and Y 3,095,677,412 bp Target Region: exonome Ensembl: 85.3 Million (2.94%) RefSeq: Million (2.34%) ccds: 31,266,049 (1.08%) consisting of 185,446 nr exons
34
Target Coverage
36
SOLiD color coding Maridis Annu. Rev. Genome. Human Genet. 2008
37
SNP calling
39
Array-based High-throughput Dataset
40
Limitations of hybridization-based approach
Reliance existing knowledge about genome sequence Background noise and a limited dynamic detecting range Cross-experiment comparison is difficult Requiring complicated normalization methods Wang et al. Nature Reviews Genetics 2009
41
Quantifying gene expression using RNA-Seq data
RPKM: Reads Per Kb exon length and Millions of mapped readings
42
Large Dynamic Range Mortazavi et al. Nature Methods 2008
43
High reproducibility Mortazavi et al. Nature Methods 2008
44
High Accuracy Wang et al. Nature 2008
45
Advantages of RNA-Seq Not limited to the existing genomic sequence
Very low (if any) background signal Large dynamic detecting range Highly reproducibility Highly accurate Less sample Low cost per base Wang et al. Nature Reviews Genetics 2009
46
Huge amount of data! For a typical RNA-Seq SOLiD run, ~ 2T image file ~ 120G text file for downstream analysis ~ 75 M short reads per sample Efficient methods for data storage and management
47
Considerable sequencing error
High-quality image analysis for base calling
48
Genome alignment and assembly: time consuming and memory demanding
To perform genome mapping for SOLiD data 32-opteron HP DL785 with 128GB of ram 12~14 hours per sample High-performance parallel computing
49
Bioinformatics Challenges
Efficient methods to store, retrieve and process huge amount of data To reduce errors in image analysis and base calling Fast and accurate for genome alignment and assembly New algorithms in downstream analyses
50
Experimental Challenges
Library fragmentation Strand specific Wang et al. Nature Reviews Genetics 2009
51
Question& Answer Han Liang E-mail: hliang1@mdanderson.org
Tel:
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.