Download presentation
Presentation is loading. Please wait.
Published byWilliam Fox Modified over 9 years ago
1
Genome sequencing Haixu Tang School of Informatics
2
Cell: fundamental working units of every living system
3
A cell works like a self-contained car factory All information to replicate itself and carry out functions is stored inside a cell car encyclopedia (genome) –Human genome consists of 23 pairs of long double stranded DNA molecules (chromosomes), around 3 billions base pair long in total. –Almost every cell in human body contains same genome Machinery: proteins and other molecules –Make (synthesize) components and assemble components into its new offsprings –Guide its functions (car factory & driving school)
4
Overview of a cell Genome = Encyclopedia Chromosomes = Volumes Genes = Chapters Almost every cell in an organism contains the same whole set of encyclopedia. But each cell may make different types of cars (cells), with different functions (e.g. SUV or convertible, based on the SAME handbook containing information to make all types of cars (cells).
5
Genome and genes Genome: an organism’s genetic material (Car encyclopedia) Gene: a discrete units of hereditary information located on the chromosomes and consisting of DNA. (Chapters to make components of a car, or to use and drive a car).
6
Basic operations in a cell (Central Dogma) A gene is expressed in two steps 1)Transcription: RNA synthesis 2)Translation: Protein synthesis
7
Cells have a way to modify / expand its handbook Modification / expansion of encyclopedia genetic mutation genotypes New car models (phenotypes) Model design: natural selection / evolution
8
Genome: long double stranded DNA The structure and the four genomic letters code for all living organisms Adenine (A), Guanine (G), Thymine (T), and Cytosine (C) which pair A-T and C-G on complimentary strands.
9
Genome sequence: four letters alphabet text aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaa tgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgc taagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatt taccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaa tggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctggga tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggat ccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcct gcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatcc gatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatat gctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctggga tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgc ggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaat gcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgct aagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaat ggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggc tatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctat gctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgc ggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctg cggctatgctaatgcatgcggctatgctaagctcatgcgg
10
Gene: a paragraph from a genome sequence aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaa tgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgc taagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatt taccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaa tggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctggga tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggat ccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcct gcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatcc gatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatat gctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctggga tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgc ggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaat gcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgct aagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaat ggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggc tatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctat gctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgc ggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctg cggctatgctaatgcatgcggctatgctaagctcatgcgg
11
DNA Sequencing: History Sanger method (1977): labeled ddNTPs terminate DNA copying at random points. Both methods generate labeled fragments of varying lengths that are further electrophoresed. Gilbert method (1977): chemical method to cleave DNA at specific points (G, G+A, T+C, C).
12
Sanger Method: Generating Read 1.Start at primer (restriction site) 2.Grow DNA chain 3.Include ddNTPs 4.Stops reaction at all possible points 5.Separate products by length, using gel electrophoresis
13
Automatic DNA sequencing
14
Electrophoresis Diagrams
15
Shotgun sequencing Automated sequencing can accurately (1% error rate) determine DNA sequences with 500-800 bases long cut many times at random (Shotgun) DNA
16
Fragment Assembly reads ?
17
Fragment assembly: a jigsaw puzzle
18
Solution: try all pairs of pieces
19
Fragment Assembly reads
20
Gaps and Contigs Gap Contig 1 Contig 2
21
Read Coverage (assuming uniform distribution of reads) Length of genomic segment: L Number of reads: n Coverage C = n l / L Length of each read: l How much coverage is enough? Lander-Waterman model: P(Y=y) = (c y * e -c ) / y! P(Y=0): C=10 results in 1 gapped region per 1,000,000 nucleotides C
22
Poisson Distribution
23
Fragment Assembly Cover region with >7-fold redundancy Overlap reads and extend to reconstruct the original DNA sequence reads
24
Fragment assembly: Overlap- Layout-Consensus Assemblers:ARACHNE, PHRAP, CAP, TIGR, CELERA Overlap: find potentially overlapping reads Layout: merge reads into contigs Consensus: derive the DNA sequence and correct read errors..ACGATTACAATAGGTT..
25
Overlap Find the best match between the suffix of one read and the prefix of another Due to sequencing errors, need to use dynamic programming to find the optimal overlap alignment Apply a filtration method to filter out pairs of fragments that do not share a significantly long common substring
26
Overlapping Reads TAGATTACACAGATTAC ||||||||||||||||| Sort all k-mers in reads (k ~ 24) Find pairs of reads sharing a k-mer Extend to full alignment – throw away if not >95% similar T GA TAGA | || TACA TAGT ||
27
Layout Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA
28
Derive Consensus Sequence Derive multiple alignment from pairwise read alignments TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive each consensus base by weighted voting
29
Consensus A consensus sequence is derived from a profile of the assembled fragments A sufficient number of reads are required to ensure a statistically significant consensus. Reading errors are corrected
30
Challenges in Fragment Assembly Repeats: A major problem for fragment assembly > 50% of human genome are repeats: - over 1 million Alu repeats (about 300 bp) - about 200,000 LINE repeats (1000 bp and longer) Repeat Green and blue fragments are interchangeable when assembling repetitive DNA
31
Repeat Types Low-Complexity DNA (e.g. ATATATATACATA…) Microsatellite repeats (a 1 …a k ) N where k ~ 3-6 (e.g. CAGCAGTAGCAGCACCAG) Transposons/retrotransposons –SINE Short Interspersed Nuclear Elements (e.g., Alu: ~300 bp long, 10 6 copies) –LINE Long Interspersed Nuclear Elements ~500 - 5,000 bp long, 200,000 copies –LTR retroposonsLong Terminal Repeats (~700 bp) at each end Gene Families genes duplicate & then diverge Segmental duplications ~very long, very similar copies
32
Triazzle: A Fun Example The puzzle looks simple. BUT there are repeats!!!
33
Human genome project 300,000 Bacterial Artificial Clones (BACs, each ~10 5 bps long) “sequence-ready” map Shotgun sequencing of each BAC Order BACs to get the genome sequences
34
Hierarchical genome sequencing Repeat BACs Reads
35
Can whole genome shotgun (WGS) done on large eukaryotic genomes, e.g human genome? cut many times at random (Shotgun) genomic segment Get two reads (mate pairs) from each segment (double-barreled) ~500 bp
36
Double barreled sequencing may resolve repeats Repeat Green and blue fragments are interchangeable when assembling repetitive DNA } Mate pairs
37
Repeats, Errors, and Contig Lengths Repeats shorter than read length are OK Repeats with more base pair differencess than sequencing error rate are OK To resolve a repeat in the genome, try to: –Increase clone length of mate pairs
38
Layout Do two aligned fragments really overlap, or are they from two copies of a repeat? Solution: repeat masking – hide the repeats!!! Masking results in high rate of misassembly (up to 20%)
39
Merge Reads into Contigs Merge reads up to potential repeat boundaries repeat region
40
Merge Reads into Contigs Ignore “hanging” reads, when detecting repeat boundaries sequencing error repeat boundary??? b a
41
Link Contigs into Supercontigs Find all links between unique contigs Connect contigs incrementally, if 2 links
42
Link Contigs into Supercontigs Fill gaps in supercontigs with masked reads
43
Scaffolds
44
Finishing: chromosome walking (filling in gaps in assembly)
45
WGS of human genome 2001 Two assemblies of initial human genome sequences published –International Human Genome project –Celera Genomics: WGS approach
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.