Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genome sequencing Haixu Tang School of Informatics.

Similar presentations


Presentation on theme: "Genome sequencing Haixu Tang School of Informatics."— Presentation transcript:

1 Genome sequencing Haixu Tang School of Informatics

2 Cell: fundamental working units of every living system

3 A cell works like a self-contained car factory All information to replicate itself and carry out functions is stored inside a cell  car encyclopedia (genome) –Human genome consists of 23 pairs of long double stranded DNA molecules (chromosomes), around 3 billions base pair long in total. –Almost every cell in human body contains same genome Machinery: proteins and other molecules –Make (synthesize) components and assemble components into its new offsprings –Guide its functions (car factory & driving school)

4 Overview of a cell Genome = Encyclopedia Chromosomes = Volumes Genes = Chapters Almost every cell in an organism contains the same whole set of encyclopedia. But each cell may make different types of cars (cells), with different functions (e.g. SUV or convertible, based on the SAME handbook containing information to make all types of cars (cells).

5 Genome and genes Genome: an organism’s genetic material (Car encyclopedia) Gene: a discrete units of hereditary information located on the chromosomes and consisting of DNA. (Chapters to make components of a car, or to use and drive a car).

6 Basic operations in a cell (Central Dogma) A gene is expressed in two steps 1)Transcription: RNA synthesis 2)Translation: Protein synthesis

7 Cells have a way to modify / expand its handbook Modification / expansion of encyclopedia  genetic mutation  genotypes New car models (phenotypes) Model design: natural selection / evolution

8 Genome: long double stranded DNA The structure and the four genomic letters code for all living organisms Adenine (A), Guanine (G), Thymine (T), and Cytosine (C) which pair A-T and C-G on complimentary strands.

9 Genome sequence: four letters alphabet text aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaa tgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgc taagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatt taccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaa tggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctggga tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggat ccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcct gcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatcc gatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatat gctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctggga tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgc ggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaat gcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgct aagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaat ggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggc tatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctat gctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgc ggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctg cggctatgctaatgcatgcggctatgctaagctcatgcgg

10 Gene: a paragraph from a genome sequence aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaa tgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgc taagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatt taccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaa tggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctggga tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggat ccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcct gcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatcc gatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatat gctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctggga tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgc ggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaat gcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgct aagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaat ggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggc tatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctat gctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgc ggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctg cggctatgctaatgcatgcggctatgctaagctcatgcgg

11 DNA Sequencing: History Sanger method (1977): labeled ddNTPs terminate DNA copying at random points. Both methods generate labeled fragments of varying lengths that are further electrophoresed. Gilbert method (1977): chemical method to cleave DNA at specific points (G, G+A, T+C, C).

12 Sanger Method: Generating Read 1.Start at primer (restriction site) 2.Grow DNA chain 3.Include ddNTPs 4.Stops reaction at all possible points 5.Separate products by length, using gel electrophoresis

13 Automatic DNA sequencing

14 Electrophoresis Diagrams

15 Shotgun sequencing Automated sequencing can accurately (1% error rate) determine DNA sequences with 500-800 bases long cut many times at random (Shotgun) DNA

16 Fragment Assembly reads ?

17 Fragment assembly: a jigsaw puzzle

18 Solution: try all pairs of pieces

19 Fragment Assembly reads

20 Gaps and Contigs Gap Contig 1 Contig 2

21 Read Coverage (assuming uniform distribution of reads) Length of genomic segment: L Number of reads: n Coverage C = n l / L Length of each read: l How much coverage is enough? Lander-Waterman model: P(Y=y) = (c y * e -c ) / y! P(Y=0): C=10 results in 1 gapped region per 1,000,000 nucleotides C

22 Poisson Distribution

23 Fragment Assembly Cover region with >7-fold redundancy Overlap reads and extend to reconstruct the original DNA sequence reads

24 Fragment assembly: Overlap- Layout-Consensus Assemblers:ARACHNE, PHRAP, CAP, TIGR, CELERA Overlap: find potentially overlapping reads Layout: merge reads into contigs Consensus: derive the DNA sequence and correct read errors..ACGATTACAATAGGTT..

25 Overlap Find the best match between the suffix of one read and the prefix of another Due to sequencing errors, need to use dynamic programming to find the optimal overlap alignment Apply a filtration method to filter out pairs of fragments that do not share a significantly long common substring

26 Overlapping Reads TAGATTACACAGATTAC ||||||||||||||||| Sort all k-mers in reads (k ~ 24) Find pairs of reads sharing a k-mer Extend to full alignment – throw away if not >95% similar T GA TAGA | || TACA TAGT ||

27 Layout Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA

28 Derive Consensus Sequence Derive multiple alignment from pairwise read alignments TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive each consensus base by weighted voting

29 Consensus A consensus sequence is derived from a profile of the assembled fragments A sufficient number of reads are required to ensure a statistically significant consensus. Reading errors are corrected

30 Challenges in Fragment Assembly Repeats: A major problem for fragment assembly > 50% of human genome are repeats: - over 1 million Alu repeats (about 300 bp) - about 200,000 LINE repeats (1000 bp and longer) Repeat Green and blue fragments are interchangeable when assembling repetitive DNA

31 Repeat Types Low-Complexity DNA (e.g. ATATATATACATA…) Microsatellite repeats (a 1 …a k ) N where k ~ 3-6 (e.g. CAGCAGTAGCAGCACCAG) Transposons/retrotransposons –SINE Short Interspersed Nuclear Elements (e.g., Alu: ~300 bp long, 10 6 copies) –LINE Long Interspersed Nuclear Elements ~500 - 5,000 bp long, 200,000 copies –LTR retroposonsLong Terminal Repeats (~700 bp) at each end Gene Families genes duplicate & then diverge Segmental duplications ~very long, very similar copies

32 Triazzle: A Fun Example The puzzle looks simple. BUT there are repeats!!!

33 Human genome project 300,000 Bacterial Artificial Clones (BACs, each ~10 5 bps long)  “sequence-ready” map Shotgun sequencing of each BAC Order BACs to get the genome sequences

34 Hierarchical genome sequencing Repeat BACs Reads

35 Can whole genome shotgun (WGS) done on large eukaryotic genomes, e.g human genome? cut many times at random (Shotgun) genomic segment Get two reads (mate pairs) from each segment (double-barreled) ~500 bp

36 Double barreled sequencing may resolve repeats Repeat Green and blue fragments are interchangeable when assembling repetitive DNA } Mate pairs

37 Repeats, Errors, and Contig Lengths Repeats shorter than read length are OK Repeats with more base pair differencess than sequencing error rate are OK To resolve a repeat in the genome, try to: –Increase clone length of mate pairs

38 Layout Do two aligned fragments really overlap, or are they from two copies of a repeat? Solution: repeat masking – hide the repeats!!! Masking results in high rate of misassembly (up to 20%)

39 Merge Reads into Contigs Merge reads up to potential repeat boundaries repeat region

40 Merge Reads into Contigs Ignore “hanging” reads, when detecting repeat boundaries sequencing error repeat boundary??? b a

41 Link Contigs into Supercontigs Find all links between unique contigs Connect contigs incrementally, if  2 links

42 Link Contigs into Supercontigs Fill gaps in supercontigs with masked reads

43 Scaffolds

44 Finishing: chromosome walking (filling in gaps in assembly)

45 WGS of human genome 2001 Two assemblies of initial human genome sequences published –International Human Genome project –Celera Genomics: WGS approach


Download ppt "Genome sequencing Haixu Tang School of Informatics."

Similar presentations


Ads by Google