Presentation is loading. Please wait.

Presentation is loading. Please wait.

DNA Sequencing Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the circular genome (host)

Similar presentations


Presentation on theme: "DNA Sequencing Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the circular genome (host)"— Presentation transcript:

1 DNA Sequencing Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the circular genome (host) that incorporated the fragment BAC Bacterial Artificial Chromosome, a type of insert–vector combination, typically of length 100-200 kb read a 500-900 long word that comes out of a sequencing machine coverage the average number of reads (or inserts) that cover a position in the target DNA piece shotgun the process of obtaining many reads sequencing from random locations in DNA, to detect overlaps and assemble

2 The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones 3.“Walk” from seeds using clone-ends to pick library clones that extend left & right

3 Walking: An Example

4 Walking off a Single Seed Cycle time to process one clone: 1-2 months 1.Grow clone 2.Prepare & Shear DNA 3.Prepare shotgun library & perform shotgun 4.Assemble in a computer 5.Close remaining gaps A mammalian genome would need 15,000 walking steps !

5 Walking off several seeds in parallel Few sequential steps Additional redundant sequencing In general, can sequence a genome in ~5 walking steps, with <20% redundant sequencing EfficientInefficient

6 Using Two Libraries Solution: Use a second library of small clones Most inefficiency comes from closing a small gap with a much larger clone

7 Whole Genome Shotgun Sequencing cut many times at random genome forward-reverse paired reads plasmids (2 – 10 Kbp) cosmids (40 Kbp) known dist ~500 bp

8 Advantages & Disadvantages of different sequencing strategies Physical Mapping  ADV. Easy assembly  DIS. Build physical map Whole Genome Shotgun (WGS)  ADV. No mapping  DIS. Difficult to assemble and resolve repeats Walking: combines some advantages of both Other possible method: Shotgun sequencing of 10x BACs without any mapping  ADV. Can re-sequence hard regions  DIS. Too many shotgun libraries

9 Fragment Assembly (in whole-genome shotgun sequencing)

10 Fragment Assembly Given N reads… Where N ~ 6 million… We need to use a linear-time algorithm

11 Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT.. 2. Merge some “good” pairs of reads into longer contigs 3. Link contigs to form supercontigs Some Terminology read a 500-900 long word that comes out of sequencer mate pair a pair of reads from two ends of the same insert fragment contig a contiguous sequence formed by several overlapping reads with no gaps supercontig an ordered and oriented set (scaffold) of contigs, usually by mate pairs consensus sequence derived from the sequene multiple alignment of reads in a contig

12 1. Find Overlapping Reads aaactgcagtacggatct aaactgcag aactgcagt … tacggatct gggcccaaactgcagtac gggcccaaa ggcccaaac … ctgcagtac gtacggatctactacaca tgacggatc gacggatct … tactacaca (word, read, orient., pos.) aaactgcag aactgcagt actgcagta … gtacggatc tacggatct gggcccaaa ggcccaaac gcccaaact … actgcagta ctgcagtac gtacggatc tacggatct acggatcta … ctactacac tactacaca (word, read, orient., pos.) aaactgcag aactgcagt acggatcta actgcagta cccaaactg cggatctac ctactacac ctgcagtac gcccaaact ggcccaaac gggcccaaa gtacggatc tacggatct tactacaca

13 1. Find Overlapping Reads Sort all k-mers in reads (k ~ 24) TAGATTACACAGATTAC ||||||||||||||||| Find pairs of reads sharing a k-mer Extend to full alignment – throw away if not >97% similar T GA TAGA | || TACA TAGT ||

14 1. Find Overlapping Reads One caveat: repeats A k-mer that appears N times, initiates N 2 comparisons ALU: 1,000,000 times Solution: Discard all k-mers that appear more than c  Coverage, (c ~ 10)

15 1. Find Overlapping Reads Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA

16 1. Find Overlapping Reads (cont’d) Correct errors using multiple alignment TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA C: 20 C: 35 T: 30 C: 35 C: 40 C: 20 C: 35 C: 0 C: 35 C: 40 Score alignments Accept alignments with good scores A: 15 A: 25 A: 40 A: 25 - A: 15 A: 25 A: 40 A: 25 A: 0

17 2. Merge Reads into Contigs Merge reads up to potential repeat boundaries repeat region Unique Contig Overcollapsed Contig

18 2. Merge Reads into Contigs Overlap graph:  Nodes: reads r 1 …..r n  Edges: overlaps (r i, r j, shift, orientation, score) Remove transitively inferrable overlaps

19 Overlap graph after forming contigs

20 Repeats, errors, and contig lengths Repeats shorter than read length are OK Repeats with more base pair diffs than sequencing error rate are OK To make the genome appear less repetitive, try to:  Increase read length  Decrease sequencing error rate Role of error correction: Discards ~90% of single-letter sequencing errors decreases error rate  decreases effective repeat content  increases contig length

21 2. Merge Reads into Contigs Ignore non-maximal reads Merge only maximal reads into contigs repeat region

22 2. Merge Reads into Contigs Ignore “hanging” reads, when detecting repeat boundaries sequencing error repeat boundary??? b a

23 ????? Unambiguous Insert non-maximal reads whenever unambiguous 2. Merge Reads into Contigs

24 3. Link Contigs into Supercontigs Too dense  Overcollapsed Inconsistent links  Overcollapsed? Normal density

25 Find all links between unique contigs 3. Link Contigs into Supercontigs Connect contigs incrementally, if  2 links

26 Fill gaps in supercontigs with paths of repeat contigs 3. Link Contigs into Supercontigs

27 4. Derive Consensus Sequence Derive multiple alignment from pairwise read alignments TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive each consensus base by weighted voting (Alternative: take maximum-quality letter)

28 Some Assemblers PHRAP Early assembler, widely used, good model of read errors Overlap O(n 2 )->layout (no mate pairs)->consensus Celera First assembler to handle large genomes (fly, human, mouse) overlap->layout->consensus Arachne Public assembler (mouse, several fungi) overlap->layout->consensus Phusion overlap->clustering->PHRAP->assemblage->consensus Euler indexing->Euler graph->layout by picking paths->consensus

29 Quality of assemblies Celera’s assemblies of human and mouse

30 Quality of assemblies—mouse

31

32 Quality of assemblies—rat

33 History of WGA 1982: -virus, 48,502 bp 1995: h-influenzae, 1 Mbp 2000: fly, 100 Mbp 2001 – present  human (3Gbp), mouse (2.5Gbp), rat *, chicken, dog, chimpanzee, several fungal genomes Gene Myers Let’s sequence the human genome with the shotgun strategy That is impossible, and a bad idea anyway Phil Green 1997

34 Next few lectures More on alignments Large-scale global alignment – Comparing entire genomes Suffix trees, sparse dynamic programming MumMer, Avid, LAGAN, Shuffle-LAGAN Multiple alignment – Comparing proteins, many genomes Scoring, Multidimensional-DP, Center-Star, Progressive alignment CLUSTALW, TCOFFEE, MLAGAN Gene recognition Gene recognition on a single genome GENSCAN – A HMM for gene recognition Cross-species comparison-based gene recognition TWINSCAN – A HMM SLAM – A pair-HMM


Download ppt "DNA Sequencing Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the circular genome (host)"

Similar presentations


Ads by Google