Download presentation
Presentation is loading. Please wait.
1
CS296-5 Genomes, Networks, and Cancer
2
Course Organization Seminar style Grading: CS requirements:
10% participation 30% paper reviews (3) 20% presentations (1-2) 40% final project Extra credit: 5% for report on visiting seminars (pre-arranged): Maximum 2 Feb. 1, 4pm (Beerenwinkel), Feb.15, 4pm (E. Myers) CS requirements: PhD, Area B (Algorithms), ScM, Theory or Practice Meeting time
3
Biology 101 Central Dogma
4
Human Genome Sequenced 2000-2003, ???
Now what?
5
Course Topics Genomes Networks Cancer
6
Course Topics Genomes Genome Assembly Genome Rearrangements Networks
Cancer
7
Comparative Genomic Architectures: Mouse vs Human Genome
Humans and mice have similar genomes, but their genes are ordered differently ~245 rearrangements Reversals Fusions Fissions Translocation
8
Genome rearrangements
Mouse (X chrom.) Unknown ancestor ~ 75 million years ago Human (X chrom.) What are the similarity blocks and how to find them? What is the architecture of the ancestral genome? What is the evolutionary scenario for transforming one genome into the other?
9
Reversals 1 3 2 4 10 5 6 8 9 7 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 Blocks represent conserved genes.
10
Reversals 1 2 3 9 10 8 4 7 5 6 1, 2, 3, -8, -7, -6, -5, -4, 9, 10 Blocks represent conserved genes. In the course of evolution or in a clinical context, blocks 1,…,10 could be misread as 1, 2, 3, -8, -7, -6, -5, -4, 9, 10.
11
Reversals and Breakpoints
1 2 3 9 10 8 4 7 5 6 1, 2, 3, -8, -7, -6, -5, -4, 9, 10 The reversion introduced two breakpoints (disruptions in order).
12
Reversal Distance Problem
Goal: Given two permutations, find the shortest series of reversals that transforms one into another Input: Permutations p and s Output: A series of reversals r1,…rt transforming p into s, such that t is minimum t - reversal distance between p and s d(p, s) - smallest possible value of t, given p and s
13
Course Topics Genomes Networks Cancer Network Alignment
Network Integration
14
Regulatory Network
15
Protein-Protein Interaction (PPI) Network
16
Protein-Protein Interaction Network?
Proteins are nodes Interactions are edges Edges may have weights Yeast PPI network H. Jeong et al. Lethality and centrality in protein networks. Nature 411, 41 (2001)
17
Network Alignment Sharan and Ideker. Modeling cellular machinery through biological network comparison. Nature Biotechnology 24, pp , 2006
18
The Network Alignment Problem
Given: k different interaction networks belonging to different species, Find: Conserved sub-networks within these networks Conserved defined by protein sequence similarity (node similarity) and interaction similarity (network topology similarity)
19
Course Topics Genomes Networks Cancer
Measuring mutation in cancer genomes Modeling cancer progression
20
Tumor Genomes Chromosomal aberrations
Mutation and selection Compromised genome stability Chromosomal aberrations Structural: translocations, inversions, fissions, fusions. Copy number changes: gain and loss of chromosome arms, segmental duplications/deletions.
21
Tumor Genome Architecture
Maybe time for a “Cancer Genome Project”?? What are detailed architectures of tumor genomes? What sequence of rearrangements produce these architectures?
22
Array Comparative Genomic Hybridization (aCGH)
23
Next: Tumor Genome → Phenotype Gene Networks and Pathways
Integration of Multiple Data Sources ESP and copy number Mutation Expression (mRNA and miRNA) Binding (ChIP-chip) Pathways Epigenetics activation repression Duplicated genes Deleted genes
24
Additional Topics: Rearrangements in Genetics
Interaction b/w genome rearrangements and single nucleotide polymorphisms (SNPs) Detecting Rearrangements Under Selection Population Substructure
25
DNA Sequencing
26
DNA sequencing How we obtain the sequence of nucleotides of a species
…ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…
27
DNA Sequencing – Overview
1975 Gel electrophoresis Predominant, old technology by F. Sanger Whole genome strategies Physical mapping Walking Shotgun sequencing Computational fragment assembly The future—new sequencing technologies Pyrosequencing, single molecule methods, … Assembly techniques Future variants of sequencing Resequencing of humans Microbial and environmental sequencing Cancer genome sequencing 2015
28
DNA Sequencing Goal: Find the complete sequence of A, C, G, T’s in DNA
Challenge: There is no machine that takes long DNA as an input, and gives the complete sequence as output Can only sequence ~500 letters at a time
29
DNA Sequencing – vectors
Shake DNA fragments Known location (restriction site) Vector Circular genome (bacterium, plasmid) + =
30
Different types of vectors
Size of insert Plasmid 2,000-10,000 Can control the size Cosmid 40,000 BAC (Bacterial Artificial Chromosome) 70, ,000 YAC (Yeast Artificial Chromosome) > 300,000 Not used much recently
31
DNA Sequencing – gel electrophoresis
Start at primer (restriction site) Grow DNA chain Include dideoxynucleoside (modified a, c, g, t) Stops reaction at all possible points Separate products with length, using gel electrophoresis
32
Electrophoresis diagrams
33
Challenging to read answer
34
Challenging to read answer
35
Challenging to read answer
36
Reading an electropherogram
Filtering Smoothening Correction for length compressions A method for calling the letters – PHRED PHRED – PHil’s Read EDitor (by Phil Green) Several better methods exist, but labs are reluctant to change
37
Output of PHRED: a read A read: 500-700 nucleotides
A C G A A T C A G …A …21 Quality scores: -10log10Prob(Error) Reads can be obtained from leftmost, rightmost ends of the insert Double-barreled sequencing: (1990) Both leftmost & rightmost ends are sequenced, reads are paired
38
Method to sequence longer regions
genomic segment cut many times at random (Shotgun) Get one or two reads from each segment ~500 bp ~500 bp
39
Reconstructing the Sequence (Fragment Assembly)
reads Cover region with ~7-fold redundancy (7X) Overlap reads and extend to reconstruct the original genomic region
40
Definition of Coverage
Length of genomic segment: L Number of reads: n Length of each read: l Definition: Coverage C = n l / L How much coverage is enough? Lander-Waterman model: Assuming uniform distribution of reads, C=10 results in 1 gapped region /1,000,000 nucleotides
41
Challenges in Fragment Assembly
Repeats: A major problem for fragment assembly > 50% of human genome are repeats: - over 1 million Alu repeats (about 300 bp) - about 200,000 LINE repeats (1000 bp and longer) Repeat Green and yellow fragments are interchangeable when assembling repetitive DNA
42
Triazzle: A Fun Example
The puzzle looks simple BUT there are repeats!!! The repeats make it very difficult. Try it – only $7.99 at
43
Repeat Types Bacterial genomes: 5% Mammals: 50% Repeat types:
Low-Complexity DNA (e.g. ATATATATACATA…) Microsatellite repeats (a1…ak)N where k ~ 3-6 (e.g. CAGCAGTAGCAGCACCAG) Transposons SINE (Short Interspersed Nuclear Elements) e.g., ALU: ~300-long, 106 copies LINE (Long Interspersed Nuclear Elements) ~4000-long, 200,000 copies LTR retroposons (Long Terminal Repeats (~700 bp) at each end) cousins of HIV Gene Families genes duplicate & then diverge (paralogs) Recent duplications ~100,000-long, very similar copies
44
Sequencing and Fragment Assembly
AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT 3x109 nucleotides 50% of human DNA is composed of repeats Error! Glued together two distant regions
45
What can we do about repeats?
Two main approaches: Cluster the reads Link the reads
46
What can we do about repeats?
Two main approaches: Cluster the reads Link the reads
47
What can we do about repeats?
Two main approaches: Cluster the reads Link the reads
48
Sequencing and Fragment Assembly
AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT 3x109 nucleotides A R B ARB, CRD or ARD, CRB ? C R D
49
Sequencing and Fragment Assembly
AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT 3x109 nucleotides
50
Strategies for whole-genome sequencing
Hierarchical – Clone-by-clone Break genome into many long pieces Map each long piece onto the genome Sequence each piece with shotgun Example: Yeast, Worm, Human, Rat Online version of (1) – Walking Start sequencing each piece with shotgun Construct map as you go Example: Rice genome Whole genome shotgun One large shotgun pass on the whole genome Example: Drosophila, Human (Celera), Neurospora, Mouse, Rat, Dog
51
Whole Genome Shotgun Sequencing
cut many times at random plasmids (2 – 10 Kbp) forward-reverse paired reads known dist cosmids (40 Kbp) ~500 bp ~500 bp
52
We need to use a linear-time algorithm
Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm
53
Fragment Assembly Computational Challenge: assemble individual short fragments (reads) into a single genomic sequence (“superstring”) Until late 1990s the shotgun fragment assembly of human genome was viewed as intractable problem
54
Shortest Superstring Problem
Problem: Given a set of strings, find a shortest string that contains all of them Input: Strings s1, s2,…., sn Output: A string s that contains all strings s1, s2,…., sn as substrings, such that the length of s is minimized Complexity: NP – complete Note: this formulation does not take into account sequencing errors
55
Shortest Superstring Problem: Example
56
Overlap-Layout-Consensus
Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA Overlap: find potentially overlapping reads Layout: merge reads into contigs and contigs into supercontigs Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT..
57
Overlap Find the best match between the suffix of one read and the prefix of another Due to sequencing errors, need to use dynamic programming to find the optimal overlap alignment Apply a filtration method to filter out pairs of fragments that do not share a significantly long common substring
58
Overlapping Reads Sort all k-mers in reads (k ~ 24)
Find pairs of reads sharing a k-mer Extend to full alignment – throw away if not >95% similar TACA TAGT || TAGATTACACAGATTAC T GA TAGA | || ||||||||||||||||| TAGATTACACAGATTAC
59
Overlapping Reads and Repeats
A k-mer that appears N times, initiates N2 comparisons For an Alu that appears 106 times 1012 comparisons – too much Solution: Discard all k-mers that appear more than t Coverage, (t ~ 10)
60
Finding Overlapping Reads
Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA
61
Find Overlapping Reads
Correct errors using multiple alignment TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTATTGA TAG-TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTACTGA TAG-TTACACAGATTATTGA insert A correlated errors— probably caused by repeats disentangle overlaps replace T with C TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA In practice, error correction removes up to 98% of the errors TAG-TTACACAGATTATTGA TAG-TTACACAGATTATTGA
62
Layout Repeats are a major challenge
Do two aligned fragments really overlap, or are they from two copies of a repeat? Solution: repeat masking – hide the repeats!!! Masking results in high rate of misassembly (up to 20%) Misassembly means alot more work at the finishing step
63
Merge Reads into Contigs
repeat region Unique Contig Overcollapsed Contig We want to merge reads up to potential repeat boundaries
64
Repeats, errors, and contig lengths
Repeats shorter than read length are OK Read that spans across a repeat disambiguates order of flanking regions Repeats with more base pair diffs than sequencing error rate are OK We throw overlaps between two reads in different copies of the repeat To make the genome appear less repetitive, try to: Increase read length Decrease sequencing error rate Role of error correction: Discards up to 98% of single-letter sequencing errors decreases error rate decreases effective repeat content increases contig length
65
Link Contigs into Supercontigs
Normal density Too dense Overcollapsed Inconsistent links Overcollapsed?
66
Link Contigs into Supercontigs (cont’d)
Find all links between unique contigs Connect contigs incrementally, if 2 links supercontig (aka scaffold)
67
Link Contigs into Supercontigs
Fill gaps in supercontigs with paths of repeat contigs
68
Consensus A consensus sequence is derived from a profile of the assembled fragments A sufficient number of reads is required to ensure a statistically significant consensus Reading errors are corrected
69
Derive Consensus Sequence
TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive multiple alignment from pairwise read alignments Derive each consensus base by weighted voting (Alternative: take maximum-quality letter)
70
Some Assemblers PHRAP Celera Arachne Phusion Euler
Early assembler, widely used, good model of read errors Overlap O(n2) layout (no mate pairs) consensus Celera First assembler to handle large genomes (fly, human, mouse) Overlap layout consensus Arachne Public assembler (mouse, several fungi) Phusion Overlap clustering PHRAP assemblage consensus Euler Indexing Euler graph layout by picking paths consensus
71
EULER - A New Approach to Fragment Assembly
Traditional “overlap-layout-consensus” technique has a high rate of mis-assembly EULER uses the Eulerian Path approach borrowed from “sequencing by hybridization” (SBH) Fragment assembly without repeat masking can be done in linear time with greater accuracy
72
l-mer composition Def: Given string s, the Spectrum ( s, l ) is unordered multiset of all possible (n – l + 1) l-mers in a string s of length n The order of individual elements in Spectrum ( s, l ) does not matter For s = TATGGTGC all of the following are equivalent representations of Spectrum ( s, 3 ): {TAT, ATG, TGG, GGT, GTG, TGC} {ATG, GGT, GTG, TAT, TGC, TGG} {TGG, TGC, TAT, GTG, GGT, ATG}
73
l-mer composition Def: Given string s, the Spectrum ( s, l ) is unordered multiset of all possible (n – l + 1) l-mers in a string s of length n The order of individual elements in Spectrum ( s, l ) does not matter For s = TATGGTGC all of the following are equivalent representations of Spectrum ( s, 3 ): {TAT, ATG, TGG, GGT, GTG, TGC} {ATG, GGT, GTG, TAT, TGC, TGG} {TGG, TGC, TAT, GTG, GGT, ATG}
74
The SBH Problem Goal: Reconstruct a string from its l-mer composition
Input: A set S, representing all l-mers from an (unknown) string s Output: String s such that Spectrum ( s,l ) = S
75
SBH: Hamiltonian Path Approach
S = { ATG AGG TGC TCC GTC GGT GCA CAG } H ATG AGG TGC TCC GTC GGT GCA CAG ATG C A G G T C C Path visited every VERTEX once
76
SBH: Hamiltonian Path Approach
A more complicated graph: S = { ATG TGG TGC GTG GGC GCA GCG CGT }
77
SBH: Hamiltonian Path Approach
S = { ATG TGG TGC GTG GGC GCA GCG CGT } Path 1: ATGCGTGGCA Path 2: ATGGCGTGCA
78
SBH: Eulerian Path Approach
S = { ATG, TGC, GTG, GGC, GCA, GCG, CGT } Vertices correspond to ( l – 1 ) – mers : { AT, TG, GC, GG, GT, CA, CG } Edges correspond to l – mers from S AT GT CG CA GC TG GG Path visited every EDGE once
79
SBH: Eulerian Path Approach
S = { AT, TG, GC, GG, GT, CA, CG } corresponds to two different paths: GT CG GT CG AT TG GC AT TG GC CA CA GG GG ATGGCGTGCA ATGCGTGGCA
80
Euler Theorem A graph is balanced if for every vertex the number of incoming edges equals to the number of outgoing edges: in(v)=out(v) Theorem: A connected graph is Eulerian if and only if each of its vertices is balanced.
81
Euler Theorem: Proof Eulerian → balanced
for every edge entering v (incoming edge) there exists an edge leaving v (outgoing edge). Therefore in(v)=out(v) Balanced → Eulerian ???
82
Algorithm for Constructing an Eulerian Cycle
a. Start with an arbitrary vertex v and form an arbitrary cycle with unused edges until a dead end is reached. Since the graph is Eulerian this dead end is necessarily the starting point, i.e., vertex v.
83
Algorithm for Constructing an Eulerian Cycle (cont’d)
b. If cycle from (a) above is not an Eulerian cycle, it must contain a vertex w, which has untraversed edges. Perform step (a) again, using vertex w as the starting point. Once again, we will end up in the starting vertex w.
84
Algorithm for Constructing an Eulerian Cycle (cont’d)
c. Combine the cycles from (a) and (b) into a single cycle and iterate step (b).
85
Euler Theorem: Extension
Theorem: A connected graph has an Eulerian path if and only if it contains at most two semi-balanced vertices and all other vertices are balanced.
86
Overlap Graph: Hamiltonian Approach
Each vertex represents a read from the original sequence. Vertices from repeats are connected to many others. Repeat Find a path visiting every VERTEX exactly once: Hamiltonian path problem
87
Overlap Graph: Eulerian Approach
Repeat Placing each repeat edge together gives a clear progression of the path through the entire sequence. Find a path visiting every EDGE exactly once: Eulerian path problem
88
Multiple Repeats Can be easily constructed with any number of repeats
89
Building Repeat Graph Problem: Construct the repeat graph from a collection of reads. Solution: Break the reads into smaller pieces. ?
90
Building Repeat Graph Reads are constructed from an original sequence in lengths that allow biologists a high level of certainty. They are then broken again into k-mers
91
Construction of Repeat Graph (cont’d)
Error correction in reads: “consensus first” approach to fragment assembly. Makes reads (almost) error-free BEFORE the assembly even starts. Using reads and mate-pairs to simplify the repeat graph (Eulerian Superpath Problem).
92
Approaches to Fragment Assembly
Find a path visiting every VERTEX exactly once in the OVERLAP graph: Hamiltonian path problem NP-complete: algorithms unknown
93
Approaches to Fragment Assembly (cont’d)
Find a path visiting every EDGE exactly once in the REPEAT graph: Eulerian path problem Linear time algorithms are known
94
Minimizing Errors If an error exists in one of the 20-mer reads, the error will be perpetuated among all of the smaller pieces broken from that read.
95
Minimizing Errors (cont’d)
However, that error will not be present in the other instances of the 20-mer read. So it is possible to eliminate most point mutation errors before reconstructing the original sequence.
96
Genomes Sequenced
97
Some new sequencing technologies
98
Nanopore Sequencing http://www.mcb.harvard.edu/branton/index.htm
Figure 1 . A nanopore sensor for sequencing DNA. A channel or nanopore in an insulating membrane separates two ionic solution-filled compartments. In response to a voltage bias (labeled “ - ” and “+”) across the membrane, ssDNA molecules (yellow) in the “-” compartment are driven, one at a time, into and through the channel. Embedded in the membrane, an electrically connected nanotube (orange) that abuts on the nanopore serves as a sensor to identify the nucleotides in the translocating DNA molecules. Elevated temperatures and denaturants maintain the DNA in an unstructured, single-stranded form. The underlying principle of nanopore sequencing is that a single-stranded DNA or RNA molecule can be electrophoretically driven through a nano-scale pore in such a way that the molecule traverses the pore in strict linear sequence, as illustrated in Figure 1. Because a translocating molecule partially obstructs or blocks the nanopore, it alters the pore's electrical properties 1 .
99
Nanopore Sequencing—Assembly
Resulting reads are likely to look different than Sanger reads: Long (perhaps 10,000bp-1,000,000bp) High error rate (perhaps 10% – 30%) Two colors? A/ CTG AT/ CG AG/ CT How can we assemble under such conditions?
100
Polony Sequencing "Polonies" are tiny colonies of DNA, about one micron in diameter, grown on a glass microscope slide (the word itself is a contraction of "polymerase colony"). To create them, researchers first pour a solution containing chopped-up DNA onto the slide. Adding an enzyme called polymerase causes each piece to copy itself repeatedly, creating millions of polonies, each dot containing only copies of the original piece of DNA. The polonies are then exposed to a series of chemically-labeled probes that light up when run through a scanning machine, identifying each nucleotide base in the strand of code, much as dusting with powder allows crime-scene investigators to bring up fingerprints on a surface. Prior to sequencing, dsDNA is denatured and unbound copy strands are washed away. - Covalently linked template strands allow for washing.
101
Polony sequencing—Assembly
? Resulting reads are likely to look different than Sanger reads: Short (currently 100 to 200 bp) Low error rates, except in homopolymeric runs (AAA…, CCC…, etc) Currently, not known how to do paired reads on a chip. Maybe very soon!
102
Some future directions for sequencing
Personalized genome sequencing Find your ~1,000,000 single nucleotide polymorphisms (SNPs) Find your rearrangements Goals: Link genome with phenotype Provide personalized diet and medicine (???) designer babies, big-brother insurance companies Timeline: Inexpensive sequencing: Genotype–phenotype association: 2010-??? Personalized drugs: ???
103
Some future directions for sequencing
2. Environmental sequencing Find your flora: organisms living in your body External organs: skin, mucous membranes Gut, mouth, etc. Normal flora: >200 species, >trillions of individuals Flora–disease, flora–non-optimal health associations Timeline: Inexpensive research sequencing: today Research & associations within next 10 years Personalized sequencing Find diversity of organisms living in different environments Hard to isolate Assembly of all organisms at once
104
Some future directions for sequencing
Organism sequencing Sequence a large fraction of all organisms Deduce ancestors Reconstruct ancestral genomes Synthesize ancestral genomes Clone—Jurassic park! Study evolution of function Find functional elements within a genome How those evolved in different organisms Find how modules/machines composed of many genes evolved
105
Sources http://bioalgorithms.info
Serafim Batzoglou (Sequencing slides)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.