BIONF/BENG 203: Functional Genomics

BIONF/BENG 203: Functional Genomics
Gene expression Trey Ideker and Vineet Bafna TA:Ngo Vu 1

The Dynamic nature of the cell
The molecules in the body, RNA, and proteins are constantly turning over. New ones are ‘created’ through transcription, translation Proteins are modified post-translationally Active molecules interact with each other in functional networks ‘Old’ molecules are degraded

Measuring Molecular States
A key part of functional genomics is to observe the (changes in) Molecular states via abundance of functional molecules Expressed transcripts Microarray hybridization to ‘count’ the number of copies of RNA RNA-seq Expressed proteins Mass spectrometry is used to ‘count’ the number of copies of a protein sequences

Expression analysis and the perturbation matrix
Preprocessing functional genomics data: our goal is to create a perturbation matrix rows are molecules (transcripts, proteins, peptides…) columns are experiments (samples, or perturbations) Entries are normalized abundance values. Week 1: Creating the matrix for transcripts Week 3: creating the matrix for proteins Week 4-7: Analysis of expression data transcripts/proteins Sample/experiment

RNA-seq 200M reads

NGS and expression Sample RNA, sequence via the RNA-seq protocol
Map fragments to the genome Normalize and create and Expression matrix Rows are transcripts Columns are samples/experiments.

NGS and expression T1 T2 T3 T4 T1 T2 T3 T4 Input: m bp sequenced from the sample (usually as short reads), a database of length n. Output: the mapping coordinates for each of the short reads.

The computational challenge
A single sequencing Illumina run can produce billions (~3B) of reads (length 100bp). RNA-seq for mammals can be as large as 200M reads Each read must be aligned to the reference human assembly (3Gb).

Alignment Human reference RNA seq Recall that local alignment of two strings of size n,m requires ~nm steps. Typically: n=3.109, m≅2.1010

On the other hand… While alignment is prohibitively expensive, the situation changes if we match without any errors (indels/substitutions). Typically n+m steps are needed Two ideas: Preprocess and index the queries (sampled reads) in O(n) steps, then scan the database in O(m) steps for ALL queries. Preprocess and index the database in O(m) steps, then search each query in time proportional to its length.

Reconciling Can we use the ideas of exact matching to speed up mapping?

The Pigeonhole principle
True or False: No two persons in San Diego have exactly the same number of hair. Not counting bald people

The Pigeonhole principle of Combinatorics
If there are n pigeonholes and n+1 pigeons then any assignment will require at least 2 pigeons in some pigeonhole

Observation 1 Applying the pigeonhole principle
Suppose we are looking for a database string with greater than 90% identity to the query (length 100) Partition the query into size 10 substrings. At least one must match the database string exactly

Expected number of exact Matches is small!
Idea: Do an exact match for keywords of a small size (k=20), and a full alignment only around the exact match. Pigeonhole principle suggests that the true mapping will not be missed. What about speed? Expected number of matches = mn*0.25k If n=3.109, m=2.1011, k=30 Then, expected number of matches = 516 Number of computations: time for scanning+time for alignment Here, the time for alignment is minimal, and the total time is around

String matching to speed up computations
Human reference RNA seq Idea: Find exact matches of query substrings in the database. Do an alignment only near the queries.

Fast mapping All mapping algorithms use the same basic strategy
Use exact matching to identify db locations where a query string might match Speedup by indexing databases (EX: bwa) Speedup by indexing query strings (EX: Blast) Some tools also do approximate matching Use fast alignment techniques to quickly extend the alignments Alignment is only applied to the few locations where exact match has occurred Engineering plays as important a role as algorithms

Exploiting NGS properties
In most NGS technologies, the sequence has very high quality at the beginning of the read. In bwa for example, we allow for at most 2 errors in the first 32 bp.

Indexing sequence databases

Indexing sequence databases
String hash tables k a a a a c a a c g a a c 3 a a g a a t m Pros: fast search time O(k) per query string Cons: large memory requirement (4k+m) 430=1018

Suffix trees a c a a c g $ c a a c g $ a a c g $ a c g $ c g $ g $
Trie of all suffixes Pros: low theoretical memory requirement O(m), as redundancies get merged into the trie. Fast search (lin. in size of query) Cons memory requirement large in practice. The constant in O(m) is ~60 Suffix arrays might help a bit. a c a a c g $ c a a c g $ a a c g $ a c g $ c g $ g $ a g c acg$ $ cg$

The BW transform $ a c a a c g g $ a c a a c c g $ a c a a
a c g $ a c a a a c g $ a c c a a c g $ a a c a a c g $ Clever, memory efficient index Add a special symbol Rotate m times to get m strings

The BW transform $ a c a a c g a a c g $ a c a c a a c g $
c g $ a c a a g $ a c a a c Rotate m times to get m strings Sort lexicographically Use last column, first column positions a c g 0 0 1 0 1 1 1 1 1 2 1 1 3 1 1 3 2 1 Occ Row

Lexicographic sort $ a c a a c g a a c g $ a c a c a a c g $
Lexicographic: Dictionary Sort Note the connection to the suffix tree $ a c a a c g a a c g $ a c a c a a c g $ a c g $ a c a c a a c g $ a c g $ a c a a g $ a c a a c a g c acg$ $ cg$

Maintaining the First column
a c a a c g For every symbol σ, define Next(σ), and Prev(σ) Next(a)= Prev(a)= Note that the first column is sorted All we need to do is to keep the index of the first occurrence of each symbol Pos(a) = 1 Memory used? O(|Σ|) (alphabet size) $ a c a a c g a a c g $ a c a c a a c g $ a c g $ a c a c a a c g $ a c g $ a c a a g $ a c a a c c $

The BW transform: (index on the last column)
a c a a c g B[i] : last symbol in the i-th substring B[3]=a Space requirement 2n bits (assuming 2 bits per symbol) $ a c a a c g a a c g $ a c a c a a c g $ a c g $ a c a c a a c g $ a c g $ a c a a g $ a c a a c B

The BWT properties I a c a a c g $ a c a a c g a a c g $ a c
The first character is preceded by the last character in the actual string. $ a c a a c g a a c g $ a c a c a a c g $ a c g $ a c a c a a c g $ a c g $ a c a a g $ a c a a c

BWT property 2: LF property
a c a a c g The i-th occurrence of a in the last column, and the i-th occurrence of a in first column corresponds to the same character $ a c a a c g a a c g $ a c a c a a c g $ a c g $ a c a c a a c g $ a c g $ a c a a g $ a c a a c

LF property proof a c a a c g $ a c a a c g a a c g $ a c
Number each symbol by its occurrence in the string. Lemma: σi < σj in the first column, iff σi < σj in the last column. Proof: Suppose σi < σj Let σix and σjy denote the corresponding suffixes. As the first symbol is the same, then Loc(x)<Loc(y) Then B[Loc(x)] < B[Loc(y)] However, B[Loc(x)]= σi and B[Loc(y)]= σi a c a a c g $ a c a a c g a a c g $ a c a c a a c g $ a c g $ a c a c a a c g $ a c g $ a c a a g $ a c a a c

Using the BW transform to query
Given word q, does q exist in the database string D (represented by the BWT transform)? Note that all occurrences of q are next to each other, We only need to find the range (F,L) of positions. We proceed recursively q F L

Matching single symbol strings
If (|q|=1) Then F = Pos(q) L = Pos(Next(q))-1

The general case Let q = σw Recursively, let F = first position of w
L = last position of w Can we find the first and last positions of σw? F w L

The general case Consider the occurrences of σ in the first column.
Clearly, σw is within the range, but we cannot tell where. If we knew two values, we’d be done o1 (number of occurrences of σ before σw), and o2, (number of occurrences of σw) Pos(σ) σ o1 σw σw o2 σ F w L

The general case Next, consider the occurrences of σ in the BW Transform (last column) Claim: o2 is the number of strings that start with w and end in σ (Proof: BW Prop. 1) o1 is the number of occurrences of σ in the last column before the first occurrence of w (Prof: BW property 2) Pos(σ) σ o1 σ σw o2 σw o1 σ σ F w σ o2 σ L

The BW transform: the final data structure
Pos BW transform B Occ(σ,i): number of occurrences of σ in B[0]….B[i] Space = 2n+|Σ|n log n bits (2n+4n log n bits for DNA) Now o1=Occ(σ,F-1) o2=Occ(σ,L)-o1 Pos(σ) σ o1 σ σw o2 σw o1 σ σ F w σ o2 σ L

Exact matching of string
a c a a c g GetRange(σw) //Ex: σw=ac (F,L)=GetRange(w) //constant time F=4, L=5 o1=Occ[σ,F-1] p1=Occ[σ,L] //here o1=1, p1=3 F1=Pos[σ]+o1 L1=Pos[σ]+p1-1 return(F1,L1) a c g 0 0 1 0 1 1 1 1 1 2 1 1 3 1 1 3 2 1 Occ 0. $ a c a a c g 1. a a c g $ a c 2. a c a a c g $ 3. a c g $ a c a 4. c a a c g $ a 5. c g $ a c a a 6. g $ a c a a c

The BW transform With some tricks, the BW transform becomes a memory efficient data structure to query for exact matches. It has many other properties

The Pigeonhole principle revisited
Suppose we are looking to match 96bp string with up to 5 errors How would we use exact matching so as to guarantee sensitivity? Break up the string into 6 pieces of size 16bp. Sensitivity is 100%.

The Pigeonhole principle revisited
Break up the string into 6 pieces of size 16bp. Sensitivity is 100%. What about speed? Number of random hits? 3*109*2*1010*4-16=1.4 *1010 If we did an alignment (104 steps) around each hit, the total computation is 1.4*1015 steps. If we could only choose larger words, we could gain in speed. For 25-mers, number of hits is very small: ~530K hits only To maintain speed-sensitivity tradeoffs, should we try and look for approximate matches?

Approximate matching Consider query aag. Find all matches with at most one mismatch Consider the suffix tree first: a g c acg$ $ cg$

Approximate matching We do a BFS, maintaining errors seen in reaching a node. Worst case time is O(w2). a g c acg$ $ cg$

Speedup 1: Branch and bound
Goal is to match q with <= z errors Pre-compute D[i]: minimum number of errors needed to match the i-suffix of the query Suppose in the context of BFS, we reach a node with e errors, after matching the first i-1 symbols. If (e+D[i]>z), no match is possible and we can stop. i e Pre-compute D[i]: minimum number of errors needed to match the i-suffix of the query

Speedup 2 BWA requires that in the prefix (high quality region) we have a tighter match. In the first 32 bp (of 70bp queries), at most 2 errors allowed. Possible approach: We quickly check for all suffixes of the query to see if matches with 0 or 1 errors, and set D[i] to 0 or 1. Other BWA heuristics: Score appropriately for indels versus mismatches.

Fast search for exact matching
There are two strategies. Build an index on the reference O(n) preprocessing, O(m) search + Time to create alignments Ex: suffix trees, suffix arrays, Burrows Wheeler transforms Automaton on queries; search genome with those queries. O(m) preprocessing time, O(n) search time + Time to create alignments. Ex: Aho corasick tries

Dictionary Matching 1:POTATO 2:POTASSIUM P O T A S T P O T A T O
3:TASTE P O T A S T P O T A T O database dictionary Q: Given k words (si has length li), and a database of size n, find all matches to these words in the database string. How fast can this be done?

Dict. Matching & string matching
How fast can you do it, if you only had one word of length m? Trivial algorithm O(nm) time Pre-processing O(m), Search O(n) time. Dictionary matching Trivial algorithm (l1+l2+l3…)n Using a keyword tree, lpn (lp is the length of the longest pattern) Aho-Corasick: O(n) after preprocessing O(l1+l2..) We will consider the most general case

Direct Algorithm P O P O P O T A S T P O T A T O P O T A T O
Observations: When we mismatch, we (should) know something about where the next match will be. When there is a mismatch, we (should) know something about other patterns in the dictionary as well.

The Trie Automaton Construct an automaton A from the dictionary
A[v,x] describes the transition from node v to a node w upon reading x. A[u,’T’] = v, and A[u,’S’] = w Special root node r Some nodes are terminal, and labeled with the index of the dictionary word. 1:POTATO 2:POTASSIUM 3:TASTE u v P O T A U I S M E 1 r S 2 w 3

An O(lpn) algorithm for keyword matching
Start with the first position in the db, and the root node. If successful transition Increment current pointer Move to a new node If terminal node “success” Else Retract ‘current’ pointer Increment ‘start’ pointer Move to root & repeat

Illustration: P O T A S T P O T A T O v Database T: P O T A U I S M E
1 S 2 3

Idea for improving the time
Suppose we have partially matched pattern i (indicated by l, and c), but fail subsequently. If some other pattern j is to match Then prefix(pattern j) = suffix [ first c-l characters of pattern(i)) c l P O T A S T P O T A T O P O T A S S I U M Pattern i T A S T E 1:POTATO 2:POTASSIUM 3:TASTE Pattern j

Failure function Every node v corresponds to a string sv that is a prefix of some pattern. Define F[v] to be the node u such that su is the longest suffix of sv If we fail to match at v, we should jump to F[v], and commence matching from there Let lp[v] = |su| 1 P n2 O n3 T n4 A n5 T n6 O n1 v T S S I U M n7 A n10 S T E n8 n9

Illustration What is F(n10)? What is F(n5)? F(n3)? Lp(n10)? 1 P O T A
v T S S I U M n7 A n10 S T E n8 n9

Illustration P O T A S T P O T A T O c = 1 l = 1 v 1 P O T A T O T S S
M n7 A n10 S T E n8 n9

Illustration P O T A S T P O T A T O l = 1 c = 6 1 P O T A T O T S v S

Illustration P O T A S T P O T A T O l = 3 c = 6 1 P O T A T O T S S I
M n7 v A n10 S T E n8 n9

Illustration P O T A S T P O T A T O c = 7 l = 3 1 P O T A T O T S S I
M n7 v A n10 S T E n8 n9 n11

Illustration P O T A S T P O T A T O l = 7 c = 7 v 1 P O T A T O T S S

Illustration P O T A S T P O T A T O c = 12 l = 7 v 1 P O T A T O T S

Time analysis In each step, either c is incremented, or l is incremented Neither pointer is ever decremented (lp[v] < c-l). l and c do not exceed n Total time <= 2n l c P O T A S T P O T A T O

Reviewing Steps for mapping of reads:
Build an index on reads (e.g. Aho-Corasick trie), or on the database (e.g. BW transform) Search for exact matches to k-mers (k~25). When an exact match is found, extend using a Smith-Waterman alignment Report matches with good scores.

Normalizing NGS for measuring RNA expression
The mapping gives us raw counts at a locus What is an appropriate measure? We must normalize for number of reads sequenced, as well as length of gene FPKM: fragments per 1000bp per million reads To compute FPKM: rg1 rg2

Transcripts per million
sample 1 rg1 rg2 gene g Why is tpm better?

Normalizing to number of reads
There is a huge variation in true abundance values. The top 10% of the expressed genes contribute 60% of the reads. Small changes in high abundance genes lead to large changes in expression values of low abundance genes Perhaps better to normalize with the 75%ile Why is upper percentile better than total?

Normalization and differential gene expression
100 250K +25K 25K +2.5K gene g Total 1M 1.027M TPM 100 97.4 75% Total 300K 302.5K TPM 33.3 33.02

The two problems of RNA-seq quantification
The sub-problems of RNA-seq based quantification Identify possible isoforms Map reads to each isoform Compute abundance of each isoform The problems are inter-related. Possible isoforms are determined by reads mapping to it. Isoform abundance is determined by the number of reads mapping to it. Isoform abundance dictates the number of reads that should map to it.

Multiple isoforms Assume that each of the exons is 500bp.
We are given a collection of reads mapping to each gene locus g, representing multiple isoforms. How do we get transcript level expression? In some tools (e.g. Cufflinks), they pre-compute the possible transcripts based on reads.

Multiple isoforms Assume that each of the exons is 500bp
We have 2 transcripts: t1 of length 1.5K (exons 1,2,3), and t2 of length 1K (exons 1,3). Let γ1 = Expression(t1) γ2 =Expression(t2)

Multiple isoforms and bias
g e3 t2 n2 e1 n1 n3 e3 Data: mapped reads Model Assume all exons are equally long, and all reads are equally likely, and do not cross exons. Let read R map to e1. Pr(R|γt)=?

Modeling In this example, we are looking only at fragments mapping to a specific locus, and only the parameter γ is relevant. Most tools that predict transcript level inference apply a heuristic to pre-compute possible transcripts

Multiple spliced isoforms
g e3 t2 n2 e1 n1 n3 R1 R2 e3 Data D: mapped read raw counts Model

Multiple isoforms Differentiating, ML solution is achieved for

Suppose we had 3 transcripts
1/3 e2 n1=8 γ1 g γ2 t2 e1 n2=5 γ3 1/2 e3 n3=5 1/2 t3

To compute the ML estimate, Roberts et al
To compute the ML estimate, Roberts et al. use an optimization method, coordinate descent

Coordinate descent method
minimized when

Solve for γ1 in terms of γ2

Length Bias Bias towards a location is computed as a function of the sequence, and the position relative to the 3’ or 5’ end of the transcript.

Modeling Use empirical fragment length distribution
The parameter to be estimated is the expression value ρt for all transcripts t. G: set of loci βg=relative abundance of locus g γt= relative abundance of t within its locus (multiple spliced isoforms exist at any locus) ρt=βg.γt F: set of fragments Xg: set of fragments mapping to locus g. bt(i,j): Probability that fragment 5’, 3’ ends touch I,j given that they arise from transcript t

A generative model for a fragment
(i,j) t f γt D b βg Consider a fragment f that maps to position (i,j) of transcript t, at locus g

Likelihood Assuming that fragments are generated independently
(i,j) t f γt b βg D Assuming that fragments are generated independently The Ml parameters are computed using gradient descent or other optimization methods.

Estimating parameters
We have to estimate ρt=βgγt, and also D, and bt for all genes (loci) g, and for all transcripts t. The length distribution D can be measured empirically. Note that we cannot estimate bias unless we have a gold standard where we know ρ, and we cannot estimate ρ unless we know the bias. Roberts et al. use 2-step iteration to get an ML estimate Use uniform bias to get an initial estimate of ρ Use initial ρ to estimate bias Reestimate ρ

Correlation between quantitative PCR and RNA-seq
Roberts, Genome Biol. 2011

Correcting across platforms

Some issues with the Cufflinks model
The initial choice of isoforms is a parsimony based one. It may not allow for the discovery of novel isoforms iReckon allows for the simultaneous discovery of novel isoforms and their quantification. Mezlini, Genome Research 2013 Simulated data. Prec. = TP/(TP+FP) Rec. = TP/(TP+FN)

Conclusion The processing of mapped RNA reads allows us to generate a column of the transcript abundance matrix transcript

BIONF/BENG 203: Functional Genomics

Similar presentations

Presentation on theme: "BIONF/BENG 203: Functional Genomics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

BIONF/BENG 203: Functional Genomics

Similar presentations

Presentation on theme: "BIONF/BENG 203: Functional Genomics"— Presentation transcript:

Similar presentations

About project

Feedback