CS CM124/224 & HG CM124/224 DISCUSSION SECTION (JUN 6, 2013) TA: Farhad Hormozdiari.

Slides:



Advertisements
Similar presentations
Hypothesis Testing. To define a statistical Test we 1.Choose a statistic (called the test statistic) 2.Divide the range of possible values for the test.
Advertisements

Point Estimation Notes of STAT 6205 by Dr. Fan.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Week11 Parameter, Statistic and Random Samples A parameter is a number that describes the population. It is a fixed number, but in practice we do not know.
Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests
Sections 7-1 and 7-2 Review and Preview and Estimating a Population Proportion.
1 Some more probability Samuel Marateck © Another way of calculating card probabilities. What’s the probability of choosing a hand of cards with.
Chapter 4 Probability and Probability Distributions
COUNTING AND PROBABILITY
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Probability Distributions
LARGE SAMPLE TESTS ON PROPORTIONS
Statistics.
Discrete Probability Distributions
Probability Distributions Random Variables: Finite and Continuous Distribution Functions Expected value April 3 – 10, 2003.
Slide 1 Statistics Workshop Tutorial 4 Probability Probability Distributions.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
Inferences About Process Quality
Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.
Stat 1510: Introducing Probability. Agenda 2  The Idea of Probability  Probability Models  Probability Rules  Finite and Discrete Probability Models.
HAWKES LEARNING SYSTEMS math courseware specialists Copyright © 2010 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Chapter 8 Continuous.
Chapter 6: Probability Distributions
1 As we have seen in section 4 conditional probability density functions are useful to update the information about an event based on the knowledge about.
Normal Approximation Of The Binomial Distribution:
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Review and Preview This chapter combines the methods of descriptive statistics presented in.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Sections 6-1 and 6-2 Overview Estimating a Population Proportion.
Confidence Intervals 1 Chapter 6. Chapter Outline Confidence Intervals for the Mean (Large Samples) 6.2 Confidence Intervals for the Mean (Small.
Chapter 7 Estimates and Sample Sizes
OPIM 5103-Lecture #3 Jose M. Cruz Assistant Professor.
Random Variables. A random variable X is a real valued function defined on the sample space, X : S  R. The set { s  S : X ( s )  [ a, b ] is an event}.
BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.
CHAPTER Discrete Models  G eneral distributions  C lassical: Binomial, Poisson, etc Continuous Models  G eneral distributions 
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Section 5-2 Random Variables.
Week 21 Conditional Probability Idea – have performed a chance experiment but don’t know the outcome (ω), but have some partial information (event A) about.
 A probability function is a function which assigns probabilities to the values of a random variable.  Individual probability values may be denoted.
One Random Variable Random Process.
Sections 7-1 and 7-2 Review and Preview and Estimating a Population Proportion.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Section 7-1 Review and Preview.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Mean, Variance, Moments and.
1 3. Random Variables Let ( , F, P) be a probability model for an experiment, and X a function that maps every to a unique point the set of real numbers.
5.1 Randomness  The Language of Probability  Thinking about Randomness  The Uses of Probability 1.
Some Common Discrete Random Variables. Binomial Random Variables.
Random Variables Ch. 6. Flip a fair coin 4 times. List all the possible outcomes. Let X be the number of heads. A probability model describes the possible.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Section 5-1 Review and Preview.
Probability and Distributions. Deterministic vs. Random Processes In deterministic processes, the outcome can be predicted exactly in advance Eg. Force.
 A probability function is a function which assigns probabilities to the values of a random variable.  Individual probability values may be denoted.
Statistical Estimation Vasileios Hatzivassiloglou University of Texas at Dallas.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
Theory of Computational Complexity Probability and Computing Ryosuke Sasanuma Iwama and Ito lab M1.
3. Random Variables (Fig.3.1)
Chapter 3 Discrete Random Variables and Probability Distributions
Bluman, Chapter 5.
CONCEPTS OF ESTIMATION
Discrete random variable X Examples: shoe size, dosage (mg), # cells,…
Econometric Models The most basic econometric model consists of a relationship between two variables which is disturbed by a random error. We need to use.
Probability distributions
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Random Variables Binomial Distributions
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
3. Random Variables Let (, F, P) be a probability model for an experiment, and X a function that maps every to a unique point.
11. Conditional Density Functions and Conditional Expected Values
11. Conditional Density Functions and Conditional Expected Values
Discrete Random Variables: Basics
Independence and Counting
Discrete Random Variables: Basics
Simulation Berlin Chen
Independence and Counting
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Discrete Random Variables: Basics
Presentation transcript:

CS CM124/224 & HG CM124/224 DISCUSSION SECTION (JUN 6, 2013) TA: Farhad Hormozdiari

Reminder  Final Review

Agenda  Re-sequencing  Sequence Mapping Coverage  Tumor Genome Reconstruction

Re-sequencing  I want to sequence my genome (know my DNA sequence). How?  Several sequencing technologies  One is called next generation sequencing  Cheaper than other sequencing technologies  Generate many short reads from my genome  A short read is a short DNA segment from my genome of length 30bp ~ ?  Re-sequencing is mapping these short reads to known DNA sequence (called reference genome)  Assume that my genome is very close to reference genome  Require short reads from my genome and reference genome (constructed by other sequencing technologies)  Why don’t we just use sequencing technologies that were used to construct the reference genome?  Because they are more expensive and takes more time

Problems with Re-sequencing  Repeated sequences in reference genome – reads from target map to multiple positions  Insertion, deletion, or inversion in target genome (or any target sequence that is significantly different from reference genome) – reads do not map to any position  If reads have random errors  Solution – Collect many reads that map to the same position  Coverage – the number of times each position is mapped by different reads. For example, 10x coverage means there are on average 10 reads mapping to the same position  We then take consensus among reads that map to the same position  Since only few reads have error at that position, this solves error of reads. And, more reads we have at the position (higher coverage), less likely incorrect prediction is made

Consensus Algorithm for SNP calling

Problems with Re-sequencing  Example – If error rate is e, and we are going to predict the consensus sequence, what is the error rate if the coverage is 3.

Problems with Re-sequencing  Example – If error rate is e, and we are going to predict the consensus sequence, what is the error rate if the coverage is 3.  We will make a prediction with an error if two out of three reads or three out of all three reads have an error in the same place. Probability of 3 reads having error Probability of 2 out of 3 reads having error

Sequence Mapping Coverage  If a genome is length N (human is 3,000,000,000), and the total length of all sequence reads collected is M, the coverage (ratio) is defined as M/N  Often written with an “x”. For example, 10x or 20x coverage  10x coverage means there are 10 reads on average mapping to the same position; can be less than 10, more than 10, or exactly 10 depending on the position of genome

Coverage Example  Assume we have 3x overall coverage 4 reads1 read 3 reads2 reads We assume that coverage (# of reads at a specific position of genome) follows the Poisson distribution whose mean is the overall coverage (e.g. 3x)

Poisson Distribution  Discrete probability distribution to compute probability of (rare) events given known mean  Only one parameter: λ, mean of distribution  Probability Mass Function  Mean = λ  Variance = λ

Poisson Distribution

Poisson Distribution to Sequence Coverage  Overall coverage = λ  Probability that exactly X reads span a certain position (percentages of genome that have coverage equal to X)  dpois(X, λ )  Probability that X or fewer reads span a certain position (percentages of genome that have coverage equal to or less than X)  ppois(X, λ )  At least Y% of the genome have at least λ coverage  qpois(Y, λ )

Diploid Coverage  Humans have 2 chromosomes  Each read comes from one chromosome at random generate reads map to reference 2 reads from 1 st Chr & 3 reads from 2 nd Chr

Diploid Coverage  Assume a position in the reference genome is covered by Y reads  The probability that X of those Y reads come from the first chromosome follows the binomial distribution with.5 probability  dbinom(X, Y, 0.5)  Same as the probability of observing X heads when we toss the fair coin Y times

Diploid Coverage  Given that we have Y reads mapped to a specific position of reference genome, what is the probability of having at least X reads (or coverage) for each chromosome?  Let’s assume Y = 10, X = 3  We want to add the following probabilities  The probability of having 3 reads from 1 st Chr and 7 reads from 2 nd Chr  The probability of having 4 reads from 1 st Chr and 6 reads from 2 nd Chr  The probability of having 5 reads from 1 st Chr and 5 reads from 2 nd Chr  The probability of having 6 reads from 1 st Chr and 4 reads from 2 nd Chr  The probability of having 7 reads from 1 st Chr and 3 reads from 2 nd Chr  dbinom(3,10,0.5)+dbinom(4,10,0.5)+dbinom(5,10,0.5)+dbinom(6,10,0.5)+dbinom(7,10,0.5)  Or

Another Diploid Coverage  We assume that the overall coverage is λ  What is the probability of having at least X coverage for each chromosome over the whole genome?  First, we want to compute the probability of having i coverage at a specific position of genome given the overall coverage λ  dpois(i, λ )  Given we have i coverage at a specific position, what is the probability of having at least X coverage for each chromosome? Then, given the overall coverage (λ), what is the probability of having i reads (or coverage) at a specific position and having at least X coverage for each chromosome?

Another Diploid Coverage  We only computed the probability when there are i reads (or coverage) at a specific position  The minimum value of i is 2X (2 times X)  We want at least X coverage for each chromosome, so we need to have at least 2X coverage at a specific position  For example, if we want at least 5 coverage for each chromosome, we need to have at least 10 reads mapped to the reference genome  The maximum value of i is infinitity  Hence, i increases from 2X to infinity, and we sum the probabilities for each i value Note that this is a nested loop (double loop). Not just multiplying two separate loops

Diploid Coverage Examples

Tumor Genome Reconstruction  We have the reference genome (known)  We have the tumor genome (unknown)  We have paired-end reads from tumor genome and map them to the (known) reference genome  By observing how those paired-end reads map to the reference genome, we can reconstruct the tumor genome  Parts of tumor genome can be the same as the reference genome, or some regions may be inverted, duplicated, or translocated

CS/HG 124/224http://genetics.cs.ucla.edu/cs124 Jae Hoon Sul

CS/HG 124/224http://genetics.cs.ucla.edu/cs124 Jae Hoon Sul

Why do we care about rearrangement? chronic myelogenous leukemia

CS/HG 124/224http://genetics.cs.ucla.edu/cs124

CS/HG 124/224http://genetics.cs.ucla.edu/cs124 Jae Hoon Sul

CS/HG 124/224http://genetics.cs.ucla.edu/cs124 Jae Hoon Sul

Tumor Genome – same as reference A tumor genome (unknown) A reference genome (known) The read from the tumor genome normally maps to the reference genome –“Normally” means the gap between two ends of paired-end read is the same (or similar) for both tumor and reference genomes Hence, the region contained by this paired-end read is the same for both tumor and reference genomes map a read to the reference a given read

Tumor Genome – Duplication  Assume a region of the reference genome is duplicated in tumor genome A tumor genome (unknown) A reference genome (known) A read #1read #2 Read #1 maps normally to the reference genome However, when we map read #2, it does not map normally –There is a big space between two ends of the paired-end read –The order of paired-end read is also different (read #1: green is on left side of region, read #2: red is on left side of region) Hence, we can conclude that region A is duplicated in the tumor genome

Tumor Genome – Clarification  We are given this A From this information, we want to reconstruct this A A

Tumor Genome – Inversion  Assume a region of the reference genome is inverted in tumor genome Read #2 does not map normally –There is a big space between two ends of the paired-end read –The direction of the green arrow is opposite when read #2 maps to the reference genome We can conclude that region B is inverted in the tumor genome B C A tumor genome (unknown) B C A reference genome (known) read #1read #2

Tumor Genome – Translocation  Assume a region of the reference genome is translocated in tumor genome Read #2 and #3 do not map normally –There is a big space between two ends of the paired-end read We can conclude that region B and C are translocated in the tumor genome B C A tumor genome (unknown) B C A reference genome (known) read #1 read #2read #3