CSE182-L18 Population Genetics. Perfect Phylogeny Assume an evolutionary model in which no recombination takes place, only mutation. The evolutionary.

CSE182-L18 Population Genetics

Perfect Phylogeny Assume an evolutionary model in which no recombination takes place, only mutation. The evolutionary history is explained by a tree in which every mutation is on an edge of the tree. All the species in one sub-tree contain a 0, and all species in the other contain a 1. Such a tree is called a perfect phylogeny. How can one reconstruct such a tree?

The 4-gamete condition A column i partitions the set of species into two sets i 0, and i 1 A column is homogeneous w.r.t a set of species, if it has the same value for all species. Otherwise, it is heterogenous. EX: i is heterogenous w.r.t {A,D,E} i A 0 B 0 C 0 D 1 E 1 F 1 i0i0 i1i1

4 Gamete Condition –There exists a perfect phylogeny if and only if for all pair of columns (i,j), either j is not heterogenous w.r.t i 0, or i 1. –Equivalent to –There exists a perfect phylogeny if and only if for all pairs of columns (i,j), the following 4 rows do not exist (0,0), (0,1), (1,0), (1,1)

4-gamete condition: proof Depending on which edge the mutation j occurs, either i 0, or i 1 should be homogenous. (only if) Every perfect phylogeny satisfies the 4-gamete condition (if) If the 4-gamete condition is satisfied, does a prefect phylogeny exist? i0i0 i1i1 i

An algorithm for constructing a perfect phylogeny We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later. In any tree, each node (except the root) has a single parent. –It is sufficient to construct a parent for every node. In each step, we add a column and refine some of the nodes containing multiple children. Stop if all columns have been considered.

Inclusion Property For any pair of columns i,j –i < j if and only if i 1  j 1 Note that if i<j then the edge containing i is an ancestor of the edge containing i i j

Example 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 r A BCDE Initially, there is a single clade r, and each node has r as its parent

Sort columns Sort columns according to the inclusion property (note that the columns are already sorted here). This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0

Add first column In adding column i –Check each edge and decide which side you belong. –Finally add a node if you can resolve a clade r A B C D E 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 u

Adding other columns Add other columns on edges using the ordering property r E B C D A 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 1 2 4 3 5

Unrooted case Switch the values in each column, so that 0 is the majority element. Apply the algorithm for the rooted case

Summary :No recombination leads to correlation between sites 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 3 8 5 The different sites are linked. A 1 in position 8 implies 0 in position 5, and vice versa. The history of a population can be expressed as a tree. The tree can be constructed efficiently AB

Recombination A tree is not sufficient as a sequence may have 2 parents Recombination leads to violation of 4 gamete property. Recombination leads to loss of correlation between columns 00000000 11111111 00011111

Studying recombination A tree is not sufficient as a sequence may have 2 parents Recombination leads to loss of correlation between columns How can we measure recombination?

Linkage (Dis)-equilibrium (LD) Extensive Recombination –Pr[A,B=(0,1)=0.125 Linkage equilibrium AB0101000010101010AB010100001010101000 No recombination –Pr[A,B=0,1] = 0.25 Linkage disequilibrium AB0001000011101010AB00010000111010100001

Measuring LD Consider two bi-allelic sites A and B, with values 0 and 1. Let p 1 = probability[individual has allele 1 in site A] q 1 = probability[individual has allele 1 in site B] P 11 = Prob [individual has allele 1 in site A, and B] Linkage Disequilibrium, D = |P 11 -p 1 q 1| = |P 01 -p 0 q 1| =…. If D=0, sites are uncorrelated, (are in linkage equilibrium) If |D| >>0, sites are highly correlated (have high LD) Other measures exist, but they all measure similar quantities.

LD can be used to map disease genes LD decays with distance from the disease allele. By plotting LD, one can short list the region containing the disease gene. 011001011001 DNNDDNDNNDDN LD

Population sub-structure can cause problems in disease gene mapping

Population sub-structure can increase LD Consider two populations that were isolated and evolving independently. They might have different allele frequencies in some regions. Pick two regions that are far apart (LD is very low, close to 0) 0.. 1 0.. 0 1.. 1 0.. 1 1.. 0 0.. 0 1.. 1 1.. 0 Pop. A Pop. B p 1 =0.1 q 1 =0.9 P 11 =0.1 D=0.01 p 1 =0.9 q 1 =0.1 P 11 =0.1 D=0.01

Recent ad-mixing of population If the populations came together recently (Ex: African and European population), artificial LD might be created. D = 0.15 (instead of 0.01), increases 10-fold This spurious LD might lead one false associations Other genetic events can cause LD to arise, and one needs to be careful 0.. 1 0.. 0 1.. 1 0.. 1 1.. 0 0.. 0 1.. 1 1.. 0 Pop. A+B p 1 =0.5 q 1 =0.5 P 11 =0.1 D=0.1-0.25=0.15

Determining population sub- structure Given a mix of people, can you sub-divide them into ethnic populations. Turn the ‘problem’ of spurious LD into a clue. –Find markers that are too far apart to show LD –If they do show LD (correlation), that shows the existence of multiple populations. –Sub-divide them into populations so that LD disappears.

Determining Population sub-structure Same example as before: The two markers are too similar to show any LD, yet they do show LD. However, if you split them so that all 0..1 are in one population and all 1..0 are in another, LD disappears 0.. 1 0.. 0 1.. 1 0.. 1 1.. 0 0.. 0 1.. 1 1.. 0

Iterative Algorithm for Population Substructure Assume that there are 2 sub-populations Randomly partition the individuals into two. Select an individual, and compute the probabilities Pr(x|A), and Pr (x|B) Assign the individual to A with probability –Pr(x|A)/ (Pr(x|A)+Pr(x|B)) Continue.

Iterative algorithm for population sub-structure Define N = number of individuals (each has a single chromosome) k = number of sub-populations. Z  {1..k} N is a vector giving the sub-population. –Z i =k’ => individual i is assigned to population k’ X i,j = allelic value for individual i in position j P k,j,l = frequency of allele l at position j in population k

Example Ex: consider the following assignment P 1,1,0 = 0.9 P 2,1,0 = 0.1 0.. 1 0.. 0 1.. 1 0.. 1 1.. 0 0.. 0 1.. 1 1.. 0 1111111111222222222211111111112222222222

Goal X is known. P, Z are unknown. The goal is to estimate Pr(P,Z|X) Various learning techniques can be employed. Here a Bayesian (MCMC) scheme is employed. We will only consider a simplified version

Algorithm:Structure Iteratively estimate –(Z (0),P (0) ), (Z (1),P (1) ),.., (Z (m),P (m) ) After ‘convergence’, Z (m) is the answer. Iteration –Guess Z (0) –For m = 1,2,.. Sample P (m) from Pr(P | X, Z (m-1) ) Sample Z (m) from Pr(P | X, P (m-1) ) How is this sampling done?

Example Choose Z at random, so each individual is assigned to be in one of 2 populations. See example. Now, we need to sample P (1) from Pr(P | X, Z (0) ) Simply count N k,j,l = number of people in pouplation k which have allele l in position j p k,j,l = N k,j,l / N 0.. 1 0.. 0 1.. 1 0.. 1 1.. 0 0.. 0 1.. 1 1.. 0 1221121212122112122112211212121221121221

Example N k,j,l = number of people in population k which have allele l in position j p k,j,l = N k,j,l / N 1,1,* N 1,1,0 = 4 N 1,1,1 = 6 p 1,1,0 = 4/10 p 1,2,0 = 4/10 Thus, we can sample P (m) 0.. 1 0.. 0 1.. 1 0.. 1 1.. 0 0.. 0 1.. 1 1.. 0 1221121212122112122112211212121221121221

Sampling Z Pr[Z 1 = 1] = Pr[”01” belongs to population 1]? We know that each position should be in linkage equilibrium and independent. Pr[”01” |Population 1] = p 1,1,0 * p 1,2,1 =(4/10)*(6/10)=(0.24) Pr[”01” |Population 2] = p 2,1,0 * p 2,2,1 = (6/10)*(4/10)=0.24 Pr [Z 1 = 1] = 0.24/(0.24+0.24) = 0.5

Sampling Suppose, during the iteration, there is a bias. Then, in the next step of sampling Z, we will do the right thing Pr[“01”| pop. 1] = p 1,1,0 * p 1,2,1 = 0.7*0.7 = 0.49 Pr[“01”| pop. 2] = p 2,1,0 * p 2,2,1 =0.3*0.3 = 0.09 Pr[Z 1 = 1] = 0.49/(0.49+0.09) = 0.85 Pr[Z 6 = 1] = 0.49/(0.49+0.09) = 0.85 Eventually all “01” will become 1 population, and all “10” will become a second population 0.. 1 0.. 0 1.. 1 0.. 1 1.. 0 0.. 0 1.. 1 1.. 0 1112121211222122122111121212112221221221

Population Structure 377 locations (loci) were sampled in 1000 people from 52 populations. 6 genetic clusters were obtained, which corresponded to 5 geographic regions (Rosenberg et al. Science 2003) Africa EurasiaEast Asia America Oceania

Other topics Protein Sequence Analysis Sequence Analysis Gene Finding Assembly ncRNA Genomic Analysis/ Pop. Genetics

ncRNA gene finding Gene is transcribed but not translated. What are the clues to non-coding genes? –Look for signals selecting start of transcription and translation. Non coding genes are transcribed by Pol III –Non-coding genes have structure. Look for genomic sequences that fold into an RNA structure Structure: Given a sequence, what is the structure into which it can fold with minimum energy?

tRNA structure

RNA structure: Basics Key: RNA is single-stranded. Think of a string over 4 letters, AC,G, and U. The complementary bases form pairs. Base-pairing defines a secondary structure. The base- pairing is usually non-crossing.

RNA structure: pseudoknots Sometimes, unpaired bases in loops form ‘crossing pairs’. These are pseudoknots

Transcript profiling A Static picture of the cell is insufficient Each Cell is continuously active, –Genes are being transcribed into RNA –RNA is translated into proteins –Proteins are PT modified and transported –Proteins perform various cellular functions Can we probe the Cell dynamically Gene Regulation Proteomic profiling Pathways

Gene expression The expression of transcripts and protein in the cell is not static. It changes in response to signals. The expression can be measured using micro- arrays. What causes the change in expression?

Transcriptional machinery DNA polymerase (II) scans the genome, initiating transcription, and terminating it. The same machinery is used for every gene, so while Pol II is required, it is not sufficient to confer specificity

TF binding Other transcription factors interact with the core machinery and upstream DNA to provide specificity. TFs bind to TF binding sites which are clustered in upstream enhancer and promoter elements. The enhancer elements may be located many kb upstream of the core- promoter Upstream elements Transcription factors

TF binding sites TF binding sites are weak signal (about 10 bp with 5bp conserved) If two genes are co- regulated, they are likely to share binding sites Discovery of binding site motifs is an important research problem. TGAGGAG TCAGGAG TCAGGTG TGAGGTG TCAGGTG g1g1 g2g2 g3g3 g4g4 g5g5

http://www.gene-regulation.com/pub/databases.html#transfac

Discovering TF binding sites Identification of these TF binding sites/switches is critical. Requires identification of co-regulated genes (genes containing the same set of switches). How do we find co-regulated genes?

Idea1: Use orthologous genes from different species ACGGCAGCTCGCCGCCGCGC ||||| || ||||||| || ACGGC-GGGCGCCGCCCCGC ACGGCAGCTCGCCGCCGC-C | || | ||||||| | AGTGC-GGGCGCCGCCTCAT ACGGC-GC-TCGCCGCCGCGC | | | || | | AT-ACGAAGTAGCGG-ATGGT 1.The species are too close (EX: humans and chimps). Binding & non-binding sites are both conserved. 2.The species are distant. Binding sites are conserved but not other sequence. 3.The species are very distant. Even binding sites are not conerved. The genes have alternative regulators.

Idea2: Microarray Expression level of all genes

Pathways Proteins interact to transduce signal, catalyze reactions, etc. The interactions can be captured in a database. Queries on this database are about looking for interesting sub-graphs in a large graph.

Summary Biological databases cannot be understood without understanding the data, and the tools for querying and accessing these data. While database technology (XML, Relational OO databases, text formats) is used to store this data, its use is (often) transparent for Bioinformatics people. In this course, we looked at various data-streams, and pointed to databases that store these data- streams Nucleic Acids Research brings out a database issue every January 2005 issue

CSE182-L18 Population Genetics. Perfect Phylogeny Assume an evolutionary model in which no recombination takes place, only mutation. The evolutionary.

Similar presentations

Presentation on theme: "CSE182-L18 Population Genetics. Perfect Phylogeny Assume an evolutionary model in which no recombination takes place, only mutation. The evolutionary."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSE182-L18 Population Genetics. Perfect Phylogeny Assume an evolutionary model in which no recombination takes place, only mutation. The evolutionary.

Similar presentations

Presentation on theme: "CSE182-L18 Population Genetics. Perfect Phylogeny Assume an evolutionary model in which no recombination takes place, only mutation. The evolutionary."— Presentation transcript:

Similar presentations

About project

Feedback