CSE182-L18 Population Genetics. Perfect Phylogeny Assume an evolutionary model in which no recombination takes place, only mutation. The evolutionary.

Slides:



Advertisements
Similar presentations
Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
Advertisements

CSE/Beng/BIMM 182: Biological Data Analysis Instructor: Vineet Bafna TA: Yuan Zhao Course Link Course Link.
PHYLOGENETIC TREES Bulent Moller CSE March 2004.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Molecular Evolution Revised 29/12/06
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Predicting RNA Structure and Function. Non coding DNA (98.5% human genome) Intergenic Repetitive elements Promoters Introns mRNA untranslated region (UTR)
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
CSE182-L12 Gene Finding.
CSE 182: Biological Data Analysis Instructor: Vineet Bafna TA: Ryan Kelley
Graph, Search Algorithms Ka-Lok Ng Department of Bioinformatics Asia University.
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
CSE182-L17 Clustering Population Genetics: Basics.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Fa 05CSE182 CSE182-L8 Mass Spectrometry. Fa 05CSE182 Bio. quiz What is a gene? What is a transcript? What is translation? What are microarrays? What is.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Phylogenetic trees Sushmita Roy BMI/CS 576
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
20.1 – 1 Look at the illustration of “Cloning a Human Gene in a Bacterial Plasmid” (Figure 20.4 in the orange book). If the medium used for plating cells.
20.1 – 1 Look at the illustration of “Cloning a Human Gene in a Bacterial Plasmid” (Figure 20.4 in the orange book). If the medium used for plating cells.
Biotechnology Application of biological science to solving practical problems Method we focus on: I. Breeding Strategies A. Selective Breeding - indirect.
Molecular phylogenetics
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna Population data Recall that we often study a population in the form of a SNP matrix – Rows.
CSE280Vineet Bafna CSE280a: Algorithmic topics in bioinformatics Vineet Bafna.
E QUILIBRIA IN POPULATIONS CSE280Vineet Bafna Population data Recall that we often study a population in the form of a SNP matrix – Rows.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
INTRODUCTION TO ASSOCIATION MAPPING
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
Questions?. Novel ncRNAs are abundant: Ex: miRNAs miRNAs were the second major story in 2001 (after the genome). Subsequently, many other non-coding genes.
Chapter 17 From Gene to Protein. 2 DNA contains the genes that make us who we are. The characteristics we have are the result of the proteins our cells.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Human Genomics. Writing in RED indicates the SQA outcomes. Writing in BLACK explains these outcomes in depth.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
Lecture 4: Transcription in Prokaryotes Chapter 6.
Wi’08Structure Population sub-structure. Wi’08Structure Projects Harish/Nitin Gaurav (Tuesday) Stefano/Hossein (Tuesday) Nisha/Yu David Jian/Josue (Tuesday)
Motif Search and RNA Structure Prediction Lesson 9.
CSE280Vineet Bafna In a ‘stable’ population, the distribution of alleles obeys certain laws – Not really, and the deviations are interesting HW Equilibrium.
Coalescent theory CSE280Vineet Bafna Expectation, and deviance Statements such as the ones below can be made only if we have an underlying model that.
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
Finding genes in the genome
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
EQTLs.
L4: Counting Recombination events
Estimating Recombination Rates
Vineet Bafna/Pavel Pevzner
Outline Cancer Progression Models
Clustering.
Presentation transcript:

CSE182-L18 Population Genetics

Perfect Phylogeny Assume an evolutionary model in which no recombination takes place, only mutation. The evolutionary history is explained by a tree in which every mutation is on an edge of the tree. All the species in one sub-tree contain a 0, and all species in the other contain a 1. Such a tree is called a perfect phylogeny. How can one reconstruct such a tree?

The 4-gamete condition A column i partitions the set of species into two sets i 0, and i 1 A column is homogeneous w.r.t a set of species, if it has the same value for all species. Otherwise, it is heterogenous. EX: i is heterogenous w.r.t {A,D,E} i A 0 B 0 C 0 D 1 E 1 F 1 i0i0 i1i1

4 Gamete Condition –There exists a perfect phylogeny if and only if for all pair of columns (i,j), either j is not heterogenous w.r.t i 0, or i 1. –Equivalent to –There exists a perfect phylogeny if and only if for all pairs of columns (i,j), the following 4 rows do not exist (0,0), (0,1), (1,0), (1,1)

4-gamete condition: proof Depending on which edge the mutation j occurs, either i 0, or i 1 should be homogenous. (only if) Every perfect phylogeny satisfies the 4-gamete condition (if) If the 4-gamete condition is satisfied, does a prefect phylogeny exist? i0i0 i1i1 i

An algorithm for constructing a perfect phylogeny We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later. In any tree, each node (except the root) has a single parent. –It is sufficient to construct a parent for every node. In each step, we add a column and refine some of the nodes containing multiple children. Stop if all columns have been considered.

Inclusion Property For any pair of columns i,j –i < j if and only if i 1  j 1 Note that if i<j then the edge containing i is an ancestor of the edge containing i i j

Example A B C D E r A BCDE Initially, there is a single clade r, and each node has r as its parent

Sort columns Sort columns according to the inclusion property (note that the columns are already sorted here). This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order A B C D E

Add first column In adding column i –Check each edge and decide which side you belong. –Finally add a node if you can resolve a clade r A B C D E A B C D E u

Adding other columns Add other columns on edges using the ordering property r E B C D A A B C D E

Unrooted case Switch the values in each column, so that 0 is the majority element. Apply the algorithm for the rooted case

Summary :No recombination leads to correlation between sites The different sites are linked. A 1 in position 8 implies 0 in position 5, and vice versa. The history of a population can be expressed as a tree. The tree can be constructed efficiently AB

Recombination A tree is not sufficient as a sequence may have 2 parents Recombination leads to violation of 4 gamete property. Recombination leads to loss of correlation between columns

Studying recombination A tree is not sufficient as a sequence may have 2 parents Recombination leads to loss of correlation between columns How can we measure recombination?

Linkage (Dis)-equilibrium (LD) Extensive Recombination –Pr[A,B=(0,1)=0.125 Linkage equilibrium AB AB No recombination –Pr[A,B=0,1] = 0.25 Linkage disequilibrium AB AB

Measuring LD Consider two bi-allelic sites A and B, with values 0 and 1. Let p 1 = probability[individual has allele 1 in site A] q 1 = probability[individual has allele 1 in site B] P 11 = Prob [individual has allele 1 in site A, and B] Linkage Disequilibrium, D = |P 11 -p 1 q 1| = |P 01 -p 0 q 1| =…. If D=0, sites are uncorrelated, (are in linkage equilibrium) If |D| >>0, sites are highly correlated (have high LD) Other measures exist, but they all measure similar quantities.

LD can be used to map disease genes LD decays with distance from the disease allele. By plotting LD, one can short list the region containing the disease gene DNNDDNDNNDDN LD

Population sub-structure can cause problems in disease gene mapping

Population sub-structure can increase LD Consider two populations that were isolated and evolving independently. They might have different allele frequencies in some regions. Pick two regions that are far apart (LD is very low, close to 0) Pop. A Pop. B p 1 =0.1 q 1 =0.9 P 11 =0.1 D=0.01 p 1 =0.9 q 1 =0.1 P 11 =0.1 D=0.01

Recent ad-mixing of population If the populations came together recently (Ex: African and European population), artificial LD might be created. D = 0.15 (instead of 0.01), increases 10-fold This spurious LD might lead one false associations Other genetic events can cause LD to arise, and one needs to be careful Pop. A+B p 1 =0.5 q 1 =0.5 P 11 =0.1 D= =0.15

Determining population sub- structure Given a mix of people, can you sub-divide them into ethnic populations. Turn the ‘problem’ of spurious LD into a clue. –Find markers that are too far apart to show LD –If they do show LD (correlation), that shows the existence of multiple populations. –Sub-divide them into populations so that LD disappears.

Determining Population sub-structure Same example as before: The two markers are too similar to show any LD, yet they do show LD. However, if you split them so that all 0..1 are in one population and all 1..0 are in another, LD disappears

Iterative Algorithm for Population Substructure Assume that there are 2 sub-populations Randomly partition the individuals into two. Select an individual, and compute the probabilities Pr(x|A), and Pr (x|B) Assign the individual to A with probability –Pr(x|A)/ (Pr(x|A)+Pr(x|B)) Continue.

Iterative algorithm for population sub-structure Define N = number of individuals (each has a single chromosome) k = number of sub-populations. Z  {1..k} N is a vector giving the sub-population. –Z i =k’ => individual i is assigned to population k’ X i,j = allelic value for individual i in position j P k,j,l = frequency of allele l at position j in population k

Example Ex: consider the following assignment P 1,1,0 = 0.9 P 2,1,0 =

Goal X is known. P, Z are unknown. The goal is to estimate Pr(P,Z|X) Various learning techniques can be employed. Here a Bayesian (MCMC) scheme is employed. We will only consider a simplified version

Algorithm:Structure Iteratively estimate –(Z (0),P (0) ), (Z (1),P (1) ),.., (Z (m),P (m) ) After ‘convergence’, Z (m) is the answer. Iteration –Guess Z (0) –For m = 1,2,.. Sample P (m) from Pr(P | X, Z (m-1) ) Sample Z (m) from Pr(P | X, P (m-1) ) How is this sampling done?

Example Choose Z at random, so each individual is assigned to be in one of 2 populations. See example. Now, we need to sample P (1) from Pr(P | X, Z (0) ) Simply count N k,j,l = number of people in pouplation k which have allele l in position j p k,j,l = N k,j,l / N

Example N k,j,l = number of people in population k which have allele l in position j p k,j,l = N k,j,l / N 1,1,* N 1,1,0 = 4 N 1,1,1 = 6 p 1,1,0 = 4/10 p 1,2,0 = 4/10 Thus, we can sample P (m)

Sampling Z Pr[Z 1 = 1] = Pr[”01” belongs to population 1]? We know that each position should be in linkage equilibrium and independent. Pr[”01” |Population 1] = p 1,1,0 * p 1,2,1 =(4/10)*(6/10)=(0.24) Pr[”01” |Population 2] = p 2,1,0 * p 2,2,1 = (6/10)*(4/10)=0.24 Pr [Z 1 = 1] = 0.24/( ) = 0.5

Sampling Suppose, during the iteration, there is a bias. Then, in the next step of sampling Z, we will do the right thing Pr[“01”| pop. 1] = p 1,1,0 * p 1,2,1 = 0.7*0.7 = 0.49 Pr[“01”| pop. 2] = p 2,1,0 * p 2,2,1 =0.3*0.3 = 0.09 Pr[Z 1 = 1] = 0.49/( ) = 0.85 Pr[Z 6 = 1] = 0.49/( ) = 0.85 Eventually all “01” will become 1 population, and all “10” will become a second population

Population Structure 377 locations (loci) were sampled in 1000 people from 52 populations. 6 genetic clusters were obtained, which corresponded to 5 geographic regions (Rosenberg et al. Science 2003) Africa EurasiaEast Asia America Oceania

Other topics Protein Sequence Analysis Sequence Analysis Gene Finding Assembly ncRNA Genomic Analysis/ Pop. Genetics

ncRNA gene finding Gene is transcribed but not translated. What are the clues to non-coding genes? –Look for signals selecting start of transcription and translation. Non coding genes are transcribed by Pol III –Non-coding genes have structure. Look for genomic sequences that fold into an RNA structure Structure: Given a sequence, what is the structure into which it can fold with minimum energy?

tRNA structure

RNA structure: Basics Key: RNA is single-stranded. Think of a string over 4 letters, AC,G, and U. The complementary bases form pairs. Base-pairing defines a secondary structure. The base- pairing is usually non-crossing.

RNA structure: pseudoknots Sometimes, unpaired bases in loops form ‘crossing pairs’. These are pseudoknots

Transcript profiling A Static picture of the cell is insufficient Each Cell is continuously active, –Genes are being transcribed into RNA –RNA is translated into proteins –Proteins are PT modified and transported –Proteins perform various cellular functions Can we probe the Cell dynamically Gene Regulation Proteomic profiling Pathways

Gene expression The expression of transcripts and protein in the cell is not static. It changes in response to signals. The expression can be measured using micro- arrays. What causes the change in expression?

Transcriptional machinery DNA polymerase (II) scans the genome, initiating transcription, and terminating it. The same machinery is used for every gene, so while Pol II is required, it is not sufficient to confer specificity

TF binding Other transcription factors interact with the core machinery and upstream DNA to provide specificity. TFs bind to TF binding sites which are clustered in upstream enhancer and promoter elements. The enhancer elements may be located many kb upstream of the core- promoter Upstream elements Transcription factors

TF binding sites TF binding sites are weak signal (about 10 bp with 5bp conserved) If two genes are co- regulated, they are likely to share binding sites Discovery of binding site motifs is an important research problem. TGAGGAG TCAGGAG TCAGGTG TGAGGTG TCAGGTG g1g1 g2g2 g3g3 g4g4 g5g5

Discovering TF binding sites Identification of these TF binding sites/switches is critical. Requires identification of co-regulated genes (genes containing the same set of switches). How do we find co-regulated genes?

Idea1: Use orthologous genes from different species ACGGCAGCTCGCCGCCGCGC ||||| || ||||||| || ACGGC-GGGCGCCGCCCCGC ACGGCAGCTCGCCGCCGC-C | || | ||||||| | AGTGC-GGGCGCCGCCTCAT ACGGC-GC-TCGCCGCCGCGC | | | || | | AT-ACGAAGTAGCGG-ATGGT 1.The species are too close (EX: humans and chimps). Binding & non-binding sites are both conserved. 2.The species are distant. Binding sites are conserved but not other sequence. 3.The species are very distant. Even binding sites are not conerved. The genes have alternative regulators.

Idea2: Microarray Expression level of all genes

Pathways Proteins interact to transduce signal, catalyze reactions, etc. The interactions can be captured in a database. Queries on this database are about looking for interesting sub-graphs in a large graph.

Summary Biological databases cannot be understood without understanding the data, and the tools for querying and accessing these data. While database technology (XML, Relational OO databases, text formats) is used to store this data, its use is (often) transparent for Bioinformatics people. In this course, we looked at various data-streams, and pointed to databases that store these data- streams Nucleic Acids Research brings out a database issue every January 2005 issue