Intro to Comp Genomics Lecture 3: Genomic features and patterns.

Intro to Comp Genomics Lecture 3: Genomic features and patterns

RNA Based Genomes Ribosome Proteins Genetic Code DNA Based Genomes Membranes Diversity! ? ? 3.4 – 3.8 BYA – fossils?? 3.2 BYA – good fossils 3 BYA – metanogenesis 2.8 BYA – photosynthesis.. 1.7-1.5 BYA – eukaryotes.. 0.55 BYA – camberian explosion 0.44 BYA – jawed vertebrates 0.4 – land plants 0.14 – flowering plants 0.10 - mammals

Biknots Uniknots Eukaryotes

Uniknots – one flagela at some developmental stage Fungi Animals Animal parasites Amoebas Biknots – ancestrally two flagellas Green plants Red algea Ciliates, plasmoudium Brown algea More amobea Strange biology! A big bang phylogeny: speciations across a short time span? Ambiguity – and not much hope for really resolving it

Vertebrates Sequenced Genomes phylogeny Fossil based, large scale phylogeny

Marmoset Macaque Orangutan Chimp Human Baboon Gibbon Gorilla 0.5% 0.8% 1.5% 3% 9% 1.2% Primates

Yeasts

Genome Size Human: 2.8GB Fly: 130MB Arabidopsis: 115M Plasmodium: 22MB S. Cerevisae/S. Pombe: 12MB E.Coli 4.6MB

Why larger genomes? Selflish DNA – –larger genomes are a result of the proliferation of selfish DNA –Proliferation stops only when it is becoming too deleterious Bulk DNA –Genome content is a consequence of natural selection –Larger genome is needed to allow larger cell size, larger nuclear membrane etc.

Why smaller genomes? Metabolic cost: maybe cells lose excess DNA for energetic efficiency –But DNA is only 2-5% of the dry mass –No genome size – replication time correlation in prokaryotes –Replication is much faster than transcription (10-20 times in E. coli)

Mutational balance Balance between deletions and insertions –May be different between species –Different balances may have been evolved In flies, yeast laboratory evolution –4-fold more 4kb spontaneous insertions In mammals –More small deletions than insertions Mutational hazard No loss of function for inert DNA –But is it truly not functional? Gain of function mutations are still possible: –Transcription –Regulation Differences in population size may make DNA purging more effective for prokaryotes, small eukaryotes Differences in regulatory sophistication may make DNA mutational hazard less of a problem for metazoan

Genome Structural features: centromeres/telomeres Rat – Partly acrocentric Human Centromeres are essential and universally important for proper cell division, but are highly diverging among species S. Cerevisae: 100bp centromere S. Pombe: repetitive (~50kb) Mouse: one arm is degenerate Human: both arms contain genes Pericentromeric regions – more repeats Telomeres are critical for genome maintenance Sub telomeric regions – also repetitive, and rearranged May be key to nuclear structure?

Substitution rates and stationary distributions A simple Markov chain P(X=A;t+1) = P(X=A;t)P(A|A) +P(X=C;t)P(A|C) + P(X=G;t)P(A|G) + P(X=T;t)P(A|T) We represent the change using the transition probability matrix Running Start from P(C;t=0) = 1 and running for a long time, what would you expect P(A) to be? A C G T A C G T Fixed point: More later in the course (and in the second term) Differences in substitution rates result in major changes to the stationary distribution

Nucleotide composition: human vs. mouse

TG CpG Islands (mainly at promoters) Low methylation (+Selection??) Deamination and slow correction High methylation Deamination and slow correction Normal mutation CA TGCATG Normal #CpGsSmall #CpGs CpG Islands: %(G+C)>0.5 and %CpG/(%G*%C) > 0.6, for a “long” genomic interval

K-mer distribution Specialized proteins can bind DNA in a sequence specific fashion Genomes can therefore control the level of affinity of each region to a large set of DNA binding proteins DNA binding sites are typically short (<20bp) Multiple binding sites at different affinities participate in regulation The frequency of k-mer DNA words in the genome is called the k-spectrum of the genome The K-spectrum is complex, due to multiple effects 1 2 3 Distance from insert A C G T

Genomic information: Protein coding genes

Defining and detecting genes Predictions: using probabilistic models. HMM based, using landmark features Different among prokeryotes/eukaryotes Modest success and only work for “classical genes” (protein coding) RNA based Sequencing RNA’s from different tissues Mapping to genome using std alignment algorithms Compared to known protein sequcene databases Gold standard EST based Sequencing expressed sequence tags (Unigene) Clustering and defining criteria for coverage Combine with RNA/Predictions Comparative Aligning gene models from one species to another based on sequence homology Effective for uncharacterized genomes, but of limited accuracy Databases: Refseq: containing validated transcripts, high confidence, missing stuff UCSC: knowngene (combining multiple sources) – half conservative Ensembl: combining multiple sources Model organisms: SGD, Flybase, Wormbase

Introns/Exons

Genomic information: the gene repertoire is evolving by duplication and loss

Strand asymmetry Polak and Arndt

Structure meets information: HOX clusters as an example Hox genes are important developmental regulators Present in linear clusters, preserving order Their expression is frequently coordinate with the gene order 4 HOX clusters are present in the human genome Additional gene clusters: Protocadherins, Olfactory receptors, MAGE genes, Zinc fingers Additional smaller groups of related regulators are co-located

miRNA clusters

Repeats: selfish DNA Genome FractionCopiesClass 20.4%868,000 (only ~100 active!!) LINEs 13.1%1,558,000 (70% Alu) SINEs 8.3%443,000LTR elements 2.8%294,000Transposons Repetitive elements in the human genome

Retrotransposition via RNA

Repeats: short tandems, satellites DNA-based transposons do not involve an RNA intermediate, and are quite rare. Satellite DNA duplicate by Replication slippages which is enhanced for specific sequences. Abundant near telomeres and centromeres. Some of these are still a mystery. Retrotransposition is generally sloppy and noisy – so elements die out quickly Element proliferation appears in evolutionary bursts.

Pseudogenes Genes that are becoming inactive due to mutations are called pseudogenes mRNAs that jump back into the genome are called processed pseudogenes (they therefore lack introns)

Linear correlation Figure: wikipedia For normalized vectors Easy to compute in one pass Hard to say if meaningful: Assuming binorrmality:

Spearman correlation Linear correlation is biased whenever you observe a non linear behavior. Linear correlation is extremely sensitive to outliers A-parameteric statistics transform all values to their rank statistic (by sorting) Ties are broken using “mid-ranking” Computing correlation on the rank statitsics generate the Spearman correlation Spearman values of independent variables are distributed just like linear correlation Computing p-value is done accordingly

Studying trend lines Strong correspondence between variables can be observed without any spearman or pearson correlation (can you think of an example?) One can try using parametric transformation to fix such a problem In any case, looking at the data carefully and computing trend lines is essential Generating statistics on trend lines: Fixed Span bins Fixed Size bins Sliding window Weighted sliding window

Auto-correlation Computing correlation between point x and point x-d for different d’s Providing clues for different scales of correlation in the data For example: –Fragment lengths on arrays –Nucleosomes Related method: Fast Fourier Transform (FFT)

Testing difference between samples Model based: assume normality, compute p-value assume binomal distribution assume poisson distribution Direct comparison: T-test to compare means ANOVA Chi-square to compare contingency tables Kolmogorov Smirnov

Multi-variate cross correlation Pairwise correlation matrix –Plotting – but in which order? Multiple testing should be controlled for –Bonferoni’s union bound –False discovery rate (FDR) Control one/few parameters Pros: Results are robust if done right Cons: fewer stats Normalize one/few parameters Clustering or model based approaches (see later in the course)

Preparations: Download one chromosome 17 in the human genome Download ucsc knowngene table Compute the moving average G+C content for bins of 500 bps, decide on 8 G+C content bins. Modeling: Build a 4-order, G+C contet dependent Markov model from all of the non-exonic sequences Analysis: Compute the expected frequency of 4-mers in your genomic bins Compute the observed/expected ratio for all 256 4-mers on all genomic bins. Compute correlations between them. Develop a p-value (using any reasonable method) for the difference between observed and expected k-mers frequencies, report the most significant p- values you discovered. Your Task Preparations: -Links to data in the wiki -File format of knowngene table at the USCS site -Reasonable G+C content bins: balance the bin span and the bin count (having very few cases in a bin will make your subsequent model non-realistic, having a large span of G+C in a bin would make your model inaccurate)

Preparations: Download one chromosome 17 in the human genome Download ucsc knowngene table Compute the moving average G+C content for bins of 500 bps, decide on 8 G+C content bins. Modeling: Build a 3-order, G+C contet dependent Markov model from all of the non-exonic sequences Analysis: Compute the expected frequency of 4-mers in your genomic bins Compute the observed/expected ratio for all 256 4-mers on all genomic bins. Compute correlations between them. Develop a p-value (using any reasonable method) for the difference between observed and expected k-mers frequencies, report the most significant p- values you discovered. Your Task Modeling: Count K-mers for each G+C bin. Transform to 3 order Markov model Use fixed size, non overlapping bins of 20kb for testing observed/expected stat. Ignore repeat maksed sequence (lower case characters) If you believe a different bin size would work better – go for it (think of the expected number of each of the 256 k-mers – in 20kb bin it is ~100 if there are no masked sequence).

Preparations: Download one chromosome 17 in the human genome Download ucsc knowngene table Compute the moving average G+C content for bins of 500 bps, decide on 8 G+C content bins. Modeling: Build a 3-order, G+C contet dependent Markov model from all of the non-exonic sequences Analysis: Compute the expected frequency of 4-mers in your genomic bins using only G+C content or the 3-order markov model Compute the observed/expected ratio for all 256 4-mers on all genomic bins. Compute correlations between them. Develop a p-value (using any reasonable method) for the difference between observed and expected k-mers frequencies, report the most significant p- values you discovered. Your Task Analysis: The expected number of appearances for a 4- mer given only the G+C content: The expected number of appearances for a 4- mer given the 3 order Markov and the G+C: The obs/exp ratio should be studied in log- scale, handling carefully low values Computing 256*256/2 correlations of ~1000 values should not be difficult. We will use these correlations later. P-value can be computed using a Z-score

Intro to Comp Genomics Lecture 3: Genomic features and patterns.

Similar presentations

Presentation on theme: "Intro to Comp Genomics Lecture 3: Genomic features and patterns."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Intro to Comp Genomics Lecture 3: Genomic features and patterns.

Similar presentations

Presentation on theme: "Intro to Comp Genomics Lecture 3: Genomic features and patterns."— Presentation transcript:

Similar presentations

About project

Feedback