Gene Expression in that cell at that time

Slides:

Advertisements

Similar presentations

RNA-Seq as a Discovery Tool

Advertisements

RNA-seq library prep introduction

Peter Tsai Bioinformatics Institute, University of Auckland

1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.

Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.

Introduction to DNA Microarrays Todd Lowe BME 88a March 11, 2003.

Gene Expression And Regulation Bioinformatics January 11, 2006 D. A. McClellan

Differentially expressed genes

CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.

Expression Analysis of RNA-seq Data

CDNA Microarrays MB206.

Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.

RNAseq analyses -- methods

Agenda Introduction to microarrays

Verna Vu & Timothy Abreo

Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.

Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine

Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.

Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.

E XOME SEQUENCING AND COMPLEX DISEASE : practical aspects of rare variant association studies Alice Bouchoms Amaury Vanvinckenroye Maxime Legrand 1.

1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:

RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.

Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Introduction to RNAseq

CSIRO Insert presentation title, do not remove CSIRO from start of footer Experimental Design Why design? removal of technical variance Optimizing your.

Overview of Microarray. 2/71 Gene Expression Gene expression Production of mRNA is very much a reflection of the activity level of gene In the past, looking.

Comp. Genomics Recitation 10 4/7/09 Differential expression detection.

ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.

TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.

ESTs Ian Keller Laboratory Techniques in Molecular Bio.

Lecture 12 RNA – seq analysis.

PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS) LECTURE 13 ANALYSIS OF THE TRANSCRIPTOME.

Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.

Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.

Microarray Technology and Data Analysis Roy Williams PhD Sanford | Burnham Medical Research Institute.

Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.

Canadian Bioinformatics Workshops

Arrays How do they work ? What are they ?. WT Dwarf Transgenic Other species Arrays are inverted Northerns: Extract target RNA YFG Label probe + hybridise.

Transcriptomics History and practice.

Simon v RNA-Seq Analysis Simon v

bacteria and eukaryotes

Next generation sequencing

The Transcriptional Landscape of the Mammalian Genome

RNA-Seq for the Next Generation RNA-Seq Intro Slides

Moderní metody analýzy genomu

Biases and their Effect on Biological Interpretation

Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017

Gene expression from RNA-Seq

RNA-Seq analysis in R (Bioconductor)

Tutorial 6 : RNA - Sequencing Analysis and GO enrichment

Then clustering.

Gene expression.

Gene Expression in that cell at that time

Very important to know the difference between the trees!

Functional Genomics in Evolutionary Research

Kallisto: near-optimal RNA seq quantification tool

Lecture 11 By Shumaila Azam

From: TopHat: discovering splice junctions with RNA-Seq

KEY CONCEPT Entire genomes are sequenced, studied, and compared.

EXTENDING GENE ANNOTATION WITH GENE EXPRESSION

Transcriptomics History and practice.

Joseph Rodriguez, Jerome S. Menet, Michael Rosbash Molecular Cell

Microarray Data Analysis

Sequence Analysis - RNA-Seq 2

Data Type 1: Microarrays

Design Issues Lecture Topic 6.

Presentation transcript:

Gene Expression in that cell at that time Expressed genes are those that have been transcribed. A gene expression profile of a cell is the snapshot of which genes are expressed in that cell at the time the sample was taken. Knowing which genes are expressed in a cell allows: identification of new genes or transcripts comparison of expression profiles between samples Main motives: disease, development, dynamic responses in that cell at that time

Differential Gene Expression Expression profile of genes in one sample vs another. Different cells, tissues, disease states, developmental stages, culture conditions, etc, can be compared. Measure both, subtract the overlap, obtain the difference, interpret it. Pay due attention to controls, negative and positive Pay due attention to range of variability within samples High throughput and higher throughput

Differential Gene Expression Workflow Formulate the biological questions Experimental design Which platform, which controls, how many replicates Run the experiment Image processing (by machine) Low-level analysis Data preprocessing (normalisation step) High-level analysis Data analysis Reach biological conclusions Interpretation of results

High Throughput Methods- Advantages Fast a lot of data produced quickly Comprehensive entire genomes in one experiment Easy submit RNA samples to a core facility Cost getting cheaper still

High Throughput Methods - Disadvantages Cost - Some researchers can’t afford to do appropriate numbers of controls, replicates RNA - The final product of gene expression is a protein Significance - How do you filter out non-coding RNA, or transcripts that are not translated? Quality - Artifacts with image analysis and data analysis Control - Not enough attention to experimental design - Need more collaboration with computational scientists

Measuring Differential Gene Expression Quantification of mRNA transcripts EST libraries SAGE libraries Microarray technology High throughput RNA-seq technology trancriptome sequencing ( quantification of mRNA transcripts)

EST Libraries Expressed Sequence Tags Single sequencing reads from cDNA libraries- 250 (earlier) - 800 (later) bases long usually from 5’ or 3’ end according to cloning strategies an indication of which mRNA are in that cell at that time Highly expressed genes = many ESTs Low expression genes = fewer (or no detectable) ESTs Can miss very low level transcripts, and some transcript variants OK for quantification, known inaccuracies inherent in the method Not OK for discovery of rare transcripts, too much noise from common transcripts dbEST at NCBI

SAGE Libraries Serial analysis of gene expression 14 bp fragment is enough to uniquely identify a transcript. Make cDNA library, cut it to 14bp fragment per transcript. Ligate tags into long concatemers separated by a marker, and sequence them. Output is a quantifiable list of short tags denoting presence of a gene and how much of it is there. Useful in comparing transcriptomes and in discovery of new genes or transcripts

Microarray Technology Single stranded lawn of DNA probes attached to a membrane on a microarray chip Usually 20-30 bp of a unique sequence from a gene. Often have more than one probe representing a gene. Target (hybridization extract) Total cDNA extracted from biological sample and labeled with fluorescent dye Targets hybridise to the probes that have complimentary sequences. Intensity of hybridization is measured as an indication of the presence of that gene in the biological sample.

Uses of microarrays Changes in Gene Expression levels (one or two colours) Probes are ssDNA (cDNA or oligos), Target is labeled cDNA derived from mRNA. Genomic Gains and Losses (two colours) CGH (Comparative Genomic Hybridization) Probes are ssDNA (oligos) Target is labeled DNA derived from genomic DNA Genomic SNPs (one colour) Probes are short genomic sequences containing SNPs The Benefits of GEO and MAML By storing vast amounts of data on gene expression profiles derived from multiple experiments using varied criteria and conditions, GEO will aid in the study of functional genomics—the development and application of global experimental approaches to assess gene function GEO will facilitate the cross-validation of data obtained using different techniques and technologies and will help set benchmarks and standards for further gene expression studies By making the information stored in GEO publicly available, the fields of bioinformatics and functional genomics will be both promoted and advanced That such experimental data should be freely accessible to all is consistent with NCBI's legislative mandate and mission: to develop new information technologies to aid in the understanding of fundamental molecular and genetic processes that control health and disease

Holdus, Stavrum, Petersen and Stansberg 2008

Image Processing This is computerized - you just see the final result in a spreadsheet. The software scans the array and quantitates the signal values, i.e. converts fluorescence intensity to digital value

Data Preprocessing Background subtraction: Eliminates background noise You choose the parameters, software does the work Background subtraction: Eliminates background noise Normalization: This step takes care of Unequal quantity of starting sample Difference in labeling efficiency Difference in detection efficiency System biases, etc. Brings all samples into a similar range of distribution Statistical QC Removes low quality samples and probesets

Detection of Significantly Differentially Expressed Genes Statistical tests Student’s t test for two conditions/groups (control vs treated) (i.e. the comparison of the means and standard deviations of two bell shaped curves, based on a t-statistic, testing the null-hypothesis that both distributions came from the same distribution) ANOVA analysis (control vs treatment 1 vs treatment 2) (i.e. ANalysis Of VAriances: Allows to test the null hypothesis that the differences within and between at least 3 groups are the same on average. Based on F-statistic, the ratio of the variance calculated among the means to the variance within the samples)

Detection of Significantly Differentially Expressed Genes 2-way ANOVA (eg 2 cell lines, 2 treatments) (i.e. The two-way ANOVA compares the mean differences between groups that have been split on two independent variables called factors. The primary purpose of a two-way ANOVA is to understand if there is an interaction between the two independent variables on the dependent variable). All these methods produce p-values to assess the probability to obtain the result by chance. Problem: What happens if we have many such tests?

Detection of Significantly Differentially Expressed Genes Multiple testing problem: Say you have a set of hypotheses that you wish to test simultaneously. Let’s, consider a case where you have 20 hypotheses to test, and a significance level of a = 0.05. What’s the probability of observing at least one significant result just due to chance? P(at least one significant result) = 1 − P(no significant results) = 1−(1−0.05)20 ≈ 0.64 We have a 64% CHANCE to find one significant result randomly …

Detection of Significantly Differentially Expressed Genes Correction methods: Bonferroni (very conservative): significance threshold is a/N FDR (False Discovery Rate): check if the kth ordered p-value is larger than (k × a)/N q-value: chance that p-values in this column are false positives: q-value

Detection of Significantly Differentially Expressed Genes Fold change Difference in the intensity of a sample vs control or another sample, indicative of difference in level of expression of the gene Threshold > 2, or > 1.6 in some cases

Then clustering

Then clustering In differential gene expression, you are looking for genes that behave differently between one sample and another, either up- or down- Once you get your DE gene set, you group the genes according to similar expression, and the outliers become more obvious Clustering methods similar to those of phylogenetics, but without the evolutionary weightings, ie distance matrices More downstream analysis later in the course

RNA-seq Same concept as sequencing ESTs and counting SAGE tags, but does not stop at short segments and tags. What is being sequenced is the cDNA from the mRNA component. Sequencing of whole transcriptome of a sample (NGS), and comparing it against the whole transcriptome of another sample. Costly, informative, bioinformatics not yet fully sorted out- when does a lot of data become too much data?

Finding the real transcripts

It’s all about the alignment First, you align your reads to a reference genome or genomic region (or assemble the reads de novo) BWA, Bowtie2, etc Then you use a splice-aware aligner, such as TopHat or STAR, to refine the aligments according to coding sequences (exons) using known and/or predicted splice junctions

Quantifying reads per gene Your aim is to count sequence reads per gene When mapping reads to genome: Filter out rRNA, tRNA, mitRNA, etc Filtering out (or in!) non-coding RNA Deal with alternative splicing Deal with overlapping genes, pseudogenes Small reads mean many short overlaps at one end or the other of intron gaps Allele specific gene expression

Some Solutions Can create a library of transcripts and map reads to transcripts (still have some ambiguity for multiple isoforms) [limited, few (if any) use this method] Can create a library of splice-junctions (span intron gaps) [Illumina CASAVA uses this method] Can predict transcripts from genome mapped RNA-seq reads plus known splice junctions plus predicted splice junctions [TopHat] Can do de novo assembly of new transcripts from reads [Trinity] c.f. S. Brown, NYU

Normalization Coverage is not exactly the same for each sample Problem: Need to scale RNA counts per gene to total sample coverage Solution – divide counts per million reads Problem: Longer genes have more reads, gives better chance to detect DE Solution – divide counts by gene length Result = RPKM and later FRKM (Reads/Fragments Per KB per Million) c.f. S. Brown, NYU

Better Normalization FPKM assumes: Total amount of RNA per cell is constant Most genes do not change expression FPKM is invalid if there are a few very highly expressed genes that have dramatic change in expression (dominate the pool of reads) Many now use “Quantile” normalization New normalization methods currently being published Different normalization methods give different results c.f. S. Brown, NYU

Better Normalization quantile normalization: making distributions identical in statistical properties arrays genes rearrange columns assign ranks arrays genes rank values assign values c.f. S. Brown, NYU

Statistics of Differential Gene Expression mRNA levels are variable in cells/tissues/organisms over time/treatment/tissue etc. Need enough replicates to separate biological variability from experimental variability If there is high experimental variability, then variance within replicates will be high, statistical significance for DE will be difficult to find. Best methods to discover DE are coupled with sophisticated approaches to normalization Very low expressing genes are tricky: FPKM<1 c.f. S. Brown, NYU

Gene Expression Analysis Databases: GEO from NCBI ArrayExpress from EBI Commercial software: GeneSpring GX, CLC Bio, many others Free: Mostly R based Not being scared of statistics is an advantage New methods and algorithms continually being published Routine experiments are routine, innovative methods more care The really tricky part is the interpretation of the results

https://github.com/ccsstudentmentors/tutorials/wiki/CCS-Student-Mentors---Tutorials

Suggested additional reading: