Then clustering.

Slides:



Advertisements
Similar presentations
RNA-Seq as a Discovery Tool
Advertisements

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Ab initio gene prediction Genome 559, Winter 2011.
Peter Tsai Bioinformatics Institute, University of Auckland
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Introduction to BioInformatics GCB/CIS535
Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
RNA-seq Analysis in Galaxy
Pathway Informatics 6 th July, 2015 Ansuman Chattopadhyay, PhD Head, Molecular Biology Information Services Health Sciences Library System University of.
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
 2 Outline  Review of major computational approaches to facilitate biological interpretation of  high-throughput microarray  and RNA-Seq experiments.
Genome Informatics 2005 ~ 220 participants 1 keynote speaker: David Haussler 47 talks 121 posters.
Gene Set Enrichment Analysis (GSEA)
Networks and Interactions Boo Virk v1.0.
RNAseq analyses -- methods
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Introduction to RNAseq
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Motif Search and RNA Structure Prediction Lesson 9.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Microarray Technology and Data Analysis Roy Williams PhD Sanford | Burnham Medical Research Institute.
Pathway Informatics 30 th March, 2016 Ansuman Chattopadhyay, PhD Head, Molecular Biology Information Services Health Sciences Library System University.
Centralizing Bioinformatics Services: Analysis Pipelines, Opportunities, and Challenges with Large- scale –Omics, and other BigData High-Performance Computing.
Gene Expression in that cell at that time
RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
David Amar, Tom Hait, and Ron Shamir
Regulation of Gene Expression
bacteria and eukaryotes
Pathway Informatics 16th August, 2017
The Transcriptional Landscape of the Mammalian Genome
Networks and Interactions
Genomics A Systematic Study of the Locations, Functions and Interactions of Many Genes at Once.
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
Pathway Analysis June 13, 2017.
Gene expression from RNA-Seq
RNA-Seq analysis in R (Bioconductor)
Tutorial 6 : RNA - Sequencing Analysis and GO enrichment
GO : the Gene Ontology & Functional enrichment analysis
Detect alternative splicing
Gene expression.
Statistical Testing with Genes
Gene Expression in that cell at that time
Very important to know the difference between the trees!
Functional Annotation of the Horse Genome
Genesets and Enrichment
Reference based assembly
From: TopHat: discovering splice junctions with RNA-Seq
Advanced PGDB Editing: Regulation GO Terms
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Identification and Characterization of pre-miRNA Candidates in the C
Gene expression analysis
Pathway Informatics December 5, 2018 Ansuman Chattopadhyay, PhD
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Maximize read usage through mapping strategies
The Omics Dashboard.
Quantitative analyses using RNA-seq data
Sequence Analysis - RNA-Seq 2
Statistical Testing with Genes
Presentation transcript:

Then clustering

Then clustering In differential gene expression, you are looking for genes that behave differently between one sample and another, either up- or down- Once you get your DE gene set, you group the genes according to similar expression, and the outliers become more obvious Clustering methods similar to those of phylogenetics, but without the evolutionary weightings, ie distance matrices More downstream analysis later in the course

RNA-seq Same concept as sequencing ESTs and counting SAGE tags, but does not stop at short segments and tags. What is being sequenced is the cDNA from the mRNA component. Sequencing of whole transcriptome of a sample (NGS), and comparing it against the whole transcriptome of another sample. Costly, informative, bioinformatics not yet fully sorted out- when does a lot of data become too much data?

Finding the real transcripts

It’s all about the alignment First, you align your reads to a reference genome or genomic region (or assemble the reads de novo) BWA, Bowtie2, etc Then you use a splice-aware aligner, such as TopHat or STAR, to refine the aligments according to coding sequences (exons) using known and/or predicted splice junctions

Quantifying reads per gene Your aim is to count sequence reads per gene When mapping reads to genome: Filter out rRNA, tRNA, mitRNA, etc Filtering out (or in!) non-coding RNA Deal with alternative splicing Deal with overlapping genes, pseudogenes Small reads mean many short overlaps at one end or the other of intron gaps Allele specific gene expression

Some Solutions Can create a library of transcripts and map reads to transcripts (still have some ambiguity for multiple isoforms) [limited, few (if any) use this method] Can create a library of splice-junctions (span intron gaps) [Illumina CASAVA uses this method] Can predict transcripts from genome mapped RNA-seq reads plus known splice junctions plus predicted splice junctions [TopHat] Can do de novo assembly of new transcripts from reads [Trinity] c.f. S. Brown, NYU

Normalization Coverage is not exactly the same for each sample Problem: Need to scale RNA counts per gene to total sample coverage Solution – divide counts per million reads Problem: Longer genes have more reads, gives better chance to detect DE Solution – divide counts by gene length Result = RPKM and later FRKM (Reads/Fragments Per KB per Million) c.f. S. Brown, NYU

Better Normalization FPKM assumes: Total amount of RNA per cell is constant Most genes do not change expression FPKM is invalid if there are a few very highly expressed genes that have dramatic change in expression (dominate the pool of reads) Many now use “Quantile” normalization New normalization methods currently being published Different normalization methods give different results c.f. S. Brown, NYU

Better Normalization quantile normalization: making distributions identical in statistical properties arrays genes rearrange columns assign ranks arrays genes rank values assign values c.f. S. Brown, NYU

Statistics of Differential Gene Expression mRNA levels are variable in cells/tissues/organisms over time/treatment/tissue etc. Need enough replicates to separate biological variability from experimental variability If there is high experimental variability, then variance within replicates will be high, statistical significance for DE will be difficult to find. Best methods to discover DE are coupled with sophisticated approaches to normalization Very low expressing genes are tricky: FPKM<1 c.f. S. Brown, NYU

Gene Expression Analysis Databases: GEO from NCBI ArrayExpress from EBI Commercial software: GeneSpring GX, CLC Bio, many others Free: Mostly R based Not being scared of statistics is an advantage New methods and algorithms continually being published Routine experiments are routine, innovative methods more care The really tricky part is the interpretation of the results

https://github.com/ccsstudentmentors/tutorials/wiki/CCS-Student-Mentors---Tutorials

Suggested additional reading:

relies on information in pathways to build a network Systems Biology Genome all genes Transcriptome all expressed genes (mRNA) Proteome all translated transcripts (proteins) Metabolome all small molecules Systems biology relies on information in pathways to build a network

Integrating Data Bioinformatics algorithms (seek to) allow large scale mapping of a cell’s known pathways to create a network model. input from various experiments into one model Integrate this model to database updates, literature updates and experimental inputs, bring in added value add user-friendly visualization tools, make it look pretty and you have a powerful setup for understanding how a system works.

Major Problem Integrating data from different sources highlighted differences in Terminology Measurements Experimental platforms Rigour of cross-referencing citations Databases show inconsistencies in pathway size and annotations, number and identity of of nodes.

Controlled Vocabulary High level organisation of data under umbrella definitions

The Gene Ontology For each gene or gene product: Cellular compartment Molecular function(s) Biological process Thus you can start building a picture that can be compared across different species and data types. It sort of works: important leads that have proven true Down side: some terms are “part-of” other terms, conditional functions, species groups have their own jargon…

Biological Pathways A series of actions that lead to a certain product or change in the cell. Within a pathway, proteins can be activated, deactivated to various degrees, or completely inactivated. e.g. metabolioc, signalling, gene regulation pathways Reasonably well defined, knowledge on mechanisms incomplete, but it exists. Databases available.

Main Types of Pathways Metabolic pathways (biochemistry) A series of chemical reactions that converts one type of chemical to another. eg glycolysis converts glucose to pyruvate Signaling pathways (molecular biology) A series of binding events, sometimes involving chemical actions (eg phosphorylation), that relay a message. eg MAP-kinase cascade Gene regulation networks (molecular biology/genetics) A series of binding and chemical events that regulate gene expression. eg transcription factor binding, histone deacetylation

Databases abound Most trusted and most commonly used is KEGG Also well cited is the Reactome database BioCarta has nice diagrams There are others; some are species or disease specific (Pathguide has a list, may be slightly outdated)

Biological Networks A network forms when proteins from several pathways work together. Networks link many pathways into a model. e.g. A change in signaling leads to gene regulation which leads to a new (or more or less or somehow different) metabolic product. Delving into the unknown, data driven, “given me the results of these different analyses and let me see what comes up”.

Metabolic Networks – human (KEGG)

Networks Most work has been in protein-protein interactions. Inferring biological networks, connecting the dots from pathway to pathway, or uncovering links previously not found between protein subsets. Starting with pathway analysis of gene expression…

Pathway Analysis of Gene Expression Data = Gene Set Enrichment Analysis (GSEA) Computational methods of data mining the GO, KEGG, and other databases with gene lists that result from differential gene expression experiments. Given a gene list, how can we classify the genes within it how many are involved in metabolic or signaling pathways? how many are found in the nucleolus? GO which other genes do my genes interact with? are there any transcription factors in my list? what is the probability that this classification is correct? how do our results compare to the literature?

Given a gene list: Which groups of genes are over- or under- represented in my list of genes? Does my gene list contain a larger than by chance number of cancer-related genes? Does my gene list contain a larger than by chance number of transcription factors? Does my gene list contain a smaller than by chance number of genes that affect calcium signaling?

GSEA Need to start with a gene list, not single genes Main algorithm 1.) Ranks all genes N using a signal-to-noise metric, e.g. , where A and B are conditions (or a different measure of correlation). 2.) scans pathway S (with Ns genes) and lists against the top- and bottom- scorers in the gene set to determine over- or under- representation, adjusting the scores as it goes along.

GSEA Creates a running sum statistic calculating an enrichment score ES running through the ranked list of all genes: If gene j is not in set S (pathway of interest), then subtract from ES If gene j is in set S, then add , where The maximum sum over the whole list is the Enrichment Score ES

GSEA Estimates a P-value, then uses a correction for multiple testing (normalisation, then FDR) to obtain a q-value.

Free Tools for Pathway Analysis There are “hundreds” of tools and algorithms out there - For complicated investigations, may need a specialist to go through the options and work out a process best suited to your data. GSEA-P (Subramanian, 2005, http://www.ncbi.nlm.nih.gov/pubmed/17644558) Pathway Commons G:profiler DAVID Reactome GenSensor Suite (microarray only) Flink (NCBI)

Licensed Software for Pathway Analysis Inexpensive Tools Pathway Studio from Elsevier Pathway analysis and literature text mining algorithm More Expensive Tools MetaCore from GeneGo metabolic, signaling and gene regulation maps; can integrate your own data IPA from Ingenuity