Biostatistics: Methods and Applications

Biostatistics: Methods and Applications
Prof. Weidong Tian Tel: Office: 2320 East Guanghua Building

Applications Microarray data analysis RNA-seq data analysis
Gene set enrichment analysis

Bioconductor Bioconductor is an open source and open development software project for the analysis of bioinformatic and genomic data. The project was started in the Fall of 2001 and includes 24 core developers in the US, Europe, and Australia. Bioconductor - software, data, and documentation (vignettes); - training materials from short courses; - mailing list.

Installation of Bioconductor
The latest instructions for installing Bioconductor packages are available on the Download page. To install BioConductor packages, execute from the R console the following commands: source(" biocLite() # Installs the default set of Bioconductor packages. biocLite(c(“made4", “Heatplus")) # Command to install additional packages from BioC. source(" # Sources the getBioC.R installation script, which works the same way as biocLite.R, but includes a larger list of default packages. getBioC() # Installs the getBioC.R default set of BioConductor packages.

Bioconductor software packages
Software packages are sub divided into seven categories. Each contains a long list of contributed packages.

Bioconductor annotation packages
There are over 1,800 bioconductor annotation packages. These packages provide annotation on the genes on microarrays.

Basic workflow for gene expression data analysis

An example for gene expression data analysis
In Mycobacterium tuberculosis, there are three sigma factor genes responding to heat shock (`sigB`, `sigE` and `sigH`). Two of them (`sigB` and `sigE`) also responded to SDS exposure.In this work, the author characterize a `sigE` mutant of M. tuberculosis H37Rv. The `sigE` mutant strain was more sensitive than the wild-type strain to heat shock, SDS and various oxidative stresses. The correspoding dataset in GEO database, GSE8664, contains three conditions, 15 arrays in total.

Bioconductor packages used in this analysis
GEOquery bridge between GEO and BioConductor arrayQualityMetrics reports for data in Bioconductor microarray data containers Impute Imputation for microarray data (currently KNN only) Limma Data analysis, linear models and differential expression for microarray data

Install Bioconductor packages
source(" cLite.R") install.packages("XML") biocLite("GEOquery") ## need to get array from GEO biocLite("arrayQualityMetrics") ##need for array quality analysis biocLite("impute") ##need for fill in the NA value biocLite("limma") ##need this for normalization

Load Bioconductor packages
library(GEOquery) library(arrayQualityMetrics) library(impute) library(limma)

Step 1: Get expression data from GEO
>gse <- getGEO("GSE8664",GSEMatrix=TRUE)[[1]] ##directly get the Series Matrix from the GEO database >gse <- getGEO(filename="GSE8664_series_matrix.txt") ##get the Series Matrix from local saved file

Step 2: quality assessment of microarray data
In the fig directory,

Step 3: Get data matrix, select probesets for use
First, check the associated variables with gse

extract the expression data matrix from gse and select the probesets with gene annotation

Step 4: Fill in NA values and perform normalization

use the impute.knn() to fill in the NA value and normalizeBetweenArrays() to do the normalization

Before After

Step 5: Identify differentially expressed genes

Alternatively, use limma to identify differentially expressed genes

Step 6: Unsupervised sample clustering

Step 7: Supervised sample classification

RNA-seq

Gene set enrichment analysis
Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states. The predefined gene set such as KEGG pathways, GO classifications, chromosome bands, and protein complexes. (Provided in the GESABase, Category, GOstats and topGO.) There are three basic methods to do the GSEA analysis : Hypergeometric Testing Simple GSEA using Z-score and Permutation GSEA using Linear Models

Install and load Bioconductor packages
source(" biocLite("GOstats") ## Tools for manipulating GO and microarrays biocLite("hgu95av2.db") ## Affymetrix Human Genome U95 Set annotation data (chip hgu95av2) biocLite("Biobase") ## Base functions for Bioconductor library(GOstats) library(hgu95av2.db) library(Biobase)

Hypergeometric testing
Basic concept: Suppose there are N balls in an urn, n are white and m are black. Drawing k balls out of the urn without replacement, how many black balls do we expect to get? What is the probability of getting x black balls? Hypergeometric testing for under- and over- representation of GO terms. Three inputs: Gene universe, N. GO categories (categorize genes by GO terms). A list of interesting genes, (differentially expressed genes).

Problem with current gene set analysis tools
Pathway

Pathway PLD JAK Rac/cdc42 DAG STAT Ras-GTP IP3 ERK12 PKC JNKs Raf MEK4 ... Grb2 Shc

Pathway PLD DAG IP3 PKC Raf Grb2 Shc Ras-GTP ERK12 JNKs Rac/cdc42 JAK STAT MEK4 ... If our goal of performing pathway analysis is to understand the underlining biological mechanism governing gene expression variation, is it OK to ignore the intrinsicly different biological roles of individual genes in a pathway?

The non-equivalence of genes in pathway - p53 as an example
p53 is a tumor suppressor protein that in humans is encoded by the TP53 gene. p53 has been described as "the guardian of the genome", the "guardian angel gene", and the "master watchman", referring to its role in conserving stability by preventing genome mutation. (Wikipedia) p53 has been annotated to be involved in many pathways.

P53 HYPOXIA PATHWAY P53 SIGNALING PATHWAY CHEMICAL PATHWAY G1 PATHWAY ATM PATHWAY THYROID CANCER … GLIOMA MAPK SIGNALING PATHWAY STABILIZATION OF P53 BASAL CELL CARCINOMA MELANOMA HUNTINGTONS DISEASE CELL CYCLE CHECKPOINTS is p53 equally important in these pathways?

How to measure the non-equivalence of genes in pathway?
How to apply the gene non-equivalence for pathway analysis?

How to measure the non-equivalence of genes in a pathway
Our hypothesis:Genes playing core roles in a pathway are likely to have more functional associations with genes inside the pathway than with genes outside the pathway. Genes playing marginal roles in a pathway are on the contrary.

Random distribution Xi: number of functional associations in the pathwayMi: number of functional associations in the genome K: number of genes in the pathway N: number of genes in the genome Expected association Raw weight Adjusted weight

How to define functional associations
Direct associationsProtein-protein interactions (PPI)TF-DNA interactionsIndirect associationsFunctional similarity (co-existence in pathways)Co-expressionsWe use PPI, Functional similarity and co-expressions

gNet: the sum of the three types of associations

The weighted P53 Hypoxia pathway

P53 weights differently in different pathways

How to apply gene weights for pathway analysis?

Gene Association Network-based Pathway Analysis (GANPA)
Expression data Gene statistic Gene statistic Pathway statistic Multiple pathway comparison

Expression data Gene statistic Pathway statistic Pathway statistic Multiple pathway comparison

Expression data Gene statistic Pathway statistic Multiple pathway comparison

Weighted vs. non-weighted pathway analysis
P53 dataset Cancer cell lines P53 WT vs. P53 Mut Asthma dataset airway epithelial samples 7 healthy vs. 9 asthma children Breast cancer datasets Normal tissues vs. cancer tissues Three datasets

P53 datasetFDR: 0.15 Apoptosis-related pathways
Cell cycle-related pathways P53-related pathways

HSP27 pathway in p53 dataset

Asthma datasetFDR: 0.05 MeanAbs W-MeanAbs

“Basigin Interactions” in asthma

“Pyruvate Metabolism” in asthma
Basigin group Pyruvate group

“VEGF Pathway” in asthma
Subunit-encoding genes and intra-protein associations cause weighting bias

The multi-subunit proteins in genome

A refined gene weighting strategy by considering multi-subunits proteins

“VEGF Pathway” in asthma
before refinement after refinement

Rank in W-MeanAbs with new weights
Asthma datasetFDR: 0.05 Rank in W-MeanAbs with new weights MeanAbs W-MeanAbs

GANPA’s reproducibility
Another way to evaluate the accuracy of a method is to see whether its significant pathways are reproducible across datasets of the same study We use 3 breast cancer datasets to evaluate MeanAbs and W-MeanAbs with new weights First get top pathways (tried serial cutoffs) in each dataset then identify those present in all 3 datasets

Number of pathways consistent across three datasets

Limitations of GANPA Have no improvement for non- functional gene sets
May have no improvement for pathways with equally important genes Functional association networks may not be available for some organisms

New development of GANPA
GO annotations-derived functional association network Predicted GO annotations are included More powerful than before Readily applicable to any organisms with GO annotations

Final exam Jan 7 12:00 Am (Friday midnight) - Jan 8 8:00 Am (Sunday morning)

Biostatistics: Methods and Applications

Similar presentations

Presentation on theme: "Biostatistics: Methods and Applications"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Biostatistics: Methods and Applications

Similar presentations

Presentation on theme: "Biostatistics: Methods and Applications"— Presentation transcript:

Similar presentations

About project

Feedback