Download presentation
Presentation is loading. Please wait.
1
Biostatistics: Methods and Applications
Prof. Weidong Tian Tel: Office: 2320 East Guanghua Building
2
Applications Microarray data analysis RNA-seq data analysis
Gene set enrichment analysis
3
Bioconductor Bioconductor is an open source and open development software project for the analysis of bioinformatic and genomic data. The project was started in the Fall of 2001 and includes 24 core developers in the US, Europe, and Australia. Bioconductor - software, data, and documentation (vignettes); - training materials from short courses; - mailing list.
4
Installation of Bioconductor
The latest instructions for installing Bioconductor packages are available on the Download page. To install BioConductor packages, execute from the R console the following commands: source(" biocLite() # Installs the default set of Bioconductor packages. biocLite(c(“made4", “Heatplus")) # Command to install additional packages from BioC. source(" # Sources the getBioC.R installation script, which works the same way as biocLite.R, but includes a larger list of default packages. getBioC() # Installs the getBioC.R default set of BioConductor packages.
5
Bioconductor software packages
Software packages are sub divided into seven categories. Each contains a long list of contributed packages.
6
Bioconductor annotation packages
There are over 1,800 bioconductor annotation packages. These packages provide annotation on the genes on microarrays.
7
Basic workflow for gene expression data analysis
8
An example for gene expression data analysis
In Mycobacterium tuberculosis, there are three sigma factor genes responding to heat shock (`sigB`, `sigE` and `sigH`). Two of them (`sigB` and `sigE`) also responded to SDS exposure.In this work, the author characterize a `sigE` mutant of M. tuberculosis H37Rv. The `sigE` mutant strain was more sensitive than the wild-type strain to heat shock, SDS and various oxidative stresses. The correspoding dataset in GEO database, GSE8664, contains three conditions, 15 arrays in total.
9
Bioconductor packages used in this analysis
GEOquery bridge between GEO and BioConductor arrayQualityMetrics reports for data in Bioconductor microarray data containers Impute Imputation for microarray data (currently KNN only) Limma Data analysis, linear models and differential expression for microarray data
12
Install Bioconductor packages
source(" cLite.R") install.packages("XML") biocLite("GEOquery") ## need to get array from GEO biocLite("arrayQualityMetrics") ##need for array quality analysis biocLite("impute") ##need for fill in the NA value biocLite("limma") ##need this for normalization
13
Load Bioconductor packages
library(GEOquery) library(arrayQualityMetrics) library(impute) library(limma)
14
Step 1: Get expression data from GEO
>gse <- getGEO("GSE8664",GSEMatrix=TRUE)[[1]] ##directly get the Series Matrix from the GEO database >gse <- getGEO(filename="GSE8664_series_matrix.txt") ##get the Series Matrix from local saved file
17
Step 2: quality assessment of microarray data
In the fig directory,
26
Step 3: Get data matrix, select probesets for use
First, check the associated variables with gse
27
extract the expression data matrix from gse and select the probesets with gene annotation
29
Step 4: Fill in NA values and perform normalization
31
use the impute.knn() to fill in the NA value and normalizeBetweenArrays() to do the normalization
33
Before After
34
Step 5: Identify differentially expressed genes
36
Alternatively, use limma to identify differentially expressed genes
39
Step 6: Unsupervised sample clustering
43
Step 7: Supervised sample classification
45
RNA-seq
55
Gene set enrichment analysis
Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states. The predefined gene set such as KEGG pathways, GO classifications, chromosome bands, and protein complexes. (Provided in the GESABase, Category, GOstats and topGO.) There are three basic methods to do the GSEA analysis : Hypergeometric Testing Simple GSEA using Z-score and Permutation GSEA using Linear Models
56
Install and load Bioconductor packages
source(" biocLite("GOstats") ## Tools for manipulating GO and microarrays biocLite("hgu95av2.db") ## Affymetrix Human Genome U95 Set annotation data (chip hgu95av2) biocLite("Biobase") ## Base functions for Bioconductor library(GOstats) library(hgu95av2.db) library(Biobase)
57
Hypergeometric testing
Basic concept: Suppose there are N balls in an urn, n are white and m are black. Drawing k balls out of the urn without replacement, how many black balls do we expect to get? What is the probability of getting x black balls? Hypergeometric testing for under- and over- representation of GO terms. Three inputs: Gene universe, N. GO categories (categorize genes by GO terms). A list of interesting genes, (differentially expressed genes).
60
Problem with current gene set analysis tools
Pathway
61
Problem with current gene set analysis tools
Pathway PLD JAK Rac/cdc42 DAG STAT Ras-GTP IP3 ERK12 PKC JNKs Raf MEK4 ... Grb2 Shc
62
Problem with current gene set analysis tools
Pathway PLD DAG IP3 PKC Raf Grb2 Shc Ras-GTP ERK12 JNKs Rac/cdc42 JAK STAT MEK4 ... If our goal of performing pathway analysis is to understand the underlining biological mechanism governing gene expression variation, is it OK to ignore the intrinsicly different biological roles of individual genes in a pathway?
63
The non-equivalence of genes in pathway - p53 as an example
p53 is a tumor suppressor protein that in humans is encoded by the TP53 gene. p53 has been described as "the guardian of the genome", the "guardian angel gene", and the "master watchman", referring to its role in conserving stability by preventing genome mutation. (Wikipedia) p53 has been annotated to be involved in many pathways.
64
The non-equivalence of genes in pathway - p53 as an example
P53 HYPOXIA PATHWAY P53 SIGNALING PATHWAY CHEMICAL PATHWAY G1 PATHWAY ATM PATHWAY THYROID CANCER … GLIOMA MAPK SIGNALING PATHWAY STABILIZATION OF P53 BASAL CELL CARCINOMA MELANOMA HUNTINGTONS DISEASE CELL CYCLE CHECKPOINTS is p53 equally important in these pathways?
65
The non-equivalence of genes in pathway - p53 as an example
66
The non-equivalence of genes in pathway - p53 as an example
67
How to measure the non-equivalence of genes in pathway?
How to apply the gene non-equivalence for pathway analysis?
68
How to measure the non-equivalence of genes in a pathway
Our hypothesis:Genes playing core roles in a pathway are likely to have more functional associations with genes inside the pathway than with genes outside the pathway. Genes playing marginal roles in a pathway are on the contrary.
69
How to measure the non-equivalence of genes in a pathway
70
How to measure the non-equivalence of genes in a pathway
Random distribution Xi: number of functional associations in the pathwayMi: number of functional associations in the genome K: number of genes in the pathway N: number of genes in the genome Expected association Raw weight Adjusted weight
71
How to define functional associations
Direct associationsProtein-protein interactions (PPI)TF-DNA interactionsIndirect associationsFunctional similarity (co-existence in pathways)Co-expressionsWe use PPI, Functional similarity and co-expressions
72
gNet: the sum of the three types of associations
73
The weighted P53 Hypoxia pathway
74
P53 weights differently in different pathways
75
How to apply gene weights for pathway analysis?
76
Gene Association Network-based Pathway Analysis (GANPA)
Expression data Gene statistic Gene statistic Pathway statistic Multiple pathway comparison
77
Gene Association Network-based Pathway Analysis (GANPA)
Expression data Gene statistic Pathway statistic Pathway statistic Multiple pathway comparison
78
Gene Association Network-based Pathway Analysis (GANPA)
Expression data Gene statistic Pathway statistic Multiple pathway comparison
79
Weighted vs. non-weighted pathway analysis
P53 dataset Cancer cell lines P53 WT vs. P53 Mut Asthma dataset airway epithelial samples 7 healthy vs. 9 asthma children Breast cancer datasets Normal tissues vs. cancer tissues Three datasets
80
P53 datasetFDR: 0.15 Apoptosis-related pathways
Cell cycle-related pathways P53-related pathways
81
HSP27 pathway in p53 dataset
82
Asthma datasetFDR: 0.05 MeanAbs W-MeanAbs
83
“Basigin Interactions” in asthma
84
“Pyruvate Metabolism” in asthma
Basigin group Pyruvate group
85
“VEGF Pathway” in asthma
Subunit-encoding genes and intra-protein associations cause weighting bias
86
The multi-subunit proteins in genome
87
A refined gene weighting strategy by considering multi-subunits proteins
88
“VEGF Pathway” in asthma
before refinement after refinement
89
Rank in W-MeanAbs with new weights
Asthma datasetFDR: 0.05 Rank in W-MeanAbs with new weights MeanAbs W-MeanAbs
90
GANPA’s reproducibility
Another way to evaluate the accuracy of a method is to see whether its significant pathways are reproducible across datasets of the same study We use 3 breast cancer datasets to evaluate MeanAbs and W-MeanAbs with new weights First get top pathways (tried serial cutoffs) in each dataset then identify those present in all 3 datasets
91
Number of pathways consistent across three datasets
93
Limitations of GANPA Have no improvement for non- functional gene sets
May have no improvement for pathways with equally important genes Functional association networks may not be available for some organisms
96
New development of GANPA
GO annotations-derived functional association network Predicted GO annotations are included More powerful than before Readily applicable to any organisms with GO annotations
97
Final exam Jan 7 12:00 Am (Friday midnight) - Jan 8 8:00 Am (Sunday morning)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.