Presentation is loading. Please wait.

Presentation is loading. Please wait.

Biostatistics: Methods and Applications

Similar presentations


Presentation on theme: "Biostatistics: Methods and Applications"— Presentation transcript:

1 Biostatistics: Methods and Applications
Prof. Weidong Tian Tel: Office: 2320 East Guanghua Building

2 Applications Microarray data analysis RNA-seq data analysis
Gene set enrichment analysis

3 Bioconductor Bioconductor is an open source and open development software project for the analysis of bioinformatic and genomic data. The project was started in the Fall of 2001 and includes 24 core developers in the US, Europe, and Australia. Bioconductor - software, data, and documentation (vignettes); - training materials from short courses; - mailing list.

4 Installation of Bioconductor
The latest instructions for installing Bioconductor packages are available on the Download page. To install BioConductor packages, execute from the R console the following commands: source(" biocLite() # Installs the default set of Bioconductor packages. biocLite(c(“made4", “Heatplus")) # Command to install additional packages from BioC. source(" # Sources the getBioC.R installation script, which works the same way as biocLite.R, but includes a larger list of default packages. getBioC() # Installs the getBioC.R default set of BioConductor packages.

5 Bioconductor software packages
Software packages are sub divided into seven categories. Each contains a long list of contributed packages.

6 Bioconductor annotation packages
There are over 1,800 bioconductor annotation packages. These packages provide annotation on the genes on microarrays.

7 Basic workflow for gene expression data analysis

8 An example for gene expression data analysis
In Mycobacterium tuberculosis, there are three sigma factor genes responding to heat shock (`sigB`, `sigE` and `sigH`). Two of them (`sigB` and `sigE`) also responded to SDS exposure.In this work, the author characterize a `sigE` mutant of M. tuberculosis H37Rv. The `sigE` mutant strain was more sensitive than the wild-type strain to heat shock, SDS and various oxidative stresses. The correspoding dataset in GEO database, GSE8664, contains three conditions, 15 arrays in total.

9 Bioconductor packages used in this analysis
GEOquery bridge between GEO and BioConductor arrayQualityMetrics  reports for data in Bioconductor microarray data containers Impute Imputation for microarray data (currently KNN only) Limma Data analysis, linear models and differential expression for microarray data

10

11

12 Install Bioconductor packages
source(" cLite.R") install.packages("XML") biocLite("GEOquery") ## need to get array from GEO biocLite("arrayQualityMetrics") ##need for array quality analysis biocLite("impute") ##need for fill in the NA value biocLite("limma") ##need this for normalization

13 Load Bioconductor packages
library(GEOquery) library(arrayQualityMetrics) library(impute) library(limma)

14 Step 1: Get expression data from GEO
>gse <- getGEO("GSE8664",GSEMatrix=TRUE)[[1]] ##directly get the Series Matrix from the GEO database >gse <- getGEO(filename="GSE8664_series_matrix.txt") ##get the Series Matrix from local saved file

15

16

17 Step 2: quality assessment of microarray data
In the fig directory,

18

19

20

21

22

23

24

25

26 Step 3: Get data matrix, select probesets for use
First, check the associated variables with gse

27 extract the expression data matrix from gse and select the probesets with gene annotation

28

29 Step 4: Fill in NA values and perform normalization

30

31 use the impute.knn() to fill in the NA value and normalizeBetweenArrays() to do the normalization

32

33 Before After

34 Step 5: Identify differentially expressed genes

35

36 Alternatively, use limma to identify differentially expressed genes

37

38

39 Step 6: Unsupervised sample clustering

40

41

42

43 Step 7: Supervised sample classification

44

45 RNA-seq

46

47

48

49

50

51

52

53

54

55 Gene set enrichment analysis
Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states. The predefined gene set such as KEGG pathways, GO classifications, chromosome bands, and protein complexes. (Provided in the GESABase, Category, GOstats and topGO.) There are three basic methods to do the GSEA analysis : Hypergeometric Testing Simple GSEA using Z-score and Permutation GSEA using Linear Models

56 Install and load Bioconductor packages
source(" biocLite("GOstats") ## Tools for manipulating GO and microarrays biocLite("hgu95av2.db") ## Affymetrix Human Genome U95 Set annotation data (chip hgu95av2) biocLite("Biobase") ## Base functions for Bioconductor library(GOstats) library(hgu95av2.db) library(Biobase)

57 Hypergeometric testing
Basic concept: Suppose there are N balls in an urn, n are white and m are black. Drawing k balls out of the urn without replacement, how many black balls do we expect to get? What is the probability of getting x black balls? Hypergeometric testing for under- and over- representation of GO terms. Three inputs: Gene universe, N. GO categories (categorize genes by GO terms). A list of interesting genes, (differentially expressed genes).

58

59

60 Problem with current gene set analysis tools
Pathway

61 Problem with current gene set analysis tools
Pathway PLD JAK Rac/cdc42 DAG STAT Ras-GTP IP3 ERK12 PKC JNKs Raf MEK4 ... Grb2 Shc

62 Problem with current gene set analysis tools
Pathway PLD DAG IP3 PKC Raf Grb2 Shc Ras-GTP ERK12 JNKs Rac/cdc42 JAK STAT MEK4 ... If our goal of performing pathway analysis is to understand the underlining biological mechanism governing gene expression variation, is it OK to ignore the intrinsicly different biological roles of individual genes in a pathway?

63 The non-equivalence of genes in pathway - p53 as an example
p53 is a tumor suppressor protein that in humans is encoded by the TP53 gene. p53 has been described as "the guardian of the genome", the "guardian angel gene", and the "master watchman", referring to its role in conserving stability by preventing genome mutation. (Wikipedia) p53 has been annotated to be involved in many pathways.

64 The non-equivalence of genes in pathway - p53 as an example
P53 HYPOXIA PATHWAY P53 SIGNALING PATHWAY CHEMICAL PATHWAY G1 PATHWAY ATM PATHWAY THYROID CANCER GLIOMA MAPK SIGNALING PATHWAY STABILIZATION OF P53 BASAL CELL CARCINOMA MELANOMA HUNTINGTONS DISEASE CELL CYCLE CHECKPOINTS is p53 equally important in these pathways?

65 The non-equivalence of genes in pathway - p53 as an example

66 The non-equivalence of genes in pathway - p53 as an example

67 How to measure the non-equivalence of genes in pathway?
How to apply the gene non-equivalence for pathway analysis?

68 How to measure the non-equivalence of genes in a pathway
Our hypothesis:Genes playing core roles in a pathway are likely to have more functional associations with genes inside the pathway than with genes outside the pathway. Genes playing marginal roles in a pathway are on the contrary.

69 How to measure the non-equivalence of genes in a pathway

70 How to measure the non-equivalence of genes in a pathway
Random distribution Xi: number of functional associations in the pathwayMi: number of functional associations in the genome K: number of genes in the pathway N: number of genes in the genome Expected association Raw weight Adjusted weight

71 How to define functional associations
Direct associationsProtein-protein interactions (PPI)TF-DNA interactionsIndirect associationsFunctional similarity (co-existence in pathways)Co-expressionsWe use PPI, Functional similarity and co-expressions

72 gNet: the sum of the three types of associations

73 The weighted P53 Hypoxia pathway

74 P53 weights differently in different pathways

75 How to apply gene weights for pathway analysis?

76 Gene Association Network-based Pathway Analysis (GANPA)
Expression data Gene statistic Gene statistic Pathway statistic Multiple pathway comparison

77 Gene Association Network-based Pathway Analysis (GANPA)
Expression data Gene statistic Pathway statistic Pathway statistic Multiple pathway comparison

78 Gene Association Network-based Pathway Analysis (GANPA)
Expression data Gene statistic Pathway statistic Multiple pathway comparison

79 Weighted vs. non-weighted pathway analysis
P53 dataset Cancer cell lines P53 WT vs. P53 Mut Asthma dataset airway epithelial samples 7 healthy vs. 9 asthma children Breast cancer datasets Normal tissues vs. cancer tissues Three datasets

80 P53 datasetFDR: 0.15 Apoptosis-related pathways
Cell cycle-related pathways P53-related pathways

81 HSP27 pathway in p53 dataset

82 Asthma datasetFDR: 0.05 MeanAbs W-MeanAbs

83 “Basigin Interactions” in asthma

84 “Pyruvate Metabolism” in asthma
Basigin group Pyruvate group

85 “VEGF Pathway” in asthma
Subunit-encoding genes and intra-protein associations cause weighting bias

86 The multi-subunit proteins in genome

87 A refined gene weighting strategy by considering multi-subunits proteins

88 “VEGF Pathway” in asthma
before refinement after refinement

89 Rank in W-MeanAbs with new weights
Asthma datasetFDR: 0.05 Rank in W-MeanAbs with new weights MeanAbs W-MeanAbs

90 GANPA’s reproducibility
Another way to evaluate the accuracy of a method is to see whether its significant pathways are reproducible across datasets of the same study We use 3 breast cancer datasets to evaluate MeanAbs and W-MeanAbs with new weights First get top pathways (tried serial cutoffs) in each dataset then identify those present in all 3 datasets

91 Number of pathways consistent across three datasets

92

93 Limitations of GANPA Have no improvement for non- functional gene sets
May have no improvement for pathways with equally important genes Functional association networks may not be available for some organisms

94

95

96 New development of GANPA
GO annotations-derived functional association network Predicted GO annotations are included More powerful than before Readily applicable to any organisms with GO annotations

97 Final exam Jan 7 12:00 Am (Friday midnight) - Jan 8 8:00 Am (Sunday morning)


Download ppt "Biostatistics: Methods and Applications"

Similar presentations


Ads by Google