Understanding protein lists from proteomics studies Bing Zhang Department of Biomedical Informatics Vanderbilt University

Slides:



Advertisements
Similar presentations
Understanding protein lists from proteomics studies Bing Zhang Department of Biomedical Informatics Vanderbilt University
Advertisements

Exploring the Human Transcriptome
Oncomine Database Lauren Smalls-Mantey Georgia Institute of Technology June 19, 2006 Note: This presentation contains animation.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Genomic Innovations- Orthology Paralogy. Genomic innovation.
CAVEAT 1 MICROARRAY EXPERIMENTS ARE EXPENSIVE AND COMPLICATED. MICROARRAY EXPERIMENTS ARE THE STARTING POINT FOR RESEARCH. MICROARRAY EXPERIMENTS CANNOT.
EBI Proteomics Services Team – Standards, Data, and Tools for Proteomics Henning Hermjakob European Bioinformatics Institute SME forum 2009 Vienna.
Gene Ontology John Pinney
Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis Nathan Edwards Department of Biochemistry and Molecular & Cellular.
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Demonstration Trupti Joshi Computer Science Department 317 Engineering Building North (O)
Pathway and network analysis Functional interpretation of gene lists
Protein and Function Databases
DEMO CSE fall. What is GeneMANIA GeneMANIA finds other genes that are related to a set of input genes, using a very large set of functional.
Modeling Functional Genomics Datasets CVM Lessons 4&5 10 July 2007Bindu Nanduri.
>>> Korean BioInformation Center >>> KRIBB Korea Research institute of Bioscience and Biotechnology GS2PATH: Linking Gene Ontology and Pathways Jin Ok.
Daniel Rico, PhD. Daniel Rico, PhD. ::: Introduction to Functional Analysis Course on Functional Analysis Bioinformatics Unit.
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Gene Set Enrichment Analysis (GSEA)
Copyright OpenHelix. No use or reproduction without express written consent1.
EGAN: Exploratory Gene Association Networks by Jesse Paquette Biostatistics and Computational Biology Core Helen Diller Family Comprehensive Cancer Center.
1 Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.
Copyright OpenHelix. No use or reproduction without express written consent1.
Intralab Workshop - Reactome CMAP Chang-Feng Quo June 29 th, 2006.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Copyright OpenHelix. No use or reproduction without express written consent1.
Fission Yeast Computing Workshop -1- Searching, querying, browsing downloading and analysing data using PomBase Basic PomBase Features Gene Page Overview.
Managing Data Modeling GO Workshop 3-6 August 2010.
Taverna Workflow. A suite of tools for bioinformatics Fully featured, extensible and scalable scientific workflow management system – Workbench, server,
1 Bio-Trac 40 (Protein Bioinformatics) October 8, 2009 Zhang-Zhi Hu, M.D. Associate Professor Department of Oncology Department of Biochemistry and Molecular.
EADGENE and SABRE Post-Analyses Workshop 12-14th November 2008, Lelystad, Netherlands 1 François Moreews SIGENAE, INRA, Rennes Cytoscape.
Network & Systems Modeling 29 June 2009 NCSU GO Workshop.
1 of 38 Data Mining in Ensembl with BioMart. 2 of 38 Simple Text-based Search Engine.
Gene expression analysis
Copyright OpenHelix. No use or reproduction without express written consent1.
Strategies for functional modeling TAMU GO Workshop 17 May 2010.
UBio Training Courses Micro-RNA web tools Gonzalo
Data Mining in Ensembl with BioMart Nov,
Modeling of complex systems: what is relevant? Arno Knobbe, Marvin Meeng, Joost Kok Leiden Institute of Advanced Computer Science (LIACS)
Tutorial 7 Gene expression analysis 1. Expression data –GEO –UCSC –ArrayExpress General clustering methods –Unsupervised Clustering Hierarchical clustering.
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
PROGNOCHIP-BASE, FORTH-ICS 1 PrognoChip-BASE: An Information System for the Management of Spotted DNA MicroArray Experiments Extension of BASE v
Analysis of GEO datasets using GEO2R Parthav Jailwala CCR Collaborative Bioinformatics Resource CCR/NCI/NIH.
PPI team Progress Report PPI team, IDB Lab. Sangwon Yoo, Hoyoung Jeong, Taewhi Lee Mar 2006.
Data Mining in Ensembl with BioMart Giulietta Spudich.
GeWorkbench John Watkinson Columbia University. geWorkbench The bioinformatics platform of the National Center for the Multi-scale Analysis of Genomic.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible.
Input data for analysis Users that have expression values (dataset 1_ chicken affy_foldchane.txt. can upload that file as shown in slide 30.
ID Mapping to accessions from different databases. COST Functional Modeling Workshop April, Helsinki.
A collaborative tool for sequence annotation. Contact:
Flat clustering approaches
GO based data analysis Iowa State Workshop 11 June 2009.
GO enrichment and GOrilla
Copyright OpenHelix. No use or reproduction without express written consent1.
CuffDiff ran successfully. Output files include gene_exp.diff What are the next steps? Use Navigation bar to find files; they may be under DNA Subway if.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Gene Annotation & Gene Ontology May 24, Gene lists from RNAseq analysis What do you do with a list of 100s of genes that contain only the following.
Networks and Interactions
GSEA-Pro Tutorial Anne de Jong University of Groningen.
Tutorial 6 : RNA - Sequencing Analysis and GO enrichment
ID Mapping tools: Converting Accessions between Databases
The Omics Dashboard.
Welcome to the GrameneMart Tutorial
Functional classification and visualization of differentially expressed genes. Functional classification and visualization of differentially expressed.
Presentation transcript:

Understanding protein lists from proteomics studies Bing Zhang Department of Biomedical Informatics Vanderbilt University

A typical comparative shotgun proteomics study Li et.al. JPR, 2010 IPI IPI IPI IPI IPI IPI IPI IPI IPI IPI IPI IPI IPI IPI IPI IPI IPI ……. 2 BCHM352, Spring 2011

Omics technologies generate gene/protein lists Genomics  Genome Wide Association Study (GWAS) Transcriptomics  mRNA profiling Microarrays Serial analysis of gene expression (SAGE) RNA-Seq  Protein-DNA interaction Chromatin immunoprecipitation Proteomics  Protein profiling LC-MS/MS  Protein-protein interaction Yeast two hybrid Affinity pull-down/LC-MS/MS 3 BCHM352, Spring 2011 ……….. ………. ………………

Samples files can be downloaded from  Significant proteins  hnscc_sig_proteins.txt Significant proteins with log fold change  hnscc_sig_withLogRatio.txt All proteins identified in the study  hnscc_all_proteins.txt Sample files BCHM352, Spring

Level I  What are the proteins/genes behind the IDs and what do we know about the functions of the proteins/genes? Level II  Which biological processes and pathways are the most interesting in terms of the experimental question? Level III  How do the proteins work together to form a network? Understanding a protein list 5 BCHM352, Spring 2011

Level one: information retrieval 6 BCHM352, Spring 2011 One-protein-at-a-time Time consuming Information is local and isolated Hard to automate the information retrieval process One-protein-at-a-time Time consuming Information is local and isolated Hard to automate the information retrieval process Query interface (

Biomart is a query-oriented data management system. Batch information retrieval for complex queries Particularly suited for providing 'data mining' like searches of complex descriptive data such as those related to genes and proteins Open source and can be customized Originally developed for the Ensembl genome databases Adopted by many other projects including UniProt, InterPro, Reactome, Pancreatic Expression Database, and many others (see a complete list and get access to the tools from ) Biomart: a batch information retrieval system BCHM352, Spring

BioMart: basic concepts Dataset Filter Attribute From Prof. Kevin Schey (Biochemistry): “I’ve attached a spreadsheet of our proteomics results comparing 5 Vehicle and 5 Aldosterone treated patients. We’ve included only those proteins whose summed spectral counts are >30 in one treatment group. Would it be possible to get the GO annotations for these? The Uniprot name is listed in column A and the gene name is listed in column R. If this is a time consuming task (and I imagine that it is), can you tell me how to do it?” From all human genes, selected those with the listed Uniprot IDs, and retrieve GO annotations. BCHM352, Spring

Choose dataset  Choose database: Ensembl Genes 61  Choose dataset: Homo sapiens genes (GRch37) Set filters  Gene: a list of genes/proteins identified by various database IDs (e.g. IPI IDs)  Gene Ontology: filter for proteins with specific GO terms (e.g. cell cycle)  Protein domains: filter for proteins with specific protein domains (e.g. SH2 domain)  Region: filter for genes in a specific chromosome region (e.g. chr1 1: or 11q13)  Others Select output attributes  Gene annotation information in the Ensembl database, e.g. gene description, chromosome name, gene start, gene end, strand, band, gene name, etc.  External data: Gene Ontology, IDs in other databases  Expression: anatomical system, development stage, cell type, pathology  Protein domains: SMART, PFAM, Interpro, etc. Ensembl Biomart analysis BCHM352, Spring

Ensembl BioMart: query interface BCHM352, Spring Choose dataset Set filters Help Results Count Perl API Select output attributes

Ensembl Biomart: sample output BCHM352, Spring Export all results to a file

Level I  What are the proteins/genes behind the IDs and what do we know about the functions of the proteins/genes? Level II  Which biological processes and pathways are the most interesting in terms of the experimental question? Level III  How do the proteins work together to form a network? Understanding a protein list 12 BCHM352, Spring 2011

Re-organization of the information BCHM352, Spring Differential genes Functional category Differential genes

Enrichment analysis: is a functional group (e.g. cell cycle) significantly associated with the experimental question? Enrichment analysis MMP9 SERPINF1 A2ML1 F2 FN1 LYZ TNXB FGG MPO FBLN1 THBS1 HDLBP GSN FBN1 CA2 P11 CCL21 FGB …… IPI IPI IPI IPI IPI IPI IPI IPI IPI IPI IPI IPI IPI IPI IPI … Observed Random All identified proteins (1733) Filter for significant proteins Differentially expressed protein list (260 proteins) Extracellular space (83 proteins) Extracellular space (83 proteins) Compare annotated 14 BCHM352, Spring 2011

Enrichment analysis: hypergeometric test BCHM352, Spring Significant proteins Non-significant proteins Total Proteins in the groupkj-kj Other proteinsn-km-n-j+km-j Totalnm-nm Hypergeometric test: given a total of m proteins where j proteins are in the functional group, if we pick n proteins randomly, what is the probability of having k or more proteins from the group? Zhang et.al. Nucleic Acids Res. 33:W741, 2005 n Observed k j m

Gene Ontology (  Structured, precisely defined, controlled vocabulary for describing the roles of genes and gene products  Three organizing principles: molecular function, biological process, and cellular component Pathways  KEGG (  Pathway commons (  WikiPathways ( Cytogenetic bands Commonly used functional groups 16 BCHM352, Spring 2011

8 organisms 132 ID types 73,986 functional groups WebGestalt WebGestalt: Web-based Gene Set Analysis Toolkit webgestalt webgestalt Zhang et.al. Nucleic Acids Res. 33:W741, 2005 Duncan et al. BMC Bioinformatics. 11(Suppl 4) :P10, BCHM352, Spring 2011

Select the organism of interest. Upload a gene/protein list in the txt format, one ID per row. Optionally, a value can be provided for each ID. In this case, put the ID and value in the same row and separate them by a tab. Then pick the ID type that corresponds to the list of IDs. Categorize the uploaded ID list based upon GO Slim (a simplified version of Gene Ontology that focuses on high level classifications). Analyze the uploaded ID list for for enrichment in various biological contexts. You will need to select an appropriate predefined reference set or upload a reference set. If a customized reference set is uploaded, ID type also needs to be selected. After this, select the analysis parameters (e.g., significance level, multiple test adjustment method, etc.). Retrieve enrichment results by opening the respective results files. You may also open and/or download a TSV file, or download the zipped results to a directory on your desktop. WebGestalt analysis 18 BCHM352, Spring 2011

Input list  260 significant proteins identified in the HNSCC study (hnscc_sig_withLogRatio.txt) Mapping result  Total number of User IDs: 260. Unambiguously mapped User IDs to Entrez IDs: 229. Unique User Entrez IDs: 224. The Enrichment Analysis will be based upon the unique IDs. WebGestalt: ID mapping 19 BCHM352, Spring 2011

WebGestalt: GOSlim classification Molecular function Cellular component Biological process 20 BCHM352, Spring 2011

WebGestalt: top 10 enriched GO biological processes 21 BCHM352, Spring 2011 Reference list: CSHL2010_hnscc_all_proteins.txt

WebGestalt: top 10 enriched WikiPathways 22 BCHM352, Spring 2011

Level I  What are the proteins/genes behind the IDs and what do we know about the functions of the proteins/genes? Level II  Which biological processes and pathways are the most interesting in terms of the experimental question? Level III  How do the proteins work together to form a network? Understanding a protein list 23 BCHM352, Spring 2011

GeneMANIA  String  Genes2Networks  enes2networks/ enes2networks/ Resources 24 BCHM352, Spring 2011

Level I  What are the proteins/genes behind the IDs and what do we know about the functions of the proteins/genes?  Biomart ( Level II  Which biological processes and pathways are the most interesting in terms of the experimental question?  WebGestalt (  Related tools: DAVID ( GenMAPP ( Level III  How do the proteins work together to form a network?  GeneMANIA (  Related tools: Cytoscape ( STRING ( Genes2Networks ( Ingenuity ( Pathway Studio ( Understanding a protein list: summary 25 BCHM352, Spring 2011