PROMPT Protein Mapping and Comparison Tool By Thorsten Schmidt and Dmitrij Frishman Free for academic. Website (Binary.

Slides:



Advertisements
Similar presentations
13:10:58 A New Tool for Mapping Microarray Data onto the Gene Ontology Structure ( Abstract e GOn (explore Gene Ontology) is a.
Advertisements

MitoInteractome : Mitochondrial Protein Interactome Database Rohit Reja Korean Bioinformation Center, Daejeon, Korea.
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
GenomePixelizer - a visualization tool for comparative genomics within and between species. A. Kozik, E. Kochetkova, and R. Michelmore (Department of Vegetable.
How to use the web for bioinformatics Molecular Technologies Ethan Strauss X 1171
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
How to use the web for bioinformatics Molecular Technologies February 11, 2005 Ethan Strauss X 1373
The Protein Data Bank (PDB)
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Login: BITseminar Pass: BITseminar2011 Login: BITseminar Pass: BITseminar2011.
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Supplementary Material Epigenetic histone modifications of human transposable elements: genome defense versus exaptation Ahsan Huda, Leonardo Mariño-Ramírez.
1//hw Cherniak Software Development Corporation ARM Features Presentation Alacrity Results Management (ARM) Major Feature Description.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
July 2015 CSHL Data analysis: GO tools and YeastMine, use-case examples.
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
An Introduction to Designing and Executing Workflows with Taverna Katy Wolstencroft University of Manchester.
Fission Yeast Computing Workshop -1- Searching, querying, browsing downloading and analysing data using PomBase Basic PomBase Features Gene Page Overview.
Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.
BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar †, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli.
EXPLORING DEAD GENES Adrienne Manuel I400. What are they? Dead Genes are also called Pseudogenes Pseudogenes are non functioning copies of genes in DNA.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
NGS Bioinformatics Workshop 1.5 Tutorial – Genome Annotation April 5th, 2012 IRMACS Facilitator: Richard Bruskiewich Adjunct Professor, MBB.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
An Introduction to Designing and Executing Workflows with Taverna Aleksandra Pawlik materials by: Katy Wolstencroft University of Manchester.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.
EMBOSS over a Grid 1. 1st EELA Grid School December 4th of 2006 Eduardo MURRIETA LEON Romualdo ZAYAS-LAGUNAS Pierre-Alain BRANGER Jérôme VERLEYEN Roberto.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
Copyright OpenHelix. No use or reproduction without express written consent1.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Abel Carrión Ignacio Blanquer Vicente Hernández.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
Copyright OpenHelix. No use or reproduction without express written consent1.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Copyright OpenHelix. No use or reproduction without express written consent1.
Title: Assign Pathways to Gene Set June 21, 2007 Guanming Wu.
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
Welcome to the combined BLAST and Genome Browser Tutorial.
HANDS-ON ConSurf! Web-Server: The ConSurf webserver.
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
Biocomputational Languages December 1, 2011 Greg Antell & Khoa Nguyen.
PROTEIN IDENTIFIER IAN ROBERTS JOSEPH INFANTI NICOLE FERRARO.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Functional and structural genomics using PEDANT
Why Create a PGDB? Perform pathway analyses as part of a genome project Analyze omics data Create a central public information resource for the organism,
EMBL-EBI, programmatically - take a REST from manual searching: Sequence analysis tools Web Production Team Anna Foix Joon Lee.
Demo: Protein Information Resource
Figure 1. Number of CCDS IDs and genes represented in the human (A) and mouse (B) CCDS releases. The X-axis indicates the year in which a CCDS dataset.
Functional Annotation of Transcripts
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Genome Center of Wisconsin, UW-Madison
PIR: Protein Information Resource
Prediction of Protein Structure and Function on a Proteomic Scale
Comparative Genomics.
Basic Local Alignment Search Tool
Automating and Validating Edits
Explore Evolution: Instrument for Analysis
Basic Local Alignment Search Tool (BLAST)
Volume 14, Issue 7, Pages (February 2016)
Supporting High-Performance Data Processing on Flat-Files
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

PROMPT Protein Mapping and Comparison Tool By Thorsten Schmidt and Dmitrij Frishman Free for academic. Website (Binary + Source)

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT Motivation Past: Sparse data available  single pairwise comparison Present + Future: High-throughput technologies  weighting large protein datasets against each other Differences between individuals Differences between populations Hundreds of questions: Do Germans drive faster than Americans? Is one gene group significantly enriched in certain functional categories? Do GroEL depending proteins prefer certain structural folds?

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT Input FASTA xx GenBank xxx EMBL xx Swiss-Protxxxx UniProt XMLxxxx Generic XML xx Generic XML Input allows to import any numeric or nominal data Folder with multiple files File with single (protein) entryFile with multiple (protein) entries List of identifiers Analyse annotations Additionally

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT Protein set A (SwissProt, EMBL, GenBank, PEDANT, SIMAP, FASTA, XML) Protein set B (SwissProt, EMBL, GenBank, PEDANT, SIMAP, FASTA, XML) Dataset ADataset B Processing Layer Comparison Mapping Statistical testing Input Layer User Input Parsing Caching Retrieval Results Presentation Layer Figure Plotting Export View Within PROMPT Spreadsheet Import

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT

Statistical tests Help about each test and its parameter. Although you can apply any test manually, in the most cases appropriate tests are performed automatically.

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT Built-in help

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT Case study: SCOP fold comparison GroEL depending substrats vs. Lysate Background:  Around 200 proteins in E.coli depend on the GroEL chaperon for folding. Questions  What distinguish the GroEL depending proteins? Data:  PEDANT genome from clu1.gsf.de E.coli K12 (updated version)  Assignment threshold 1 E-4 for SCOP folds

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT Symbolic Frequency Comparison (Symbolic), (Symbolic) Fraction relative to the number of proteins with annotations in each set P-value * < 0.05 ** < *** <

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT Case study: Comparison of pI distributions Question:  Do the proteins of E.coli and H.pylori differ with respect of their isoelectric points? Data:  Protein sequences of H.pylori and E.coli  The pI is calculated by PROMPT automatically (as many other sequence based properties too)

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT Numeric Distribution Comparison (Numeric), (Numeric) Statistical tests: Kolmogorov-Smirnov test Mann-Whitney Chi Square Test

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT Case study: Protein length and hydrophobicity Question:  Is there any relationship between protein length and hydrophobicity in membrane proteins? Data:  2 multi FASTA files with amino acid sequences membrane.fastacontains all membrane* proteins of E.coli fullgenome.fastaall proteins of E.coli *) all proteins with more than 6 membrane spanning regions predicted by TMHMM 2.0  The GRAVY (grand average hydrophobicity) value and a lot of other computable properties are calculated from the sequence by PROMPT automatically

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT Numeric Correlation New research result: The longer membrane proteins are the less hydrophobic they are X-Axes: Protein length Hydrophobicity: GRAVY value Numeric property [ Pearson coefficient -0.69; p-value 2.8 E-54 ] A. All E.coli proteinsB. Membrane proteins only (Numeric x Numeric)

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT Protein set A (SwissProt, EMBL, GenBank, PEDANT, SIMAP, FASTA, XML) Protein set B (SwissProt, EMBL, GenBank, PEDANT, SIMAP, FASTA, XML) IDs +sequencesIDs onlyIDs +sequencesIDs only Sequences are retrieved automatically Web- services DB Query Compare A and B by BLAST, find equivalent sequences Mapped identifiers Set B Set A ID5 ID3 No equivalent ID2 ID3 ID1 A: IDs + sequences B: IDs + sequences User Input PROMPT Results

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT Data Import and Mapping

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT Blast parameter dialog

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT View Mapping Results

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT Mapping filtering Choose correct assignments by 2 ways: Manually e.g. expert knowledge Automatic filter with user specific parameters e.g. Select SUBJECT_ID where IDENTITY>99 and MISMATCHES<5 Manual further processing e.g. save GIs to text file Generic XML file: Symbolic property holds mapping information VFDB1 GI_1234 VFDB3 GI_3456 …

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT Case studies summary ExampleType of Data usedPROMPT Method: FunCat distribution in Human (*)(Symbolic)Symbolic feature frequencies Scop Fold enrichment of GroEL depending substrates (Symbolic), (Symbolic)Symbolic feature comparison of two sets Fold bias of virulence factor proteins (*) (Symbolic) subset of (Symbolic) Symbolic feature enrichment in subset vs. set pI comparison of H.pylori and E.coli (Numeric), (Numeric)Numeric feature comparison Protein length and hydrophobicity(Numeric x Numeric)Numeric feature correlation Essentiality and protein (*) abundance (Symbolic x Numeric)Numeric distribution within categories Note: x means corresponding data pairs e.g. here describing two values of the same protein (*) not shown in this talk As the generic XML input allows the processing of any kind of nominal or numeric data, PROMPT can be applied to nearly any problem domain

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT Scripting Scripting ways: Interactive Console Stream (e.g. from pipeline) File Scripting commands Beanshell = simplified Java Or full Java code Advantages Run Java-code directly  No compilation necessary All PROMPT classes are available from the scripts „Classpath hell“ was yesterday Just call:./prompt.sh Filename.java

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT Conclusions PROMPT can map, compare and analyse protein sets Easy-to-use interactively Large-scale batch processing Automatical or manual testing for significance Helps to avoid to reinvent the wheel Graphical visualisations pointing up results Generic  application even beyond bioinformatics Dig our data gold mine efficiently

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT Acknowledgements Dmitrij Frishman Hans-Werner Mewes All MIPSies and Lehrstuhl-people for valuable discussions