C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Master Course Sequence Alignment Lecture 11 Database searching Issues (2)

Slides:



Advertisements
Similar presentations
1 Orthologs: Two genes, each from a different species, that descended from a single common ancestral gene Paralogs: Two or more genes, often thought of.
Advertisements

Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions Paper by Umar Syed and Golan Yona department of CS, Cornell.
Table 2 shows that the set TFsf-TGblbs of predicted regulatory links has better results than the other two sets, based on having a significantly higher.
Molecular Basis for Relationship between Genotype and Phenotype DNA RNA protein genotype function organism phenotype DNA sequence amino acid sequence transcription.
TIGR gene locicDNA (bp) Number of introns % homology (nucleotides) Protein Molecular weight (kDa) % homology (amino acids) KOME Accession number % homology.
Pfam(Protein families )
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Structural bioinformatics
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
Strict Regularities in Structure-Sequence Relationship
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Master Course Sequence Alignment Lecture 9 Database searching (3)
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
1 Multiple sequence alignment Lesson 4. 2 VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Remote homology detection  Remote homologs:  low sequence similarity, conserved structure/function  A number of databases and tools are available 
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
What’s next ?? Today 3.3 Protein function 10.3 Protein secondary structure prediction 17.3 Protein tertiary structure prediction 24.3Gene expression &
CISC667, F05, Lec23, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines (II) Bioinformatics Applications.
Genome Analysis 2007 Lecture 7 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Iterative homology searching (PSI-BLAST)
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Introduction to Bioinformatics Lecture 13: Predicting Protein Function Centre for Integrative Bioinformatics VU (IBIVU)
PDB-Protein Data Bank SCOP –Protein structure classification CATH –Protein structure classification genTHREADER–3D structure prediction Swiss-Model–3D.
Comparing Database Search Methods & Improving the Performance of PSI-BLAST Stephen Altschul.
Protein Structure Prediction II
EVOLUTIONARY AND COMPUTATIONAL GENOMICS Shin-Han Shiu Plant Biology / CMB / EEBB / Genetics / QBMI.
Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Classifying the protein universe Synapse- Associated Protein 97 Wu et al, EMBO J 19:
Functional Linkages between Proteins. Introduction Piles of Information Flakes of Knowledge AGCATCCGACTAGCATCAGCTAGCAGCAGA CTCACGATGTGACTGCATGCGTCATTATCTA.
Proteins Secondary Structure Predictions Structural Bioinformatics.
Identification of Protein Domains. Orthologs and Paralogs Describing evolutionary relationships among genes (proteins): Two major ways of creating homologous.
Sensitivity Sensitivity answers the following question: If a person has a disease, how often will the test be positive (true positive rate)? i.e.: if the.
1 Orthology and paralogy A practical approach Searching the primaries Searching the secondaries Significance of database matches DB Web addresses Software.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
CATH – a hierarchic classification of protein domain structures Rui Kuang.
Iterative homology searching using PSI-BLAST, scoring statistics and performance evaluation Introduction to bioinformatics 2008 Lecture 10 C E N T R F.
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
COURSE OF BIOINFORMATICS Exam_31/01/2014 A.
BIOE 301 Lecture Thirteen. Review of Lecture 12 The burden of cancer Contrasts between developed/developing world How does cancer develop? Cell transformation.
Construction of Substitution Matrices
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
NIGMS Protein Structure Initiative: Target Selection Workshop ADDA and remote homologue detection Liisa Holm Institute of Biotechnology University of Helsinki.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Likelihood 2005/5/22. Likelihood  probability I am likelihood I am probability.
Protein and RNA Families
Protein Tertiary Structure. Protein Data Bank (PDB) Contains all known 3D structural data of large biological molecules, mostly proteins and nucleic acids:
Protein Classification Using Averaged Perceptron SVM
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
Using blast to study gene evolution – an example.
3DM: Protein Super-family Platforms 3DM Protein super-family data integration Tom van den Bergh Bio-Prodict.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.
InterPro Sandra Orchard.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Detecting Protein Function and Protein-Protein Interactions from Genome Sequences TuyetLinh Nguyen.
METHOD: Family Classification Scheme 1)Set for a model building: 67 microbial genomes with identified protein sequences (Table 1) 2)Set for a model.
Bioinformatics What is a genome? How are databases used? What is a phylogentic tree?
There are four levels of structure in proteins
Relationship between Genotype and Phenotype
Identify D. melanogaster ortholog
The future of protein secondary structure prediction accuracy
What do you with a whole genome sequence?
Relationship between Genotype and Phenotype
EST Analysis of the Cnidarian Acropora millepora Reveals Extensive Gene Loss and Rapid Sequence Divergence in the Model Invertebrates  R.Daniel Kortschak,
Computational genomics
Relationship between Genotype and Phenotype
Exploring a Putative Gene
Introduction to bioinformatics 2007
Presentation transcript:

C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Master Course Sequence Alignment Lecture 11 Database searching Issues (2)

C; family: zinc finger -- CCHH-type C; class: small C; reordered by kitschorder 1.0a C; reordered by kitschorder 1.0a C; last update 7/9/98 >P1;1zaa1 structureX:1zaa: 3 :C: 33 :C:zinc-finger (ZIF268, domain 1):Mus musculus:2.10: RPYACPVESCDRRFSRSDELTRHI-RI-HTGQK* >P1;1zaa2 structureX:1zaa: 34 :C: 61 :C:zinc-finger (ZIF268, domain 2):Mus musculus:2.10: PFQCRI--CMRNFSRSDHLTTHI-RT-HTGEK* >P1;1zaa3 structureX:1zaa: 62 :C: 87 :C:zinc-finger (ZIF268, domain 3):Mus musculus:2.10: PFACDI--CGRKFARSDERKRHT-KI-HLR--* >P1;1ard structureN:1ard: 102 : : 130 : :zinc-finger (transcription factor ADR1):Saccharomyces cerevisiae:-1.00: RSFVCEV--CTRAFARQEHLKRHY-RS-HTNEK* >P1;1znf structureN:1znf: 1 : : 25 : :zinc-finger (XFIN, 31st domain):Xenopus laevis:-1.00: YKCGL--CERSFVEKSALSRHQ-RV-HKN--* >P1;2drp2 structureX:2drp: 137 :A: 165:A:zinc-finger (tramtrack, domain 2):Drosophila melanogaster:2.80: NVKVYPCPF--CFKEFTRKDNMTAHV-KIIHK---* >P1;3znf structureN:3znf: 1 : : 30 : :zinc-finger (enhancer binding protein):Homo sapiens:-1.00: RPYHCSY--CNFSFKTKGNLTKHMKSKAHSKK-* >P1;5znf structureN:5znf: 1 : : 30 : :zinc-finger (ZFY-6T):Homo sapiens:-1.00: KTYQCQY--CEYRSADSSNLKTHIKTK-HSKEK* Example You can also look at superposed structures..

Sensitivity and Specificity – medical world + - Test Test True Positive (TP) 990 False Positive (FP) All with Positive Test TP+FP Positive Predictive Value= TP/(TP+FP) 9990/( ) =91% - 10 False Negative (FN) 989,010 True Negative (TN) All with Negative Test FN+TN Negative Predictive Value= TN/(FN+TN) 989,010/(10+989,0 10) =99.999% All with Disease 10,000 All without Disease 999,000 Everyone= TP+FP+FN+TN Sensitivity= TP/(TP+ FN) 9990/( ) Specificity= TN/(FP+TN ) 989,010/ (989, ) Pre-Test Probability= (TP+FN)/(TP+FP+FN+TN) (in this case = prevalence) 10,000/1,000,000 = 1%

Structure-based function prediction SCOP ( is a protein structure classification database where proteins are grouped into a hierarchy of families, superfamilies, folds and classes, based on their structural and functional similarities

Structure-based function prediction SCOP hierarchy – the top level: 11 classes

Structure-based function prediction All-alpha protein Coiled-coil protein All-beta protein Alpha-beta proteinmembrane protein

Structure-based function prediction SCOP hierarchy – the second level: 800 folds

Structure-based function prediction SCOP hierarchy - third level: 1294 superfamilies

Structure-based function prediction SCOP hierarchy - third level: 2327 families

Structure-based function prediction Using sequence-structure alignment method, one can predict a protein belongs to a –SCOP family, superfamily or fold Proteins predicted to be in the same SCOP family are orthologous Proteins predicted to be in the same SCOPE superfamily are homologous Proteins predicted to be in the same SCOP fold are structurally analogous folds superfamilies families

Note: the numbers do not add up in every profile column since a selection of alignment sequences in the MSA and amino acids represented in the profile are taken!

ABAB B C C D

Conserved hypotheticals >P00001 Conserved hypothetical A substantial fraction of genes in sequenced genomes encodes 'conserved hypothetical' proteins, i.e. those that are found in organisms from several phylogenetic lineages but have not been functionally characterized.