Function preserves sequences

Slides:



Advertisements
Similar presentations
Bioinformatics Ayesha M. Khan Spring 2013.
Advertisements

Databanks (A) NCBINCBI (National Center for Biotechnology Information) is a home for many public biological databases (see an older diagram below). All.
Spring 2002Christophe Roos - 2/6 Gene finding Function preserves sequences Christophe Roos - MediCel ltd Similarity is the result.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
On line (DNA and amino acid) Sequence Information Lecture 7.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
How to use the web for bioinformatics Molecular Technologies Ethan Strauss X 1171
GENBANK, SWISSPROT AND OTHERS As Problem Sources for CSE 549 Andriy Tovkach Genetics.
Lecture 7 Types of databases.
Swiss-Prot Protein Database Daniel Amoruso December 2, 2004 BI 420.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Sequence Similarity Searching Class 4 March 2010.
Archives and Information Retrieval
How to use the web for bioinformatics Molecular Technologies February 11, 2005 Ethan Strauss X 1373
Protein Databases EBI – European Bioinformatics Institute
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Summer Bioinformatics Workshop 2008 Sequence Alignments Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.
Protein databases Henrik Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Sequence similarity.
Class European Resources Protein Focused. Protein Databases EBI – European Bioinformatics Institute
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
Bioinformatics Resources and Tools on the Web: A Primer.
An Introduction to Bioinformatics Molecular Biology Databases.
On line (DNA and amino acid) Sequence Information
Bioinformatics.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
1 Orthology and paralogy A practical approach Searching the primaries Searching the secondaries Significance of database matches DB Web addresses Software.
Sequence Retrieving, Manipulation and Management BIOINFORMATICS Lecture 3.
Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases Work Steps Extract and analyze results.
Biological Databases and Tools Sandra Sinisi / Kathryn Steiger November 25, 2002.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Search and Analysis SPE 1653 (703)
PROTEIN DATABASES. The ideal sequence database for computational analyses and data-mining: I t must be complete with minimal redundancy It must contain.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
1 EMBL Outstation — The European Bioinformatics Institute Removing redundancy in SWISS-PROT and TrEMBL.
Application of Bioinformatics in Genetic Research Instructors: Dr. Henry Baker Dr. Luciano Brocchieri Dr. Michele Tennant Dr. Lei Zhou
Bioinformatics and Computational Biology
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Computer Storage of Sequences
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
Construction of Substitution matrices
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
Copyright OpenHelix. No use or reproduction without express written consent1.
Information retrieval and sliding window programs April 5, 2011 Hand in Homework #1. Homework #2 due Tuesday, April 12. Learning objectives- Understand.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Protein databases Henrik Nielsen
Basics of BLAST Basic BLAST Search - What is BLAST?
Archives and Information Retrieval
UniProt: the Universal Protein Resource
Genomes and Their Evolution
Introduction to Bioinformatics
Lesson 3 Bioinformatics Laboratory
Bioinformatics Lecture 2 By: Dr. Mehdi Mansouri
Introduction to Databases
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Function preserves sequences Christophe Roos - MediCel ltd christophe.roos@medicel.fi Mutations change sequences Molecular evolution Function preserves sequences Part 3: sequence databases & comparisons Similarity is the result of conservation or converging evolution – it has its reason of being

The public biological databases EMBL or GenBank or DDBJ for DNA emblnew for daily updates, merges the main DB 4x/year SwissProt or PIR for proteins Trembl, tremblnew, remtrembl PDB for structures In flat file format, yet quite informative and convertible Fasta format is a ‘universal’ sequence format: first line starts with ‘>’ followed by free text. Second line has the start of the sequence (50 or 60 characters per line). Use the first line for the name or the Accession Number (AC) A wealth of databases. Here some of the most central ones. Christophe Roos - 3/6 Sequence databases & comparison Spring 2002

Database homes The European database home is in Hinxton, Cambridge, UK: European Bioinformatics Institute - EBI http://www.ebi.ac.uk Access through the Sequence Retrieval System, SRS The American database home is in Washington DC: National Center for Biotechnology Information – NCBI http://www.ncbi.nlm.nih.gov Access through Entrez Both centers exchange their data on a daily basis, however there are differences in annotations, consistency, speed and quality. There is also a Japanese database provider, DDBJ. Public databases, corporate databases, private databases, merged databases, primary databases, derived databases, databases with more errors or less accuracy... There is one for every taste. Go to EBI or NCBI to start with. Christophe Roos - 3/6 Sequence databases & comparison Spring 2002

A look at one entry from EMBL From the EBI home page, take the SRS-link under Databases or go directly to http://srs.ebi.ac.uk . Choose from the Top Page the database (in this case EMBL) and ask for a standard or extended query page (bottons at left). Use the field specifications and enter the query string. To get information on what fields are available and what they mean, follow from the EBI main page the links to the databases  EMBL  Documentation  User manual. part 1/3 Christophe Roos - 3/6 Sequence databases & comparison Spring 2002

A look at one entry from EMBL Each line in the database is described in http://www.ebi.ac.uk/embl/Documentation/User_manual/format.html . part 2/3 Christophe Roos - 3/6 Sequence databases & comparison Spring 2002

A look at one entry from EMBL The feature table of the entry contains several linked items, such as exon-assembly (mRNA) and coding sequence (CDS). There are also cross-references to other databases Some kind of integration is available. Some databases cross-refer to other databases. For example the protein databases might refer to DNA databases, since the DNA has the coding sequences. part 3/3 Christophe Roos - 3/6 Sequence databases & comparison Spring 2002

A look at one entry from SwissProt The eyeless gene: a master regulatory gene in eye formation Use SRS to find the fruitfly protein for the eyeless gene. Search for ’Drosophila melanogaster’ in the organism field and ’eyeless’ in the description field selectable on the query page. Two entries are listed (April 2002): pax6_drome and O96791, the latter being only a fragment of the full protein. Christophe Roos - 3/6 Sequence databases & comparison Spring 2002

The effect of the eyeless gene The eyeless gene is a master regulatory gene in eye formation When it is absent, no eyes are formed When it is present where it should not, it induces eye formation Normal The eyeless gene has been conserved during evolution: it is used in the formation as such various eyes as the insect facette eye, the octopus, the owl or the human eye. It is responsible for turning on the eye development program. When it is mutated, no eyes are formed (in the human, no iris is formed). When its expression is artificially induced where it is normally not expressed, it induces eye-like development. Overexpressed in antennae and wings Absent Christophe Roos - 3/6 Sequence databases & comparison Spring 2002

A look at one entry from SwissProt Part 2: the annotations about the function and location Use SRS to find the fruitfly protein for the eyeless gene. Search for ’Drosophila melanogaster’ in the organism field and ’eyeless’ in the description field selectable on the query page. Two entries are listed (April 2002): pax6_drome and O96791, the latter being only a fragment of the full protein. Christophe Roos - 3/6 Sequence databases & comparison Spring 2002

A look at one entry from SwissProt Part 3: The feature table and the amino acid sequence Use SRS to find the fruitfly protein for the eyeless gene. Search for ’Drosophila melanogaster’ in the organism field and ’eyeless’ in the description field selectable on the query page. Two entries are listed (April 2002): pax6_drome and O96791, the latter being only a fragment of the full protein. Christophe Roos - 3/6 Sequence databases & comparison Spring 2002

A look at one entry from SwissProt The eyeless gene is also called PAX6 and can be found in several species: birds, mammals, reptiles, fish, invertebrates Use SRS to find the other PAX6 proteins. They are found in chicken, quail, human, mouse, frog, fishes and fruit fly from the SwissProt databases and are therefore expected to be found in real life in many more species. Extract the sequences by choosing ’Save’. Keep all sequences (except the first, which is a fragment only) for future use in sequence comparison. Christophe Roos - 3/6 Sequence databases & comparison Spring 2002

Sequence comparison - Why? Function by analogy: If sequences are conserved their function is probably also conserved. Functional domains: If some parts of the sequences are more conserved than other parts, there must be an underlying biological reason for it. Establishing relationship/differences in function: By quantification of sequence relationships it is possible to estimate function of novel genes Establishing relationship between species Pairwise sequence comparison is the primary means of linkinng biological function to a sequence and of propagating known information from one sequence to another. This applies for individual sequences as well as whole genomes. Christophe Roos - 3/6 Sequence databases & comparison Spring 2002

Sequence comparison – how? Compare two sequences of similar length Compare two sequences of very different length Compare several sequences Allow gaps or not? Scoring: yes-no or good-intermediate-bad The best or all above a threshold? Knowledge-based single sequence analysis for sequence characteristics, pairwise sequence comparisons and sequence-based searching or multiple sequence alignments: The criteria for comparing varies and the metrics must be formalised. Christophe Roos - 3/6 Sequence databases & comparison Spring 2002

Sequence comparison – metrics gap match GA-CGGATTAG GATCGGAATAG The scoring matrix The score for a match The penality for a mismatch The penality for the insertion of a gap (gap-open) The penality for elongating a gap (gap-length) Local or global similarities ? mismatch We must define what is biologically similar and what not. It is also good to grade the scale to more than two values. Christophe Roos - 3/6 Sequence databases & comparison Spring 2002