Sequence and structure databanks can be divided into many different categories. One of the most important is Supervised databanks with gatekeeper. Examples:

Slides:



Advertisements
Similar presentations
Databanks (A) NCBINCBI (National Center for Biotechnology Information) is a home for many public biological databases (see an older diagram below). All.
Advertisements

Escherichia coli, strain CFT073, uropathogenic Escherichia coli, strain EDL933, enterohemorrhagic Escherichia coli K12, strain MG1655, laboratory strain,
Mathematical Modeling Overview on Mathematical Modeling in Chemical Engineering By Wiratni, PhD Chemical Engineering Gadjah Mada University Yogyakarta.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Why bacteria run Linux while eukaryotes run Windows? Sergei Maslov Brookhaven National Laboratory New York.
Basics of Comparative Genomics Dr G. P. S. Raghava.
MainLabMeeting_PingZheng_ Ran the fgenesh on the large contigs from the matina_1_6_RNA dataset and performed BLAST the Putative genes against.
Group 4 members: Wang Ting, Jiang Bai, Qin Zhiyi, Li Jun Group 4 1 Genomics and Epigenomics.
Sequence Similarity Searching Class 4 March 2010.
BLAST Tutorial 3 What is BLAST? Basic Local Alignment Search Tool Is a set of similarity search programs designed to explore sequence databases. What are.
Protein Databases EBI – European Bioinformatics Institute
Description of Group B Streptococcus Pan-genome Genome comparisons of 8 closely related GBS strains Tettelin, Fraser et al., PNAS 2005 Sep 27;102(39)
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Class European Resources Protein Focused. Protein Databases EBI – European Bioinformatics Institute
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
MCB 372 #12: Tree, Quartets and Supermatrix Approaches Collaborators: Olga Zhaxybayeva (Dalhousie) Jinling Huang (ECU) Tim Harlow (UConn) Pascal Lapierre.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
MCB 371/372 Student Projects Databanks 3/16/05 Peter Gogarten Office: BSP 404 phone: ,
Microbial Genomes Features Analysis Role of high-throughput sequencing Yeast - the eukaryotic model microbe Databases –TIGR CMR –NCBI Microbial Genomes.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
© Wiley Publishing All Rights Reserved. Protein and Specialized Sequence Databases.
1. How does conjugation work? Sex in Bacteria How do bacteria exchange DNA.
SELECTION OF NEW TARGET PROTEINS FOR DRUG DESIGN IN GENOME OF MYCOBACTERIUM TUBERCULOSIS Alexander V. Veselovsky V.N. Orechovich Institute of Biomedical.
Function preserves sequences
Protein and RNA Families
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
GEBA Project Summary Dongying Wu. Phylogenetic Tree Building (Martin Wu) Concatenate alignments of 31 marker genes build a PHYML tree 667 non-GEBA genomes,
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
Biological Information and Biological Databases Meena K Sakharkar Bioinformatics Centre National University of Singapore.
Chromosome - the most informative molecule of a cell - and the most variable?
Brückner et al., Fig. 1b Brückner et al., Fig. 1B a c b 6 Fig. 1. Circular representation of Streptococcus pneumoniae genome comparisons.
How many genes are there?
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
TOPIC 13 Standard Deviation. The STANDARD DEVIATION is a measure of dispersion and it allows us to assess how spread out a set of data is: 1. STANDARD.
What is BLAST? Basic BLAST search What is BLAST?
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Justin S Hogg et al. {Genome Biology} 2007, 8:R103 Metagenomics Seminar, Spring 2008 Presenter : Kwangmin Choi.
Gene3D, Orthology and Homology-Based Inheritance of Protein-Protein Interactions Corin Yeats
MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res
Bioinformatics Shared Resource Bioinformatics : How to… Bioinformatics Shared Resource Kutbuddin Doctor, PhD.
De novo creation of new genes 1.Retrotransposition (+/- cooption of other sequences) AAAAA Pre-mRNA AAAAA Splicing to remove intron Reverse transcription.
House spider genome uncovers evolutionary shifts in the diversity and expression of black widow venom proteins associated with extreme toxicity Gendreau.
The Original Question:
Basics of Comparative Genomics
Welch RA, et al. Proc Natl Acad Sci U S A. 2002; 99:
Genome Annotation Continued
Target selection strategies for the mouse genome
Is this a positive delta G or negative delta G reaction?
Principal component analysis of the GO category composition of all genes in each genome/transcriptome and WGD paralogs. Principal component analysis of.
26.5 Molecular Clocks Help Track Evolutionary Time
Reaction time زمن الرجع.
Supplementary Figure 4. Comparisons of MethyLight and gene expression data. PMR values (X-axis) were plotted against log2 gene expression values (Y-axis)
Chromosome - the most informative molecule of a cell
Clonal strains of Pseudomonas aeruginosa were identified from independent patient-pairs using multilocus sequence typing (MLST). Clonal strains of Pseudomonas.
Pangenomes and core genomes of 13 M. florum strains.
Basics of Comparative Genomics
RND efflux operons in P. aeruginosa.
Core genome phylogeny of V. anguillarum strains.
Sequence Analysis Alan Christoffels
Bland and Altman plot for comparison of the difference between the score based on the kinematic variables recorded from the photogrammetric system outputs.
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Toward Accurate and Quantitative Comparative Metagenomics
Global analysis of the chemical–genetic interaction map.
Phylogenetic analysis of complete Fusobacterium genomes.
Presentation transcript:

Sequence and structure databanks can be divided into many different categories. One of the most important is Supervised databanks with gatekeeper. Examples: Swissprot Refseq (at NCBI) Entries are checked for accuracy. + more reliable annotations -- frequently out of date Repositories without gatekeeper. Examples: GenBank EMBL TrEMBL Everything is accepted + everything is available -- many duplicates -- poor reliability of annotations

Description of Group B Streptococcus Pan-genome Genome comparisons of 8 closely related GBS strains Tettelin, Fraser et al., PNAS 2005 Sep 27;102(39)

Method

Bacterial Core Genes that are shared among all Bacteria Bit score cutoff 50.0 (~10E -4 ) f(x) = A 1 *exp (-K1*x) + A 2 *exp (-K2*x) + A 3 *exp (-K3*x) + Plateau

Genes without homologs f(x) = A 1 *exp (-K1*x) + A 2 *exp (-K2*x) + A 3 *exp (-K3*x) + A 4 *exp (-K4*x) + A 5 *exp (-K5*x) + Plateau

Decomposed function

Core Essential genes (Replication, energy, homeostasis) ~ 116 gene families Extended Core Set of genes that define groups or species (Symbiosis, photosynthesis) ~ 17,060 gene families Accessory Pool Genes that can be used to distinguish strains or serotypes (Mostly genes of unknown functions) ~ 114,800 gene Families uncovered so far

76.6% 3.8% 19.6% Gene frequency in individual genomes Core Extended Core Accessory Pool

f(x) = A 1* exp (-k1*x) + A 2* exp (-k2*x) + A 3* exp (-k3*x) + A 4* exp (-k4*x) + A 5* exp (-k5*x) + Plateau 1/k 1 = /k 2 = 2.3 1/k 3 = /k 4 = /k 5 = A 1 = A 2 = A 3 = A 4 = A 5 = Number of genomes added

Kézdy-Swinbourne Plot Novel genes after looking in x genomes Novel genes after looking in x + ∆x genomes ~230 novel genes per genome

A K é zdy-Swinbourne Plot plot can be used to estimate the value that a decay function approaches as time goes to infinity. Assume the simple decay function f(x) = K + A e -kx, then f(x + ∆x) = K + A e -k(x+∆x). Through elimination of A: f(x+∆x)=e -k ∆x f(x) + K ’ For the plot of f(x+∆x) against f(x) the slope is e -k ∆x. For x  both f(x) and f(x+∆x) approach the same constant : f(x)  K, f(x+∆x)  K. (see the def. for the decay function) The K é zdy-Swinbourne Plot is rather insensitive to deviations from a simple single component decay function. More at Hiromi K: Kinetics of Fast Enzyme Reactions. New York: Halsted Press (Wiley); 1979

Kézdy-Swinbourne Plot Novel genes after looking in x genomes Novel genes after looking in x + ∆x genomes ~230 novel genes per genome