IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT
IST Computational Biology2 Sizing Biological Information This week (20 Sept. 2010) the EMBL Database contained nucleotides in 195,945,264 entries.
IST Computational Biology3 Sizing Biological Information Release 2010_09 of 10-Aug-10 of UniProtKB/Swiss-Prot contains sequence entries, comprising amino acids abstracted from references. 998 sequences have been added since release 2010_08, the sequence data of 160 existing entries has been updated and the annotations of entries have been revised. Protein existence (PE): entries % Evidence at protein level % Evidence at transcript level % Inferred from homology % Predicted % Uncertain %
IST Computational Biology4 Sizing Biological Information
IST Computational Biology5 Sizing Biological Information
IST Computational Biology6 Protein Structures X-RAY NMR8588 ELECTRON MICROSCOPY306 HYBRID26 other147 Total RSCB - PDB
IST Computational Biology7 Data deluge, where from Sequencing (NGS, SMS) Microarray experiments Parallelized drug screening and testing Other
IST Computational Biology8 Gene Ontology – towards consistent descriptions The need to produce consistent effective searches Uniform terminology Controlled vocabulary Hierarchical relations
IST Computational Biology9 Gene Ontology
IST Computational Biology10 Specialized Search tools Searching on specific fields is relatively easy Using keywords allows indexed searching on text fields Searching sequence data is more complex Similarity search: BLAST is a fast way of searching sequence data for similarity Some databases of nucleotide or protein sequences are formatted for BLAST
IST Computational Biology11 Interoperability Adherence to standards Minimal experiment descriptions Ontological concerns Integration Warehousing
IST Computational Biology12 Bibliography DBs Pubmed (Medline) “Entrez” searching Data Mining in text Tagged text to avoid loss (Utopia doucuments).
IST Computational Biology13 Medical Subject Headings Part of the NLM/Pubmed effort. MESH is a seacheable database. Controlled Vocabulary Disambiguation Term relationships Spelling:Hemoglobin or Haemoglobin? Context:NMR spectrocopy or imaging?
IST Computational Biology14 More on bibliography Web of knowledge b-on Institutional repositories PubCrawler (alerts)
IST Computational Biology15 Structural Protein DBs Primary Coordinates from X-ray diffraction, NMR, etc Composition from UniprotKB Properties from annotations
IST Computational Biology16 Specialized DBs Binding sites SNPs
IST Computational Biology17 Classification of Proteins CATH Classification, Architecture, Topology, Homology SCOP Structural Classification of Proteins
IST Computational Biology18 Integrated DBs Built to aggregate other databases Provide common search Calculate cross linking tables Interpro –Results from integrating several derivative databases such as PRINTS; PROSITE; SMART; ProDom; Pfam; TIGRfam
IST Computational Biology19 Knowledge bases Uniprot (Swissprot/PIR/TREmbl) ENSEMBL (genome centered) GeneCards (gene centered)
IST Computational Biology20 GeneCards
IST Computational Biology21 GeneCards
IST Computational Biology22 GeneCards – expression data
IST Computational Biology23 Clinical OMIM Mendelian inheritance, human diseases HGMD Mutations and associated human diseases dbSNP SNPs in >1% incidence
IST Computational Biology24 The synchronization issue Many copies of public databases (version control) Content update on primary and derived databases influences integration Inconsistencies are slow to resolve Indexes need frequent recalculation
IST Computational Biology25 Purifying content Efforts are in place to enhance contents of derived databases For example, manual curation of genomic databases in specific sectors, such as eukariots, human, plants, etc.
IST Computational Biology26 HAVANA Manual annotation by chromosome in human genome.
IST Computational Biology27 ENCODE Project to review functional parts of the human genome in fine detail