Databases מאגרי מידע - חלק ב' אחסון שליפה
What are we looking for in a GOOD database? Large amount of data Numerous entries Well defined fields Non-redundancy Reliable data (periodic updating) Informative links to other DBs Efficient and user-friendly associated tools (software) necesary for db access/query, db information insertion, db information deletion Curated vs. non-curated DBs
Repository DBs (archives) vs. topic centered First generation vs. advanced generations Not curated vs. well curated Partially annotated vs. fully annotated Nucleotide & Protein Sequence DBs ~20 Years of Data Accumulation More redundant vs. less redundant
Primary Sequence Repositories בור סוד שאינו מאבד טיפה (highly redundant) אך גם אינו מעבד טיפה (poorly annotated) First Generation Databases EMBL/GenBank/DDBJ
EMBL/GenBank/DDBJ Sort of sequence museum, where sequences are preserved for eternity as they were determined, interpreted and published originally by their authors (primary sequence repository) The authors have full authority over the content of the entries they submit ! (editorial control of the content belongs to the authors) Redundancy, insufficient annotation.
Unexpected information you can find in these dbs: מי חבר של פידל? EMBL כמה שנים הוא שמר את הסיגר?
EMBL/GenBank/DDBJ Unexpected information you can find in these db: Z71230 EMBL FT source FT /db_xref="taxon:4097" FT /organelle="plastid:chloroplast" FT /organism="Nicotiana tabacum" FT /isolate="Cuban cahibo cigar, gift from President Fidel Castro" Or: FT source FT /chromosome="complete mitochondrial genome" FT /db_xref="taxon:9267" FT /organelle="mitochondrion" FT /organism="Didelphis virginiana" FT /dev_stage="adult" FT /isolate="fresh road killed individual" FT /tissue_type="liver"
Advanced generations of nucleotide sequence databases Non-redundant sequence-centric database A comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products. RefSeq Gene-centric databases All the sequence information relevant to a given gene is made accessible at once Gene Genome-centric databases Information about gene sequence, relative position, strand orientation, biochemical functions… Genome browsers Different entries Single entry
Boolean operatorsKeywords Fields Syntax 4. Access additional entries discussing same or similar entities by links to additional databases (DBXref) 2. Choose appropriate database Think, evaluate. The computer is just a machine. You are (hopefully) a thinking organism. 1. Think – phrase your scientific question. Phrase your query Current tutorial Preview/index Preview/index, limits MeSH terms Previous and current tutorials History
Found (+) Not found (-) True positive False negative Related False positive True negative Unrelated Search results “ s c i e n ti fi c t r u t h ” Evaluating Search Results Easy to detect Harder to detect (?)
A database is a structured collection of information. A database is composed of basic objects called records or entries ( רשומות ). Each record is composed of fields ( שדות ), which hold defined data that is related to that record. The organization of each record into predetermined fields, allows us to use queries on fields. Common to all databases
Real life of a protein sequence … TrEMBL Genpept CoDing Sequences provided by submitters cDNAs, ESTs, genomes, … EMBL, GenBank, DDBJ Data not submitted to public databases, delayed or cancelled… Swiss-Prot CoDing Sequences provided by submitters and « de novo » gene prediction RefSeq XP_NNNNN UniProt: Swiss-Prot + TrEMBL + (PIR) NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF Manually annotated PRF Scientific publications derived sequences with or without annotated CDS PRF, PIR Protein Identification Resource Protein research foundation, Japan
Type of recordSample Accession Format GenBank/EMBL/DDBJOne letter followed by five digits: e.g. U12345 Two letters followed by 6 digits: e.g. AF Swiss-Prot/TrEMBLOne letter and five digits/letters: e.g. P12345 RefSeq nucleotideTwo letters, underscore bar and six digit: e.g. mRNA NM_ e.g. genomic NT_ RefSeq proteine.g. NP_00483 RefSeq predictione.g. XM_ e.g. XP_ PDB (protein structure)One digit followed by three letters: e.g. 1TUP The AC number jungle Not always easy to recognize the origin of the record