Presentation is loading. Please wait.

Presentation is loading. Please wait.

Databases מאגרי מידע - חלק ב' אחסון שליפה. What are we looking for in a GOOD database? Large amount of data Numerous entries Well defined fields Non-redundancy.

Similar presentations


Presentation on theme: "Databases מאגרי מידע - חלק ב' אחסון שליפה. What are we looking for in a GOOD database? Large amount of data Numerous entries Well defined fields Non-redundancy."— Presentation transcript:

1 Databases מאגרי מידע - חלק ב' אחסון שליפה

2 What are we looking for in a GOOD database? Large amount of data Numerous entries Well defined fields Non-redundancy Reliable data (periodic updating) Informative links to other DBs Efficient and user-friendly associated tools (software) necesary for db access/query, db information insertion, db information deletion Curated vs. non-curated DBs

3 Repository DBs (archives) vs. topic centered First generation vs. advanced generations Not curated vs. well curated Partially annotated vs. fully annotated Nucleotide & Protein Sequence DBs ~20 Years of Data Accumulation More redundant vs. less redundant

4 Primary Sequence Repositories בור סוד שאינו מאבד טיפה (highly redundant) אך גם אינו מעבד טיפה (poorly annotated) First Generation Databases EMBL/GenBank/DDBJ

5 EMBL/GenBank/DDBJ Sort of sequence museum, where sequences are preserved for eternity as they were determined, interpreted and published originally by their authors (primary sequence repository) The authors have full authority over the content of the entries they submit ! (editorial control of the content belongs to the authors) Redundancy, insufficient annotation.

6 Unexpected information you can find in these dbs: מי חבר של פידל? EMBL כמה שנים הוא שמר את הסיגר?

7 EMBL/GenBank/DDBJ Unexpected information you can find in these db: Z71230 EMBL FT source 1..124 FT /db_xref="taxon:4097" FT /organelle="plastid:chloroplast" FT /organism="Nicotiana tabacum" FT /isolate="Cuban cahibo cigar, gift from President Fidel Castro" Or: FT source 1..17084 FT /chromosome="complete mitochondrial genome" FT /db_xref="taxon:9267" FT /organelle="mitochondrion" FT /organism="Didelphis virginiana" FT /dev_stage="adult" FT /isolate="fresh road killed individual" FT /tissue_type="liver"

8 Advanced generations of nucleotide sequence databases Non-redundant sequence-centric database A comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products. RefSeq Gene-centric databases All the sequence information relevant to a given gene is made accessible at once Gene Genome-centric databases Information about gene sequence, relative position, strand orientation, biochemical functions… Genome browsers Different entries Single entry

9 Boolean operatorsKeywords Fields Syntax 4. Access additional entries discussing same or similar entities by links to additional databases (DBXref) 2. Choose appropriate database 3. 5. Think, evaluate. The computer is just a machine. You are (hopefully) a thinking organism. 1. Think – phrase your scientific question. Phrase your query Current tutorial Preview/index Preview/index, limits MeSH terms Previous and current tutorials History

10 Found (+) Not found (-) True positive False negative Related False positive True negative Unrelated Search results “ s c i e n ti fi c t r u t h ” Evaluating Search Results Easy to detect Harder to detect (?)

11

12 A database is a structured collection of information. A database is composed of basic objects called records or entries ( רשומות ). Each record is composed of fields ( שדות ), which hold defined data that is related to that record. The organization of each record into predetermined fields, allows us to use queries on fields. Common to all databases

13

14

15 Real life of a protein sequence … TrEMBL Genpept CoDing Sequences provided by submitters cDNAs, ESTs, genomes, … EMBL, GenBank, DDBJ Data not submitted to public databases, delayed or cancelled… Swiss-Prot CoDing Sequences provided by submitters and « de novo » gene prediction RefSeq XP_NNNNN UniProt: Swiss-Prot + TrEMBL + (PIR) NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF Manually annotated PRF Scientific publications derived sequences with or without annotated CDS PRF, PIR Protein Identification Resource Protein research foundation, Japan

16 Type of recordSample Accession Format GenBank/EMBL/DDBJOne letter followed by five digits: e.g. U12345 Two letters followed by 6 digits: e.g. AF123456 Swiss-Prot/TrEMBLOne letter and five digits/letters: e.g. P12345 RefSeq nucleotideTwo letters, underscore bar and six digit: e.g. mRNA NM_000492 e.g. genomic NT_000907 RefSeq proteine.g. NP_00483 RefSeq predictione.g. XM_000483 e.g. XP_000467 PDB (protein structure)One digit followed by three letters: e.g. 1TUP The AC number jungle Not always easy to recognize the origin of the record


Download ppt "Databases מאגרי מידע - חלק ב' אחסון שליפה. What are we looking for in a GOOD database? Large amount of data Numerous entries Well defined fields Non-redundancy."

Similar presentations


Ads by Google