1 Introduction to Bioinformatics Fall 2008
2 Administration Adi Doron Nimrod Rubinstein Dudu Burstein Reception hours: by appointment Britania 405,
3 Course Website
4 Exercises Each student participates once in 2 weeks: Sunday 16:00-18:00 Monday 12:00-14:00 Monday 14:00-16:00 Computer classroom Sherman 03
5 Requirements Exam – 80% of final grade Assignments – 20% of final grade (Compulsory) Assignments include class and home works: Assignments include class and home works: Class works are planned to be completed during the exercise. They should be mailed to the TA. They will be checked but not graded.Class works are planned to be completed during the exercise. They should be mailed to the TA. They will be checked but not graded. Home works should be handed in the following exercise (2 weeks after the hand out date). They will be checked and graded.Home works should be handed in the following exercise (2 weeks after the hand out date). They will be checked and graded.
6 Goals To familiarize the students with research topics in bioinformatics, and with bioinformatic tools The emphasis will be on tools and their use Prerequisites Familiarity with topics in molecular biology (cell biology and genetics) Basic familiarity with computers & internet
7 BIOINFORMATIC DATABASES
8 What’s in a database? Sequences – genes, proteins, etc. Full genomes Annotation – information about the gene/protein: - function - cellular location - chromosomal location - introns/exons - protein structure - phenotypes, diseases Publications
9 NCBI and Entrez One of the largest and most comprehensive databases belonging to the NIH – national institute of health (USA) Entrez is the search engine of NCBI Search for : genes, proteins, genomes, structures, diseases, publications and more.
10 Search for published papers Yang X, Kurteva S, Ren X, Lee S, Sodroski J. “Subunit stoichiometry of human immunodeficiency virus type 1 envelope glycoprotein trimers during virus entry into host cells “, J Virol May;80(9):
11 Use fields! Yang[AU] AND glycoprotein[TI] AND 2006[DP] AND J virol[TA] For the full list of field tags: go to help -> Search Field Descriptions and Tags
12 Exercise Retrieve all publications in which the first author is: Pe'er I and the last author is: Shamir R
13 Using Limits Retrieve the publications of Friedman N, in the journals: Bioinformatics and Journal of Computational Biology, in the last 5 years
14 Google scholar
15
16 NCBI gene & protein databases: GenBank GenBank is an annotated collection of all publicly available DNA sequences. Holds 65 billion bases (Oct. 2007) GenPept is a database of translated coding sequences from GenBank
17 Searching for CD4 human using Entrez Search demonstration
18
19 Using Field Descriptions, Qualifiers, and Boolean Operators Cd4[GENE] AND human[ORGN] Or Cd4[gene name] AND human[organism] List of field codes: Boolean Operators: AND OR NOT Boolean Operators: AND OR NOT Note: do not use the field Protein name [PROT], only GENE!
20
21 RefSeq REFSEQ: sub-collection of NCBI databases with only non-redundant, highly annotated entries (genomic DNA, transcript (RNA), and protein products)
22
23 An explanation on GenBank records
24 Accession Numbers Two letters followed by six digits, e.g.: AY One letter followed by five digits, e.g.: U12345 GenBankEMBL Three letters and five digits, e.g.: AAA12345 GenPept (a.a. translations of GenBank) RefSeq accession numbers can be distinguished from GenBank accessions by their prefix distinct format of [2 characters+underscore], e.g.: NP_ NM_: nucleotide, NP_: protein Refseq All are six characters: Character/Format 1 [O,P,Q] 2 [0-9] 3 [A-Z,0-9] 4 [A-Z,0-9] 5 [A-Z,0-9] 6 [0-9] e.g.:P12345 and Q9JJS7 SWISS-PROT (another protein database) one digit followed by three letters, e.g.: 1hxw PDB (Protein Data Bank – structure database)
25 Swissprot A protein sequence database which strives to provide a high level of annotation: * the function of a protein * domains structure * post-translational modifications * variants One entry for each protein
26
27 GenBank Vs. Swiss-Prot GenBank results Swiss-Prot results
28 Downloading & Fasta format Fasta format > sp|P01730|CD4_HUMAN T-cell surface glycoprotein CD4 precursor MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIK ILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQL LVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSG TWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWW QAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLA LEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWV LNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCV RCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI Save Accession Numbers for future use (makes searching quicker): Refseq: NP_ Swissprot: P01730
29
30 PDB: Protein Data Bank Main database of 3D structures. Includes ~47,000 entries (proteins, nucleic acids, others). Proteins organized in groups, families etc. Is highly redundant.
31 CD4 in complex with gp120 gp120 CD4 PDB ID 1G9M
32 Model organisms have independent database: Organism specific HIV database
33 Genecards All in one database of human genes (a project by Weizmann institute) Attempts to integrate as many as possible databases, publications and all available knowledge
34
35 Summary General and comprehensive databases: NCBI, EMBL, DDBJ NCBI, EMBL, DDBJ Genome specific databases: ENSEMBL, UCSC genome browser ENSEMBL, UCSC genome browser Highly annotated databases: Human genes Human genes GenecardsGenecards Proteins: Proteins: Swissprot, RefseqSwissprot, Refseq Structures: Structures: PDBPDB
36 The MOST important of all 1. Google (or any search engine)
37 And always remember: 2. RT(F)M – Read the manual!!
38 Help! Read the Help section Read the FAQ section Google the question!