Introduction to Bioinformatics Dr. Yael Mandel-Gutfreund TA: Oleg Rokhlenko
2 Course Objectives To introduce the bioinfomatics discipline To make the students familiar with the major biological questions which can be addressed by bioinformatics tools To introduce the major tools used for sequence and structure analysis and explain in general how they work (limitation etc..)
3 Course Requirements 1.Submit written assignments. 1.9/12 short class assignments 4/4 home assignments 2.Each assignment is to be done and submitted in pairs (except the first two class assignment). 3.The pairs are ideally composed of a person from computer science and a person from life science. 2.A final project or a take home exam, submitted in pairs. 3.The course web site:
4 Grading 10 % class assignments 30 % home assignments 60% final project/ test
5 Literature list Gibas, C., Jambeck, P. Developing Bioinformatics Computer Skills. O'Reilly, Lesk, A. M. Introduction to Bioinformatics. Oxford University Press, Mount, D.W. Bioinformatics: Sequence and Genome Analysis. 2nd ed.,Cold Spring Harbor Laboratory Press, Advanced Reading Jones N.C & Pevzner P.A. An introduction to Bioinformatics algorithms MIT Press, 2004
6 Course Outline Introduction to bioinformatics Bioinformatics databases Pairwise and multiple sequence alignment Searching for sequences in databases Searching for motifs in sequences Phylogenetics RNA secondary Structure Protein structure: secondary and tertiary structure Proteins families: motifs, domains, clustering The Human Genome Project Gene prediction, alternative splicing Gene expression analysis (DNA microarrays) Comparative genomics, Biological networks
7 Course Outline Introduction to bioinformatics Bioinformatics databases Pairwise and multiple sequence alignment Searching for sequences in databases Searching for motifs in sequences Phylogenetics RNA secondary Structure Protein structure: secondary and tertiary structure Proteins families: motifs, domains, clustering The Human Genome Project Gene prediction, alternative splicing Gene expression analysis (DNA microarrays) Comparative genomics, Biological networks
8 Introduction to Bioinformatics What is Bioinformatics? From DNA to Genome What’s next? the post genomic era
9 “the field of science in which biology, computer science, and information technology merge to form a single discipline Ultimate goal: to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned.” What is Bioinformatics?
10 Central Paradigm in Molecular Biology mRNAGene (DNA)Protein TranslationTranscription DNA RNA Protein Symptomes (Phenotype )
11 21st century Biology – from purely lab-based science to an information science
12 Central Paradigm of Bioinformatics Genetic Information Molecular Structure Biochemical Function Symptoms
13 From DNA to Genome Watson and Crick DNA model Sanger sequences insulin protein ARPANET (early Internet) Sanger dideoxy DNA sequencing PDB (Protein Data Bank) N-W sequence alignment GenBank database PCR (Polymerase Chain Reaction) Dayhoff’s Atlas of Protein Seqs.
SWISS-PROT database USA’s NCBI WWW (World Wide Web) Celera Genomics First human genome draft Israel’s INN Human Genome Initiative BLAST algorithm FASTA algorithm First bacterial genome Europe’s EBI Yeast genome
eukaryotes 20 bacteria 194 archaea 19 Complete Genomes
16 The “post-genomics” era Goal: to understand the functional networks of a living cell AnnotationComparative genomics Structural genomics Functional genomics What’s Next ?
17 Annotation Open reading frames Functional sites Structure, function
18 CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA TAT GGA CAA TTG GTT TCT TCT CTG AAT TGAAAAACGTA
19 CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA TAT GGA CAA TTG GTT TCT TCT CTG AAT TGA AAAACGTA TF binding site promoter Ribosome binding Site ORF=Open Reading Frame CDS=Coding Sequence Transcription Start Site
20 Comparative genomics Comparing ORFs Identifying orthologs Concluding on structure and function Comparing functional sites Concluding on regulatory networks
21 Researchers have learned a great deal about the function of human genes by examining their counterparts in simpler model organisms such as the mouse. Conservation of the IGFALS (Insulin-like growth factor) Between human and mouse.
22 Ultraconserved Elements in the Human Genome Gill Bejerano,1* Michael Pheasant,3 Igor Makunin,3 Stuart Stephen,3W.James Kent,1 John S. Mattick,3 David Haussler2* There are 481 segments longer than 200 base pairs (bp) that are absolutely conserved (100% identity with no insertions or deletions) between orthologous regions of the human, rat, and mouse genomes. Nearly all of these segments are also conserved in the chicken and dog genomes, with an average of 95 and 99% identity, respectively. Many are also significantly conserved in fish. These ultraconserved elements of the human genome are most often located either overlapping exons in genes involved in RNA processing or in introns or nearby genes involved in the regulation of transcription and development. Along with more than 5000 sequences of over 100 bp that are absolutely conserved among the three sequenced mammals, these represent a class of genetic elements whose functions and evolutionary origins are yet to be determined, but which are more highly conserved between these species than are proteins and appear to be essential for the ontogeny of mammals and other vertebrates.
23 Functional genomics Genome-wide profiling of: mRNA levels Protein levels Co-expression of genes and/or proteins Identifying protein-protein interaction Networks of interactions
24 Understanding the function of genes and other parts of the genome
25 Structural genomics Assign structure to all proteins encoded in a genome
26 Structural Genomics Expectations ~300 unique folds in PDB ~300 unique folds Currently structure
27 Structural Genomics Expectations unique folds in “structure space” Estimate
28 Course Outline Introduction to bioinformatics Bioinformatics databases Pairwise and multiple sequence alignment Searching for sequences in databases Searching for motifs in sequences Phylogenetics RNA secondary Structure Protein structure: secondary and tertiary structure Proteins families: motifs, domains, clustering The Human Genome Project Gene prediction, alternative splicing Gene expression analysis (DNA microarrays) Comparative genomics, Biological networks
29 Database Types Sequence databases Generalspecial GenBank, emblTF binding sites PIR, SwissprotPromoters Genomes Structure databases GeneralSpecial PDBSpecific protein families folds Databases of experimental results Co-expressed genes, prot-prot interaction, etc.
30 World Wide Web –USA National Center for Biotechnology Information: –European Bioinformatics Institute: –ExPASy Molecular Biology Server: –Israeli National Node: inn.org.il
31 Entrez – NCBI Engine Entrez is the integrated, text-based search and retrieval system used at NCBI for the major databases, including PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, and others.Entrez
32 Entrez – NCBI Engine
33 Nucleotide Nucleotides database is a collection of sequences from several sources, including GenBank, RefSeq, and PDB. April > 38,989,342,565 bases
34 PubMed MEDLINE publication database –Over 17,000 journals –Some other citations Papers from 1960s –Over 12,000,000 entries Alerting services – –
35 OMIM Online Mendelian Inheritance in Man –Genes and genetic disorders –Edited by team at Johns Hopkins –Updated daily Entries –10670 single-loci phenotypes (*) –1294 multi-loci phenotypes (#) –2415 unclassified phenotypes
36 Searching PubMed Structureless searches –Automatic term mapping Structured searches –Field names, e.g. [au], [ta], [dp], [ti] –Boolean operators, e.g. AND, OR, NOT, () Additional features –Subsets, limits –Clipboard, history
37 Searching OMIM Search Fields –Disease name, e.g. hypertension –Cytogenetic location, e.g. 1p31.6 –Inheritance, e.g. autosomal dominant Browsing Interfaces –Alphabetical by disease –Genetic map Additional features like PubMed