Introduction to Bioinformatics Monday, November 17, 2008 Jonathan Pevsner Bioinformatics M.E:800.707.

Introduction to Bioinformatics Monday, November 17, 2008 Jonathan Pevsner pevsner@jhmi.edu Bioinformatics M.E:800.707

Bethany Drehmanbethfoxglove@gmail.com Cheng Ran (Lisa) Huanghuangchengran@gmail.com Teaching assistants!

People with very diverse backgrounds in biology Some people with backgrounds in computer science and biostatistics Most people (will) have a favorite gene, protein, or disease Who is taking this course?

To provide an introduction to bioinformatics with a focus on the National Center for Biotechnology Information (NCBI) and EBI To focus on the analysis of DNA, RNA and proteins To introduce you to the analysis of genomes To combine theory and practice to help you solve research problems What are the goals of the course?

Themes throughout the course Textbooks Web sites Literature references Gene/protein families Computer labs

Textbook The course textbook has no required textbook. I wrote Bioinformatics and Functional Genomics (Wiley, 2003). The seven lectures in this course correspond closely to chapters. An electronic version is available on the Welch Library website. A few copies will be available on reserve at Welch Library, and the library has six more copies. I recommend several other bioinformatics texts: Baxevanis and Ouellette David Mount Durbin et al.

Visit http://www.welch.jhu.edu Search for: “bioinformatics” in “ebook titles”

Web sites The course website is reached via moodle: http://pevsnerlab.kennedykrieger.org/moodle (or Google “moodle bioinformatics”) --This site contains the powerpoints for each lecture. including color and black & white versions --The weekly quizzes are here --You can ask questions via the forum The textbook website is: http://www.bioinfbook.org This has powerpoints, URLs, etc. organized by chapter

Literature references You are encouraged to read original source articles (posted on moodle). They will enhance your understanding of the material. Readings are optional but recommended.

Themes throughout the course: gene/protein families We will use beta globin and retinol-binding protein 4 (RBP4) as model genes/proteins throughout the course. Globins including hemoglobin and myoglobin carry oxygen. RBP4 is a member of the lipocalin family. It is a small, abundant carrier protein. We will study globins and lipocalins in a variety of contexts including --sequence alignment --gene expression --protein structure --phylogeny --homologs in various species

Computer labs There are three computer labs. I STRONGLY encourage you to bring a laptop to class. Also, the seven weekly quizzes function as a computer lab: to solve the questions, you may need to go to a website and use databases or software.

Grading 60% moodle quizzes (best six out of seven). Quizzes are taken at the moodle website, and are due one week after the relevant lecture 40% final exam Tuesday, January 12 (in class). Closed book, cumulative, no computer, short answer / multiple choice. Past exams will be made available ahead of time.

Google “moodle bioinformatics” to get here; Click “Introduction to Bioinformatics” to sign in; The enrollment key is…

Outline for the course 1. Accessing information about DNA and proteinsNov. 17 2. Pairwise alignmentNov. 24 3. BLASTDec. 1 LAB #1 of 3Dec. 1 4. Multiple sequence alignmentDec. 8 5. Molecular phylogeny and evolutionDec. 15 LAB #2 of 3Dec. 15 6. ProteomicsDec. 22 7. Gene expression: microarraysJan. 5 LAB #3 of 3Jan. 5 Final examJan. 12

Outline for today Definition of bioinformatics Overview of the NCBI website Accessing information about DNA and proteins --Definition of an accession number --Four ways to find information on proteins and DNA Access to biomedical literature Pairwise alignment: introduction

Interface of biology and computers Analysis of proteins, genes and genomes using computer algorithms and computer databases Genomics is the analysis of genomes. The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects. What is bioinformatics?

On bioinformatics “Science is about building causal relations between natural phenomena (for instance, between a mutation in a gene and a disease). The development of instruments to increase our capacity to observe natural phenomena has, therefore, played a crucial role in the development of science - the microscope being the paradigmatic example in biology. With the human genome, the natural world takes an unprecedented turn: it is better described as a sequence of symbols. Besides high-throughput machines such as sequencers and DNA chip readers, the computer and the associated software becomes the instrument to observe it, and the discipline of bioinformatics flourishes.”

On bioinformatics “However, as the separation between us (the observers) and the phenomena observed increases (from organism to cell to genome, for instance), instruments may capture phenomena only indirectly, through the footprints they leave. Instruments therefore need to be calibrated: the distance between the reality and the observation (through the instrument) needs to be accounted for. This issue of Genome Biology is about calibrating instruments to observe gene sequences; more specifically, computer programs to identify human genes in the sequence of the human genome.” Martin Reese and Roderic Guigó, Genome Biology 2006 7(Suppl I):S1, introducing EGASP, the Encyclopedia of DNA Elements (ENCODE) Genome Annotation Assessment Project

Tool-users Tool-makers bioinformatics public health informatics medical informatics infrastructure databases algorithms

Three perspectives on bioinformatics The cell The organism The tree of life Page 4

DNARNAphenotypeprotein Page 5

Time of development Body region, physiology, pharmacology, pathology Page 5

After Pace NR (1997) Science 276:734 Page 6

DNARNAphenotypeprotein

Growth of GenBank Year Base pairs of DNA (billions) Sequences (millions) 198219861990199419982002 Fig. 2.1 Page 17

Growth of GenBank + Whole Genome Shotgun (1982-November 2008) Number of sequences in GenBank (millions) 0 50 100 150 200 250 198219871992199720022007 Base pairs of DNA in GenBank (billions) Base pairs in GenBank + WGS (billions)

DNARNAprotein Central dogma of molecular biology genometranscriptomeproteome Central dogma of bioinformatics and genomics

DNARNA cDNA ESTs UniGene phenotype genomic DNA databases protein sequence databases protein Fig. 2.2 Page 20

GenBankEMBLDDBJ There are three major public DNA databases The underlying raw DNA sequences are identical Page 16

GenBankEMBLDDBJ Housed at EBI European Bioinformatics Institute There are three major public DNA databases Housed at NCBI National Center for Biotechnology Information Housed in Japan Page 16

The Trace Archive at NCBI contains over 2 billion traces 11/08

Taxonomy at NCBI: ~200,000 species are represented in GenBank http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi11/08

The most sequenced organisms in GenBank Homo sapiens 13.1 billion bases Mus musculus 8.4b Rattus norvegicus 6.1b Bos taurus5.2b Zea mays 4.6b Sus scrofa3.6b Danio rerio 3.0b Oryza sativa (japonica) 1.5b Strongylocentrotus purpurata1.4b Nicotiana tabacum 1.1b Updated 11-6-08 GenBank release 168.0 Excluding WGS, organelles, metagenomics Table 2-2 Page 18

National Center for Biotechnology Information (NCBI) www.ncbi.nlm.nih.gov Page 24

www.ncbi.nlm.nih.gov Fig. 2.5 Page 25

Fig. 2.5 Page 25

PubMed is… National Library of Medicine's search service 16 million citations in MEDLINE links to participating online journals PubMed tutorial (via “Education” on side bar) Page 24

Entrez integrates… the scientific literature; DNA and protein sequence databases; 3D protein structure data; population study data sets; assemblies of complete genomes Page 24

Entrez is a search and retrieval system that integrates NCBI databases Page 24

BLAST is… Basic Local Alignment Search Tool NCBI's sequence similarity search tool supports analysis of DNA and protein databases 100,000 searches per day Page 25

OMIM is… Online Mendelian Inheritance in Man catalog of human genes and genetic disorders created by Dr. Victor McKusick; led by Dr. Ada Hamosh at JHMI Page 25

Books is… searchable resource of on-line books Page 26

TaxBrowser is… browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses) taxonomy information such as genetic codes molecular data on extinct organisms Page 26

Structure site includes… Molecular Modelling Database (MMDB) biopolymer structures obtained from the Protein Data Bank (PDB) Cn3D (a 3D-structure viewer) vector alignment search tool (VAST) Page 26

Outline for today Definition of bioinformatics Overview of the NCBI website Accessing information about DNA and proteins --Definition of an accession number --Five ways to find information on proteins and DNA Access to biomedical literature Pairwise alignment: introduction

Accession numbers are labels for sequences NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences. You may want to acquire information beginning with a query such as the name of a protein of interest, or the raw nucleotides comprising a DNA sequence of interest. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data. Page 26

What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775GenBank genomic DNA sequence NT_030059Genomic contig Rs7079946dbSNP (single nucleotide polymorphism) N91759.1An expressed sequence tag (1 of 170) NM_006744RefSeq DNA sequence (from a transcript) NP_007635RefSeq protein AAC02945GenBank protein Q28369SwissProt protein 1KT7Protein Data Bank structure record protein DNA RNA Page 27

Five ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) [5] UCSC Genome Browser Page 27

5 ways to access protein and DNA sequences [1] Entrez Gene with RefSeq Entrez Gene is a great starting point: it collects key information on each gene/protein from major databases. It covers all major organisms. RefSeq provides a curated, optimal accession number for each DNA (NM_006744) or protein (NP_007635) Page 27

From the NCBI home page, type “beta globin” and hit “Go” revised 11/08 Fig. 2.7 Page 29

revised Fig. 2.7 Page 29

By applying limits, there are now fewer entries

revised Fig. 2.8 Page 30 Entrez Gene (top of page) Note that links to many other HBB database entries are available

Entrez Gene (middle of page)

Entrez Gene (middle of page, continued)

Entrez Gene (bottom of page): RefSeqs

Entrez Gene (bottom of page): non-RefSeq accessions

Fig. 2.9 Page 32

FASTA format: versatile, compact with >one header line followed by a string of nucleotides or amino acids in the single letter code Fig. 2.10 Page 32

What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples: X02775GenBank genomic DNA sequence NT_030059Genomic contig Rs7079946dbSNP (single nucleotide polymorphism) N91759.1An expressed sequence tag (1 of hundreds) NM_006744RefSeq DNA sequence (from a transcript) NP_007635RefSeq protein AAC02945GenBank protein Q28369SwissProt protein 1KT7Protein Data Bank structure record protein DNA RNA Page 27

NCBI’s important RefSeq project: best representative sequences RefSeq (accessible via the main page of NCBI) provides an expertly curated accession number that corresponds to the most stable, agreed-upon “reference” version of a sequence. RefSeq identifiers include the following formats: Complete genomeNC_###### Complete chromosomeNC_###### Genomic contigNT_###### mRNA (DNA format)NM_###### e.g. NM_006744 ProteinNP_###### e.g. NP_006735 Page 29-30

Accession MoleculeMethodNote AC_123456 GenomicMixedAlternate complete genomic AP_123456 ProteinMixedProtein products; alternate NC_123456 GenomicMixedComplete genomic molecules NG_123456 GenomicMixedIncomplete genomic regions NM_123456 mRNAMixedTranscript products; mRNA NM_123456789 mRNAMixedTranscript products; 9-digit NP_123456 ProteinMixedProtein products; NP_123456789 ProteinCurationProtein products; 9-digit NR_123456 RNAMixedNon-coding transcripts NT_123456 GenomicAutomatedGenomic assemblies NW_123456 GenomicAutomatedGenomic assemblies NZ_ABCD12345678 GenomicAutomatedWhole genome shotgun data XM_123456 mRNAAutomatedTranscript products XP_123456 ProteinAutomatedProtein products XR_123456 RNAAutomatedTranscript products YP_123456 ProteinAuto. & CuratedProtein products ZP_12345678 ProteinAutomatedProtein products NCBI’s RefSeq project: accession for genomic, mRNA, protein sequences

DNARNA complementary DNA (cDNA) protein UniGene Fig. 2.3 Page 23

UniGene: unique genes via ESTs Find UniGene at NCBI: www.ncbi.nlm.nih.gov/UniGene UniGene clusters contain many expressed sequence tags (ESTs), which are DNA sequences (typically 500 base pairs in length) corresponding to the mRNA from an expressed gene. ESTs are sequenced from a complementary DNA (cDNA) library. UniGene data come from many cDNA libraries. Thus, when you look up a gene in UniGene you get information on its abundance and its regional distribution. Pages 20-21

Cluster sizes in UniGene This is a gene with 1 EST associated; the cluster size is 1 Fig. 2.3 Page 23

Cluster sizes in UniGene This is a gene with 10 ESTs associated; the cluster size is 10

Cluster sizes in UniGene (human) Cluster size (ESTs) Number of clusters 1  40,300 218,500 3-418,000 5-813,400 9-168,100 17-325,200  500-10001,900  1000-4000940  4000-16,00074  16,000-65,0008 UniGene build 216, 11/08 16000:70000[ESTC]

UniGene: unique genes via ESTs Conclusion: UniGene is a useful tool to look up information about expressed genes. UniGene displays information about the abundance of a transcript (expressed gene), as well as its regional distribution of expression (e.g. brain vs. liver). We will discuss UniGene further on January 5 (gene expression). Page 31

Ensembl to access protein and DNA sequences Try Ensembl at www.ensembl.org for a premier human genome web browser. We will encounter Ensembl as we study the human genome, BLAST, and other topics.

click human

enter RBP4

ExPASy to access protein and DNA sequences ExPASy sequence retrieval system (ExPASy = Expert Protein Analysis System) Visit http://www.expasy.ch/ Page 33

Fig. 2.11 Page 33

[1] Visit http://genome.ucsc.edu/, click Genome Browser [2] Choose organisms, enter query (beta globin), hit submit

Example of how to access sequence data: HIV-1 pol There are many possible approaches. Begin at the main page of NCBI, and type an Entrez query: hiv-1 pol Page 34

Searching for HIV-1 pol: Following the “genome” link yields a manageable five results

Example of how to access sequence data: HIV-1 pol For the Entrez query: hiv-1 pol there are about 80,000 nucleotide or protein records (and >200,000 records for a search for “hiv-1”), but these can easily be reduced in two easy steps: --specify the organism, e.g. hiv-1[organism] --limit the output to RefSeq! Page 34

only 1 RefSeq over 200,000 nucleotide entries for HIV-1

Examples of how to access sequence data: histone query for “histone”# results protein records21847 RefSeq entries7544 RefSeq (limit to human)1108 NOT deacetylase697 At this point, select a reasonable candidate (e.g. histone 2, H4) and follow its link to Entrez Gene. There, you can confirm you have the right gene/protein. 8-12-06

Outline for today Definition of bioinformatics Overview of the NCBI website Accessing information about DNA and proteins --Definition of an accession number --Four ways to find information on proteins and DNA Access to biomedical literature Pairwise alignment: introduction

PubMed at NCBI to find literature information

PubMed is the NCBI gateway to MEDLINE. MEDLINE contains bibliographic citations and author abstracts from over 4,600 journals published in the United States and in 70 foreign countries. It has >18 million records dating back to 1950s. Page 35 Updated 11-08

MeSH is the acronym for "Medical Subject Headings." MeSH is the list of the vocabulary terms used for subject analysis of biomedical literature at NLM. MeSH vocabulary is used for indexing journal articles for MEDLINE. The MeSH controlled vocabulary imposes uniformity and consistency to the indexing of biomedical literature. Page 35

PubMed search strategies Try the tutorial (“education” on the left sidebar) Use boolean queries (capitalize AND, OR, NOT) lipocalin AND disease Try using “limits” Try “Links” to find Entrez information and external resources Obtain articles on-line via Welch Medical Library (and download pdf files): http://www.welch.jhu.edu/ Page 35

lipocalin AND disease (60 results) lipocalin OR disease (1,650,000 results) lipocalin NOT disease (530 results) 1 AND 2 1 OR 2 1 NOT 2 1 1 1 2 2 2 Fig. 2.12 Page 34

true positive “globin” is found “globin” is not found “globin” is present “globin” is absent Article contents: Search result: true negative false negative ( article discusses globins ) false positive ( article does not discuss globins )

WelchWeb is available at http://www.welch.jhu.edu

Welch Medical Library liasons to the basic sciences WelchWeb is available at http://www.welch.jhu.edu

November 24, 2008 Pairwise sequence alignment Jonathan Pevsner, Ph.D. Bioinformatics Johns Hopkins M.E:440.707

Outline: pairwise alignment Overview and examples Definitions: homologs, paralogs, orthologs Assigning scores to aligned amino acids: Dayhoff’s PAM matrices Alignment algorithms: Needleman-Wunsch, Smith-Waterman Statistical significance of pairwise alignments

Pairwise alignments in the 1950s  -corticotropin (sheep) Corticotropin A (pig) ala gly glu asp asp glu asp gly ala glu asp glu Oxytocin Vasopressin CYIQNCPLG CYFQNCPRG

Early example of sequence alignment: globins (1961) H.C. Watson and J.C. Kendrew, “Comparison Between the Amino-Acid Sequences of Sperm Whale Myoglobin and of Human Hæmoglobin.” Nature 190:670-672, 1961. myoglobin  globins:

It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify domains or motifs that are shared between proteins It is the basis of BLAST searching (next week) It is used in the analysis of genomes Pairwise sequence alignment is the most fundamental operation of bioinformatics

Pairwise alignment: protein sequences can be more informative than DNA protein is more informative (20 vs 4 characters); many amino acids share related biophysical properties codons are degenerate: changes in the third position often do not alter the amino acid that is specified protein sequences offer a longer “look-back” time DNA sequences can be translated into protein, and then used in pairwise alignments

DNA can be translated into six potential proteins 5’ CAT CAA 5’ ATC AAC 5’ TCA ACT 5’ GTG GGT 5’ TGG GTA 5’ GGG TAG Pairwise alignment: protein sequences can be more informative than DNA 5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’ 3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’

Pairwise alignment: protein sequences can be more informative than DNA Many times, DNA alignments are appropriate --to confirm the identity of a cDNA --to study noncoding regions of DNA --to study DNA polymorphisms --example: Neanderthal vs modern human DNA Query: 181 catcaactacaactccaaagacacccttacacccactaggatatcaacaaacctacccac 240 |||||||| |||| |||||| ||||| | ||||||||||||||||||||||||||||||| Sbjct: 189 catcaactgcaaccccaaagccacccct-cacccactaggatatcaacaaacctacccac 247

retinol-binding protein 4 (NP_006735)  -lactoglobulin (P02754) Page 42

Outline: pairwise alignment Overview and examples Definitions: homologs, paralogs, orthologs Assigning scores to aligned amino acids: Dayhoff’s PAM matrices Alignment algorithms: Needleman-Wunsch, Smith-Waterman Statistical significance of pairwise alignments

Pairwise alignment The process of lining up two sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology. Definitions

Homology Similarity attributed to descent from a common ancestor. Definitions Page 42

Homology Similarity attributed to descent from a common ancestor. Definitions Identity The extent to which two (nucleotide or amino acid) sequences are invariant. Page 44 RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVA 59 + K++ + ++ GTW++MA + L + A glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKA 55

Orthologs Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function. Paralogs Homologous sequences within a single species that arose by gene duplication. Definitions: two types of homology Page 43

Orthologs: members of a gene (protein) family in various organisms. This tree shows RBP orthologs. common carp zebrafish rainbow trout teleost African clawed frog chicken mouse rat rabbitcowpig horse human 10 changes Page 43

Paralogs: members of a gene (protein) family within a species apolipoprotein D retinol-binding protein 4 Complement component 8 prostaglandin D2 synthase neutrophil gelatinase- associated lipocalin 10 changes Lipocalin 1 Odorant-binding protein 2A progestagen- associated endometrial protein Alpha-1 Microglobulin /bikunin Page 44

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| |. |... | :.||||.:| : 1...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: |.|. || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| |..| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. | | | : ||. | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin Pairwise alignment of retinol-binding protein 4 and  -lactoglobulin Page 46

Similarity The extent to which nucleotide or protein sequences are related. It is based upon identity plus conservation. Identity The extent to which two sequences are invariant. Conservation Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico- chemical properties of the original residue. Definitions

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| |. |... | :.||||.:| : 1...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: |.|. || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| |..| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. | | | : ||. | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin Pairwise alignment of retinol-binding protein and  -lactoglobulin Identity (bar) Page 46

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| |. |... | :.||||.:| : 1...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: |.|. || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| |..| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. | | | : ||. | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin Pairwise alignment of retinol-binding protein and  -lactoglobulin Somewhat similar (one dot) Very similar (two dots) Page 46

Pairwise alignment The process of lining up two sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology. Definitions Page 47

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| |. |... | :.||||.:| : 1...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: |.|. || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| |..| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. | | | : ||. | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin Pairwise alignment of retinol-binding protein and  -lactoglobulin Internal gap Terminal gap Page 46

Positions at which a letter is paired with a null are called gaps. Gap scores are typically negative. Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is ascribed more significance than the length of the gap. In BLAST, it is rarely necessary to change gap values from the default. Gaps

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| |. |... | :.||||.:| : 1...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: |.|. || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| |..| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. | | | : ||. | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin Pairwise alignment of retinol-binding protein and  -lactoglobulin

1.MKWVWALLLLA.AWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 48 :: || || ||.||.||..| :|||:.|:.| |||.||||| 1 MLRICVALCALATCWA...QDCQVSNIQVMQNFDRSRYTGRWYAVAKKDP 47..... 49 EGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTED 98 |||| ||:||:|||||.|.|.||| ||| :||||:.||.| ||| || | 48 VGLFLLDNVVAQFSVDESGKMTATAHGRVIILNNWEMCANMFGTFEDTPD 97..... 99 PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADS 148 ||||||:||| ||:|| ||||||::||||| ||: ||||..||||| | 98 PAKFKMRYWGAASYLQTGNDDHWVIDTDYDNYAIHYSCREVDLDGTCLDG 147..... 149 YSFVFSRDPNGLPPEAQKIVRQRQEELCLARQYRLIVHNGYCDGRSERNLL 199 |||:||| | || || |||| :..|:|.|| : | |:|: 148 YSFIFSRHPTGLRPEDQKIVTDKKKEICFLGKYRRVGHTGFCESS...... 192 Pairwise alignment of retinol-binding protein from human (top) and rainbow trout (O. mykiss)

43210 Pairwise sequence alignment allows us to look back billions of years ago (BYA) Origin of life Origin of eukaryotes insects Fungi/animal Plant/animal Earliest fossils Eukaryote/ archaea Page 48

fly GAKKVIISAP SAD.APM..F VCGVNLDAYK PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA fly KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY TGSQNLMDGP NGKP.RRRRA AAENIIPTST fly GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK GASYDEIKAK human GAAKAVGKVI PELNGKLTGM AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM AIRVPVPNGS ITEFVVDLDD DVTESDVNAA Multiple sequence alignment of glyceraldehyde 3-phosphate dehydrogenases Page 49

~~~~~EIQDVSGTWYAMTVDREFPEMNLESVTPMTLTTL.GGNLEAKVTM lipocalin 1 LSFTLEEEDITGTWYAMVVDKDFPEDRRRKVSPVKVTALGGGNLEATFTF odorant-binding protein 2a TKQDLELPKLAGTWHSMAMATNNISLMATLKAPLRVHITSEDNLEIVLHR progestagen-assoc. endo. VQENFDVNKYLGRWYEIEKIPTTFENGRCIQANYSLMENGNQELRADGTV apolipoprotein D VKENFDKARFSGTWYAMAKDPEGLFLQDNIVAEFSVDETGNWDVCADGTF retinol-binding protein LQQNFQDNQFQGKWYVVGLAGNAI.LREDKDPQKMYATIDKSYNVTSVLF neutrophil gelatinase-ass. VQPNFQQDKFLGRWFSAGLASNSSWLREKKAALSMCKSVDGGLNLTSTFL prostaglandin D2 synthase VQENFNISRIYGKWYNLAIGSTCPWMDRMTVSTLVLGEGEAEISMTSTRW alpha-1-microglobulin PKANFDAQQFAGTWLLVAVGSACRFLQRAEATTLHVAPQGSTFRKLD... complement component 8 Multiple sequence alignment of human lipocalin paralogs Page 49

Introduction to Bioinformatics Monday, November 17, 2008 Jonathan Pevsner Bioinformatics M.E:800.707.

Similar presentations

Presentation on theme: "Introduction to Bioinformatics Monday, November 17, 2008 Jonathan Pevsner Bioinformatics M.E:800.707."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Bioinformatics Monday, November 17, 2008 Jonathan Pevsner Bioinformatics M.E:800.707.

Similar presentations

Presentation on theme: "Introduction to Bioinformatics Monday, November 17, 2008 Jonathan Pevsner Bioinformatics M.E:800.707."— Presentation transcript:

Similar presentations

About project

Feedback