Bioinformatics and Data Warehousing 1)Introduction to Bioinformatics 2)FASTA File Format 3)Searching Gene Sequences (BLAST) 4)Data Management in Biomedical.

Slides:



Advertisements
Similar presentations
DNAStructureandReplication. Transformation: Robert Griffith (1928)
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Human Genome Project What did they do? Why did they do it? What will it mean for humankind? Animation OverviewAnimation Overview - Click.
The chemical Basis of Inheritance. Chromatin / Chromosomes.
First release of HOGENOM, a database of homologous genes from complete genome Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Sequence Similarity Searching Class 4 March 2010.
An Introduction to Genomics, Pharmacogenomics, and Personalized Medicine Michael D. Kane, PhD Associate Professor, University Faculty Scholar, Graduate.
Heuristic alignment algorithms and cost matrices
Introduction to Bioinformatics Spring 2008 Yana Kortsarts, Computer Science Department Bob Morris, Biology Department.
In Introduction to DNA Forensics: The Basics
Bioinformatics (Using Computers to Solve Biological Problems) & Biomedical Informatics (Using Computers to Solve Human Health Problems) Michael D. Kane,
Biomedical Informatics Michael D. Kane, Ph.D.. The Cell is a Living Machine.
Using a Genetic Algorithm for Approximate String Matching on Genetic Code Carrie Mantsch December 5, 2003.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Sequence Alignment Oct 9, 2002 Joon Lee Genomics & Computational Biology.
BNFO 235 Lecture 5 Usman Roshan. What we have done to date Basic Perl –Data types: numbers, strings, arrays, and hashes –Control structures: If-else,
Protein Sequence Comparison Patrice Koehl
Goals of the Human Genome Project determine the entire sequence of human DNA identify all the genes in human DNA store this information in databases improve.
Pairwise alignment Computational Genomics and Proteomics.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Local alignment
Lecture Objectives Define Terms: Transcription, Translation, nucleic acid, amino acid, DNA, RNA, mRNA, cDNA, “ATCG”, Gene, Genomics, Protein, Proteomics,
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Arabidopsis Gene Project GK-12 April Workshop Karolyn Giang and Dr. Mulligan.
Computational Structural Bioinformatics ECS129 Instructor: Patrice Koehl
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Comparative Genomics of the Eukaryotes
Genome projects and model organisms Level 3 Molecular Evolution and Bioinformatics Jim Provan.
Developing Pairwise Sequence Alignment Algorithms
Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information.
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
This presentation was originally prepared by C. William Birky, Jr. Department of Ecology and Evolutionary Biology The University of Arizona It may be used.
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
Sevas Educational Society All Rights Reserved, 2008 Module 1 Introduction to Bioinformatics.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Team Conoscenza Bioinformatics Tan Jian Wei ~ Tan Fengnan.
CSCI-256 Data Structures & Algorithm Analysis Lecture Note: Some slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved. 17.
© 2015 W. H. Freeman and Company CHAPTER 1 The Genetics Revolution Introduction to Genetic Analysis ELEVENTH EDITION Introduction to Genetic Analysis ELEVENTH.
Chapter 3 Computational Molecular Biology Michael Smith
REMINDERS 2 nd Exam on Nov.17 Coverage: Central Dogma of DNA Replication Transcription Translation Cell structure and function Recombinant DNA technology.
Topic 25 Dynamic Programming "Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it.
Introduction to Bioinformatics Dr. Rybarczyk, PhD University of North Carolina-Chapel Hill
PubMed: Scientific Journals Entrez: Keyword Search of Database BLAST: Sequence Queries OMIM: Online Mendelian Inheritance in Man Books.
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Chapter 1 Introduction.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Constructing Probability Matrices Redux Suppose we live in a world with only 3 amino acids: Alanine Leucine Serine Furthermore suppose: Alanine Leucine.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Alignment.
Genome They are the volums of an encyclopaedia called Genome. Cell Nucleus Tissues The chromosomes contains the instruction of alive beings.
A Lot More Advanced Biotechnology Tools (Part 2) Sequencing.
Construction of Substitution matrices
Lecture 21 – Genome Annotation & Sequenced Genomes Based on Chapther 8 Genomics: The Mapping and Sequencing of Genomes Copyright © 2010 Pearson Education.
Gene models and proteomes for Saccharomyces cerevisiae (Sc), Schizosaccharomyces pombe (Sp), Arabidopsis thaliana (At), Oryza sativa (Os), Drosophila melanogaster.
Chapter 11 Meiosis & Genetics What do you think meiosis makes?
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Genome Revolution: COMPSCI 004G 8.1 BLAST l What is BLAST? What is it good for?  Basic.
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
CS515: Bioinformatic Algorithms
A fast Prunning Algorithm for optimal Sequence Alignment
Sequence comparison: Local alignment
EL: To find out what a genome is and how gene expression is regulated
Pairwise sequence Alignment.
Every living organism inherits a blueprint for life from its parents.
Bioinformatics Lecture 2 By: Dr. Mehdi Mansouri
Computational genomics
How to search NCBI.
Presentation transcript:

Bioinformatics and Data Warehousing 1)Introduction to Bioinformatics 2)FASTA File Format 3)Searching Gene Sequences (BLAST) 4)Data Management in Biomedical Informatics Michael Kane, Ph.D. Computer & Information Technology Bindley Bioscience Center Purdue University

DNA is Information Storage

“Zipped Files” Decompression “Executable Files”

DNA is Double Stranded – One strand is the “coding strand” and the other strand is there to stabilize the DNA sequence when not in use. Double-stranded DNA is very durable in our environment.

CAGGACCATGGAACTCAGCGTCCTCCTCTTCCTTGCACTCCTCACAGGACTCTTGCTACT CCTGGTTCAGCGCCACCCTAACACCCATGACCGCCTCCCACCAGGGCCCCGCCCTCTG CCCCTTTTGGGAAACCTTCTGCAGATGGATAGAAGAGGCCTACTCAAATCCTTTCTGAG GTTCCGAGAGAAATATGGGGACGTCTTCACGGTACACCTGGGACCGAGGCCCGTGGTC ATGCTGTGTGGAGTAGAGGCCATACGGGAGGCCCTTGTGGACAAGGCTGAGGCCTTCT CTGGCCGGGGAAAAATCGCCATGGTCGACCCATTCTTCCGGGGATATGGTGTGATCTTT GCCAATGGAAACCGCTGGAAGGTGCTTCGGCGATTCTCTGTGACCACTATGAGGGACTT CGGGATGGGAAAGCGGAGTGTGGAGGAGCGGATTCAGGAGGAGGCTCAGTGTCTGAT AGAGGAGCTTCGGAAATCCAAGGGGGCCCTCATGGACCCCACCTTCCTCTTCCAGTCC ATTACCGCCAACATCATCTGCTCCATCGTCTTTGGAAAACGATTCCACTACCAAGATCAA GAGTTCCTGAAGATGCTGAACTTGTTCTACCAGACTTTTTCACTCATCAGCTCTGTATTCG GCCAGCTGTTTGAGCTCTTCTCTGGCTTCTTGAAATACTTTCCTGGGGCACACAGGCAA GTTTACAAAAACCTGCAGGAAATCAATGCTTACATTGGCCACAGTGTGGAGAAGCACCG TGAAACCCTGGACCCCAGCGCCCCCAAGGACCTCATCGACACCTACCTGCTCCACATG GAAAAAGAGAAATCCAACGCACACAGTGAATTCAGCCACCAGAACCTCAACCTCAACA CGCTCTCGCTCTTCTTTGCTGGCACTGAGACCACCAGCACCACTCTCCGCTACGGCTTC CTGCTCATGCTCAAATACCCTCATGTTGCAGAGAGAGTCTACAGGGAGATTGAACAGGT GATTGGCCCACATCGCCCTCCAGAGCTTCATGACCGAGCCAAAATGCCATACACAGAGG CAGTCATCTATGAGATTCAGAGATTTTCCGACCTTCTCCCCATGGGTGTGCCCCACATTG TCACCCAACACACCAGCTTCCGAGGGTACATCATCCCCAAGGACACAGAAGTATTTCTC ATCCTGAGCACTGCTCTCCATGACCCACACTA

THEREDCAT_HSDKLSD_WASNOTHOTBUT_WKKNASDN KSAOJ.ASDNALKS_WASWET_ASDFLKSDOFIJEIJKNAW DFN_ANDMAD_WERN.JSNDFJN_YETSAD_MNSFDGPOIJ D_BUTTHEFOX_SDKMFIDSJIR.JER_GOTWET_JSN.DFOI AMNJNER_ANDATEHIM.

Start with a thin 2 x 4 lego block… Add a 2 x 2 lego block… Add a 2 x 3 lego block… Add a 2 x 4 lego block…

What are the comparative genome sizes of humans and other organisms being studied? organismestimated size estimated gene number average gene density chromo -some number Homo sapiens (human) 2900 million bases~30,0001 gene per 100,000 bases46 Rattus norvegicus (rat) 2750 million bases~30,0001 gene per 100,000 bases42 Mus musculus (mouse) 2500 million bases~30,0001 gene per 100,000 bases40 Drosophila melanogaster (fruit fly) 180 million bases13,6001 gene per 9,000 bases8 Arabidopsis thaliana (plant) 125 million bases25,5001 gene per 4000 bases5 Caenorhabditis elegans (roundworm) 97 million bases19,1001 gene per 5000 bases6 Saccharomyces cerevisiae (yeast) 12 million bases63001 gene per 2000 bases16 Escherichia coli (bacteria) 4.7 million bases32001 gene per 1400 bases1 H. influenzae (bacteria) 1.8 million bases17001 gene per 1000 bases1 Genome size does not correlate with evolutionary status, nor is the number of genes proportionate with genome size.

>gi| |emb|CAA | myosin-IF [Homo sapiens] QEKLTSRKMDSRWGGRSESINVTLNVEQAAYTRDALAKGLYARLFDFLVEAINRAMQKPQEEYSIGVLDI YGFEIFQKNGFEQFCINFVNEKLQQIFIELTLKAEQEEYVQEGIRWTPIQYFNNKVVCDLIENKLSPPGI MSVLDDVCATMHATGGGADQTLLQKLQAAVGTHEHFNSWSAGFVIHHYAGKVSYDVSGFCERNRDVLFSD LIELMQSSDQAFLRMLFPEKLDGDKKGRPSTAGSKIKKQANDLVATLMRCTPHYIRCIKPNETKHARDWE ENRVQHQVEYLGLKENIRVRRAGFAYRRQFAKFLQRYAILTPETWPRWRGDERQGVQHLLRAVNMEPDQY QMGSTKVFVKNPESLFLLEEVRERKFDGFARTIQKAWRRHVAVRKYEEMREEASNILLNKKERRRNSINR NFVGDYLGLEERPELRQFLGKKERVDFADSVTKYDRRFKPIKRDLILTPKCVYVIGREKMKKGPEKGPVC EILKKKLDIQALRGVSLSTRQDDFFILQEDAADSFLESVFKTEFVSLLCKRFEEATRRPLPLTFSDTLQF RVKKEGWGGGGTRSVTFSRGFGDLAVLKVGGRTLTVSVGDGLPKNSKPTGKGLAKGKPRRSSQAPTRAAP GAPQGMDRNGAPLCPQGGAPCPLEKFIWPRGHPQASPALRPHPWDASRRPRARPPSEHNTEFLNVPDQGM AGMQRKRSVGQRPVPVGRPKPQPRTHGPRCRALYQYVGQDVDELSFNVNEVIEILMEDPSGWWKGRLHGQ EGLFPGNYVEKI FASTA File Format TinySeq XML X Homo sapiens Homo sapiens partial mRNA for myosin-IF 2711 CAGGAGAAGCTGACCAGCCGCAAGATGGACAGCCGCTGGGGCGGGCGCAGCGAGTCCATCAATGT……

>GENE NUMBER ONE AGCTGCTAGTAGAGTCGCTCGGATAGGACTGCTAGCTGC >GENE NUMBER TWO AGCTGCTAGTAGAGTCGCTCGGATAGGACTGCTAGCTGC >GENE NUMBER THREE AGCTGCTAGTAGAGTCGCTCGGATAGGACTGCTAGCTGC >GENE NUMBER FOUR AGCTGCTAGTAGAGTCGCTCGGATAGGACTGCTAGCTGC >GENE NUMBER FIVE AGCTGCTAGTAGAGTCGCTCGGATAGGACTGCTAGCTGC >GENE NUMBER SIX AGCTGCTAGTAGAGTCGCTCGGATAGGACTGCTAGCTGC >GENE NUMBER SEVEN AGCTGCTAGTAGAGTCGCTCGGATAGGACTGCTAGCTGC >GENE NUMBER TWENTY MILLION AGCTGCTAGTAGAGTCGCTCGGATAGGACTGCTAGCTGC FASTA File Format DATABASE (DATA WAREHOUSE)

DYNAMIC PROGRAMMING and SEQUENCE SEARCHES 'Dynamic programming' is an efficient programming technique for solving certain combinatorial problems. It is particularly important in bioinformatics as it is the basis of sequence alignment algorithms for comparing protein and DNA sequences. In the bioinformatics application Dynamic Programming gives a spectacular efficiency gain over a purely recursive algorithm. Don't expect much enlightenment from the etymology of the term 'dynamic programming,' though. Dynamic programming was formalized in the early 1950s by mathematician Richard Bellman, who was working at RAND Corporation on optimal decision processes. He wanted to concoct an impressive name that would shield his work from US Secretary of Defense Charles Wilson, a man known to be hostile to mathematics research. His work involved time series and planning—thus 'dynamic' and 'programming' (note, nothing particularly to do with computer programming). Bellman especially liked 'dynamic' because "it's impossible to use the word dynamic in a derogatory sense"; he figured dynamic programming was "something not even a Congressman could object to.”

The following is an example of global sequence alignment using Needleman/Wunsch techniques. For this example, the two sequences to be globally aligned are: G A A T T C A G T T A (sequence #1) G G A T C G A (sequence #2) Initialization Step Since this example assumes there is no gap opening or gap extension penalty, the first row and first column of the matrix can be initially filled with 0. DYNAMIC PROGRAMMING and SEQUENCE SEARCHES

Matrix Fill Step DYNAMIC PROGRAMMING and SEQUENCE SEARCHES

Traceback Step (Seq #1) A | (Seq #2) A (Seq #1) T A | (Seq #2) _ A (Seq #1)T T A | (Seq #2)_ _ A G A A T T C A G T T A | | | G G A _ T C _ G _ _ A _ G A A T T C A G T T A | | | | | | G G A _ _ T C _ G _ _ A There are multiple solutions to this alignment, and most dynamic programming algorithms print out only a single solution. DYNAMIC PROGRAMMING and SEQUENCE SEARCHES

BLAST (Basic Local Alignment Search Tool) Why is BLAST so fast? By preindexing all the possible 11-letter words into the database records (4 11 = 4,194,304).. GTCGTAGTCG ATCGTAGTCG CTCGTAGTCG. Steps: 1) Find all the 11-letter words in your query sequence, plus a few variations. 2) Look these up in the 11-letter-word index. 3) Retrieve all sequences containing those words. 4) Use a rigorous algorithm (e.g. Smith-Waterman) to extend the match in both directions

>UNKNOWN GENE SEQUENCE AGGACCATGGAACTCAGCGTCCTCCTCTTCCTTGCACTCCTCACAGGACTCTTGCTACTC CTGGTTCAGCGCCACCCTAACACCCATGACCGCCTCCCACCAGGGCCCCGCCCTCTGCC CCTTTTGGGAAACCTTCTGCAGATGGATAGAAGAGGCCTACTCAAATCCTTTCTGAGGTT CCGAGAGAAATATGGGGACGTCTTCACGGTACACCTGGGACCGAGGCCCGTGGTCATGC TGTGTGGAGTAGAGGCCATACGGGAGGCCCTTGTGGACAAGGCTGAGGCCTTCTCTGGC CGGGGAAAAATCGCCATGGTCGACCCATTCTTCCGGGGATATGGTGTGATCTTTGCCAAT GGAAACCGCTGGAAGGTGCTTCGGCGATTCTCTGTGACCACTATGAGGGACTTCGGGAT GGGAAAGCGGAGTGTGGAGGAGCGGATTCAGGAGGAGGCTCAGTGTCTGATAGAGGAG CTTCGGAAATCCAAGGGGGCCCTCATGGACCCCACCTTCCTCTTCCAGTCCATTACCGCC AACATCATCTGCTCCATCGTCTTTGGAAAACGATTCCACTACCAAGATCAAGAGTTCCTG AAGATGCTGAACTTGTTCTACCAGACTTTTTCACTCATCAGCTCTGTATTCGGCCAGCTGT TTGAGCTCTTCTCTGGCTTCTTGAAATA

Genomic Database (DATA WAREHOUSE) >GENE Agtgctcgatagatcg ctcgcata… RESULTS New Gene Sequences (1 per second!) Results DB Gene Cloning DB