CSE 5290: Algorithms for Bioinformatics Fall 2009 Suprakash Datta datta@cs.yorku.ca Office: CSEB 3043 Phone: 416-736-2100 ext 77875 Course page: http://www.cs.yorku.ca/course/5290 4/29/2019 CSE 5290, Fall 2009
My research Computer Networks….. Clustering of Biological data, e.g. Flow cytometry data Microarray data Genomic Signal Processing Convert biological sequences to numerical sequences and apply signal processing tools exon prediction, retroviral insertions 4/29/2019 CSE 5290, Fall 2009
Administrivia Lectures: Tue-THu 1:00 - 2:30 pm (Ross S 537) Textbook: Lectures: Tue-THu 1:00 - 2:30 pm (Ross S 537) Office hours: Wed 1-4 pm, or by appointment. TA: none. http://www.cs.yorku.ca/course/5290 Webpage: All announcements/handouts will be published on the webpage -- check often for updates) An Introduction to Bioinformatics Algorithms Neil C. Jones and Pavel A. Pevzner MIT Press, August 2004. 4/29/2019 CSE 5290, Fall 2009
Administrivia – contd. Described in more detail on webpage Grading: Midterms : 30% Homework : 30% Project: 40% Grades: will be on ePost. Project details are on the webpage. 4/29/2019 CSE 5290, Fall 2009
Course objectives Familiarity with computational problems in Biology Applying algorithmic ideas Understand real-life computational challenges Improve understanding of algorithms 4/29/2019 CSE 5290, Fall 2009
What I expect from you Some familiarity with undergraduate algorithms Interest in computational problems Willingness to pick up a little Biology Active interest in your project and assignments 4/29/2019 CSE 5290, Fall 2009
What is bioinformatics? No consensus! Genomics Proteomics Evolutionary biology Clinical trial informatics Epidemiology? Medical image processing? Artificial life? From http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html “Bioinformatics is the field of science in which biology, computer science, and information technology merge to form a single discipline.” 4/29/2019 CSE 5290, Fall 2009
Why Bioinformatics? Make an impact! Interdisciplinary work Work with real data sets Use algorithmic skills 4/29/2019 CSE 5290, Fall 2009
Biological (genomic) data ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAACTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAATCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTGCACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCATGGCAGTGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATATCCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTTGGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAATAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCACCAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAGTTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAGGCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGAAATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGCGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGCTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTACGAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACAAAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGAAATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCATTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCATCCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATTAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACACAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCCACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCT 4/29/2019 CSE 5290, Fall 2009
Annotated data 4/29/2019 CSE 5290, Fall 2009 ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCACTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGAGTTGTATGTATTTGGCCTTATGTAGCTCGCGCCCGTTCGAGATAAGGATGTTTCTAGAAATCCGTAAAGATATAGAGATGTACACACATCTACATTTGTAACTCTATTTATAGTTAGAAACTTGTCCTCGAGGTCTCTCTATAAACCTTTTTGTACGTCCATAAATGTGGAAATCTACCGATCTTTTTGTCTCCGTATATAGGGAAACAGCGTTTTCGCTATACCTGGGTACAAACAGAGTTTTGTAGCTCCACGTTTCGCTGTCTCGTTCCGTGGAGCCCTGGGGGTCCTTAGACATATACTCTTTTTACATAGTTGGATGGGGGTCTGTACCTAGTTAGCTCTAGCTCGAGACAGCGATAGAGAATTTTGTATACTTGTCCGTTTACGTGTACCCGGCGATCTGTCTATGTCTATGGATAGCCACGTTTATGTCGTTTTGTAGGTCGTTGTATATCGATATATAGAGCGCGGATAATTAGGTAGGTCGACCGCGCTGTGGCTCTATCTCTAGTTATTTGTAGGTCGATGTGTAGATGTAATTCTAGCTGGACATCCATACCTACCTGTGTTTGTAGGTATTTCCATAAAACCACAGCGATGTTTGTAGAAAACGCGCGCTACCCCTACACCGCTATATACATAATATATCTCTGTACAAAGATGTATAGAGATAAAGACACAGTTCGAAACCTATCGACTTGGACAAACAGTTGTTTATTTTTAAGTCGCTCGACCGAACTAGTTACACCGAGATCGATTTGTTTCTCTATACACCTCTCTCTGTGGAGAAACAGAGCGAGAAGTAGATTTCGAGAAGCCACCGGGACAATTACAGAAAGCGGTAGATTTACATACAAAGAAGGAGACTTATCGATACACATAGAGGTATCGATAACGATGTATACCTACATCCAGCTCCATACCTAAAGGTAGAAAGACATGTGTCGACATGTTTACGTTTAGATATGGACCTAGATATGTCTACGGACACGAGACTTACACGTCTAGATAGTTGTGTATTTCTCTCTGGACAGATCTGTAAAGGTACGTCTACATGTCGATATCGGTCTGTCGGTAGATAAAAATCTATATTTAGACCGATAACTAGGTCTCGATGTCGTTCTCTAACGATGGACCTGTAGACCGAAAAAGAACTTTTTGTTTTCCACAAGTCTAGACTTTTTTGGGTCTAGACACTTGTGGAATTCGAGATAGGGCTCTCCCTCTGGCTCTATAGATCGACGGGTATCGCTATCGAGACGTGGGTCGAGATGGGTATCTCGCTATACATGGATTTCCAACCTTGTAGGTGTCTCTCGAAGGCGGTAGGGACACAAAATAGCTGTAGCTACAACTACGTATCGATACATAAAGAGCTACAAATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGCTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTACGAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACAAAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGAAATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCATTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCATACCGCCAGGTACGTACGTATACAGAAATACATGTATCTGTGGATATCCGTACATCGAGCCACATATCCCTTTAACTGGCGAAATATACTTATACCGAAAATTAGAGGGAACGCGGTATATGTACGACCGACACAATGAAACTAGATTGCGTAATTTCTAGTGTAAACAAATATGGCTATCTAAATGTCTCTAGGTACATCGAAAGAAAGTTACATATATTTAAATCGATAACTACGTAGATGGGTTTCTAGTTGTAGAGCGACAAATCTCGAAAGCTCTTTTTGGAGAGGTAGATATATAGTATATATCGCTGTCGAAGTATACAAATATCTACTTCGATAACTAACCAACGGTATCGGTCTAGAAAAGTCTCGCCAGGTCCGTAAACAAAGAGGTACATAACGAGACCGGTGGGTGTTTCGGTATACACTTGTGGGTATCGAGACATGTATGTTTGTGTGTAACTATATCCAAGGTCTTTGTGGACTTGTAGAGGTGTATCTGTCGATATTTACGTC 4/29/2019 CSE 5290, Fall 2009
Importance of algorithms – Compare human vs. mouse (blocks of 1,000 nucleotides) • 3,000,000*3,000,000 comparisons, each 1,000*1,000 operations (w/dynamic progr.) • At 1 trillion operations per second, it would take 104 days – Search all regulatory motifs of length 20 (11^20) in the human genome • 426 years 4/29/2019 CSE 5290, Fall 2009
Clustering flow cytometry data 1 million vectors Each of length 25 (real numbers) Need quick output! Results should be biologically meaningful! 4/29/2019 CSE 5290, Fall 2009
R: introduction Why R? Lots of available libraries (statistics, machine learning,…..) Very good visualization capability Free Multiplatform Easy to publish code Biologists use it! 4/29/2019 CSE 5290, Fall 2009
R - contd Grew out of a popular statistics package Used extensively by statisticians and computational biologists Lots of resources (see class web page) Some similarities with MatLab 4/29/2019 CSE 5290, Fall 2009
R – strengths and weaknesses Allows very quick testing of ideas Libraries available for most purposes Allows integration with C code Weaknesses Not as efficient as MatLab on matrix operations Not very good at handling large data sets 4/29/2019 CSE 5290, Fall 2009
Next class Ch 3 of text In the meantime… Read Ch 1 and 2 on your own. Get familiar with R 4/29/2019 CSE 5290, Fall 2009