Applying principles of computer science in a biological context

Slides:



Advertisements
Similar presentations
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
Advertisements

Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
BIOINFORMATICS Ency Lee.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Bioinformatics “Other techniques raise more questions than they answer. Bioinformatics is what answers the questions those techniques generate.” SheAvery
Jeff Shen, Morgan Kearse, Jeff Shi, Yang Ding, & Owen Astrachan Genome Revolution Focus 2007, Duke University, Durham, North Carolina Introduction.
Sequence Similarity Searching Class 4 March 2010.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Introduction to Bioinformatics Spring 2008 Yana Kortsarts, Computer Science Department Bob Morris, Biology Department.
Bioinformatics and Phylogenetic Analysis
Similar Sequence Similar Function Charles Yan Spring 2006.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Recap Don’t forget to – pick a paper and – me See the schedule to see what’s taken –
Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.
Welcome to Introduction to Bioinformatics Computing aka BIC1.
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Bioinformatics Sean Langford, Larry Hale. What is it?  Bioinformatics is a scientific field involving many disciplines that focuses on the development.
BLAST What it does and what it means Steven Slater Adapted from pt.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Muhammad Awais PhD Biochemistry 08-ARID-1103 Understanding Basic Local Alignment Search Tool.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
Organizing information in the post-genomic era The rise of bioinformatics.
Condor: BLAST Monday, July 19 th, 3:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BioInformatics Database of Primer Results In order to help predict the way proteins will act in an organism, biologists cross-examine sequences of amino.
Condor: BLAST Rob Quick Open Science Grid Indiana University.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Condor: BLAST Monday, 3:30pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
The iPlant Collaborative Vision Enable life science researchers and educators to use and extend cyberinfrastructure.
Compression of Protein Sequences EE-591 Information Theory FEI NAN, SUMIT SHARMA May 3, 2003.
What is BLAST? Basic BLAST search What is BLAST?
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
BLAST: Basic Local Alignment Search Tool Robert (R.J.) Sperazza BLAST is a software used to analyze genetic information It can identify existing genes.
Sequence: PFAM Used example: Database of protein domain families. It is based on manually curated alignments.
What is BLAST? Basic BLAST search What is BLAST?
Computer Applications and Bioinformatics
Part 3 Gene Technology & Medicine
Introduction to Bioinformatics Resources for DNA Barcoding
Research Paper on BioInformatics
Data-intensive Computing: Case Study Area 1: Bioinformatics
Bioinformatics Madina Bazarova. What is Bioinformatics? Bioinformatics is marriage between biology and computer. It is the use of computers for the acquisition,
What is Bioinformatics?
Genome Center of Wisconsin, UW-Madison
Predicting Active Site Residue Annotations in the Pfam Database
Ab initio gene prediction
Predict Protein Sequence by Fuzzy-Association Rules
Bioinformatics and BLAST
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Fast Sequence Alignments
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Bioinformatics Vicki & Joe.
LESSON 1 INTNRODUCTION HYE-JOO KWON, Ph.D /
Basic Local Alignment Search Tool (BLAST)
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Introduction to Bioinformatics
Lab 3 – BLAST – Directed It’s a BLAST! (too easy?)
Sequence alignment, E-value & Extreme value distribution
Condor: BLAST Tuesday, Dec 7th, 10:45am
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Reconfigurable Computing (EN2911X, Fall07)
It is the presentation about the overview of DOT MATRIX and GAP PENALITY..
Presentation transcript:

Applying principles of computer science in a biological context BioInformatics Applying principles of computer science in a biological context Introduce myself Summer research and senior thesis in field of bioinformatics

Outline Biological Background Information Problem Description My Project Previous Work Senior Thesis

Biological Data Sets Raw DNA sequences Macromolecular structures Genomes Protein Sequences Broad field, new- I’m only dealing with one very specific topic in BioInformatics. I’m not doing all of BioInformatics by any means. Macromolecular- 3D structures of molecules Genomes- compilations of similar genes Focusing on protein sequences

What is a Protein Sequence? A string of amino acids, each represented by a single letter There are 20 different amino acids Typical proteins are about 300 amino acids long … I L V K M U T A N K V K M U … Amino acid = chemical compound Next slide- Amino acid examples

Examples of Amino Acids I stands for Isoleucine, only 20

Importance of Protein Sequences Compare two or more sequences Determine similarities in their functions Multiple Alignment Serves as input for my analysis Web-based programs available Maximizes areas of similarity -Why are protein sequences important? If we have a protein from one specie of fruit fly, and a similar protein from another fruit fly, we can compare the proteins by comparing the sequences. -Comparison is done through multiple alignments. -ClustalW

Multiple Alignment Example Shaded areas show regions of exact match. A dash is placed in the smaller protein sequence to achieve the alignment. Here are two protein sequences that we want to compare. If we align them up from left to right, two places exactly match. Inserting a “don’t care” amino acid in the second sequence better matches the two sequences. Now 3 places exactly match. We can get different degrees of simlarity depending on how they are aligned. A multiple alignment maximizes the degree of similarity. After completing the multiple alignment process, we are left with a group of aligned protein sequences where a blank in a column represents either an exact match or a don’t care position. This group of aligned sequences is called a primer. Redundancies in each column are then removed.

Evaluating Usefulness of a Primer Similar does not always mean useful Why? Different ways of creating amino acids Amino acids coded by nucleotide triplets A, T, C, G Triplet = Codon

Degeneracy Example To determine the usefulness of a primer, we first write down the codons that can create each amino acid in the sequence. For instance, I is made from 3 different codons, and L is made from 6. We then see how many combinations of nucleotide triplets we could form using the codons. Each triplet has an A in the first column, so only an A can be used in the first slot. Only a T is found the 2nd slot, and an A,C or T is in the 3rd. This means that only 3 combinations can be created out of I’s codons. The more important case is when we look at L. It turns out that 8 different combinations can be made from the original 6 codons. The number at the bottom is called the degeneracy. Each amino acid has a degeneracy, and we use it to calculate a total degeneracy for the entire primer. The number at the bottom right is degeneracy for entire primer- lower is better. Why? Lower means real similarity, similarity at the nucleotide level. The fewer combinations that could be generated from the codons, the higher restriction there is on how the amino acid was originally formed. Thus, regions with lots of M’s would have low degeneracy and thus high probability of usefulness.

Current Methods Client: Biology Professor Steven Horton Manual search of primers Manual calculation of degeneracy

Requirements Automate the task of finding primers Automate degeneracy calculation Record and organize results Analyze data to make predictions Pattern Matching Data mining Software engineering groups worked on first 2 requirements

Summer Research Analyzed solutions made by Software Engineering class in Spring 2003 Combined the good design features from each project Made a prototype in Java

Senior Thesis Fall Term Finished the prototype Multiple window design Made algorithm more efficient using dynamic programming As far as interface goes, give Horton same tools he has with pen and paper. This involves a multiple window design. Professor Horton will be able to select a specific primer in the main window and bring up another window to examine the primer more closely. Any number of primer windows can be open at a time. Another goal for the fall is to make the algorithm as efficient as possible. Primers of length k can be derived from primers of length k-1. This property allows us to use dynamic programming to generate primers. An ongoing objective for the project for me is to become more familiar with the biological concepts that relate to protein functions. To achieve this, regular meetings with Professor Horton will be scheduled. Understanding the biology behind primer generation will help me to ask more intelligent questions of Professor Horton to expand the capabilities of the system.

The Prototype Click Find Primers after setting preferences.

Primer List Window

Inspection Window

Senior Thesis Winter Term Incorporated a system to record to analyze results produced from finding primers Utilized data mining tools Learn more about what is done with the primers after generated from the protein sequences. Analyzing the results involves looking for patterns in the records. I intend to aim my research in the direction of data mining and pattern matching. Based on the analysis, we may be able to refine the algorithm.

Data Mining: Association Rules Have the form LHSRHS Interpretation: If every item in LHS occurs, then it is likely that all of the items in RHS will also occur Example: LHS = protein sequence A contains primers 1, 2 & 3 RHS = protein sequence A contains primer 4 & 5 Application: Find Association Rules based on Horton’s data collected about primers

Protein Database

The End