Download presentation
Presentation is loading. Please wait.
Published by정신 육 Modified over 5 years ago
1
Applying principles of computer science in a biological context
BioInformatics Applying principles of computer science in a biological context Introduce myself Summer research and senior thesis in field of bioinformatics
2
Outline Biological Background Information Problem Description
My Project Previous Work Senior Thesis
3
Biological Data Sets Raw DNA sequences Macromolecular structures
Genomes Protein Sequences Broad field, new- I’m only dealing with one very specific topic in BioInformatics. I’m not doing all of BioInformatics by any means. Macromolecular- 3D structures of molecules Genomes- compilations of similar genes Focusing on protein sequences
4
What is a Protein Sequence?
A string of amino acids, each represented by a single letter There are 20 different amino acids Typical proteins are about 300 amino acids long … I L V K M U T A N K V K M U … Amino acid = chemical compound Next slide- Amino acid examples
5
Examples of Amino Acids
I stands for Isoleucine, only 20
6
Importance of Protein Sequences
Compare two or more sequences Determine similarities in their functions Multiple Alignment Serves as input for my analysis Web-based programs available Maximizes areas of similarity -Why are protein sequences important? If we have a protein from one specie of fruit fly, and a similar protein from another fruit fly, we can compare the proteins by comparing the sequences. -Comparison is done through multiple alignments. -ClustalW
7
Multiple Alignment Example
Shaded areas show regions of exact match. A dash is placed in the smaller protein sequence to achieve the alignment. Here are two protein sequences that we want to compare. If we align them up from left to right, two places exactly match. Inserting a “don’t care” amino acid in the second sequence better matches the two sequences. Now 3 places exactly match. We can get different degrees of simlarity depending on how they are aligned. A multiple alignment maximizes the degree of similarity. After completing the multiple alignment process, we are left with a group of aligned protein sequences where a blank in a column represents either an exact match or a don’t care position. This group of aligned sequences is called a primer. Redundancies in each column are then removed.
8
Evaluating Usefulness of a Primer
Similar does not always mean useful Why? Different ways of creating amino acids Amino acids coded by nucleotide triplets A, T, C, G Triplet = Codon
9
Degeneracy Example To determine the usefulness of a primer, we first write down the codons that can create each amino acid in the sequence. For instance, I is made from 3 different codons, and L is made from 6. We then see how many combinations of nucleotide triplets we could form using the codons. Each triplet has an A in the first column, so only an A can be used in the first slot. Only a T is found the 2nd slot, and an A,C or T is in the 3rd. This means that only 3 combinations can be created out of I’s codons. The more important case is when we look at L. It turns out that 8 different combinations can be made from the original 6 codons. The number at the bottom is called the degeneracy. Each amino acid has a degeneracy, and we use it to calculate a total degeneracy for the entire primer. The number at the bottom right is degeneracy for entire primer- lower is better. Why? Lower means real similarity, similarity at the nucleotide level. The fewer combinations that could be generated from the codons, the higher restriction there is on how the amino acid was originally formed. Thus, regions with lots of M’s would have low degeneracy and thus high probability of usefulness.
10
Current Methods Client: Biology Professor Steven Horton
Manual search of primers Manual calculation of degeneracy
11
Requirements Automate the task of finding primers
Automate degeneracy calculation Record and organize results Analyze data to make predictions Pattern Matching Data mining Software engineering groups worked on first 2 requirements
12
Summer Research Analyzed solutions made by Software Engineering class in Spring 2003 Combined the good design features from each project Made a prototype in Java
13
Senior Thesis Fall Term Finished the prototype Multiple window design
Made algorithm more efficient using dynamic programming As far as interface goes, give Horton same tools he has with pen and paper. This involves a multiple window design. Professor Horton will be able to select a specific primer in the main window and bring up another window to examine the primer more closely. Any number of primer windows can be open at a time. Another goal for the fall is to make the algorithm as efficient as possible. Primers of length k can be derived from primers of length k-1. This property allows us to use dynamic programming to generate primers. An ongoing objective for the project for me is to become more familiar with the biological concepts that relate to protein functions. To achieve this, regular meetings with Professor Horton will be scheduled. Understanding the biology behind primer generation will help me to ask more intelligent questions of Professor Horton to expand the capabilities of the system.
14
The Prototype Click Find Primers after setting preferences.
15
Primer List Window
16
Inspection Window
17
Senior Thesis Winter Term
Incorporated a system to record to analyze results produced from finding primers Utilized data mining tools Learn more about what is done with the primers after generated from the protein sequences. Analyzing the results involves looking for patterns in the records. I intend to aim my research in the direction of data mining and pattern matching. Based on the analysis, we may be able to refine the algorithm.
18
Data Mining: Association Rules
Have the form LHSRHS Interpretation: If every item in LHS occurs, then it is likely that all of the items in RHS will also occur Example: LHS = protein sequence A contains primers 1, 2 & 3 RHS = protein sequence A contains primer 4 & 5 Application: Find Association Rules based on Horton’s data collected about primers
19
Protein Database
20
The End
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.