Approximate Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.

Slides:



Advertisements
Similar presentations
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Advertisements

. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Heuristic alignment algorithms and cost matrices
Basic Biology for CS262 OMKAR DESHPANDE (TA) Overview Structures of biomolecules How does DNA function? What is a gene? How are genes regulated?
Sequence Alignment III CIS 667 February 10, 2004.
Hosted by The Greatest Biology teachers at Rider.
Transcription and Translation… Its what make you, YOU!
C-kit and the D816V Mutation The nucleus of the human cell contains 46 strings of DNA, called CHROMOSOMES, arranged in 23 pairs. Each chromosome actually.
PROTEIN SYNTHESIS.
Genetic Code All of the information to make a new organism is contained in the chromosomes of the cell. Chromosomes are made of tightly coiled DNA or Deoxyribonucleic.
Sequence Alignment.
How does DNA work? Building the Proteins that your body needs.
How DNA helps make you you. DNA Function Your development and survival depend on… Your development and survival depend on…  which proteins your cells.
1. A mutation occurs at the midpoint of a gene, altering all amino acids encoded after the point of mutation. Which mutation could have produced this.
DNA Structure and Function
The Nucleic Acids An Introduction.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
DNA and now RNA DNA is deoxyriboneucleic acid. RNA is ribonucleic acid.
Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas.
Transcription and Translation. DNA RNA Protein TranscriptionTranslation.
Objective: Understand the process of Translation
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
CHAPTER 12 STUDY GUIDE MATER LAKES ACADEMY MR. R. VAZQUEZ BIOLOGY
Lecture #3 Transcription Unit 4: Molecular Genetics.
Transcription and Translation.  Genes: are segments of DNA that code for proteins  Most nucleotide base sequences in DNA don’t code for anything  ATGCGAATCGTAGCATACGATGCATGCACGTG.
DNA and RNA The Molecule of Life: DNA and RNA. DNA vs. RNA Summary DNARNA By comparison they both have: Sugar phosphate background Nitrogenous bases By.
Protein Synthesis How’d you do? Nucleus Part of the cell containing DNA and RNA and responsible for growth and reproduction. In eukaryotic cells, the.
DNA & MODERN GENETICS DNA IS A SET OF INSTRUCTIONS FOR MAKING CELL PARTS.
 During DNA replication, the two strands of the original parent DNA molecule, shown in blue, each serve as a template for making a new strand, shown in.
Section 4 DNA and the production of proteins. Learning Intention: To understand the structure and function of DNA, genes and chromosomes. Success Criteria:
RNA Another Nucleic Acid.
RNA. What is RNA?  RNA stands for Ribonucleic acid  Made up of ribose  Nitrogenous bases  And a phosphate group  The code used for making proteins.
Protein Synthesis: Protein Synthesis: Translation and Transcription EQ: What is the Central Dogma and what processes does it involve? Describe processes.
Thursday, March 31 st Objective: Explain and apply laws of heredity and their relationship to the structure and function of DNA Agenda: 1. Introduction.
DNA and the genetic code DNA is found in the chromosomes in the nucleus in eukaryotic cells or in the cytoplasm in prokaryotic cells. DNA is found in the.
DNA Structure and Protein Synthesis (also known as Gene Expression)
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
The Genetic Code. The DNA that makes up the human genome can be subdivided into information bytes called genes. Each gene encodes a unique protein that.
Doug Raiford Phage class: introduction to sequence databases.
Need to book Plasticene Repro Transcription Translation video questions for flip learning task Standard Homework & Feedback sheet Past paper questions.
Objective Explain the function and structure of RNA. Determine how transcription produces a RNA copy of DNA. Analyze the purpose of transcription.
 RNA: Ribonucleic Acid  3 types  Helps cells make protein  Single strand of nucleotides: › Ribose sugar › Phosphate › Nitrogen bases  Adenine, uracil,
Double Helix DNA consists of two strips, made of sugars and phosphates, twisted around each other and connected by nitrogen bases. Looks like a spiral.
Gene Expression Gene: contains the recipe for a protein 1. is a specific region of DNA on a chromosome 2. codes for a specific mRNA.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
THE NUCLEIC ACIDS DNA & RNA. DNA-DeoxyriboNucleic Acid  DNA is the genetic material present in chromosomes  Made up of monomers called “nucleotides”
RNA  Structure Differences:  1. Instead of being double stranded, RNA is a single stranded molecule. (ss)  2. The sugar in RNA is ribose. It has one.
4/23/12 1. In your notebook, finish the questions from the laminated sheet. (Pg 47 side only) READ THE DIRECTIONS CAREFULLY!! 2. When finished, answer.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Local Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
GA for Sequence Alignment  Pair-wise alignment  Multiple string alignment.
You are what you eat!.  Deoxyribonucleic Acid  Long, double-stranded chain of nucleotides  Contains genetic code  Instructions for making the proteins.
Chapter 13 Test Review.
RNA. Learning Objectives  Contrast RNA and DNA.  Explain the process of transcription.
DNA AND GENETICS Chapter 12 Lesson 3. Essential Questions What is DNA? What is the role of RNA in protein production? How do changes in the sequence of.
Protein Synthesis Chapter 16, section 2. The sequence (order) of bases in a strand of DNA makes the code for building proteins. EX: The three bases “CCA”
SC.912.L.16.3 DNA Replication. – During DNA replication, a double-stranded DNA molecule divides into two single strands. New nucleotides bond to each.
DNA Structure and Protein Synthesis (also known as Gene Expression)
RNA.
(3) Gene Expression Gene Expression (A) What is Gene Expression?
Human Cells Gene Expression
RNA Another Nucleic Acid.
Replication Transcription Translation
12-3 RNA and Protein Synthesis
Protein Synthesis.
REVIEW DNA DNA Replication Transcription Translation.
Making Proteins Transcription Translation.
Protein Synthesis.
Presentation transcript:

Approximate Alignment Vasileios Hatzivassiloglou University of Texas at Dallas

Midterm Focus on understanding, not memorization Three types of questions –Direct test of your knowledge from the class/slides (30%) –Thinking questions (40%) –Problems (30%) Date: Tuesday, October 14

Sample direct question What is mRNA? Name two processes where it plays a role. [2 sentences] Answer: mRNA is a complementary copy of the DNA in a gene. It is involved in transcription (from DNA to mRNA) and translation (from mRNA to proteins via tRNA).

Sample thinking question A biologist has discovered a method that reports quickly the total number of each type of nucleotide in the DNA inside the nucleus of a cell. Using this method, he reports that for Rattus Norvegicus, the distribution is 22% Adenine, 29% Cytocine, 18% Guanine, and 31% Thymine. Explain why his results cannot possibly be correct. [1-2 sentences]

Answer Since the analysis is supposedly performed over the entire nucleus, it should include both complementary strands in each chromosome. So the number of nucleotides (and percentage) of A and T (and of C and G) should be the same.

Sample problem question The Longest Common Subsequence (LCS) problem has as follows: Given strings S and T find the longest string R that is a subsequence of both S and T. Which algorithm among the ones we discussed in class can be applied without modifications to solve this problem? State the appropriate parameters for that algorithm so that it work for the LCS problem. [2-5 sentences]

Answer The LCS problem is the same as local alignment if we never allow a mismatch, score all matches with the same positive number, and do not penalize for indels. Then the DP algorithm discussed in class will find the longest (because matches improve the score) part of each string that can be exactly matched freely deleting characters from each string, i.e., the LCS. The corresponding parameters are σ(x,x) = 1 (any positive number will work), σ(x,y) = -∞ for different non-space x and y, and σ(x,-)= σ(-,x)=0.

How high is O(nm)? Suppose we are matching a given protein (300 amino acids) with SwissProt Current SwissProt stats (September 2008) –397,500 entries –143 million amino-acids (360 amino acids on average) Need 397,500 × 360 × 300 (≈42.9 billion) time units and 360 × 300 (=108,000) space units (all multiplied by a constant)

Two settings Multiple comparisons, short sequences –Parallelization One or more comparisons, very long sequences (e.g., DNA) –Space used to be the critical factor because of limits in the size of direct-access memory –Modern machines can handle the space requirements for almost all comparisons, so time is now the important factor

Heuristic alignment O(nm) may be still too much for long sequences Rather than finding the true optimal alignment, follow a heuristic approach that is likely to produce a good alignment –“good” generally not as good as optimal –sometimes a high-scoring alignment will be completely missed

Basis for heuristic alignment What if there were no gaps? –Efficient algorithms exist for aligning –Knuth-Morris-Pratt algorithm, O(n+m) time and space Any good alignment likely has one or more regions with exact matches and no gaps Find such hot spots and proceed from there

FASTA – Step 1 FASTA = Fast Alignment Find all words of length k that exactly match between the two sequences (hot spots) To avoid O(nm) complexity –Construct a hash table for one of the strings, where the keys are the possible words and the values are their starting positions –|Σ| k such strings, O(kn) complexity –Match the second string in O(km) time

Word size effects As k becomes larger –The algorithm becomes linearly slower –The algorithm takes exponentially more space –It is more difficult to find exact matches, hence –the algorithm becomes more selective and less sensitive Typical values for k are 2 for proteins and 4-6 for DNA