Recap Don’t forget to – pick a paper and – me See the schedule to see what’s taken –

Slides:



Advertisements
Similar presentations
DNA.
Advertisements

Indexing DNA Sequences Using q-Grams
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
Ch. 10: Nucleic Acids and Protein Synthesis - DNA controls the production of proteins within the cell. - These proteins form the structural units of cells.
B1b 6.1 Inheritance Keywords:
Molecular Evolution Revised 29/12/06
Structural bioinformatics
Sequence Similarity Searching Class 4 March 2010.
Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Sequence similarity.
Recap Sometimes it is necessary to conduct Bad Science – often the product of having too much information Human Genome Project changed natural scientists.
Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)
Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Computational Genomics Lecture 1, Tuesday April 1, 2003.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Today’s Agenda Exam post-mortem (15-25 min) Grades & Status (5 min) Derek’s presentation (15-25 min) Exam #2: Question #1 (time permitting)
Sequencing a genome and Basic Sequence Alignment
Sequence Alignment.
Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information.
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Brandon Andrews.  Longest Common Subsequences  Global Sequence Alignment  Scoring Alignments  Local Sequence Alignment  Alignment with Gap Penalties.
Genes and Gene Technology
Unit 7 Lesson 1 DNA Structure and Function
An Introduction to Bioinformatics
CSE 6406: Bioinformatics Algorithms. Course Outline
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequencing a genome and Basic Sequence Alignment
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Chapter 3 Computational Molecular Biology Michael Smith
Condor: BLAST Monday, July 19 th, 3:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Introduction to Bioinformatics Dr. Rybarczyk, PhD University of North Carolina-Chapel Hill
Condor: BLAST Rob Quick Open Science Grid Indiana University.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
COT 6930 HPC and Bioinformatics Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Structure of DNA Chapter 8, Section 2.
Integration of Bioinformatics into Inquiry Based Learning by Kathleen Gabric.
Reading DNA The DNA molecule has the same basic structure and function in all living things. It carries the instructions for building and operating an.
The Genetic Code.  It took almost 100 years after the discovery of DNA for scientists to figure out that it looks like a twisted ladder.  When James.
DNA Structure 1. Which two scientists constructed the first accurate model of DNA’s structure? 2. Which scientist took crucial X-Ray crystallography photographs.
MONDAY LEARNING OBJECTIVE: WHAT IS RNA AND WHY WE NEED IT ENTRY TASK: START A NEW PAGE! TURN IN ENTRY TASKS FROM PREVIOUS TWO UNITS COPY THE GRAPHIC.
DNA AND ITS STRUCTURE. DNA is located inside the nucleus.
IN Match numbers with options
DNA and RNA Chapter 12.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Unit 5 Biology Notes DNA Objective 1: Describe the structure of DNA. (shape, parts of a nucleotide, and location in the eukaryotic cell)
DNA.
Overview of Genetics.
Applying principles of computer science in a biological context
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Basic Local Alignment Search Tool
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Presentation transcript:

Recap Don’t forget to – pick a paper and – me See the schedule to see what’s taken –

Agenda Questions for you (10 minutes) Overview (40 minutes) – chromosomes – sequence comparison – string matching – alignment Quiz (25+ minutes)

Questions for you List two different functions performed by genes? What is the length of the human genome? Why is the double-helix/base-pairing so important?

Questions for you Protein sequences are composed of a chain of what? How many different amino acids are found in proteins? Proteins always form in a helix shape (True or False)?

Questions that would stump Dr. B. What is the lower limit on the length of a functional protein? – – – – 100 What is the upper limit on the length of proteins found in cells – 100’s – 1000’s – ’s

Questions that would stump Dr. B. What is average length of a human gene? – 300 – 3000 – 30,000 Approximately, how many genes are in the human genome? – 400 – 4000 – 40,000 – 400,000 – 4,000,000

Acid Sugar A T Acid Sugar G Acid Sugar A Acid Sugar T Acid Sugar A Acid Sugar C Acid Sugar T Remember this picture?

Chromosomes DNA molecule and associated proteins The 3,000,000,000 nucleotide human genome is divided among – 22 pairs of autosomes and – 1 pair of sex chromosomes Together the 23 chromosomes carry all the hereditary information of an organism.

Chromosomes

DNA Sequence Comparison Overview There are 3 different types of comparisons that are important 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)

Whole Genome Comparison Problem: Exactly how similar are two different genomes? Given a set of genomes – which two are most similar – which two are least similar

Whole Genome Comparison Ranking a set of genomes based on similarity gives us clues about heredity evolution G1 G2 G3 G4 G5 Similarity Rank G2 G G3 G G4 G G4 G G4 G G4 G G2G5G4G3G1

Whole Genome Comparison Solution: Design a metric that quantifies similarity something you can measure or something you can compute that accurately quantifies similarity

Whole Genome Comparison But what does it really mean for two genomes to be similar? Obviously, if two genomes exactly match then they are similar But, what’s more important – rough, overall similarity, or – exact, local similarity A picture will explain

Whole Genome Comparison Exact matching genomes GCCTGACTTAGACAGTCGCTGATCGATGCTATGCA

Whole Genome Comparison Rough overall similarity GCTTACTTAGACAAGTCGCTGATCATGCTATGCA GCCTGACTTAGACAGTCGCTGCTCGATGCTTGCA 2 Mismatched pairs 4 unmatched nucleotides

Whole Genome Comparison Exact local similarities TACCCAGCTCTTAGACAGCTGATCGATGGAACTAT CTGACTTAGACAGCTGATCGATGCTATGCAAGCT

Whole Genome Comparison The first metric: Edit Distance The number of edit operations needed to make the two sequences equal Edit Distance was previously used in – Spell checkers – Approximate database searching

Edit Distance 3 edit operations 1. delete a symbol 2. insert a symbol 3. modify a symbol modify = delete + insert modify counts as two edit operations

Edit Distance What is the edit distance between these two sequences? Note: edit distance implies the minimum number of basic edit operations needed to make the string equal ERICWASABIGNERD ERICSTILLISANERD ERICWASABIGNERD (5 deletions) ERICSTILLISANERD (6 deletions)

Edit Distance ERICWASABIGNERD (15 symbols) ERICSTILLISANERD (16 symbols) ERICWASABIGNERD (5 deletions) ERICSTILLISANERD (6 deletions) Metrics – Matches 10 / Smaller Sequence 15 = 66% – (Edits 11 – Symbols 31) / Symbols 31 = 64%

Edit Distance There are problems with edit distance It doesn’t properly reward exact local similarity – which is often a true sign of biological similarity Similar organisms often share a lot of similar genes But may have a few genes that don’t match at all Biologists need a metric that can reflect this type of situation

Edit Distance Another problem Two organisms might have almost identical DNA Except one has extra segments Metrics Matches 99 / Smaller Sequence 100 = 99% (Edits 50 – Symbols 250) / Symbols 250 = 80%

Edit Distance How is it possible that two metrics based on the same principle (edit distance) could produce such different results? Metrics Matches 99 / Smaller Sequence 100 = 99% (Edits 50 – Symbols 250) / Symbols 250 = 80%

Recall There are 3 different types of comparisons that are important 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)

Gene Search Problem: Biologist have sequenced a brand new segment of DNA from a previously un-sequenced organism. They want to know Is this segment a gene? Advantage: Genes are similar across different organisms. Two organisms that do the same exact function are likely to have a nearly-exact gene.

Gene Search Solution: Take your newly sequenced segment And search all the previously sequenced genomes. Find segments (in other genomes) that highly match your segment. Advantage: – Other genomes are marked-up – Segments that are known to be genes are labeled – If your segment matches a known gene then BAM! – You’ve found a gene in a previously un-sequenced organism.

Gene Search Obviously, you want to search for a segment that is highly similar to your target segment. However, this type of comparison is completely different than whole genome comparison What is the fundamental difference?

Gene Search vs. Whole Genome Comparison Whole genome comparison considers sequences in their entirety – Two sequences – Beginning to End

Gene Search vs. Whole Genome Comparison Gene search doesn’t consider the entire search sequence when evaluating similarity Two sequences – Target (the segment you sequenced) – Search Sequence (possibly a genome)

Gene Search You want to find a sub-segment of the search sequence that highly matches the target sequence. The entire search sequence is analyzed But in evaluating similarity, we don’t need to consider the search sequence in its entirety Looking for localized similarity

Gene Search How do you even know that your newly sequenced segment is a gene? Perhaps only part of it is a gene and the rest is junk.

Gene Search Now, you are trying to find a portion of your segment that highly matches a portion of the search sequence. Writing an algorithm to find such matches is hard

Gene Search Writing such algorithms required coordination between 1. Biologists – Who have some clues about true biological similarity 2. And Computer Scientists – Who have some clues about what problems can be solved efficiently and reliably.

Recall There are 3 different types of comparisons that are important 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)

Next Class Motif discovery (computer science perspective) Alignment (the technique used to measure similarity) – Global alignment – Local alignment – Scoring matrices

Homework Pick a paper! me. Read pages