Mark Vorster Supervisor: Prof Philip Machanick. Research Overview Goal  Aid bioinformaticians in research by providing a tool which can identify similar.

Slides:



Advertisements
Similar presentations
Speed, Accurate and Efficient way to identify the DNA.
Advertisements

DNA: History and Structure. A Brief History of DNA (deoxyribonucleic acid): –Discovery of DNA by many different scientists –1928 – Griffith – studied.
Aim: What is the structure of Nucleic Acids? Do Now: List 4 things you observe in the molecule below.
AP Biology Nucleic acids AP Biology Nucleic Acids.
BIOINFORMATICS Ency Lee.
Bioinformatics What is bioinformatics? Why bioinformatics? The major molecular biology facts Brief history of bioinformatics Typical problems of bioinformatics:
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
BNFO 235 Lecture 5 Usman Roshan. What we have done to date Basic Perl –Data types: numbers, strings, arrays, and hashes –Control structures: If-else,
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
–DNA functions as the inherited directions for a cell or organism. –How are these directions carried out? Flow of Genetic Information Gene DNA RNA Protein.
The Structure of DNA Pgs For General Biology (as well as building on your prior knowledge of organic molecules, pg 1 st semester.
Discovering DNA structure History activity. Erwin Chargaff Worked with numbers of chemical molecules Look at the molecules in your bag – These are VERY.
Chapter 8 From DNA to Protein. 8-2 DNA Structure 3 understandingsGenes 1. Carry information for one generation to the next 2. Determine which traits are.
Topic 3.4 DNA Replication.
KEY CONCEPT DNA structure is the same in all organisms.
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
8.1 DNA Structure & Function. Nucleic Acids DNA & RNA are nucleic acids Monomer? ▫Nucleotides 5-C Sugar P N- Base.
DNA: The Genetic Material Chapter DNA Structure DNA is a nucleic acid. The building blocks of DNA are nucleotides, each composed of: –a 5-carbon.
BY John Roderick STORES AND PASSES ON GENETIC INFORMATION FROM ONE GENERATION TO ANOTHER. DNA DEOXYRIBONUCLEIC ACID.
DNA replication and transcription Chapter 8, sections 8-1 p. 140 & 8-2 pp
DNA These “genes” never go out of style!! Ms. Kooiman La Serna High School.
Date DNA. ✤ DNA stands for deoxyribonucleic acid ✤ DNA carries all the genetic information of living organisms.
DNA Bases. Adenine: Adenine: (A) pairs with Thymine (T) only.
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
Objective: What is the structure of DNA?
HISTORY OF DNA A. Frederick Griffith – Discovers that a factor in diseased bacteria can transform harmless bacteria into deadly bacteria. (1928) B.Rosalind.
DNA – Deoxyribonucleic Acid. DNA – The Genetic Storehouse DNA occurs as a double stranded string of nucleotides that are bound together in the shape of.
Biology: Life on Earth Eighth Edition Biology: Life on Earth Eighth Edition Lecture for Chapter 9 Molecules of Heredity Lecture for Chapter 9 Molecules.
DNA Structure RHSA.
DNA Structure.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
The Structure of:.  By the end of the day, you should:  Know what DNA stands for  Understand the shape of DNA and be able to label all parts  Know.
Your “Do Now”2/17 Get out your KWL Take a ½ sheet of paper Write down 2 things that another person had on their KWL’s “L” column Be ready to share out.
History and Structure of DNA Part Two. DNA-Common Genetic Code After it was proven that DNA was the material of genes, studies began to learn about the.
. Sequence Alignment Author:- Aya Osama Supervision:- Dr.Noha khalifa.
DNA. DNA is the organic molecule Deoxyribonucleic Acid The function of DNA is as a molecule that permanently stores the information or instructions necessary.
DNA: STRUCTURE AND REPLICATION. DNA: The Code of Life  DNA is the molecule that contains all of the hereditary material for an organism  It is found.
DNA (Deoxyribonucleic Acid) A. DNA 1.Determines traits and codes for proteins 2.Found in nucleus of the cell Pg. 77.
DNA (Deoxyribonucleic Acid) A. DNA 1.Determines traits and codes for proteins 2.Found in nucleus of the cell Pg. 129.
DNA Sequences Analysis Hasan Alshahrani CS6800 Statistical Background : HMMs. What is DNA Sequence. How to get DNA Sequence. DNA Sequence formats. Analysis.
DNA and RNA. Rosalind Franklin Worked with x-ray crystallography Discovered: That DNA had a helical structure with two strands.
DNA 분자구조의 중요성 DNA : 유전 정보가 저장된 물질 Hereditary information is encoded in DNA. 유전 정보 발현의 중심 - DNA directs the development of biochemical, anatomical, physiological,
DNA DNA Deoxyribose Nucleic Acid DNA is a heredity molecule –passed on from parent/s –generation to generation Stores and transmits genetic information.
Bioinformatics Overview
DNA: The Molecule of Heredity
Click on the words in blue to find out more
Data-intensive Computing: Case Study Area 1: Bioinformatics
DNA replication and transcription
THE STRUCTURE OF DNA Section 4.2 Page 210.
DNA and Genes.
DNA: History and Structure
DNA.
Deoxyribonucleic Acid
History and Shape of DNA
DNA: History and Structure
DNA Notes!.
DNA and RNA.
Bioinformatics Vicki & Joe.
DNA Structure.
DNA Structure.
Our Friend DNA.
GENES.
DNA.
Review JEOPARDY! History DNA structure Replication Transcription
Bioinformatics Algorithms and Data Structures
Chapter 12 DNA.
Replication Makin’ copies
DNA Deoxyribonucleic Acid
Roles of the Genetic Material
DNA Notes!.
Presentation transcript:

Mark Vorster Supervisor: Prof Philip Machanick

Research Overview Goal  Aid bioinformaticians in research by providing a tool which can identify similar DNA sequences in order to infer homogeneity, in a timely manner. Reason for problems  Large data sets  Days of processing  No existing specific tools 2 BioinformaticsString MatchingDiscussionResearch Overview ---- Questions

Bioinformatics "Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioural or health data, including those to acquire, store, organize, archive, analyse, or visualise such data.“ Biomedical Information Science and Technology Initiatives Definition Committee - Dr Huerta "The branch of science concerned with information and information flow in biological systems, esp. the use of computational methods in genetics and genomics.“ Oxford English Dictionary 3 BioinformaticsString MatchingDiscussionResearch Overview ---- Questions

History of Bioinformatics and Genetics  Watson, Crick, Wilkins and Franklin.  Discrete abstraction Adenine – Thymine Guanine – Cytosine 44 One helical turn = 3.4 nm Sugar-phosphate backbone base Hydrogen bonds BioinformaticsString MatchingDiscussionResearch Overview ---- Questions

Sequence Analysis and Sequence Alignment  Sequence Alignment  Global Alignment is expensive  Assumption: Sequences are already Globally Aligned Alignment Differences TGAGCACCT  Insertion TGA C GCACCT  Deletion TGA_CACCT  Replacement TGA T CACCT  Phylogenetic inference 55 BioinformaticsString MatchingDiscussionResearch Overview ---- Questions

FASTA File Format  Leading ‘>’  Sequence Identifier  Description or comment  A number of lines of genetic code  Other Symbols 6 >SequenceName description or comment CCGGAATACCTAGGAC GCCTTCATCCCCCGCC GGTCTGTGATGTCCCA ATGGACCGGA >NextSequence description of comment ACGCCTGATTACCTGC TAGTCGGGATGATAAC CAAGAATTTGTGTCTG BioinformaticsString MatchingDiscussionResearch Overview ---- Questions

Approximate String Matching Algorithm  Nesting loops inefficient  Dynamic Programing  Take into account all previous information  Improved to O(n 2 ) | where n is number of bases in shorter sequence  Goal: Find the closet match between two strings Or the minimum number of differences 7 BioinformaticsString MatchingDiscussionResearch Overview ---- Questions

Approximate String Matching Algorithm Minimum of:  MatchCost = D[i-1][j-1], if p i = t j  ReviseCost = D[i-1][j-1]+1, if p i ≠ t j  InsertCost = D[i-1][j]+1  DeleteCost = D[i][j-1]+1  D[0][j] = 0 and D[i][0] = i 8 D[i-1][j-1]D[i-1][j] D[i][j-1]D[i][j] BioinformaticsString MatchingDiscussionResearch Overview ---- Questions

Approximate String Matching Algorithm 9 BioinformaticsString MatchingDiscussionResearch Overview ---- Questions Haveahsppyday NULL h1 a2 p3 p4 y5

Approximate String Matching Algorithm 10 BioinformaticsString MatchingDiscussionResearch Overview ---- Questions Haveahsppyday NULL h a p p y D[i-1][j-1]  MatchCost = D[i-1][j-1], if p i = t j  ReviseCost = D[i-1][j-1]+1, if p i ≠ t j  InsertCost = D[i-1][j]+1  DeleteCost = D[i][j-1]+1 D[i][j-1] i j D[i-1][j] D[i-1][j-1] tjtj pipi  MatchCost = N/A  ReviseCost = 3  InsertCost = 2  DeleteCost = 4 -> Min = 2

Approximate String Matching Algorithm 11 BioinformaticsString MatchingDiscussionResearch Overview ---- Questions Haveahsppyday NULL h a p p y

Approximate String Matching Algorithm 12  Changes  D[i][0] = i, if p i = t 0  D[i][0] = i + 1, if p i ≠ t 0  D[0][j] = j, if p 0 = t j  D[0][j] = j + 1, if p 0 ≠ t j  Additional stop case for mismatch BioinformaticsString MatchingDiscussionResearch Overview ---- Questions

Approximate String Matching Algorithm 13 TACGGACGGT T A C G A A G G G A BioinformaticsString MatchingDiscussionResearch Overview ---- Questions

Discussion 14 BioinformaticsString MatchingDiscussionResearch Overview ---- Questions  Grouping Algorithm  Scale of the problem  400 – 800 bases per sequence  Ten thousands of sequences  Assumptions:  Sequences Globally Aligned  Sequences Begin at the Same Place

Example Grouping 15 Seq[336]HK2QS7R01AXRJ6Seq[218]Seq[38]Seq[235]Seq[89]… Seq[382]HK2QS7R01BR4Q9Seq[173] Seq[180]HK2QS7R01ABFDPSeq[339]Seq[289]Seq[491]Seq[319]… Seq[269]HK2QS7R01AZHD7Seq[402]Seq[112]Seq[203]Seq[137]… Seq[210]HK2QS7R01BMNQ4Seq[364] Seq[270]HK2QS7R01AZFOGSeq[388]Seq[441] Seq[442]HK2QS7R01ADASOSeq[426]Seq[233]Seq[374]Seq[416]… …… BioinformaticsString MatchingDiscussionResearch Overview ---- Questions

Results 16 O(n 2 ), where n is number of sequences. ~1600 comparisons per second sequence ~8.6 hours. (from 10 days) Comparisons for n sequence = (n-1)n/2 BioinformaticsString MatchingDiscussionResearch Overview ---- Questions

BioinformaticsString MatchingDiscussionResearch Overview ---- Questions