Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mark Vorster Supervisor: Prof Philip Machanick. Research Overview Goal  Aid bioinformaticians in research by providing a tool which can identify similar.

Similar presentations


Presentation on theme: "Mark Vorster Supervisor: Prof Philip Machanick. Research Overview Goal  Aid bioinformaticians in research by providing a tool which can identify similar."— Presentation transcript:

1 Mark Vorster Supervisor: Prof Philip Machanick

2 Research Overview Goal  Aid bioinformaticians in research by providing a tool which can identify similar DNA sequences in order to infer homogeneity, in a timely manner. Reason for problems  Large data sets  Days of processing  No existing specific tools 2 BioinformaticsString MatchingDiscussionResearch Overview ---- Questions

3 Bioinformatics "Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioural or health data, including those to acquire, store, organize, archive, analyse, or visualise such data.“ Biomedical Information Science and Technology Initiatives Definition Committee - Dr Huerta "The branch of science concerned with information and information flow in biological systems, esp. the use of computational methods in genetics and genomics.“ Oxford English Dictionary 3 BioinformaticsString MatchingDiscussionResearch Overview ---- Questions

4 History of Bioinformatics and Genetics  1953 - Watson, Crick, Wilkins and Franklin.  Discrete abstraction Adenine – Thymine Guanine – Cytosine 44 One helical turn = 3.4 nm http://www.accessexcellence.org/RC/VL/GG/images/structure.gif Sugar-phosphate backbone base Hydrogen bonds BioinformaticsString MatchingDiscussionResearch Overview ---- Questions

5 Sequence Analysis and Sequence Alignment  Sequence Alignment  Global Alignment is expensive  Assumption: Sequences are already Globally Aligned Alignment Differences TGAGCACCT  Insertion TGA C GCACCT  Deletion TGA_CACCT  Replacement TGA T CACCT  Phylogenetic inference 55 BioinformaticsString MatchingDiscussionResearch Overview ---- Questions

6 FASTA File Format  Leading ‘>’  Sequence Identifier  Description or comment  A number of lines of genetic code  Other Symbols 6 >SequenceName description or comment CCGGAATACCTAGGAC GCCTTCATCCCCCGCC GGTCTGTGATGTCCCA ATGGACCGGA >NextSequence description of comment ACGCCTGATTACCTGC TAGTCGGGATGATAAC CAAGAATTTGTGTCTG BioinformaticsString MatchingDiscussionResearch Overview ---- Questions

7 Approximate String Matching Algorithm  Nesting loops inefficient  Dynamic Programing  Take into account all previous information  Improved to O(n 2 ) | where n is number of bases in shorter sequence  Goal: Find the closet match between two strings Or the minimum number of differences 7 BioinformaticsString MatchingDiscussionResearch Overview ---- Questions

8 Approximate String Matching Algorithm Minimum of:  MatchCost = D[i-1][j-1], if p i = t j  ReviseCost = D[i-1][j-1]+1, if p i ≠ t j  InsertCost = D[i-1][j]+1  DeleteCost = D[i][j-1]+1  D[0][j] = 0 and D[i][0] = i 8 D[i-1][j-1]D[i-1][j] D[i][j-1]D[i][j] BioinformaticsString MatchingDiscussionResearch Overview ---- Questions

9 Approximate String Matching Algorithm 9 BioinformaticsString MatchingDiscussionResearch Overview ---- Questions Haveahsppyday NULL 00000000000000000 h1 a2 p3 p4 y5

10 Approximate String Matching Algorithm 10 BioinformaticsString MatchingDiscussionResearch Overview ---- Questions Haveahsppyday NULL 00000000000000000 h1111111 a2212221 p332233 p443334 y554444 D[i-1][j-1]  MatchCost = D[i-1][j-1], if p i = t j  ReviseCost = D[i-1][j-1]+1, if p i ≠ t j  InsertCost = D[i-1][j]+1  DeleteCost = D[i][j-1]+1 D[i][j-1] i j D[i-1][j] D[i-1][j-1] tjtj pipi  MatchCost = N/A  ReviseCost = 3  InsertCost = 2  DeleteCost = 4 -> Min = 2

11 Approximate String Matching Algorithm 11 BioinformaticsString MatchingDiscussionResearch Overview ---- Questions Haveahsppyday NULL 00000000000000000 h1111111101111 a2212221211222 p3322332222123 p4433343333212 y5544444444321

12 Approximate String Matching Algorithm 12  Changes  D[i][0] = i, if p i = t 0  D[i][0] = i + 1, if p i ≠ t 0  D[0][j] = j, if p 0 = t j  D[0][j] = j + 1, if p 0 ≠ t j  Additional stop case for mismatch BioinformaticsString MatchingDiscussionResearch Overview ---- Questions

13 Approximate String Matching Algorithm 13 TACGGACGGT T0234567899 A2012345 C3101234 G4210123 A5321112 A6432212 G7543222 G8654333 G9765444 A10876545 BioinformaticsString MatchingDiscussionResearch Overview ---- Questions

14 Discussion 14 BioinformaticsString MatchingDiscussionResearch Overview ---- Questions  Grouping Algorithm  Scale of the problem  400 – 800 bases per sequence  Ten thousands of sequences  Assumptions:  Sequences Globally Aligned  Sequences Begin at the Same Place

15 Example Grouping 15 Seq[336]HK2QS7R01AXRJ6Seq[218]Seq[38]Seq[235]Seq[89]… Seq[382]HK2QS7R01BR4Q9Seq[173] Seq[180]HK2QS7R01ABFDPSeq[339]Seq[289]Seq[491]Seq[319]… Seq[269]HK2QS7R01AZHD7Seq[402]Seq[112]Seq[203]Seq[137]… Seq[210]HK2QS7R01BMNQ4Seq[364] Seq[270]HK2QS7R01AZFOGSeq[388]Seq[441] Seq[442]HK2QS7R01ADASOSeq[426]Seq[233]Seq[374]Seq[416]… …… BioinformaticsString MatchingDiscussionResearch Overview ---- Questions

16 Results 16 O(n 2 ), where n is number of sequences. ~1600 comparisons per second. 10000 sequence ~8.6 hours. (from 10 days) Comparisons for n sequence = (n-1)n/2 BioinformaticsString MatchingDiscussionResearch Overview ---- Questions

17 BioinformaticsString MatchingDiscussionResearch Overview ---- Questions


Download ppt "Mark Vorster Supervisor: Prof Philip Machanick. Research Overview Goal  Aid bioinformaticians in research by providing a tool which can identify similar."

Similar presentations


Ads by Google