Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

Similar presentations


Presentation on theme: "Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For."— Presentation transcript:

1 Lab 6 Problem 1: DNA

2 DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For instance, String: ACGTAC (N = 6) Substring: AC (K = 2) Answer: There are 2 AC in string ACGTAC. ACGTAC

3 DNA Substring is consecutive part of a string. Note that AG is not a substring of ACGTAC.

4 Brute-force Algorithm For each query Iterate through the entire string For each position in the string, check the substring, and increment count

5 DNA (70%) for (int i = 0; i < N; i++) { boolean found = true; for (int j = 0; j < K; j++) { if (text[i + j] != pattern[j]) { // character mismatch found = false; break; } if (found) counter++; }

6 DNA (70%) We can answer one query in O(N.K) Hence with Q queries, the time complexity will be O(Q.N.K) Solution: For every query, we check the substring with length K starting at index i

7 DNA (100%) Java HashTable

8 DNA (100%) Key: substring Value: Number of occurrences of substring Iterate through string once to populate hashtable O(NK) Constant time for each query

9 DNA (100%) ACGTAC Store the substrings as key. AC, CG, GT, TA.

10 DNA (100%) We will have: occur[AC] = 2 occur[CG] = 1 occur[GT] = 1 occur[TA] = 1 for (int i = 0; i < N – K + 1; i++) { occur[hash(i, K)]++; // we increase the substring starting at index i with length K. }

11 DNA(100%) After we have built the table, we can answer a query in O(1) By searching the hash table with the query as the key

12 Alternative What if we do not have Java Hash Table API?

13 DNA – V2 Implement our own hash table! Since K is very small, we can use simple hash function and array as the table.

14 DNA-V2 Hash function? First, we map A to 1, C to 2, G to 3, T to 4. (we only have A, C, G, and T in DNA sequence).

15 DNA-V2 ACGTAC We only need to store the number related to the substring. AC = 12, CG = 23, GT = 34, TA = 41.

16 DNA-V2 We will have: occur[12] = 2 occur[23] = 1 occur[34] = 1 occur[41] = 1 for (int i = 0; i < N – K + 1; i++) { occur[hash(i, K)]++; // we increase the substring starting at index i with length K. }

17 DNA (100%) After we have built the table, we can answer a query in O(K) by calculating the hash value of the substring in that query (X) Output the value in occur[X].

18 Problem 2: Find Substring

19 Find Substring Given 2 strings, Output 0: if a substring is not in string1&2 Output 1: if a substring is only in string 1 Output 2: if a substring is only in string 2 Output 3: if a substring is in both string 1&2

20 Find Substring (70%) Check the existence of a substring in both strings to determine the answer. You might notice that this problem is very similar to DNA problem, i.e. a substring is in a string if the number of occurrences is greater than 0. Can be solved using the same technique for DNA(70%)

21 Find Substring (100%) It is possible to reuse the solution for DNA If the number of occurrences of a substring in a given string > 0, it means that we can find the substring in the string. You need 2 tables, one for the first string and another one for the second string

22 Find Substring (100%) For example, we have 2 strings, i.e. ACGTAC and ACTGCA Use the same technique as the one in DNA

23 Find Substring (100%) After we have built the table, we can answer a query in O(1) E.g. check occurOne.get(“AC”) and occur2.get(“AC”)

24 Incantation-E

25 Task Find a interval (continuous section) ◦ Contains all incantations ◦ Total length is minimal {acer, wei, wei, acer, acer, jing, acer, wei}

26 Idea Maintain the interval using a queue ◦ Step1: Initially empty {[]acer, wei, wei, acer, acer, jing, acer, wei} ◦ Step2: While the queue does not contain all words, add words at the back of the queue  {[acer, wei, wei, acer, acer, jing], acer, wei}

27 Idea ◦ Step3: While the front of the queue is redundant, pop it out, and update the minimum total length  {acer, wei, [wei, acer, acer, jing], acer, wei}, min = 15 ◦ Step4: if not reach the end of the list, add the next word at the back of the queue, and goto Step3 ◦ Final Answer: {acer, wei, wei, acer, acer, [jing, acer, wei]}, min = 11

28 Time Complexity: O(N). How to check whether the first word in the queue is redundant? ◦ Hashing to store the word’s occurrence in the queue.


Download ppt "Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For."

Similar presentations


Ads by Google