Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.

Slides:



Advertisements
Similar presentations
Transparency No. 1 Java Collection API : Built-in Data Structures for Java.
Advertisements

Programming with App Inventor Computing Institute for K-12 Teachers Summer 2012 Workshop.
Sequence of characters Generalized form Expresses Pattern of strings in a Generalized notation.
Chapter 7 Strings F To process strings using the String class, the StringBuffer class, and the StringTokenizer class. F To use the String class to process.
1 Various Methods of Populating Arrays Randomly generated integers.
CPSC 335 Dynamic Programming Dr. Marina Gavrilova Computer Science University of Calgary Canada.
Overview What is Dynamic Programming? A Sequence of 4 Steps
Designing Algorithms Csci 107 Lecture 4. Outline Last time Computing 1+2+…+n Adding 2 n-digit numbers Today: More algorithms Sequential search Variations.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Circular Arrays Neat trick: use a circular array to insert and remove items from a queue in constant time The idea of a circular array is that the end.
CSC1016 Coursework Clarification Derek Mortimer March 2010.
Maps, Dictionaries, Hashtables
Recursion. Objectives At the conclusion of this lesson, students should be able to Explain what recursion is Design and write functions that use recursion.
Sets and Maps Chapter 9. Chapter 9: Sets and Maps2 Chapter Objectives To understand the Java Map and Set interfaces and how to use them To learn about.
hashing1 Hashing It’s not just for breakfast anymore!
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Hash Tables1 Part E Hash Tables  
Designing Algorithms Csci 107 Lecture 4.
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
Cs164 Prof. Bodik, Fall Symbol Tables and Static Checks Lecture 14.
(c) University of Washingtonhashing-1 CSC 143 Java Hashing Set Implementation via Hashing.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Recursion & Collections API Recursion Revisited Programming Assignments using the Collections API.
Comp 249 Programming Methodology Chapter 15 Linked Data Structure - Part B Dr. Aiman Hanna Department of Computer Science & Software Engineering Concordia.
Towers of Hanoi. Introduction This problem is discussed in many maths texts, And in computer science an AI as an illustration of recursion and problem.
Hashing CS 105. Hashing Slide 2 Hashing - Introduction In a dictionary, if it can be arranged such that the key is also the index to the array that stores.
Jan 12, 2012 Introduction to Collections. 2 Collections A collection is a structured group of objects Java 1.2 introduced the Collections Framework Collections.
Notice: Changed TA Office hour Thursday 11am-1pm  noon-2pm.
Standard Algorithms –search for an item in an array –count items in an array –find the largest (or smallest) item in an array.
Arrays An array is a data structure that consists of an ordered collection of similar items (where “similar items” means items of the same type.) An array.
The while Loop Syntax while (condition) { statements } As long condition is true, the statements in the while loop execute.
HIT2037- HIT6037 Software Development in Java 22 – Data Structures and Introduction.
Sets, Maps and Hash Tables. RHS – SOC 2 Sets We have learned that different data struc- tures have different advantages – and drawbacks Choosing the proper.
CSC 211 Data Structures Lecture 13
Built-in Data Structures in Python An Introduction.
LECTURE 34: MAPS & HASH CSC 212 – Data Structures.
Chapter 14: Searching and Sorting
© 2004 Goodrich, Tamassia Hash Tables1  
Short Read Mapper Evan Zhen CS 124. Introduction Find a short sequence in a very long DNA sequence Motivation – It is easy to sequence everyone’s genome,
CS201: Data Structures and Discrete Mathematics I Hash Table.
Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1.
CHAPTER 8 SEARCHING CSEB324 DATA STRUCTURES & ALGORITHM.
***** SWTJC STEM ***** Chapter 7 cg 68 What Are Arrays? An array is a simple but powerful way to organize and store large amounts of data and information.
CSS446 Spring 2014 Nan Wang.  To understand the implementation of linked lists and array lists  To analyze the efficiency of fundamental operations.
Hashing CS 110: Data Structures and Algorithms First Semester,
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Compsci 201 Recitation 10 Professor Peck Jimmy Wei 11/1/2013.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
Chapter 13 C Advanced Implementations of Tables – Hash Tables.
CS1020E Lab 4 (Stack and Queue)
Hash Tables ADT Data Dictionary, with two operations – Insert an item, – Search for (and retrieve) an item How should we implement a data dictionary? –
Maps Nick Mouriski.
CSC 213 Lecture 19: Dynamic Programming and LCS. Subsequences (§ ) A subsequence of a string x 0 x 1 x 2 …x n-1 is a string of the form x i 1 x.
Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.
Collections Dwight Deugo Nesa Matic
CS 115 OBJECT ORIENTED PROGRAMMING I LECTURE 11 GEORGE KOUTSOGIANNAKIS 1 Copyright: 2015 Illinois Institute of Technology_ George Koutsogiannakis.
ISOM MIS 215 Module 1 – Ordered Lists. ISOM Where are we? 2 Intro to Java, Course Java lang. basics Arrays Introduction NewbieProgrammersDevelopersProfessionalsDesigners.
Introduction toData structures and Algorithms
All-pairs Shortest paths Transitive Closure
CS1020 – Data Structures And Algorithms 1 AY Semester 2
Control Statements: Part 2
13 Text Processing Hongfei Yan June 1, 2016.
Searching.
Hash functions Open addressing
CSE 373 Data Structures and Algorithms
Arrays .
OBJECT ORIENTED PROGRAMMING I LECTURE 11 GEORGE KOUTSOGIANNAKIS
Presentation transcript:

Lab 6 Problem 1: DNA

DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For instance, String: ACGTAC (N = 6) Substring: AC (K = 2) Answer: There are 2 AC in string ACGTAC. ACGTAC

DNA Substring is consecutive part of a string. Note that AG is not a substring of ACGTAC.

Brute-force Algorithm For each query Iterate through the entire string For each position in the string, check the substring, and increment count

DNA (70%) for (int i = 0; i < N; i++) { boolean found = true; for (int j = 0; j < K; j++) { if (text[i + j] != pattern[j]) { // character mismatch found = false; break; } if (found) counter++; }

DNA (70%) We can answer one query in O(N.K) Hence with Q queries, the time complexity will be O(Q.N.K) Solution: For every query, we check the substring with length K starting at index i

DNA (100%) Java HashTable

DNA (100%) Key: substring Value: Number of occurrences of substring Iterate through string once to populate hashtable O(NK) Constant time for each query

DNA (100%) ACGTAC Store the substrings as key. AC, CG, GT, TA.

DNA (100%) We will have: occur[AC] = 2 occur[CG] = 1 occur[GT] = 1 occur[TA] = 1 for (int i = 0; i < N – K + 1; i++) { occur[hash(i, K)]++; // we increase the substring starting at index i with length K. }

DNA(100%) After we have built the table, we can answer a query in O(1) By searching the hash table with the query as the key

Alternative What if we do not have Java Hash Table API?

DNA – V2 Implement our own hash table! Since K is very small, we can use simple hash function and array as the table.

DNA-V2 Hash function? First, we map A to 1, C to 2, G to 3, T to 4. (we only have A, C, G, and T in DNA sequence).

DNA-V2 ACGTAC We only need to store the number related to the substring. AC = 12, CG = 23, GT = 34, TA = 41.

DNA-V2 We will have: occur[12] = 2 occur[23] = 1 occur[34] = 1 occur[41] = 1 for (int i = 0; i < N – K + 1; i++) { occur[hash(i, K)]++; // we increase the substring starting at index i with length K. }

DNA (100%) After we have built the table, we can answer a query in O(K) by calculating the hash value of the substring in that query (X) Output the value in occur[X].

Problem 2: Find Substring

Find Substring Given 2 strings, Output 0: if a substring is not in string1&2 Output 1: if a substring is only in string 1 Output 2: if a substring is only in string 2 Output 3: if a substring is in both string 1&2

Find Substring (70%) Check the existence of a substring in both strings to determine the answer. You might notice that this problem is very similar to DNA problem, i.e. a substring is in a string if the number of occurrences is greater than 0. Can be solved using the same technique for DNA(70%)

Find Substring (100%) It is possible to reuse the solution for DNA If the number of occurrences of a substring in a given string > 0, it means that we can find the substring in the string. You need 2 tables, one for the first string and another one for the second string

Find Substring (100%) For example, we have 2 strings, i.e. ACGTAC and ACTGCA Use the same technique as the one in DNA

Find Substring (100%) After we have built the table, we can answer a query in O(1) E.g. check occurOne.get(“AC”) and occur2.get(“AC”)

Incantation-E

Task Find a interval (continuous section) ◦ Contains all incantations ◦ Total length is minimal {acer, wei, wei, acer, acer, jing, acer, wei}

Idea Maintain the interval using a queue ◦ Step1: Initially empty {[]acer, wei, wei, acer, acer, jing, acer, wei} ◦ Step2: While the queue does not contain all words, add words at the back of the queue  {[acer, wei, wei, acer, acer, jing], acer, wei}

Idea ◦ Step3: While the front of the queue is redundant, pop it out, and update the minimum total length  {acer, wei, [wei, acer, acer, jing], acer, wei}, min = 15 ◦ Step4: if not reach the end of the list, add the next word at the back of the queue, and goto Step3 ◦ Final Answer: {acer, wei, wei, acer, acer, [jing, acer, wei]}, min = 11

Time Complexity: O(N). How to check whether the first word in the queue is redundant? ◦ Hashing to store the word’s occurrence in the queue.