First discussion section agenda

Slides:



Advertisements
Similar presentations
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Advertisements

Efficiency of Algorithms
Chapter 2: Algorithm Discovery and Design
Program Design and Development
Aho-Corasick String Matching An Efficient String Matching.
Data Structures Introduction. What is data? (Latin) Plural of datum = something given.
Algorithms and Efficiency of Algorithms February 4th.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Chapter 2: Algorithm Discovery and Design
Discussion Section: HW1 and Programming Tips GS540.
Traceback and local alignment Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
JS Arrays, Functions, Events Week 5 INFM 603. Agenda Arrays Functions Event-Driven Programming.
General Programming Introduction to Computing Science and Programming I.
1 Programming Languages Tevfik Koşar Lecture - II January 19 th, 2006.
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
C++ Programming Language Lecture 2 Problem Analysis and Solution Representation By Ghada Al-Mashaqbeh The Hashemite University Computer Engineering Department.
Principles of Computer Science I Honors Section Note Set 1 CSE 1341 – H 1.
1 CPSC 320: Intermediate Algorithm Design and Analysis July 28, 2014.
1 Computer Science of Graphics and Games MONT 105S, Spring 2009 Session 1 Simple Python Programs Using Print, Variables, Input.
Algorithms  Problem: Write pseudocode for a program that keeps asking the user to input integers until the user enters zero, and then determines and outputs.
AP Computer Science edition Review 1 ArrayListsWhile loopsString MethodsMethodsErrors
INVITATION TO Computer Science 1 11 Chapter 2 The Algorithmic Foundations of Computer Science.
HW4: sites that look like transcription start sites Nucleotide histogram Background frequency Count matrix for translation start sites (-10 to 10) Frequency.
PROGRAMMING FUNDAMENTALS INTRODUCTION TO PROGRAMMING. Computer Programming Concepts. Flowchart. Structured Programming Design. Implementation Documentation.
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 Design and Analysis of Algorithms.
C++ Memory Management – Homework Exercises
Introduction to Computing Science and Programming I
Unit – I Lists.
3.1 Fundamentals of algorithms
Introduction to programming
COMP261 Lecture 22 Data Compression 2.
Indexing Graphs for Path Queries with Applications in Genome Research
Computer Science 210 Computer Organization
Algorithmic complexity: Speed of algorithms
Chapter 6: Data Types Lectures # 10.
Algorithms Problem: Write pseudocode for a program that keeps asking the user to input integers until the user enters zero, and then determines and outputs.
Programming Tips GS540 January 10, 2011.
Object-Orientated Programming
13 Text Processing Hongfei Yan June 1, 2016.
Discussion Section 3 HW1 comments HW2 questions
JinJu Lee & Beatrice Seifert CSE 5311 Fall 2005 Week 10 (Nov 1 & 3)
Strings: Tries, Suffix Trees
Prof. Neary Adapted from slides by Dr. Katherine Gibson
CS190/295 Programming in Python for Life Sciences: Lecture 1
Computer Science 210 Computer Organization
Some Basics for Problem Analysis and Solutions
Learning to Program in Python
Discussion section #2 HW1 questions?
Some Basics for Problem Analysis
Chapter 11 Introduction to Programming in C
Lesson 2 Programming constructs – Algorithms – Scratch – Variables Intro.
ICS 353: Design and Analysis of Algorithms
Hw 5 Hints.
Objective of This Course
Chapter 8 Arrays Objectives
Lesson 15: Processing Arrays
Programming Tips GS540 January 10, 2011.
Chapter 11 Introduction to Programming in C
Coding Concepts (Basics)
Directed Acyclic Graphs && Topological Sorting
Algorithmic complexity: Speed of algorithms
Looping III (do … while statement)
Chapter 11 Introduction to Programming in C
Chapter 8 Arrays Objectives
Algorithmic complexity: Speed of algorithms
Strings: Tries, Suffix Trees
Introduction to Computer Science
Lesson 02: Introduction to Python
Genome 540: Discussion Section Week 3
Introduction to Bash Programming, part 3
Presentation transcript:

First discussion section agenda Introductions HW1 context/advice/questions General programming tips Suggestions for future topics

Introductions Who am I? Who are you? 4th year Genome Sciences student Department Programming experience Trapnell Lab Language of choice? Macrophage polarization Single cell genomics Changes in gene expression/accessibility over time Python/R/Java

HW1 Assignment: find the longest exactly matching subsequence between two bacterial genome sequences using suffix arrays Due: 11:59pm on Sunday, January 14th

HW1 Genome A (N bases) Genome B (M bases) AATGC… …GGA CTTAT… …ACC - Reverse complementation explained in slide 37 of the biological review

HW1 Genome A (N bases) Genome B (M bases) AATGC… …GGA CTTAT… …ACC Rev. complement A (N bases) Rev. complement B (M bases) TCC… …GCATT GGT… …ATAAG - Reverse complementation explained in slide 37 of the biological review Goal: find longest subsequence in A (or reverse complement of A) with an exact match in B (or reverse complement of B)

HW1 One approach: create a single combined sequence with both genome sequences and their reverse complements AATGC…GGA CTTAT…ACC Genome A (N bases) Genome B (M bases) TCC…GCATT GGT…ATAAG Rev. comp. A (N bases) Rev. comp. B (M bases) \0 null character NOTE: sequences and their reverse complements should only be stored in memory ONCE. Do NOT store separate copies of each substring sequence (or you will run out of memory).

HW1 1. Create a list of pointers to the suffixes for the genome sequences and their reverse complements p1 p2 p3 … pN+1 pN+2 pN+3 … AATGC…GGA \0 TCC…GCATT \0 CTTAT…ACC \0 GGT…ATAAG \0 p1 p2 p3 . pN+1 pN+2 pN+3 . Each pointer refers to the location in the sequence where a suffix starts.

HW1 2. When looking at two pointers, compare them based on the sequence of the substring that they point to. p1 p2 p3 … pN+1 pN+2 pN+3 … AATGC…GGA \0 TCC…GCATT \0 CTTAT…ACC \0 GGT…ATAAG \0 p1 p2 p3 . pN+1 pN+2 pN+3 .

HW1 3. Sort the suffixes lexicographically. AATGC…GGA \0 TCC…GCATT \0 p1 p2 p3 … pN+1 pN+2 pN+3 … AATGC…GGA \0 TCC…GCATT \0 CTTAT…ACC \0 GGT…ATAAG \0 p1 p2 REMEMBER: pointers are just references to locations in a string. They are NOT substrings, but we sort them based on the substrings they point to. pN+3 pN+2 pN+1 p3

HW1 4. Find matching subsequences using the list of sorted suffixes. p1 p2 p3 … pN+1 pN+2 pN+3 … AATGC…GGA \0 TCC…GCATT \0 CTTAT…ACC \0 GGT…ATAAG \0 p1 p2 4a. Compare the subsequences associated with the first two pointers and find the matching subsequence. AATGC… …GGA \0 AAG Match sequence is “AA” p1 pN+3 pN+2 pN+1 p3

HW1 4. Find matching subsequences using the list of sorted suffixes. p1 p2 p3 … pN+1 pN+2 pN+3 … AATGC…GGA \0 TCC…GCATT \0 CTTAT…ACC \0 GGT…ATAAG \0 p1 p2 4b. Continue comparing pairs of pointers, keeping track of those with the longest match sequences. ATGC… …GGA \0 AAG Match sequence is “A” pN+3 pN+2 p2 pN+1 p3

HW1 4. Find matching subsequences using the list of sorted suffixes. p1 p2 p3 … pN+1 pN+2 pN+3 … AATGC…GGA \0 TCC…GCATT \0 CTTAT…ACC \0 GGT…ATAAG \0 p1 p2 4c. If there are multiple match sequences that have the maximum length, report all of them. If a match sequence appears in more than one location, report all occurrences of the match sequence. pN+3 pN+2 pN+1 p3

HW1 4. Find matching subsequences using the list of sorted suffixes. p1 p2 p3 … pN+1 pN+2 pN+3 … AATGC…GGA \0 TCC…GCATT \0 CTTAT…ACC \0 GGT…ATAAG \0 p1 p2 NOTE: we want the longest matching subsequence that appears at least once in both organisms. pN+3 pN+2 pN+1 NO YES YES p3

HW1 tips Plan out your algorithm with pseudocode Think about what comparisons you need (and don’t need) to make Get comfortable with pointers Think about how to store inputs Think about how to store results (and intermediate results) Try to format your output to match the template Start early (especially if using Python) If submitting an incomplete assignment, demonstrate the parts that do work (e.g. can read in file, works on test data but not on full dataset, etc.)

Programming tips: style The more readable your code is The easier it will be for me to help The more useful it will be to you later (especially if you TA this class) Tips for readability Intuitive and meaningful variable/function names Comments Outlining general structure of program/key points of implemented algorithm Clarifying any tricky/unintuitive lines of code Simplicity over performance optimization (until it becomes necessary) Please make an effort to match the output template!

Programming tips: testing Create small, easily-verified test cases Try to cover any edge cases you can think of Print intermediate output Is the processed data as expected? Write incrementally, test as you go Assertion statements can be helpful Check against expectations Do your results make sense?

Programming tips: efficiency Remove unnecessary operations from loops Slow comparisons mean slow sorting Profiling tools line_profiler (python) gprof, valgrind (C/C++) [valgrind also identifies memory leaks] Jprofiler (Java)

Suggestions for future discussion topics? BLAST/multiple alignment Additional applications of HMMs (GENSCAN) Dynamic Bayesian Networks Frequentist vs. Bayesian statistics, probabilities vs. likelihoods Dynamic programming More programming tips Plotting with ggplot (R) Machine learning Other suggestions?