Loops continued and coding efficiently Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

Slides:



Advertisements
Similar presentations
Repetition Statements Perform the same task repeatedly Allow the computer to do the tedious, boring things.
Advertisements

Dictionaries (aka hash tables or hash maps) Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
ThinkPython Ch. 10 CS104 Students o CS104 n Prof. Norman.
For loops Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
While loops.
While loops Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Introduction to Python
CSC 221: Computer Programming I Fall 2001  conditional repetition  repetition for: simulations, input verification, input sequences, …  while loops.
Designing Algorithms Csci 107 Lecture 4. Outline Last time Computing 1+2+…+n Adding 2 n-digit numbers Today: More algorithms Sequential search Variations.
Introduction to working with Loops  2000 Prentice Hall, Inc. All rights reserved. Modified for use with this course. Introduction to Computers and Programming.
Chapter 5: Control Structures II (Repetition)
CS150 Introduction to Computer Science 1
1 MATERI PENDUKUNG JUMP Matakuliah: M0074/PROGRAMMING II Tahun: 2005 Versi: 1/0.
Handling Errors Introduction to Computing Science and Programming I.
Designing Algorithms Csci 107 Lecture 4.
An Object-Oriented Approach to Programming Logic and Design Chapter 7 Arrays.
1 Lab Session-III CSIT-120 Spring 2001 Revising Previous session Data input and output While loop Exercise Limits and Bounds GOTO SLIDE 13 Lab session.
Fibonacci Problem Solving and Thinking in Engineering Programming H. James de St. Germain.
Loops and Iteration Chapter 5 Python for Informatics: Exploring Information
Instructor: Chris Trenkov Hands-on Course Python for Absolute Beginners (Spring 2015) Class #005 (April somthin, 2015)
COMPE 111 Introduction to Computer Engineering Programming in Python Atılım University
Lecture Set 5 Control Structures Part D - Repetition with Loops.
Lecture 8: Choosing the Correct Loop. do … while Repetition Statement Similar to the while statement Condition for repetition only tested after the body.
Logic Structure - focus on looping Please use speaker notes for additional information!
Control Structures Week Introduction -Representation of the theory and principles of structured programming. Demonstration of for, while,do…whil.
S2008Final_part1.ppt CS11 Introduction to Programming Final Exam Part 1 S A computer is a mechanical or electrical device which stores, retrieves,
CPS120 Introduction to Computer Programming The Programming Process.
9-1 COBOL for the 21 st Century Nancy Stern Hofstra University Robert A. Stern Nassau Community College James P. Ley University of Wisconsin-Stout (Emeritus)
© Copyright 1992–2004 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. 1 Flow Control (for) Outline 4.1Introduction 4.2The.
Control Structures CPS120: Introduction to Computer Science Lecture 5.
Copyright 2003 Scott/Jones Publishing Standard Version of Starting Out with C++, 4th Edition Chapter 5 Looping.
Programming Logic and Design Fourth Edition, Comprehensive Chapter 8 Arrays.
1 Chapter 9. To familiarize you with  Simple PERFORM  How PERFORM statements are used for iteration  Options available with PERFORM 2.
Computer Programming Control Structure
(C)opyright 2000 Scott/Jones Publishers Introduction to Flowcharting.
Counting Loops.
WHAT IS THE VALUE OF X? x = 0 for value in [3, 41, 12, 9, 74, 15] : if value < 10 : x = x + value print x.
1 Standard Version of Starting Out with C++, 4th Brief Edition Chapter 5 Looping.
Python Mini-Course University of Oklahoma Department of Psychology Day 3 – Lesson 10 Iteration: Loops 05/02/09 Python Mini-Course: Day 3 - Lesson 10 1.
Topic: Control Statements. Recap of Sequence Control Structure Write a program that accepts the basic salary and allowance amount for an employee and.
Lecture 7: Menus and getting input. switch Multiple-selection Statement switch Useful when a variable or expression is tested for all the values it can.
LECTURE # 8 : REPETITION STATEMENTS By Mr. Ali Edan.
9. ITERATIONS AND LOOP STRUCTURES Rocky K. C. Chang October 18, 2015 (Adapted from John Zelle’s slides)
Copyright 2006 Addison-Wesley Brief Version of Starting Out with C++ Chapter 5 Looping.
Flow Control in Imperative Languages. Activity 1 What does the word: ‘Imperative’ mean? 5mins …having CONTROL and ORDER!
Introduction to Computing Systems and Programming Programming.
Computer Program Flow Control structures determine the order of instruction execution: 1. sequential, where instructions are executed in order 2. conditional,
For Loop GCSE Computer Science – Python. For Loop The for loop iterates over the items in a sequence, which can be a string or a list (we will discuss.
Follow up from lab See Magic8Ball.java Issues that you ran into.
Chapter 6: Loops.
while Repetition Structure
For loops Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Repetition Structures Chapter 9
Lesson 05: Iterations Class Chat: Attendance: Participation
Control Structures: Part 2
Python - Loops and Iteration
While loops Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
CPS120: Introduction to Computer Science
Lesson 05: Iterations Topic: Introduction to Programming, Zybook Ch 4, P4E Ch 5. Slides on website.
Chapter 8 The Loops By: Mr. Baha Hanene.
Iteration: Beyond the Basic PERFORM
Iteration: Beyond the Basic PERFORM
IST256 : Applications Programming for Information Systems
While loops Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Other ISAs Next, we’ll first we look at a longer example program, starting with some C code and translating it into our assembly language. Then we discuss.
Other ISAs Next, we’ll first we look at a longer example program, starting with some C code and translating it into our assembly language. Then we discuss.
While loops Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Introduction to Computer Science
CPS120: Introduction to Computer Science
Presentation transcript:

loops continued and coding efficiently Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Review x += y # adds y to the value of x x *= y # multiplies x by the value y x -= y # subtracts y from the value of x Increment operator sys.exit() # exit program immediately Explicit program exit Use to terminate when something is wrong - best to use print to provide user feedback before exit

Smart loop use if you don't know how many times you want to loop, use a while loop (indeterminate). e.g. finding all matches in a sequence e.g. looping through a list until you reach some list value lastVal = "Gotterdammerung" i = 0 while i < len(someList) and someList[i] != lastVal: i += 1 if i == len(someList): print lastVal, "not found" print "World not ended yet"

Smart loop use Read a file and print the first ten lines import sys infile = open(sys.argv[1], "r") lineList = infile.readlines() counter = 0 for line in lineList: counter += 1 if (counter > 10): break print line infile.close() Does this work? YES NO Is it good code?

What if the file has a million lines? (not uncommon in bioinformatics) import sys infile = open(sys.argv[1], "r") lineList = infile.readlines() counter = 0 for line in lineList: counter += 1 if (counter > 10): break print line infile.close() this statement reads all million lines!! import sys infile = open(sys.argv[1], "r") for counter in range(10): print infile.readline() infile.close() How about this instead? this version reads only the first ten lines, one at a time

import sys infile = open(sys.argv[1], "r") counter = 0 while counter < 10: print infile.readline() counter += 1 infile.close() This while loop does the same thing: The original readlines() approach not only takes much longer on large files it also has to store ALL the data in memory. I ran original version and efficient version on a very large file. Original version ran for 45 seconds and crashed when it ran out of memory. Improved version ran successfully in << 1 sec.

What if the file has fewer than ten lines? import sys infile = open(sys.argv[1], "r") for counter in range(10): print infile.readline() infile.close() It prints blank lines repeatedly - not ideal hint - when readline() reaches the end of a file, it returns "" but a blank line in the middle of a file returns "/n" import sys infile = open(sys.argv[1], "r") for counter in range(10): line = infile.readline() if len(line) == 0: break print line infile.close() Improved version: test for end of file

Memory allocation efficiency index = 0 curIndex = 0 while True: curIndex = hugeString[index:].find(query) if curIndex == -1: break index += curIndex print index index += 1 # move past last match First version makes a NEW large string in memory every time through the loop - slow! Second version uses the same string every time but starts search at different points in memory. Ran 10x to 1000x faster in test searches. index = 0 while True: index = hugeString.find(query, index) if index == -1: break print index index += 1 # move past last match be wary if you are splitting strings a lot

…. hugeString in memory search 1 …. … hugeString in memory search 1 search 2 search 3 search 4 hugeString.find(queryString, index) To figure out where to start this search, the computer just adds index to the position in memory of the 0 th byte of hugeString and starts the search there. …. new hugeString in memory copy bytes search 2 etc. hugeString[index:]

Sequential splitting of file contents Many problems in text or sequence parsing can employ this strategy: First, split file content into chunks (lines or fasta sequences etc.) Second, from each chunk extract the needed data This can be repeated - split each chunk into subchunks, extract needed data from subchunks. import sys lineList = open(sys.argv[1], "r").readlines() for line in lineList: fieldList = line.strip().split("\t") for field in fieldList: How many levels of splitting does this do? 2

General points for improving your code Write compactly as long as it is clear to read Consider whether you do things that are unnecessary (e.g. reading all the lines in a file when you don't need to) Consider user feedback if something unexpected arises (we will learn how to do this more elegantly soon). Don't waste memory by keeping information you don't need to use.

Sample problem #1 Write a program read-N-lines.py that prints the first N lines from a file (or all lines if the file has fewer than N lines), where N is the first argument and filename is the second argument. Be sure it handles very short and very long files correctly and efficiently. >python read-N-lines.py 7 file.txt this file has five lines >

Solution #1 import sys infile = open(sys.argv[2], "r") max = int(sys.argv[1]) counter = 0 while counter < max: line = infile.readline() if len(line) == 0: # we reached end of file break print line.strip() counter += 1

Sample problem #2 Write a program find-match.py that prints all the lines from the file cf_repmask.txt (linked from the web site) in which the 11 th text field exactly matches "CfERV1", with the number of lines matched and the total number of file lines at the end. Make the file name, the search term, and the field number command-line arguments. The file is an annotation of all the repeat sequences known in the dog genome. It is ~4.5 million lines long. Each line has 17 tab- delimited text fields. You will know you got it right if the example match count is 1,168. (If you use the smaller file cfam_repmask2.txt, the count should be 184) Your program should run in about seconds.

import sys if (len(sys.argv) != 4): print("USAGE: three arguments expected") sys.exit() query = sys.argv[1] # get the search term fnum = int(sys.argv[2]) - 1 # get the field number hitCount = 0 # initialize hit and line counts lineCount = 0 f = open(sys.argv[3]) # open the file for line in f: # for each line in file lineCount += 1 fields = line.split('\t') if fields[fnum] == query: # test for match print line.strip() hitCount += 1 f.close() print hitCount, "matches,", lineCount, "lines" Solution #2

Remark - in Solution #2 it is a bad idea to read all the lines at once with f.readlines(). Even though the problem requires you to read every line in the file, the best solution uses minimal memory because it never stores more than one line at a time.

Challenge problem 1 Extend sample problem 2 so that there is an optional 4 th argument that specifies a minimum genomic length to report a match. In the file, fields 7 and 8 are integers that indicate the genomic start and end positions of the repeat sequence. You should get 341 matches for the query "CfERV1" and a minimum genomic length of (If you use the smaller file cfam_repmask2.txt, there should be 63 matches)

import sys if (len(sys.argv) < 4): print("USAGE: at least three arguments expected") sys.exit() query = sys.argv[1] fnum = int(sys.argv[2]) - 1 minSpan = 0 # set a default so that any match passes if len(sys.argv) == 5: minSpan = int(sys.argv[4]) hitCount = 0 lineCount = 0 f = open(sys.argv[3]) for line in f: lineCount += 1 fields = line.split('\t') if fields[fnum] == query: span = int(fields[7]) - int(fields[6]) if span >= minSpan: print line.strip() hitCount += 1 f.close() print hitCount, "matches,", lineCount, "lines" Solution to challenge problem 1

Challenge problem 2 Modify sample problem 2 so that the number of matches and number of lines prints BEFORE the specific matches.

import sys if (len(sys.argv) != 4): print("USAGE: three arguments expected") sys.exit() query = sys.argv[1] fnum = int(sys.argv[2]) - 1 lineCount = 0 matchLines = [] # initialize the list to hold match lines f = open(sys.argv[3]) for line in f: lineCount += 1 fields = line.split('\t') if fields[fnum] == query: matchLines.append(line.strip()) # put line in list f.close() print len(matchLines), "matches,", lineCount, "lines" for line in matchLines: print line The trick is simple - make a list that will hold the matched lines, rather than printing them as you go. Print the list at the end. (note also that matchLines implicitly gives the number of matched lines) One possible problem is that, if the number of matched lines is huge, you could run out of memory.