Pipelines! CTB 6/15/13. A pipeline view of the world.

Slides:



Advertisements
Similar presentations
1 Copyright © 2002 Pearson Education, Inc.. 2 Chapter 1 Introduction to Perl and CGI.
Advertisements

Publishers Web Sites Standard Features. Objectives Access publishers websites Identify general features available on most publishers websites Know how.
Short introduction to the use of PEARL General properties First tier assessments Higher tier assessments Before looking at first and higher tier assessments,
Mutation Analysis Server Nagarajanlab. © Copyright 2005, Washington University School of Medicine. 2 Agenda Mutation pipeline overview High level design.
Copyright ©: Lawrence Angrave, Vikram Adve, Caccamo 1 Virtual Memory III.
Introduction to FX Stat 3. Getting Started When you open FX Stat you will see three separate areas.
Git/Unix Lab March Version Control ●Keep track of changes to a project ●Serves as a backup ●Revert to previous version ●Work on the same files concurrently.
Working with JavaScript. 2 Objectives Introducing JavaScript Inserting JavaScript into a Web Page File Writing Output to the Web Page Working with Variables.
1 Lab Session-IV CSIT-120 Spring 2001 Lab 3 Revision and Exercises Rev: Precedence Rules Lab Exercise 4-A Machine Language Programming The “Micro” Machine.
Modules, Hierarchy Charts, and Documentation
Guide To UNIX Using Linux Third Edition
Introduction to Unix (CA263) Introduction to Shell Script Programming By Tariq Ibn Aziz.
1 The World Wide Web. 2  Web Fundamentals  Pages are defined by the Hypertext Markup Language (HTML) and contain text, graphics, audio, video and software.
Cambodia-India Entrepreneurship Development Centre - : :.... :-:-
Installing software on personal computer
GIT is for people who like SVN September 18 th 2012 Vladimir Kerkez and BK.
What is Unix Prepared by Dr. Bahjat Qazzaz. What is Unix UNIX is a computer operating system. An operating system is the program that – controls all the.
READY-TO-WEAR: QUICK AND EASY MICROSITES FOR DATA-DRIVEN REPORTS Brian Karfunkel Data Analyst NYU Furman Center NNIP Idea Showcase July 16,
De-novo Assembly Day 4.
An Introduction to Designing and Executing Workflows with Taverna Aleksandra Pawlik University of Manchester materials by Dr Katy Wolstencroft and Dr Aleksandra.
Committing to the future easyEmission - Testing Module for Engines.
Introduction to Shell Script Programming
Class Instructor Name Date. Classroom Tips Class Roster – Please Sign In Class Roster – Please Sign In Internet Usage Internet Usage –Breaks and Lunch.
Introduction to Advanced UNIX March Kevin Keay.
Lecture 4 MATLAB Windows Arithmetic Operators Maintenance Functions
Ch 101 Chapter 10 Introduction to Batch Files. Ch 102 Overview A batch file is a text file that contains an ordered series of commands.
1 CSC 221: Introduction to Programming Fall 2012 Functions & Modules  standard modules: math, random  Python documentation, help  user-defined functions,
Best Practices for Script Design A PowerShell.org TechSession.
Operating Systems Lecture 2 Processes and Threads Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of.
Version control Using Git Version control, using Git1.
GNU Compiler Collection (GCC) and GNU C compiler (gcc) tools used to compile programs in Linux.
Organizing a project, making a table Biostatistics 212 Lecture 7.
Organizing a project, making a table Biostatistics 212 Session 5.
Index Building Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules.
CS 390 Unix Programming Summer Unix Programming - CS 3902 Course Details Online Information Please check.
Erin Osborne Nishimura onish.web.unc.edu University of North Carolina at Chapel Hill.
Oct 15, 2007Sprenkle - CS1111 Objectives Creating your own functions.
Getting to Grips with Assignments Bronze Level – Compulsory.
Organizing a project, making a table Biostatistics 212 Lecture 7.
Introduction to GitHub Alex Bigazzi Dec. 4, 2013 ITS Lab GitHub Introduction1.
Randy Pagels Sr. Developer Technology Specialist DX US Team (Developer Experience and Evangelism) Effective Testing and Automation with Microsoft Tools.
LYNN BRADSHAW CREATING WEB SITES WITH XARA WEB DESIGNER 7.
Hirophysics.com Steven Jeffrey K irkup. Hirophysics.com What is Object Oriented Programming (OOP)? Essentially, OOP is a protocol that can be followed.
Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.
The Use of Metadata in Creating, Transforming and Transporting Clinical Data Gregory Steffens Director, Data Management and SAS Programming ICON Development.
Pipeline Basics Jared Crossley NRAO NRAO. What is a data pipeline?  One or more programs that perform a task with reduced user interaction.  May be.
1/33 Basic Scheme February 8, 2007 Compound expressions Rules of evaluation Creating procedures by capturing common patterns.
Virtual Data Management for CMS Simulation Production A GriPhyN Prototype.
Georgia Institute of Technology Speed part 1 Barb Ericson Georgia Institute of Technology May 2006.
Version Control and SVN ECE 297. Why Do We Need Version Control?
Lab 8 Overview Apache Web Server. SCRIPTS Linux Tricks.
Lecture VIII: Software Architecture
IST 210: PHP LOGIC IST 210: Organization of Data IST210 1.
Welcome to Snap! Below the Line Decal Facilitated By: Zachary McPherson and Bao Xie.
Batch Files More flow of control Copyright © by Curt Hill.
Text2PTO: Modernizing Patent Application Filing A Proposal for Submitting Text Applications to the USPTO.
The Reproducible Research Advantage Why + how to make your research more reproducible Presentation for the Center for Open Science June 17, 2015 April.
CSE 303 Concepts and Tools for Software Development Richard C. Davis UW CSE – 10/9/2006 Lecture 6 – String Processing.
Unit Testing.
Bash Introduction (adapted from chapters 1 and 2 of bash Cookbook by Albing, Vossing, & Newham) CPTE 440 John Beckett.
GIT AND GITHUB WORKSHOP
Developing a SDG Reporting Platform – UK perspective
Perspectives on the intersection between computer science and psychology Developing reproducible – and reusable – methods through research software engineering.
Development and Deployment
Maintaining software solutions
Building Web Applications
Introduction to Advanced UNIX
OPS235: Week 1 Installing Linux ( Lab1: Investigations 1-4)
CS 240 – Advanced Programming Concepts
HP Quality Center 10.0 The Test Plan Module
Presentation transcript:

Pipelines! CTB 6/15/13

A pipeline view of the world

Each computational step is one or more commands Trimmomatic fastx Velvet

The breakdown into steps is dictated by input/output… In: reads; out: reads In: reads; out: contigs

The breakdown into steps is driven by input/output and “concept” In: reads; out: reads. Trimmomatic OR scythe OR … In: reads; out: reads. FASTX OR sickle OR ConDeTri OR … In: reads; out: contigs Velvet OR SGA OR …

Generally, I don’t include diagnostic steps as part of the main “flow”.

…but there isn’t exactly a standard :)

What is a pipeline, anyway? Conceptually: series of data in/data out steps. Practically: series of commands that load data, process it, and save it back to disk. – This is generally true in bioinformatics – You can also have programs that do multiple steps, which involves less disk “traffic” Actually: a bunch of UNIX commands.

“Shell scripting” The shell (bash, csh, etc) is specialized for exactly this: running commands. Shell “scripting” is putting together a series of commands – “scripting actions” to be run. Scripting vs programming – fuzzy line. – Scripting generally involves less complex organization. – Scripting typically done w/in single file

Writing a shell script: It’s just a series of shell commands, in a file. # trim adapters … Trimmomatic … # shuffle reads together Interleave.py … # Trim bad reads fastx_trimmer # Run velvet velveth... velvetg… trim-and-assemble.sh

Back to pipelines Automated pipelines are good things. – Encode each and every step in a script; – Provide all the details, incl parameters; Explicit: each command is present. Reusable: can easily tweak a parameter, re-run & re- evaluate. Communicable: you can give to lab mate, PI, etc. Minimizes confusion as to what you actually did :) Automated: start & walk away from long-running pipelines.

Why pipelines? Automation: – Convenience – Reuse – Reproducibility Pipelines encode knowledge in an explicit & executable computational representation.

Reproducibility Most groups can’t reproduce their own results, 6 months later. Other groups don’t even have a chance. Limits: – Reusability – Bug finding/tracking/fixing Both convenience and correctness.

Some nonobvious corollaries Each processing step from the raw data onwards is interesting; so you need to provide close-to- raw data. Making the figures is part of the pipeline; but Excel cannot be automated. Keeping track of what exact version of the pipeline script you used to generate the results now becomes a problem…

This is what version control is about. Version control gives you can explicit way to track, mark, and annotate changes to collections of files. (Git is one such system.) In combination with Web sites like github.com, you can: – View changes and files online – Download specific marked versions of files

An actual pipeline The results in our digital normalization paper are about 80% automated. – Raw data – Single command to go from raw data to fully processed data. – Single IPython Notebook to go from raw data to figures. – (Super special) single command to go from figures + paper source to submission PDF. – Figures & text are tied to a specific version of our pipeline => 100% reproducible.

IPython Notebook

This morning Let’s automate read trimming, mapping, & mismatch calculation! – Write script; run on subset of reads – Write notebook => figures – Put in version control, post to github. A quick tour of github – Forking, cloning, editing, pushing back Encoding assembly

Tips & tricks Develop a systematic naming scheme for files => easier to investigate results. Work with a small data set first & develop the pipeline; then, once it’s working, apply to full data set. Put in friendly “echo” commands. Advanced: use loops and wildcards to write generic processing steps.