Download presentation
Presentation is loading. Please wait.
Published byOpal Byrd Modified over 7 years ago
1
Python Pipelining An Introduction to Building Data Analysis Pipelines
(& Hacking Graduate School) Presented by : Kevin Dick LECTURE WEBPAGE MKatzzz
2
Presentation Outline 30 Minutes :: Setup the Environment
Brief introduction to Python Motivating the Development of Analysis Pipelines Abstraction in Programming 60 Minutes :: Workshop [20 mins] PHASE I : Basic Functional Pipeline [20 mins] PHASE II : Scaling the Analysis [20 mins] PHASE III : Intro P1 P2 MKatzzz P3
3
Preamble :: Getting the Environment Setup
You need R: MKatzzz
4
Preamble :: Getting the Environment Setup
You need R: MKatzzz
5
Preamble :: Getting the Environment Setup
You need R: MKatzzz
6
Preamble :: Getting the Environment Setup
You need R: MKatzzz
7
Preamble :: Getting the Environment Setup
You need R: MKatzzz
8
Preamble :: Getting the Environment Setup
You need R: MKatzzz
9
Preamble :: Getting the Environment Setup
You need R: MKatzzz
10
Preamble :: Getting the Environment Setup
You need R: MKatzzz
11
Preamble :: Getting the Environment Setup
You need R: MKatzzz
12
Preamble :: Getting the Environment Setup
You need R: MKatzzz
13
Control/Logic Operations
Abstraction in Programming Program Function/Module Function Definition Variable Definition Control/Logic Operations MKatzzz Return Statement
14
Control/Logic Operations
Abstraction in Programming Def functionName() var variable_1 var variable_2 var variable_n for v in [1,10]: # do stuff if (cond): # stuff return theStuff Function Definition Variable Definition Control/Logic Operations MKatzzz Return Statement
15
… Abstraction in Programming Program Function/Module 1
MKatzzz Main Function
16
… Abstraction in Programming Pipeline Program 1 Program 2 Program n
MKatzzz Program n
17
… Abstraction in Programming Pipeline Program 1
Data Flow: General direction of data manipulation The results of one program generally becoming the input to the next Program 2 … MKatzzz Program n
18
… Abstraction in Programming Pipeline Program 1
Data Flow: General direction of data manipulation The results of one program generally becoming the input to the next Modularity: Can easily swap in/out programs Program 2.1 … MKatzzz Program n
19
… C/C++ Abstraction in Programming Pipeline Program 1
Data Flow: General direction of data manipulation The results of one program generally becoming the input to the next Modularity: Can easily swap in/out programs Optimizability: Can use diverse tools for specific problems C/C++ Program 2 … MKatzzz Program n Image Source:
20
… Abstraction in Programming Pipeline Program 1
Data Flow: General direction of data manipulation The results of one program generally becoming the input to the next Modularity: Can easily swap in/out programs Optimizability: Can use diverse tools for specific problems Reproducibility: Automating the work is highly desirable Program 2 … MKatzzz Developer Scientific Community Program n
21
… Abstraction in Programming Pipeline Program 1
Data Flow: General direction of data manipulation The results of one program generally becoming the input to the next Modularity: Can easily swap in/out programs Optimizability: Can use diverse tools for specific problems Reproducibility: Automating the work is highly desirable Software Variety: Can incorporate software across diverse platforms Program 2 … MKatzzz Program n Image Source:
22
PYTHON SCRIPT EXECUTING THE PIPELINE
Abstraction in Programming PIPELINE PROGRAM Function Definition Module 1 Variable Definition … PROGRAM n Module 2 Control/Logic Operations … Return Statement MKatzzz Main Main PYTHON SCRIPT EXECUTING THE PIPELINE
23
Hacking your Graduate Studies
“If its online, its available…” Python is great for pulling information out of web-pages! We can follow the BioGrid Link to open that page and parse out the values of interest! MKatzzz Similarly, the IntAct page is more up to date!
24
Building a Simple Pipeline
.fasta .txt .png .png .png .pdf MKatzzz Legend :: DATA DECISION CODE FLOW
25
Building a Simple Pipeline
STEP 1 MKatzzz Legend :: DATA DECISION CODE FLOW
26
Building a Simple Pipeline
STEP 2 MKatzzz Legend :: DATA DECISION CODE FLOW
27
Building a Simple Pipeline
STEP 3 MKatzzz Legend :: DATA DECISION CODE FLOW
28
Building a Simple Pipeline
STEP 4 MKatzzz Legend :: DATA DECISION CODE FLOW
29
Building a Simple Pipeline
STEP 5 MKatzzz Legend :: DATA DECISION CODE FLOW
30
Building a Simple Pipeline
STEP 6 MKatzzz Legend :: DATA DECISION CODE FLOW
31
Building a Simple Pipeline
STEP 7 MKatzzz Legend :: DATA DECISION CODE FLOW
32
Building a Simple Pipeline
.fasta .txt .png .png .png .pdf MKatzzz Legend :: DATA DECISION CODE FLOW
33
Building a Simple Pipeline
PHASE I report_template .pdf get_proteome.py get_interactor_count.py plot_inter_count.py compile_report.py r_file_TAXID.r Pick an Organism Pick a Single Protein _alert_simple.py Proteome_TAXID .fasta inter_count_TAXID.txt binary_ TAXID .png biogrid_ TAXID .png intact_ TAXID .png report_TAXID.pdf MKatzzz Legend :: DATA DECISION CODE FLOW
34
Building a Simple Pipeline
PHASE II report_template .pdf APPLY TO ALL PROTEINS get_proteome.py get_interactor_count.py plot_inter_count.py compile_report.py r_file_TAXID.r Pick an Organism _alert_simple.py Proteome_TAXID .fasta inter_count_TAXID.txt binary_ TAXID .png biogrid_ TAXID .png intact_ TAXID .png report_TAXID.pdf MKatzzz Legend :: DATA DECISION CODE FLOW
35
Building a Simple Pipeline
PHASE III report_template .pdf APPLY TO ALL PROTEINS get_proteome.py get_interactor_count.py plot_inter_count.py compile_report.py APPLY TO ALL ORGANISMS r_file_TAXID.r _alert_simple.py Proteome_TAXID .fasta inter_count_TAXID.txt binary_ TAXID .png biogrid_ TAXID .png intact_ TAXID .png report_TAXID.pdf MKatzzz Legend :: DATA DECISION CODE FLOW
36
Pick Your Starter Pokémon ::
Mus musculus TAX ID :: 10090 Caenorhabditis elegans TAX ID :: 6239 Escherichia coli TAX ID :: Homo sapiens TAX ID :: 9606 MKatzzz Drosophila melanogaster TAX ID :: 7227 Saccharomyces cerevisiae TAX ID :: Arabidopsis thaliana TAX ID :: 3702 Image Sources:
37
Take Away Lessons PHASE I ::
Combat Data Veracity :: Pipelines are useful for aggregating data from multiple sources PHASE II:: Abstraction :: Wrapping a section in a loop can easily scale the outcome Determining Bottlenecks :: Estimating the runtime for each section of code helps determine bottlenecks PHASE III:: Intelligent Design / Amortiziation :: Think about scaling when designing your pipelines; more work upfront has large payoffs later. EX :: Specifying TAX_ID in all the scripts as parameters really simplifies scaling the pipeline to all organisms Offlining Data :: Identify areas in your pipeline that are severe bottlenecks (Calls to External servers are really bad...Offline when you can) Replicability :: This pipeline is highly replicable for anyone looking to understand your work and allows the community to build upon it! MKatzzz
38
Common Bugs Errors when trying to compile the LaTeX document might be the result of errors in previous steps. MKatzzz
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.