Presentation is loading. Please wait.

Presentation is loading. Please wait.

Python Pipelining An Introduction to Building Data Analysis Pipelines

Similar presentations


Presentation on theme: "Python Pipelining An Introduction to Building Data Analysis Pipelines"— Presentation transcript:

1 Python Pipelining An Introduction to Building Data Analysis Pipelines
(& Hacking Graduate School) Presented by : Kevin Dick LECTURE WEBPAGE MKatzzz

2 Presentation Outline 30 Minutes :: Setup the Environment
Brief introduction to Python Motivating the Development of Analysis Pipelines Abstraction in Programming 60 Minutes :: Workshop [20 mins] PHASE I : Basic Functional Pipeline [20 mins] PHASE II : Scaling the Analysis [20 mins] PHASE III : Intro P1 P2 MKatzzz P3

3 Preamble :: Getting the Environment Setup
You need R: MKatzzz

4 Preamble :: Getting the Environment Setup
You need R: MKatzzz

5 Preamble :: Getting the Environment Setup
You need R: MKatzzz

6 Preamble :: Getting the Environment Setup
You need R: MKatzzz

7 Preamble :: Getting the Environment Setup
You need R: MKatzzz

8 Preamble :: Getting the Environment Setup
You need R: MKatzzz

9 Preamble :: Getting the Environment Setup
You need R: MKatzzz

10 Preamble :: Getting the Environment Setup
You need R: MKatzzz

11 Preamble :: Getting the Environment Setup
You need R: MKatzzz

12 Preamble :: Getting the Environment Setup
You need R: MKatzzz

13 Control/Logic Operations
Abstraction in Programming Program Function/Module Function Definition Variable Definition Control/Logic Operations MKatzzz Return Statement

14 Control/Logic Operations
Abstraction in Programming Def functionName() var variable_1 var variable_2 var variable_n for v in [1,10]: # do stuff if (cond): # stuff return theStuff Function Definition Variable Definition Control/Logic Operations MKatzzz Return Statement

15 … Abstraction in Programming Program Function/Module 1
MKatzzz Main Function

16 … Abstraction in Programming Pipeline Program 1 Program 2 Program n
MKatzzz Program n

17 … Abstraction in Programming Pipeline Program 1
Data Flow: General direction of data manipulation The results of one program generally becoming the input to the next Program 2 MKatzzz Program n

18 … Abstraction in Programming Pipeline Program 1
Data Flow: General direction of data manipulation The results of one program generally becoming the input to the next Modularity: Can easily swap in/out programs Program 2.1 MKatzzz Program n

19 … C/C++ Abstraction in Programming Pipeline Program 1
Data Flow: General direction of data manipulation The results of one program generally becoming the input to the next Modularity: Can easily swap in/out programs Optimizability: Can use diverse tools for specific problems C/C++ Program 2 MKatzzz Program n Image Source:

20 … Abstraction in Programming Pipeline Program 1
Data Flow: General direction of data manipulation The results of one program generally becoming the input to the next Modularity: Can easily swap in/out programs Optimizability: Can use diverse tools for specific problems Reproducibility: Automating the work is highly desirable Program 2 MKatzzz Developer Scientific Community Program n

21 … Abstraction in Programming Pipeline Program 1
Data Flow: General direction of data manipulation The results of one program generally becoming the input to the next Modularity: Can easily swap in/out programs Optimizability: Can use diverse tools for specific problems Reproducibility: Automating the work is highly desirable Software Variety: Can incorporate software across diverse platforms Program 2 MKatzzz Program n Image Source:

22 PYTHON SCRIPT EXECUTING THE PIPELINE
Abstraction in Programming PIPELINE PROGRAM Function Definition Module 1 Variable Definition PROGRAM n Module 2 Control/Logic Operations Return Statement MKatzzz Main Main PYTHON SCRIPT EXECUTING THE PIPELINE

23 Hacking your Graduate Studies
“If its online, its available…” Python is great for pulling information out of web-pages! We can follow the BioGrid Link to open that page and parse out the values of interest! MKatzzz Similarly, the IntAct page is more up to date!

24 Building a Simple Pipeline
.fasta .txt .png .png .png .pdf MKatzzz Legend :: DATA DECISION CODE FLOW

25 Building a Simple Pipeline
STEP 1 MKatzzz Legend :: DATA DECISION CODE FLOW

26 Building a Simple Pipeline
STEP 2 MKatzzz Legend :: DATA DECISION CODE FLOW

27 Building a Simple Pipeline
STEP 3 MKatzzz Legend :: DATA DECISION CODE FLOW

28 Building a Simple Pipeline
STEP 4 MKatzzz Legend :: DATA DECISION CODE FLOW

29 Building a Simple Pipeline
STEP 5 MKatzzz Legend :: DATA DECISION CODE FLOW

30 Building a Simple Pipeline
STEP 6 MKatzzz Legend :: DATA DECISION CODE FLOW

31 Building a Simple Pipeline
STEP 7 MKatzzz Legend :: DATA DECISION CODE FLOW

32 Building a Simple Pipeline
.fasta .txt .png .png .png .pdf MKatzzz Legend :: DATA DECISION CODE FLOW

33 Building a Simple Pipeline
PHASE I report_template .pdf get_proteome.py get_interactor_count.py plot_inter_count.py compile_report.py r_file_TAXID.r Pick an Organism Pick a Single Protein _alert_simple.py Proteome_TAXID .fasta inter_count_TAXID.txt binary_ TAXID .png biogrid_ TAXID .png intact_ TAXID .png report_TAXID.pdf MKatzzz Legend :: DATA DECISION CODE FLOW

34 Building a Simple Pipeline
PHASE II report_template .pdf APPLY TO ALL PROTEINS get_proteome.py get_interactor_count.py plot_inter_count.py compile_report.py r_file_TAXID.r Pick an Organism _alert_simple.py Proteome_TAXID .fasta inter_count_TAXID.txt binary_ TAXID .png biogrid_ TAXID .png intact_ TAXID .png report_TAXID.pdf MKatzzz Legend :: DATA DECISION CODE FLOW

35 Building a Simple Pipeline
PHASE III report_template .pdf APPLY TO ALL PROTEINS get_proteome.py get_interactor_count.py plot_inter_count.py compile_report.py APPLY TO ALL ORGANISMS r_file_TAXID.r _alert_simple.py Proteome_TAXID .fasta inter_count_TAXID.txt binary_ TAXID .png biogrid_ TAXID .png intact_ TAXID .png report_TAXID.pdf MKatzzz Legend :: DATA DECISION CODE FLOW

36 Pick Your Starter Pokémon ::
Mus musculus TAX ID :: 10090 Caenorhabditis elegans TAX ID :: 6239 Escherichia coli TAX ID :: Homo sapiens TAX ID :: 9606 MKatzzz Drosophila melanogaster TAX ID :: 7227 Saccharomyces cerevisiae TAX ID :: Arabidopsis thaliana TAX ID :: 3702 Image Sources:

37 Take Away Lessons PHASE I ::
Combat Data Veracity :: Pipelines are useful for aggregating data from multiple sources PHASE II:: Abstraction :: Wrapping a section in a loop can easily scale the outcome Determining Bottlenecks :: Estimating the runtime for each section of code helps determine bottlenecks PHASE III:: Intelligent Design / Amortiziation :: Think about scaling when designing your pipelines; more work upfront has large payoffs later. EX :: Specifying TAX_ID in all the scripts as parameters really simplifies scaling the pipeline to all organisms  Offlining Data :: Identify areas in your pipeline that are severe bottlenecks (Calls to External servers are really bad...Offline when you can) Replicability :: This pipeline is highly replicable for anyone looking to understand your work and allows the community to build upon it! MKatzzz

38 Common Bugs Errors when trying to compile the LaTeX document might be the result of errors in previous steps. MKatzzz


Download ppt "Python Pipelining An Introduction to Building Data Analysis Pipelines"

Similar presentations


Ads by Google