Development of an interactive pipeline for Genome wide association analysis Falola Damilare & Adigun Taiwo – Covenant University Bioinformatics research – Nigeria (dare.falola@cu.edu.ng & taiwo.adigun@covenantuniversity.edu.ng) WACREN e-Research Hackfest – Lagos (Nigeria)
Background information Scientific Problem, Aim and benefits Outline Background information Scientific Problem, Aim and benefits Computational & Data model Implementation strategy Typical user action workflow Summary and Conclusion
Background Information The need for tailor healthcare and treatment therapies to individual patients based on their genetic make-up and other biological features is becoming more essential in today’s clinical practice. Genome Wide Association Study (GWAS) has been applied extensively to uncover several variations also known Single Nucleotide Polymorphisms (SNPs) and genes related to different diseases, traits and clinical symptoms.
Background Information Genome-wide association studies involves the collection of several unrelated individuals with and without a specific trait or disease. the use of high-throughput genotyping technologies to assay hundreds of thousands of single-nucleotide polymorphisms (SNPs) of those individuals. relate the genotyped SNPs using appropriate statistical techniques e.g. Chi Square, Logistic regression etc. to clinical conditions and measurable traits to find what SNPs might be associated with the disease.
Background Information
Typical GWAS workflow
Scientific Problem, Aim and benefits A typical GWAS analysis involves the use of numerous complex commands from different languages, which makes research work complex for researchers. Use of large computing and storage resources to perform state of art GWAS data analysis which might not be available for most African or developing country researchers. AIM The aim of this project is to develop and implement an e-infrastructure that will provide state-of-the art GWAS analysis to local researchers. This tool will include all tools. Benefits This allows users focus mainly on the research problem, by making the analysis process a black box technique, which will bring about better and accurate research results. This solution also brings in user interactivity providing better visualization of results, swift comparison of results from different types of analysis, and management of several projects.
Computational & Data model
Typical user action workflow: The main users of the system are: Public health or medical researchers, scientists, and bioinformaticians who have and would upload genotype & phenotype data. i.e. either as a raw-intensity file, for analysis starting at the first phase or in a plink format, for analysis starting at the second phase or a list of significant SNPs for the third phase. A typical GWAS analysis involves three main phases, SNP chip genotype calling, Association testing and Post GWAS analysis.
Typical user action workflow: Phase 1 includes four (4) stages, which are initial quality control, genotype calling, post-calling quality control and conversion to plink file format. Phase II includes four steps, they are quality control, Population stratification correction association testing and result visualization. Phase III involves the annotation of the biological significant markers we associated with the disease phenotype in Stage II.
Implementation strategy Back-end Each sub stages of every phase have implemented in various standalone bash, perl, R scripts and Java source codes. The business logic of the system will be implemented using Java technologies which includes: Servlets and Java Server Pages. Each scripts for each phase will be parallelized using "processes input and output declarations" of NextFlow DSL (Domain Specific Language). Complex stages like population stratification will be put into different NextFlow pipeline scripts. Java API for RESTful Web Services (JAX-RS) and Javscript Object Notation (JSON) will be used to aid developers' programmatic access to the web application FutureGateway will be used to provide access to distributed computing resources such as grid, cloud and HPCs.
Implementation strategy Front-end Dataset upload will be done via FTP or globus online APIs for JAVA in to a storage element. gLibrary will be used to manage metadata about the data. HTML5 and JavaScript will be used for UI design. styling of the interface will be done using Cascading Style Sheet (CSS) and the system will be made mobile responsive using the CSS 3 @media Query. The database will be built using MYSQL (Relational Database Management System) RDBMS.
Summary and conclusions This solution makes GWAS analysis easier to perform, by requiring limited understanding computational needs from researchers. This allows them to focus mainly on research problem and give better biological interpretation to the results.
Special Appreciation to Abayomi Mosaku, Bruce Becker and Mario Torrisi Thank you! Special Appreciation to Abayomi Mosaku, Bruce Becker and Mario Torrisi sci-gaia.eu info@sci-gaia.eu