GWAS: Installing and Testing

Slides:



Advertisements
Similar presentations
Information System (IS) Stakeholders
Advertisements

Testing Relational Database
GCSE Computing Lesson 5.
Fast and Thorough: Quality Assurance for Agile Data Warehousing Projects.
Embrace the Elephant A few provocative questions….
Review of main points from last week Medical costs escalating largely due to new technology This is an ethical/social problem with major conseq. Many new.
ECEU300 Ethics in the Workplace Why talk about Ethics? Everyone is ethical, everyone knows how to behave at work. Everyone gets it about not stealing stuff.
Software Development Languages and Environments. Programming languages High level languages are problem orientated contain many English words are easier.
S.T.A.I.R.. General problem solving strategy that can be applied to a range problems.
Abirami Poonkundran 2/22/10.  Goal  Introduction  Testing Methods  Testing Scope  My Focus  Current Progress  Explanation of Tools  Things to.
Chapter 15 Application of Computer Simulation and Modeling.
1 Introduction to Computability Theory Lecture12: Reductions Prof. Amos Israeli.
St Testing, Simulation and Monitoring (actually mostly simulation) Stephen Hillier Joint Meeting, Mainz, June 2001.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Geography 465 Overview Geoprocessing in ArcGIS. MODELING Geoprocessing as modeling.
Russell Taylor Lecturer in Computing & Business Studies.
EE694v-Verification-Lect5-1- Lecture 5 - Verification Tools Automation improves the efficiency and reliability of the verification process Some tools,
Chapter 1 Program Design
Business Systems. Categories n Transaction Processing Systems n Information Systems –Information Reporting Systems –Decision Support Systems –Executive.
Testing Test Plans and Regression Testing. Programs need testing! Writing a program involves more than knowing the syntax and semantics of a language.
Activity 1 - WBs 5 mins Go online and spend a moment trying to find out the difference between: HIGH LEVEL programming languages and LOW LEVEL programming.
Dr. Pedro Mejia Alvarez Software Testing Slide 1 Software Testing: Building Test Cases.
Ch 4. The Evolution of Analytic Scalability
Introduction to Systems Analysis and Design Trisha Cummings.
Polymorphism and Variant Analysis Lab
Extreme Programming Software Development Written by Sanjay Kumar.
Modes of selection on quantitative traits. Directional selection The population responds to selection when the mean value changes in one direction Here,
TESTING.
Extending the Discovery Environment: Tool Integration and Customization.
Chapter 8: Systems analysis and design
Software Testing.
Single Nucleotide Polymorphism
Winrunner Usage - Best Practices S.A.Christopher.
Polymorphism & Variant Analysis Lab Saurabh Sinha Polymorphism and Variant Analysis Lab v1 | Saurabh Sinha 1 Powerpoint by Casey Hanson.
The Software Development Life Cycle. Software Development SDLC The Software Development Life-Cycle Sometimes called the program development lifecycle.
IPlant cyberifrastructure to support ecological modeling Presented at the Species Distribution Modeling Group at the American Museum of Natural History.
Introduction of Geoprocessing Topic 7a 4/10/2007.
Testing Methods Carl Smith National Certificate Year 2 – Unit 4.
BLACK BOX TESTING K.KARTHIKEYAN. Black box testing technique Random testing Equivalence and partitioning testing Boundary value analysis State transition.
Software Construction Lecture 18 Software Testing.
E XOME SEQUENCING AND COMPLEX DISEASE : practical aspects of rare variant association studies Alice Bouchoms Amaury Vanvinckenroye Maxime Legrand 1.
I Power Higher Computing Software Development Development Languages and Environments.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
The Software Development Process
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop GWAS/QTL Apps Overview.
This is a continuation of part 2 and is extremely important.
Introduction of Geoprocessing Lecture 9. Geoprocessing  Geoprocessing is any GIS operation used to manipulate data. A typical geoprocessing operation.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Software Engineering 2004 Jyrki Nummenmaa 1 BACKGROUND There is no way to generally test programs exhaustively (that is, going through all execution.
Why A Software Review? Now have experience of real data and first major analysis results –What have we learned? –How should that change what we do next.
Testing and inspecting to ensure high quality An extreme and easily understood kind of failure is an outright crash. However, any violation of requirements.
How To Program An Overview Or A Reframing of the Question of Programming.
Observing the Current System Benefits Can see how the system actually works in practice Can ask people to explain what they are doing – to gain a clear.
Introduction of Geoprocessing Lecture 9 3/24/2008.
Extending the Discovery Environment: Tool Integration and Customization.
T EST T OOLS U NIT VI This unit contains the overview of the test tools. Also prerequisites for applying these tools, tools selection and implementation.
Information, Data & Communication Part One. Data and Information Defined The terms “data” and “information” are used interchangeably in every day speech.
Transforming Science Through Data-driven Discovery Workshop Overview Ohio State University MCIC Jason Williams – Lead, CyVerse – Education, Outreach, Training.
Definition CASE tools are software systems that are intended to provide automated support for routine activities in the software process such as editing.
Introduction to programming
Topics Introduction to Repetition Structures
A451 Theory – 7 Programming 7A, B - Algorithms.
TRANSLATORS AND IDEs Key Revision Points.
Introduction to Data Formats and tools
Learning to Program in Python
Beyond GWAS Erik Fransen.
Ch 4. The Evolution of Analytic Scalability
In these studies, expression levels are viewed as quantitative traits, and gene expression phenotypes are mapped to particular genomic loci by combining.
Baisc Of Software Testing
GWAS/QTL Apps Overview
Presentation transcript:

GWAS: Installing and Testing Dustin Landers & Troy Kling

Introduction to GWAS GWAS TOOL Knowledge about how genotypes relate to traits Genotype and Trait Data GWAS tools (e.g. PLINK, FaST-LMM, etc.) are used to identify how markers (regions of the DNA sequence) relate to some trait For example, what changes in the DNA sequence will translate to increased plant height? GWAS tools take relational data about different markers we have on the DNA sequence (these are called Single Nucleotide Polymorphisms or SNPs) and use that to model changes in a quantitative or categorical trait

Troy’s work so far Installing Genome-wide Association Studies tools on Atmosphere & Discovery Environment. Working mostly with GWAS packages in R. e.g. SKAT, aml, BATools, etc. Installing a new tool that uses an R package requires writing a wrapper script for it. Wrappers for R packages can be broken down into three main chunks: Grab command-line arguments. Execute an association test on user-supplied inputs. Return the results. The wrapper script creation process can be tedious and time-consuming. Designing software to automate the creation of wrapper scripts for R packages. My new project, called wrapR, takes the name of an R package and automatically generates a wrapper script for each function within that package. These wrapper scripts are ready to be chained together and executed from Atmosphere or the DE. Surprisingly, teaching R how to interpret different types of input is the most difficult part. Simple/Complex dichotomy. Applications to Artificial Neural Networks and Machine Learning.

Dustin’s problems and what he’s done so far How to judge how well a tool works? Run known-truth dataset through tool, examine output. But… Any one test is atypical, so how do we run lots of known-truth data sets through a tool? Obvious problems: Problem 1) Realistic data sets are massive (our Syngenta ped-map pairs are around 1.5 gigabytes each!) Problem 2) What are the best ways to summarize information from a single run? Problem 3) How do we make this easy so that everyone will do it?

So in recap… GWAS TOOL Knowledge about how genotypes relate to traits Genotype and Trait Data Both Troy and I’s work has involved the middle part of this diagram Troy’s work has been in developing ways to easily integrate new tools in to iPlant CI Dustin’s work has been in developing ways to easily test new tools on the iPlant CI Notice that both of these statements involve the word “easily”---that’s because iPlant is interested in infrastructure and we believe that ease of use will encourage people to use it!

So what has Dustin done so far? Created two different tools Aggregate and Validate Validate accepts a folder-wide input and returns performance metrics (is public on the Discovery Environment) Aggregate is more of a data management tool (it’s a standalone executable) that accesses your iPlant Data Store and allows you to aggregate massive amounts of outputs with relative ease It’s basically a formalization of what would otherwise be a bash scripting process using curl or the like Basically, we discovered that any tester would need to do a lot of scripting—we want to cut back on that as much as possible.

Where we want Validate to go next The clear next step for us is to somehow integrate the whole process… Meaning supplying simulations, running the tools, and being able to support a larger breadth of analyses in a single swoop Also, to make sure we are including all the *right* kinds of analyses… A particular example to follow

Our recent job overlap Troy installed GEMMA Dustin needed to user-test Validate and Aggregate Late last year, Dustin tested PLINK and FaST-LMM and wrote a report outlining the results So where does GEMMA fall in this line-up?

* Indicates population structure

We noticed that GEMMA excluded certain SNPs from analysis automatically. Actually, excluding SNPs from the analysis is common if those SNPs have low minor allele frequency. But in some cases, researchers may not want to exclude SNPs on this basis alone… What is an acceptable cut-off?

These are the kinds of questions Validate intends to provide answers to! We think every researcher should be thinking about these things, but its understandable if they don’t. Having the proper infrastructure already in place to provide these kinds of analyses is essentially the point.

Why we showed you this? We are playing the role of both analysts and developers. We have to understand what makes a tool work better than other tools. In this case, we recently spotted that how a tool handles minor allele frequency is extremely important. For example, FaST-LMM doesn’t need to remove SNPs with low MAF and still performs better than most tools. In fact, FaST-LMM returns SNP effect size information for every single.

How can we improve Validate and wrapR? Any thoughts, or questions? Additional performance metrics (we are calling them performetrics) to include? Is there a better way? Is there something we are missing?

Why the better estimates and reductions in standard errors? We have a simple demonstration of this. To show why this happens, we simulate 500 SNPs with allele frequencies ranging from 0.0000001 to 0.05 We then simulate a quantitative trait. If the allele isn’t present then y~N(0,1), if it is then y~N(10,1). This is about equivalent to a heritability value of 0.8. Then we try to predict the trait using the SNP in 500 different models, and record the estimates and the standard errors.

Why the better estimates?