Introduction to Data Formats and tools

Slides:



Advertisements
Similar presentations
Statistical methods for genetic association studies
Advertisements

GWAS: Installing and Testing
PLINK: a toolset for whole genome association analysis
BST 775 Lecture PLINK – A Popular Toolset for GWAS
GBS & GWAS using the iPlant Discovery Environment
HW1b: Background information Oct 8 th, 2013 Kyrylo Bessonov 1.
Presented by Qing Duan Dr. Yun Li group UNC at Chapel Hill
Using HapMap.Org A Tutorial Lincoln Stein, Cold Spring Harbor Laboratory.
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
:NEUROPSYCHIATRIC GENETICS [BIOSTATISTICS|BIOINFORMATICS] CORE BIOSTATISTIC/BIOINFORMATIC TOOLS FOR GENETICS DATA: DATA MANAGEMENT AND ANALYSIS RICHARD.
Introduction to SPSS Short Courses Last created (Feb, 2008) Kentaka Aruga.
Linkage Analysis in Merlin
Introduction to SPSS (For SPSS Version 16.0)
Polymorphism and Variant Analysis Lab
PLINK tutorial, December 2006; Shaun Purcell, PLINK gPLINK Haploview Whole genome association software tutorial Shaun Purcell.
Factors to Consider in Selecting a Genotyping Platform Elizabeth Pugh June 22, 2007.
RTSUG 04Feb2014: Beyond Directory Listings in SAS By: Jim Worley.
File formats Wrapping your data in the right package Deanna M. Church
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Polymorphism & Variant Analysis Lab Saurabh Sinha Polymorphism and Variant Analysis Lab v1 | Saurabh Sinha 1 Powerpoint by Casey Hanson.
NGS data analysis CCM Seminar series Michael Liang:
Microeconomics Report: Using the TSP a guide for students at the 国際総合学類筑波大学 If you are using a network browser to view this program, please use the “back”
A Simple Guide to Using SPSS ( Statistical Package for the Social Sciences) for Windows.
Regression-Based Linkage Analysis of General Pedigrees Pak Sham, Shaun Purcell, Stacey Cherny, Gonçalo Abecasis.
Comparison of different output options from Stata
Genome-wide association studies (GWAS) Thomas Hoffmann Department of Epidemiology and Biostatistics, and Institute for Human Genetics.
PLINK / Haploview Whole genome association software tutorial
DTC Quantitative Methods Summary of some SPSS commands Weeks 1 & 2, January 2012.
GenABEL: an R package for Genome Wide Association Analysis
Computing with SAS Software A SAS program consists of SAS statements. 1. The DATA step consists of SAS statements that define your data and create a SAS.
Copyright OpenHelix. No use or reproduction without express written consent1.
Mx modeling of methylation data: twin correlations [means, SD, correlation] ACE / ADE latent factor model regression [sex and age] genetic association.
Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.
Data Quality Control Suzanne M. Leal Baylor College of Medicine Copyrighted © S.M. Leal 2015.
1 PEER Session 02/04/15. 2  Multiple good data management software options exist – quantitative (e.g., SPSS), qualitative (e.g, atlas.ti), mixed (e.g.,
Schematic of the single variant polymorphism (SNP) genotyping assay.
Canadian Bioinformatics Workshops
Genetic mapping and QTL analysis - JoinMap and QTLNetwork -
Genome-Wide Association Studies (GWAS) Study design: Case/Control, Family-based, Cohort Phenotype: Dichotomous, Quantitative 10 3 – 10 5 individuals; 10.
Canadian Bioinformatics Workshops
Quality Control Using EasyQC & Meta-Analysis in METAL
Imputation Sarah Medland Boulder 2015.
Common variation, GWAS & PLINK
GCTA Practical 2.
An Interactive Tutorial for SPSS 10.0 for Windows©
Introduction to SPSS.
By Dr. Madhukar H. Dalvi Nagindas Khandwala college
Genome Wide Association Studies using SNP
DEPARTMENT OF COMPUTER SCIENCE
Linkage analysis & Homozygosity mapping
Genome-Wide Pharmacogenomic Study on Methadone Maintenance Treatment
ECONOMETRICS ii – spring 2018
Data Entry and Managment
Zhengzheng Tang and Danyu Lin March 26, 2013
GxG and GxE.
ESP6800 BP Analysis (recessive model)
Lab 2 Data Manipulation and Descriptive Stats in R
Lesson 1: Introduction to Trifacta Wrangler
Mapping Quantitative Trait Loci
More Plink + R plotting Sarah Medland 2017.
Coding Concepts (Data Structures)
Analysing imputed data in raremetalworker
Genome Biology & Applied Bioinformatics Mehmet Tevfik DORAK, MD PhD
GWAS/QTL Apps Overview
A Tutorial Lincoln Stein, Cold Spring Harbor Laboratory
Amos Introduction In this tutorial, you will be briefly introduced to the student version of the SEM software known as Amos. You should download the current.
Genome-wide Complex Trait Analysis and extensions
Population Stratification Practical
The Variant Call Format
Presentation transcript:

Introduction to Data Formats and tools Data formats for GWAS and Plink Shaun Aron Sydney Brenner Institute for Molecular Bioscience University of the Witwatersrand

Measure intensities Genotype calling Variant QC Sample QC Association

Imputation Genotype Calling Tools Genome Studio genCall zCall Variant and Sample QC Plink EigenSoft R Association Testing GEMMA Genotype Reports iDAT Plink Plink Imputation Tools Impute2 PBWT Online

Genotype calling

Genotype Calls to plink Some genotyping software exports data in Plink format There are available tools, scripts to convert genotyping reports to plink format or you can do it yourself

Plink Plink is standard tool for manipulating and analyzing genotype data Plink works with standard data formats Has the functionality to convert between different formats Developed and optimized for working with biallelic SNP data Plink online manual https://www.cog-genomics.org/plink2

DATA formats Family ID Individual ID Paternal ID Maternal ID PED format PED file – Sample/Individual information MAP file – SNV information No header with 6 first defined columns Family ID Individual ID Paternal ID Maternal ID Sex (1=male, 2=female, other) Phenotype (missing -9, control 1, case 2 or quantitative trait) Followed by allele calls for the variant in a pairwise fashion – Different encodings

PED Format FID IID PATID PHENO Alleles for SNP 1 Alleles for SNP 2 MATID SEX

MAP File Chr SNP ID BP Position Genetic Distance (morgans)

DATA Storage Plain text format for thousands of samples for millions of SNPs would require a large amount of space for storage Plink rather works with Binary versions of the PED files Method to compress and reduce the size of the PED and MAP files

Binary PED format FAM file – one row per individual – first 6 columns of PED file BIM file – one row per SNP – MAP file + two alleles for that SNP BED file – one row per individual – genotype calls for each individual for all SNPs – rest of PED file in binary format FAM and BIM file are human readable while BED file in not

FAM File FID IID PATID SEX PHENO MATID Chr SNP ID BP Position SNP Alleles BIM File Genetic Distance (morgans)

Other formats Plink takes in various other data formats Able to convert from other formats into Plink format

Plink basics

Plink basics Command line based Call Plink using plink command

Plink basics Flags are used for different operations Eg. --file used to tell plink the name of the prefix of the input files and the format Eg. --file hapmap1 In your current directory you should have your data in PED format: hapmap1.map, hapmap1.ped Try it now

Plink basics Output files have a plink prefix by default. Use --out flag to specify your own name If you want to explicitly convert to binary format you may use the --make-bed flag

Plink Basics Examine your newly generated files Identify what each row and column denotes Remember that you cannot open the .bed file - not human readable If you are reading in a file in binary PED format use the --bfile flag

Run through exercise 2

Plink COmmands Command Action --recode Transform between formats --freq Generate simple statistics --vcf Read in file in VCF format --keep [file] Retain samples in the specified file --remove [file] Remove samples in the specified file --extract [file] Keep SNPs in the specified file --exclude [file] Remove SNPS in the specified file --pheno [file] Read phenotypes from specified file

Plink FIltering May be a need to extract specific parts of a complete dataset Specific SNPs or Individuals Can extract either SNPs or Individuals directly on the command line or using a file with a specific format

Plink Filtering For individual filtering you can create a file with the FID and IID of the individuals you want to keep or remove. For SNP filtering you can create a file with the SNPs IDs you would like to extract or exclude. In both cases you would most likely generate a new dataset.

Plink Filtering Sample File SNP file --keep --remove --extract --exclude

PHENO FILE Phenotypes can be added to the PED or BIM file In some instances it is useful to store them in a separate file Use --pheno flag followed by file with the following format FID IID PHENO

Plink Filtering Another useful flag is the --filter filter flag Uses the same file format as the phenotype file Also has some built is filtering functions --filter-cases --filter-controls --filter-males --filter-females

Run through exercise 3.1 – 3.11

Selection based on Criteria Flags defined to select samples/SNPs based on specific criteria Will come across these again in the QC section of the course

Plink filtering Command Action --hwe [threshold] Keep variants with HWE p<threshold --missing Compute per-sample and per-variant missingness --check-sex Check genotype vs phenotype sex based on X chr --genome Compute relatedness based on IBD --maf [threshold] Keep variants with a MAF> threshold --mind [value] Remove individuals with missing data above value --geno [value] Remove SNPs with missing data above value

Criteria Selection Flags --mind 0.02 - value of 0.02 denotes that all individuals with more than 2% of missing data should be removed --geno 0.04 – value of 0.04 indicates that all SNPs with a call rate of less that 96% should be removed --maf 0.01 – value of 0.01 indicates that all SNPs with a minor allele frequency of less than 1% should be removed

Go through exercise 3.11 – 3.22

Merging Datasets Plink has built in tools for merging datasets Not a straight forward process but useful for population studies Datasets need to be from the same build, have SNPs called on the same strands etc. Section 4 deals with how to merge data successfully

Association TEsting Plink provides a number of association testing approaches --assoc – assumes there are case/control values in the phenotype column of your PED/BED file or specified phenotype file and runs a simple chi-squared association test --assoc – assumes there are quantitative values in the phenotype column of your PED/BED file or specified phenotype file and runs a regression analysis --linear – runs a linear regression for a quantitative trait allowing for the inclusion of covariates Additional options to run adjust for multiple testing and permutation testing This will be covered in more detail in the course

RUN through sections 4 and 5