The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.

Slides:



Advertisements
Similar presentations
applications of genome sequencing projects
Advertisements

CZ5225 Methods in Computational Biology Lecture 9: Pharmacogenetics and individual variation of drug response CZ5225 Methods in Computational Biology.
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Which Phenotypes Can be Predicted from a Genome Wide Scan of Single Nucleotide Polymorphisms (SNPs): Ethnicity vs. Breast Cancer Mohsen Hajiloo, Russell.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Sharlee Climer, Alan R. Templeton, and Weixiong Zhang
Basics of Linkage Analysis
Computational Challenges in Whole-Genome Association Studies Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Bioinformatics pipeline for detection of immunogenic cancer mutations by high throughput mRNA sequencing Jorge Duitama 1, Ion Mandoiu 1, and Pramod Srivastava.
Ion Mandoiu Computer Science and Engineering Department
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
BNFO 602 Lecture 1 Usman Roshan.
Single nucleotide polymorphisms Usman Roshan. SNPs DNA sequence variations that occur when a single nucleotide is altered. Must be present in at least.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Inference of Genealogies for Recombinant SNP Sequences in Populations Yufeng Wu Computer Science and Engineering Department University of Connecticut
Polymorphisms – SNP, InDel, Transposon BMI/IBGP 730 Victor Jin, Ph.D. (Slides from Dr. Kun Huang) Department of Biomedical Informatics Ohio State University.
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
MES Genome Informatics I - Lecture V. Short Read Alignment
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
A Primer on Genetic Variation Variety Lawrence Brody - NHGRI.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads Hua Bao Sun Yat-sen University, Guangzhou,
CS177 Lecture 10 SNPs and Human Genetic Variation
SNP Haplotypes as Diagnostic Markers Shrish Tiwari CCMB, Hyderabad.
Quick introduction to genomic file types Preliminary quality control (lab)
ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Calling Somatic Mutations using VarScan
SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score 
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
Analysis of Next Generation Sequence Data BIOST /06/2015.
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
Canadian Bioinformatics Workshops
The Haplotype Blocks Problems Wu Ling-Yun
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
From Reads to Results Exome-seq analysis at CCBR
071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.
Jin Zhang, Jiayin Wang and Yufeng Wu
The same gene can have many versions.
Discovery tools for human genetic variations
The same gene can have many versions.
The same gene can have many versions.
The same gene can have many versions.
The same gene can have many versions.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
The same gene can have many versions.
The same gene can have many versions.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
The same gene can have many versions.
BF528 - Genomic Variation and SNP Analysis
The same gene can have many versions.
The same gene can have many versions.
Canadian Bioinformatics Workshops
Approximation Algorithms for the Selection of Robust Tag SNPs
The same gene can have many versions.
SNPs and CNPs By: David Wendel.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
The same gene can have many versions.
Presentation transcript:

The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science Rhode Island College Providence, RI Stephen Tetreault Department of Mathematics and Computer Science Rhode Island College Providence, RI

Single Nucleotide Polymorphisms  DNA sequence variation when a single nucleotide in the genome differs  SNPs are the majority of genetic variation  1.4 million SNPs in a human genome  Two haploid genomes differing at 1 SNP per 1,331 bp  SNPs are crucial in the effort to personalize medicine  DNA sequence variation when a single nucleotide in the genome differs  SNPs are the majority of genetic variation  1.4 million SNPs in a human genome  Two haploid genomes differing at 1 SNP per 1,331 bp  SNPs are crucial in the effort to personalize medicine

1000 Genomes Project  International consortium to create most complete catalog of human genetic variation  Sequencing is done using utilizing next generation sequencing technology (e.g. Solexa, 454, SOLiD) which is faster and less expensive  3 steps of the project:  Detailed scanning of six participants  Less detailed scan of 180 participants  Partial scans of 1000 participants  International consortium to create most complete catalog of human genetic variation  Sequencing is done using utilizing next generation sequencing technology (e.g. Solexa, 454, SOLiD) which is faster and less expensive  3 steps of the project:  Detailed scanning of six participants  Less detailed scan of 180 participants  Partial scans of 1000 participants

1000 Genomes Project  1000 Genomes Project Goals:  Discover genetic variants (SNPs, copy- number variants, indels)  Identify frequencies of the variant alleles and identify their haplotype backgrounds  1000 Genomes Project Goals:  Discover genetic variants (SNPs, copy- number variants, indels)  Identify frequencies of the variant alleles and identify their haplotype backgrounds

Project Focus  Learning about the current state of sequencing tools  Learning how to use these tools and understanding the raw data  Creating a program to to extract the SNPs from the raw data and to calculate simple variant frequencies.  More advanced data analysis - to be discussed in future works section  Learning about the current state of sequencing tools  Learning how to use these tools and understanding the raw data  Creating a program to to extract the SNPs from the raw data and to calculate simple variant frequencies.  More advanced data analysis - to be discussed in future works section

Data and Tools  1000 Genomes Project  ftp://ftp-trace.ncbi.nih.gov/1000genomes/ ftp://ftp-trace.ncbi.nih.gov/1000genomes/  MAQ   SAMtools   1000 Genomes Project  ftp://ftp-trace.ncbi.nih.gov/1000genomes/ ftp://ftp-trace.ncbi.nih.gov/1000genomes/  MAQ   SAMtools 

Sequencing  MAQ maps short reads to references and calls genotypes from the alignment  MAQ maps a read to the position where the sum of quality values of mismatched nucleotides is minimum  Issues with MAQ:  Very long run-time  Limited computing power slowed the program down  MAQ maps short reads to references and calls genotypes from the alignment  MAQ maps a read to the position where the sum of quality values of mismatched nucleotides is minimum  Issues with MAQ:  Very long run-time  Limited computing power slowed the program down

Sequencing  SAMtools was the alternative sequencing program.  It proved faster because it could utilize BAM (Binary SAM) files which are prealigned partial scans of the participant data.  MAQ had to align FASTA and FASTQ files, then change the MAP file into a Consensus file for SNP calling.  SAMtools allowed for SNP calling as MAQ did  SAMtools pileup function describes base pair information at each chromosomal position.  SAMtools was the alternative sequencing program.  It proved faster because it could utilize BAM (Binary SAM) files which are prealigned partial scans of the participant data.  MAQ had to align FASTA and FASTQ files, then change the MAP file into a Consensus file for SNP calling.  SAMtools allowed for SNP calling as MAQ did  SAMtools pileup function describes base pair information at each chromosomal position.

Sequencing  SAMtools pileup function describes base pair information at each chromosomal position.

Project Data  The raw data received through SAMtools pileup and consensus calling contains the following: chromosome, position, reference base, consensus base, consensus quality score, SNP quality score, maximum mapping quality score, number of reads mapped, read bases, and base qualities.

Phred Quality Scores  The consensus quality score and the SNP quality are Phred quality scores.  High accuracy of Phred scores helps ensure reliable SNP calling  The consensus quality score and the SNP quality are Phred quality scores.  High accuracy of Phred scores helps ensure reliable SNP calling

Finding Higher Quality SNPs  Look at the number of reads covering the position with th SNP and discard those covered by three or fewer reads.  Consensus quality is important, but SNP quality is more important. Discard a SNP with a quality score lower than 20.  Look at the number of reads covering the position with th SNP and discard those covered by three or fewer reads.  Consensus quality is important, but SNP quality is more important. Discard a SNP with a quality score lower than 20.

A Program for Extracting SNPs  Read in raw data line by line  Check for SNP of high quality  Differing reference and consensus base  SNP with a quality score of 20 or higher  Insert SNP as on object into array list (also stored in order of position)  Keep counts for variant frequency & update when SNP is found  Keep count of number of SNPs per 100,000 bases throughout chromosome 1  Read in raw data line by line  Check for SNP of high quality  Differing reference and consensus base  SNP with a quality score of 20 or higher  Insert SNP as on object into array list (also stored in order of position)  Keep counts for variant frequency & update when SNP is found  Keep count of number of SNPs per 100,000 bases throughout chromosome 1

Results  Comparing variant frequencies:  Base change of A to G and of T to C were shown to be the most frequently occuring variations  Base change of C to G was least frequently occuring  Comparing variant frequencies:  Base change of A to G and of T to C were shown to be the most frequently occuring variations  Base change of C to G was least frequently occuring

Results  The number of SNPs occuring per 100,000 bases throughout chromosome 1 for participant NA07048

Results  The number of SNPs occuring per 100,000 bases for chromosome 1 of participant NA The SNPs appear more clustered together in frequency when compared to NA07048.

Conclusion  Initial complications in data access and slow progress with MAQ were overcome.  SAMtools proved to be faster thus more efficient at sequencing and SNP calling when utilizing the prealigned partial BAM files  Initial complications in data access and slow progress with MAQ were overcome.  SAMtools proved to be faster thus more efficient at sequencing and SNP calling when utilizing the prealigned partial BAM files

Future Work  FastPHASE is a program used for estimating missing genotypes and for reconstruction of haplotypes.  Implement advanced data analysis into program by calling genotypes from the reads and running fastPHASE to obtain corresponding haplotypes.  Look at chromosome 1 for an individual and look at the reads mapped covering that position and see what the bases are for that position to determine if the SNP is heterozygous or homozygous  FastPHASE is a program used for estimating missing genotypes and for reconstruction of haplotypes.  Implement advanced data analysis into program by calling genotypes from the reads and running fastPHASE to obtain corresponding haplotypes.  Look at chromosome 1 for an individual and look at the reads mapped covering that position and see what the bases are for that position to determine if the SNP is heterozygous or homozygous

Acknowledgment  Thank you to the Professor Yufeng Wu, Jin Zhang, the Computer Science and Engineering Department at University of Connecticut, and the National Science Foundation for making this project and the Bio- Grid REU possible.