Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data

Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Zemin Ning Sequence Assembly & Analysis S 1

Outline of the Talk: Newly released ssaha2 version 2.1
Alignment presentation - cigar format Process of single and pair end reads Transcriptome reads Ssaha_pileup consensus Solexa homopolymers Indel dection Future work 2

Alignment Presentation – “cigar” Format
CIGAR format (Stabenau et al 2004) is a compact representation of the patterns of gapped alignments. It adds more complexity in order to parse the alignment in the later stage using relative less space. ./ssaha2 -output cigar subject query.fastq query_name subject_name M 57 D 1 M 36 I 2 M 26

Pipeline of ssaha_pileup
Indel File Pipeline of ssaha_pileup Sequencing Reads Reference fasta Ssaha_indel SNP File Pileup/cons Alignment - ssaha2 SE Ssaha_cigar ssaha_pileup PE Ssaha_pairs Ssaha_clean Unique placed cigar read file

Mapping Score in ssaha2 Read mapping score is used to assess the repetitive feature of the read in the genome. In the cigar file cigar::50 Smap = 50 is the mapping score: R = read length; Smax - maximum alignment score (smith-waterman) of the hits on genome; Smax2 - second best alignment score of the hits on genome; Say you have one read of 30 bases which has a few hits on the genome: Best hit: exact match with Smax 30; Second best hit: one base mismatch with Smax2 29. The mapping score for this read is Smap = 10; Read Reference 29 21 30 25 14 27

SNP Confidence Score in ssaha2
SNP score is calculated as the sum of weighted read mapping scores, combined with base quality. For Solexa reads: Smap - read mapping score, from 0 (repeat) to 50 (unique); Fq - base quality factor: Fq = 1 if Q>=30 Fq = 0.5 if Q<30; N – number of read coverage at the location. 17

Transcriptome Reads with “-trans 1”
Short Reads Intron Reference Sequence

C.elegans Transcriptome Assembly
Solexa reads: Number of reads: 3,565,445; Transcriptome size: ,933,960 bp; Read length: 35; Estimated read coverage: 3.1X; Number of covered bases: 15,774,889 bp; Number of single coverage: 5,422,688. Assembly features: Number of contigs: Total assembled bases: Mb N50 contig size: Largest contig: Averaged contig size: Contig coverage: % Contig extension errors: 0 Mis-assembly errors: 0

Read Mapping vs De novo Assembly
Chromosome I: Chromosome V:

Pileup/cons File Format

SNP output File Format

Indels or Homopolymers ?

Homopolymers at High Depth

P.Faciparum 3D7 Simulations
Indel Detection P.Faciparum 3D7 Simulations Simulated Solexa reads: Number of reads: ,647,985 Genome size: Mbp Read length: 36 Read coverage: 40x Num. of uniquely placed PE reads: 24,303,362 Percentage of placed PE reads: 94.5% Num. of uniquely placed SE reads: 23,229,651 Percentage of placed SE reads: 90.6% Detection results: Number of deletions: ,816 Number of detected deletions: 5,668 (97.5%) Number of false positives: (2.3%) Number of insertions: ,816 Number of detected insertions: 5,458 (93.8%) Number of false positives: 15 (0.26%)

SNP/indel Detection Human Chromosome X
Simulated Solexa reads: Number of reads: m Chromosome size: Mbp Read length: bp Read coverage: 45x Number of uniquely placed reads: m Percentage of placed reads: 75.0% Detection results: Total Hom Hez Number of SNPs: 97,895 30,286 67,609 Percentage in dbSNP: 83.7% 91.7% 80.1% Number of deletions: 5,597 Number of insertions: 4,294 1_Base 2_base 3_base Deletion length: Insertion length:

Availability More information:
ftp://ftp.sanger.ac.uk/pub/zn1/ssaha_pileup/ More information: ftp://ftp.sanger.ac.uk/pub/zn1/ssaha_pileup/ssaha_pileup-readme

Acknowledgements: Yong Gu Ben Blackburne Hannes Ponstingl Tony Cox
Quan Long and Other ssaha users

Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data

Similar presentations

Presentation on theme: "Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data

Similar presentations

Presentation on theme: "Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data"— Presentation transcript:

Similar presentations

About project

Feedback