Download presentation
Presentation is loading. Please wait.
Published byMarion Underwood Modified over 6 years ago
1
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Zemin Ning Sequence Assembly & Analysis S 1
2
Outline of the Talk: Newly released ssaha2 version 2.1
Alignment presentation - cigar format Process of single and pair end reads Transcriptome reads Ssaha_pileup consensus Solexa homopolymers Indel dection Future work 2
3
Alignment Presentation – “cigar” Format
CIGAR format (Stabenau et al 2004) is a compact representation of the patterns of gapped alignments. It adds more complexity in order to parse the alignment in the later stage using relative less space. ./ssaha2 -output cigar subject query.fastq query_name subject_name M 57 D 1 M 36 I 2 M 26
4
Pipeline of ssaha_pileup
Indel File Pipeline of ssaha_pileup Sequencing Reads Reference fasta Ssaha_indel SNP File Pileup/cons Alignment - ssaha2 SE Ssaha_cigar ssaha_pileup PE Ssaha_pairs Ssaha_clean Unique placed cigar read file
5
Mapping Score in ssaha2 Read mapping score is used to assess the repetitive feature of the read in the genome. In the cigar file cigar::50 Smap = 50 is the mapping score: R = read length; Smax - maximum alignment score (smith-waterman) of the hits on genome; Smax2 - second best alignment score of the hits on genome; Say you have one read of 30 bases which has a few hits on the genome: Best hit: exact match with Smax 30; Second best hit: one base mismatch with Smax2 29. The mapping score for this read is Smap = 10; Read Reference 29 21 30 25 14 27
6
SNP Confidence Score in ssaha2
SNP score is calculated as the sum of weighted read mapping scores, combined with base quality. For Solexa reads: Smap - read mapping score, from 0 (repeat) to 50 (unique); Fq - base quality factor: Fq = 1 if Q>=30 Fq = 0.5 if Q<30; N – number of read coverage at the location. 17
7
Transcriptome Reads with “-trans 1”
Short Reads Intron Reference Sequence
8
C.elegans Transcriptome Assembly
Solexa reads: Number of reads: 3,565,445; Transcriptome size: ,933,960 bp; Read length: 35; Estimated read coverage: 3.1X; Number of covered bases: 15,774,889 bp; Number of single coverage: 5,422,688. Assembly features: Number of contigs: Total assembled bases: Mb N50 contig size: Largest contig: Averaged contig size: Contig coverage: % Contig extension errors: 0 Mis-assembly errors: 0
9
Read Mapping vs De novo Assembly
Chromosome I: Chromosome V:
10
Pileup/cons File Format
11
SNP output File Format
12
Indels or Homopolymers ?
13
Indels or Homopolymers ?
14
Indels or Homopolymers ?
15
Homopolymers at High Depth
16
P.Faciparum 3D7 Simulations
Indel Detection P.Faciparum 3D7 Simulations Simulated Solexa reads: Number of reads: ,647,985 Genome size: Mbp Read length: 36 Read coverage: 40x Num. of uniquely placed PE reads: 24,303,362 Percentage of placed PE reads: 94.5% Num. of uniquely placed SE reads: 23,229,651 Percentage of placed SE reads: 90.6% Detection results: Number of deletions: ,816 Number of detected deletions: 5,668 (97.5%) Number of false positives: (2.3%) Number of insertions: ,816 Number of detected insertions: 5,458 (93.8%) Number of false positives: 15 (0.26%)
17
SNP/indel Detection Human Chromosome X
Simulated Solexa reads: Number of reads: m Chromosome size: Mbp Read length: bp Read coverage: 45x Number of uniquely placed reads: m Percentage of placed reads: 75.0% Detection results: Total Hom Hez Number of SNPs: 97,895 30,286 67,609 Percentage in dbSNP: 83.7% 91.7% 80.1% Number of deletions: 5,597 Number of insertions: 4,294 1_Base 2_base 3_base Deletion length: Insertion length:
18
Availability More information:
ftp://ftp.sanger.ac.uk/pub/zn1/ssaha_pileup/ More information: ftp://ftp.sanger.ac.uk/pub/zn1/ssaha_pileup/ssaha_pileup-readme
19
Acknowledgements: Yong Gu Ben Blackburne Hannes Ponstingl Tony Cox
Quan Long and Other ssaha users
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.