Download presentation
Presentation is loading. Please wait.
Published byBrandon Pope Modified over 9 years ago
1
The cSRA file format
2
example.csra -A cSRA-file contains a serialized file-structure -It is a read-only archive file format, similar to a tar-file -The tool “kar” can extract the directories and files -All tools in the sra-toolkit can access the data inside directly without prior extracting -cSRA compresses the sequence data by replacing aligned base pairs with reference data
3
example.csra PRIMARY_ALIGNMENT SECONDARY_ALIGNMENT SEQUENCE REFERENCE -The cSRA-archive contains the above 4 tables -The SEQUENCE-table is mandatory for the archive -The PRIMARY and SECONDARY ALIGNMENT tables can be missing ( if there is no aligned data in the archive ) -The PRIMARY and SECONDARY ALIGNMENT tables depend on the REFERENCE table -An archive that has ALIGNMENT tables - but no REFERENCE table is broken -The SECONDARY_ALIGNMENT table can be missing ( if the archive does not contain secondary aligned data )
4
example.csra PRIMARY_ALIGNMENT SECONDARY_ALIGNMENT SEQUENCE REFERENCE The “vdb-dump” – tool can display what tables are inside a cSRA archive >vdb-dump example.csra -E >>> enumerating the tables of database >example.csra< tbl #1: PRIMARY_ALIGNMENT tbl #2: REFERENCE tbl #3: SEQUENCE
5
example.csra PRIMARY_ALIGNMENT SEQUENCE REFERENCE SECONDARY_ALIGNMENT “Reassembles” most of the data that came from the original spot Information about where and how a primary alignment occurs Information about where and how secondary alignments occur Contains the reference locally or points to an external reference
6
SEQUENCE PRIM. SEC. A B C ASpot A has 2 reads: both are primary aligned BSpot B has 2 reads: the 1 st read has a primary and a secondary alignment the 2 nd read is primary aligned only CSpot C has 2 reads: the 1 st read is primary aligned the 2 nd read is not aligned This slide shows how the Sequence table points to data in the primary and secondary tables with three different use cases
7
The following slides explain the columns in the cSRA file format. The more important columns are highlighted and all other columns support those
8
SEQUENCE ALIGNMENT_COUNT vector of integers, how many alignments per read BASE_COUNT how many bases are in the whole table BIO_BASE_COUNT how many bases (excluded adapters) are in the whole table CMP_BASE_COUNT how many unaligned bases are in the whole table CMP_READ compressed read, only the unaligned reads COLOR_MATRIX static field to describe the translation between color-space and base-space CSREAD translated READ-column into color-space CS_KEY key for translation between color-space and base-space CS_NATIVE flag, to say that the sequence was produced in color-space FIXED_SPOT_LEN flag, set if all reads have the same length MAX_SPOT_ID id of the last spot in the table MIN_SPOT_ID id of the first spot in the table NAME name of the spot, generated from the row-id PLATFORM name of the platform used to sequence the table PRIMARY_ALIGNMENT_ID pointer back to primary alignment table through row-id in prim. alignment table QUALITY stored quality values, in the direction how it was sequenced part 1 usually a static column (same value for all rows in the table)
9
SEQUENCE READ assembled or stored bases, in the direction t was sequenced READ_FILTER vector of flags, one for each read, compatibility with SRA READ_LEN vector of integers, one for each read, length of each read READ_SEG vector of integer-pairs, one for each read [ zero-based start offset of read, length of read] READ_START vector of integers, one for each read, zero-based start offset of read READ_TYPE vector of flags, one for each read, tells if read is biological or adapter and the direction it was sequenced SIGNAL_LEN compatibility with SRA, lengths of recorded signal SPOT_COUNT how many spots are in the table ( == MAX_SPOT_ID ) SPOT_GROUP describes grouping in the reads, equivalent to read-group in BAM SPOT_ID row-id of each spot ( 1 based ) SPOT_LEN how many bases are in this spot TRIM_LEN length of the section not subject to be trimmed TRIM_START start of the section not subject to be trimmed part 2 usually a static column (same value for all rows in the table)
10
ALIGN_ID row-id BASE_COUNT how many bases are in the whole table BIO_BASE_COUNT how many bases (excluded adapters) are in the whole table CIGAR_LONG long form of the cigar-string CIGAR_SHORT short form of the cigar-string COLOR_MATRIX static field to describe the translation between color-space and base-space CS_KEY key for translation between color-space and base-space CS_NATIVE flag, to say that the sequence was produced in color-space GLOBAL_REF_START global position in the reference table HAS_MISMATCH bitfield of mismatches HAS_REF_OFFSET bitfield of offsets in the reference, used to represent indels LABEL label of this alignment, for future compatibility to represent multi - ploid alignment LABEL_LEN length of the label-part to be used LABEL_START start offset of the label-part to be used MAPQ mapping quality part 1 PRIMARY_ALIGNMENTSECONDARY_ALIGNMENT EDIT_DISTANCE number of mismatches usually a static column (same value for all rows in the table)
11
MATE_ALIGN_ID row-id of the mate of this read ( if any ) MATE_CIGAR_LONG long form of the cigar-string of the mate MATE_CIGAR_SHORT short form of the cigar-string of the mate MATE_EDIT_DISTANCE number of mismatches in the mate MATE_REF_ID row-id in the reference-table in the mate MATE_REF_LEN mate alignment lines in reference coordinates MATE_REF_NAME reference-name to which the mate is aligned MATE_REF_ORIENTATION orientation of the mate MISMATCH base values of the mismatches MISMATCH_QUAL qualities of the mismatches NAME auto-generated name of the alignment from row-id QUALTIY quality of the aligned sequence in the direction of the reference part 2 PRIMARY_ALIGNMENTSECONDARY_ALIGNMENT MATE_REF_POS mate position on the reference MAX_SPOT_ID id of the last spot in the table MIN_SPOT_ID id of the first spot in the table PLATFORM name of the platform used to sequence the table usually a static column (same value for all rows in the table)
12
RAW_READ original sequence read in the direction of sequencing READ sequence read in the direction of the reference REF_ID row-id in the reference table REF_POS position on the reference to the start of alignment REF_READ chunk of the reference on which alignment is projected REF_SEQ_ID sequence id of the reference REF_TABLE name of the reference table part 3 PRIMARY_ALIGNMENTSECONDARY_ALIGNMENT REF_LEN length of alignment in reference coordinates REF_NAME name of the reference REF_OFFSET orientation of original sequence to the reference REF_START offset in the row-id of the reference where alignment starts RD_FILTER vector of flags, one for each read, compatibility with SRA READ_FILTER vector of flags, one for each read, compatibility with SRA READ_LEN vector of integers, one for each read, length of each read READ_START vector of integers, one for each read, zero-based start offset of read READ_TYPE vector of flags, one for each read, tells if read is biological or adapter and the direction it was sequenced
13
SAM_FLAGS flags to be used in SAM-format SEQ_NAME auto-generated name of the sequence from sequence-row-id part 4 PRIMARY_ALIGNMENTSECONDARY_ALIGNMENT SAM_QUALITY quality converted to ascii presentation from sequence-row-id SEQ_READ_ID read-id of sequence being aligned SEQ_SPOT_ID sequence spot id SPOT_COUNT how many spots are in the table ( == MAX_SPOT_ID ) SPOT_GROUP describes grouping in the reads, equivalent to read-group in BAM SPOT_LEN how many bases are in this spot TRIM_LEN length of the section not subject to be trimmed TRIM_START start of the section not subject to be trimmed TEMPLATE_LEN size of the template usually a static column (same value for all rows in the table)
14
BASE_COUNT how many bases are in the whole table BIO_BASE_COUNT how many bases (excluded adapters) are in the whole table CGRAPTH_INDELS total number of indels in this chunk CGRAPH_LOW minimum depths of coverage in this chunk CGRAPH_MISMATCHES total number of mismatches between sequence and this chunk CIRCULAR flag if this reference is circular CMP_BASE_COUNT number of bases stored locally LABEL_LEN length of description LABEL_START start offset of description part 1 CMP_READ locally stored reference REFERENCE CGRAPH_HIGH maximum depths of coverage in this chunk COLOR_MATRIX static field to describe the translation between color-space and base-space CSREAD translated READ-column into color-space CS_KEY key for translation between color-space and base-space CS_NATIVE flag, to say that the sequence was produced in color-space LABEL description of this chunk usually a static column (same value for all rows in the table)
15
MAX_SEQ_LEN maximum size for the chunks in this table MAX_SPOT_ID id of the last chunk in this table NAME name of the sequence, equivalent what BAM used in the reference-sequence-name-field NAME_RANGE technical column, used for index lookup internally PRIMARY_ALIGNMENT_IDS list of row-id’s from primary alignment table which start their alignment in this chunk QUALTIY stores the quality of the reference, auto-generated when not available SECONDARY_ALIGNMENT_IDS list of row-id’s from secondary alignment table which start their alignment in this chunk SEQ_ID id of remotely stored sequence, used as a key to find the sequence part 2 REFERENCE MIN_SPOT_ID id of the first chunk in this table READ the sequence of the reference, merges remote and local reference into one column READ_FILTER vector of flags, one for each read, compatibility with SRA RD_FILTER vector of flags, one for each read, compatibility with SRA READ_LEN vector of integers, one for each read, length of each read READ_START vector of integers, one for each read, zero-based start offset of read READ_TYPE vector of flags, one for each read, tells if read is biological or adapter and the direction it was sequenced usually a static column (same value for all rows in the table)
16
SEQ_LEN the length of the chunk from the remotely stored sequence SEQ_START the start of this chunk on the remote sequence part 3 REFERENCE SPOT_COUNT number of spots SPOT_GROUP describes grouping in the reads, equivalent to read-group in BAM SPOT_ID row-id of current chunk SPOT_LEN length of this chunk, used for compatibility with SRA TRIM_LEN length of the section not subject to be trimmed TRIM_START start of the section not subject to be trimmed
17
The following slides show how the sequences are reconstructed from the data stored in cSRA. Play the PowerPoint slides to see the full animation effect
18
AGTACGC reference sequence HAS_MISMATCH HAS_REF_OFFSET MISMATCH REF_OFFSET case: MISMATCH A 0 0 C 0 0 G 0 0 A 1 0 A A 0 0 C 0 0 G 0 0 A
19
AGTACGC reference sequence HAS_MISMATCH HAS_REF_OFFSET MISMATCH REF_OFFSET case: INSERT A 0 0 C 0 0 G 0 0 A 1 0 A T 0 1 A 0 0 C 0 0 A
20
AGTACGC reference sequence HAS_MISMATCH HAS_REF_OFFSET MISMATCH REF_OFFSET case: DELETE A 0 0 C 0 0 G 0 0 A 0 1 C 0 0 G 0 0 T +1
21
AGTACGC reference sequence HAS_MISMATCH HAS_REF_OFFSET MISMATCH REF_OFFSET case: COMBINED A 0 0 A 1 0 G 0 0 A 1 1 A T 0 1 C 0 0 G 0 0 A A A A +1
22
AGTACGC reference sequence HAS_MISMATCH HAS_REF_OFFSET MISMATCH REF_OFFSET case: SOFTCLIP 1 1 1 0 G 0 0 T 0 0 T A 0 0 1 0 1 0 ATA -2 TATA defined by ref_pos
23
The next slides show the conversion between exploded file structure (created by the loader) and the kar format
24
exploded storage static kar storage less storage space used only one file read only more storage space used many directories and files read- and writable
25
exploded storage static kar storage kar –c karfile_to_create –d path_of_exploded_storage kar –x karfile_to_extract_from –d path_to_be_created
26
SRA cSRA One table Containing one submission Available as exploded storage or as kar-file Self-containing, no need of external files to extract data Up to 4 tables Containing one BAM-file Available as exploded storage or as kar-file Requires external / remote files to extract all data Difference between SRA and cSRA formats
27
How to use vdb-dump to inspect a cSRA-archive ( part 1 ) What tables are in the cSRA-achive? $vdb-dump example.csra –E >>> enumerating the tables of database >example.csra< tbl #1: PRIMARY_ALIGNMENT tbl #2: REFERENCE tbl #3: SEQUENCE What columns are available in a table? $vdb-dump example.csra –T SEQUENCE –o ALIGNMENT_COUNT (U8) BASE_COUNT (U64) BIO_BASE_COUNT (U64) CMP_BASE_COUNT (U64) CMP_READ (INSDC:dna:text) COLOR_MATRIX (U8) CSREAD (INSDC:color:text) CS_KEY (INSDC:dna:text) CS_NATIVE (bool) FIXED_SPOT_LEN (INSDC:coord:len) MAX_SPOT_ID (INSDC:SRA:spotid_t) …
28
How to use vdb-dump to inspect a cSRA-archive ( part 2 ) How to restrict the output to certain columns? $vdb-dump example.csra -T SEQUENCE –C READ,QUALITY READ: CAGGGCGGGCAGCGGGCCTGCCCCCCACCCCCGCGCCCCATGACCCGC… QUALITY: 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, … READ : AGGACACAATTACAAGGTGCTGGCCCAACTACTTTCAGTGTACCGTCT… QUALITY: 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, … How to restrict the row-range of the output? $vdb-dump example.csra –T SEQUENCE –R 10-20 –C READ READ: TGATCCATCAGCATCGGCCTCCCAAAGTGCTGGGATTACAGGTGT... READ: AGCCAGGCGTGGTGGTGCGACCCTGTAATCCCAGCTACTTGGGAG... READ: TAGTGGAGGCCGGCGCAGGAACAGGTTGAACAGTCTACCCTCCCT... READ: ACTCCAGCCTGGGCAACAGAGCAAGATTCTGACTCAAAAAAAAAA... READ: TTCTTTCTAAGACAGGGTCTCACTCTGTCGCCCAGGCTGGAGTGC... READ: TTTCTTTCTCTCTCTCTCTCTTTTTTTTTTTTTTTGAGACAGGGT... …
29
How many rows are in a table? $vdb-dump example.csra -T SEQUENCE –r id-range: first-row = 1, row-count = 51863105 How to use vdb-dump to inspect a cSRA-archive ( part 3 ) How to create tab-separated output? $vdb-dump example.csra -T SEQUENCE –C READ,QUALITY –f tab CATGTGACTGAACTCTTCACCCCAGTC30, 30, 30, 30, 30, 30 AAGAGATCCGACATCAAGTGCCCACCT30, 30, 30, 30, 30, 30 CTCTGTCTCTGCCCCCAGCATCACATT30, 30, 30, 30, 30, 30 TCCCACAGCTTTAATCACCATCTAAAA30, 30, 30, 30, 30, 30 TGACTCCCACCTTCACTCTCCCATGTC30, 30, 30, 30, 30, 30 How to output phred33-quality ? $vdb-dump example.csra -T SEQUENCE –C ‘(INSDC:quality:text:phred_33)QUALITY’ ???????????????5???????5???????5?????????+????5 ??????????????????????????????????????????????? ???????????5?+???55+55????5?+??5?55???5+??5++5+ +?+?+++555+55+?++555?+??++++55++55+5?+?++?5?5++
30
REFERENCE PRIMARY ALIGNMENT READS SEQUENCE SECONDARY ALIGNMENT unaligned READS The reference feeds into the primary alignment table which in turn feed data into the sequence table. The secondary alignment table takes data from the reference and the sequence tables to form the alignment data. The sequence table can also includes unaligned reads General BAM Alignment Process
31
REFERENCE PRIMARY ALIGNMENT PRIMARY ALIGNMENT READS SEQUENCE unaligned READS ALLELES EVIDENCE ALIGNMENT EVIDENCE ALIGNMENT EVIDENCE INTERNALS EVIDENCE INTERNALS SECONDARY ALIGNMENT SECONDARY ALIGNMENT Complete Genomics BAM Alignment Process
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.