The cSRA file format. example.csra -A cSRA-file contains a serialized file-structure -It is a read-only archive file format, similar to a tar-file -The.

Slides:



Advertisements
Similar presentations
Database Basics. What is Access? Database management system Computer-based equivalent of a manual database Makes it easy to organize and update information.
Advertisements

ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 8 – File Structures.
Chapter 5 Data Management. – The Best & Most Convenient Way to Learn Salesforce.com 2 Objectives By the end of the module, you.
SOLiD Sequencing & Data
Access Lesson 2 Creating a Database
Exploring Microsoft Excel 2002 Chapter 7 Chapter 7 List and Data Management: Converting Data to Information By Robert T. Grauer Maryann Barber Exploring.
1. What problems we would have during:  Insertion  Deletion  Update 2.
Copyright 2004 Prentice-Hall, Inc. Essentials of Systems Analysis and Design Second Edition Joseph S. Valacich Joey F. George Jeffrey A. Hoffer Chapter.
12.5 Record Modifications Jayalakshmi Jagadeesan Id 106.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
COMPREHENSIVE Access Tutorial 2 Building a Database and Defining Table Relationships.
CS 4432lecture #71 CS4432: Database Systems II Lecture #7 Professor Elke A. Rundensteiner.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
2015/6/301 TransCAD Managing Data Tables. 2015/6/302 Create a New Table.
FIRST COURSE Access Tutorial 2 Building a Database and Defining Table Relationships.
File Management.
With Microsoft ® PowerPoint 2010© 2011 Pearson Education, Inc. Publishing as Prentice Hall1 GO! with Microsoft ® PowerPoint 2010 Chapter 3 Enhancing a.
Database Software Application
Advanced Tables Lesson 9. Objectives Creating a Custom Table When a table template doesn’t suit your needs, you can create a custom table in Design view.
ICT Revision. Database – Data Management The insertion and deletion of fields The insertion and deletion of records Tables to be linked together The editing.
S YMFONY ORM - D OCTRINE Sayed Ahmed B.Sc. Eng. in Computer Science & Engineering M. Sc. in Computer Science Exploring Computing for 14+ years
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 5, 6 of Elmasri “ How index-learning turns no student.
Access Lesson 2 Creating a Database
1 C omputer information systems Design Instructor: Mr. Ahmed Al Astal IGGC1202 College Requirement University Of Palestine.
Copyright © 2012 Pearson Education, Inc. Publishing as Prentice Hall 9.1.
Copyright © 2012 Pearson Education, Inc. Publishing as Prentice Hall 9.1.
Introduction –All information systems create, read, update and delete data. This data is stored in files and databases. Files are collections of similar.
Lecture 8 Index Organized Tables Clusters Index compression
1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.
Chapter 9 Designing Databases Modern Systems Analysis and Design Sixth Edition Jeffrey A. Hoffer Joey F. George Joseph S. Valacich.
Computers Data Representation Chapter 3, SA. Data Representation and Processing Data and information processors must be able to: Recognize external data.
® Microsoft Office 2013 Access Building a Database and Defining Table Relationships.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Chapter 8 iComponents and Parameters. After completing this chapter, you will be able to perform the following: –Create iMates –Change the display of.
Lecture 12 Designing Databases 12.1 COSC4406: Software Engineering.
Copyright 2006 Prentice-Hall, Inc. Essentials of Systems Analysis and Design Third Edition Joseph S. Valacich Joey F. George Jeffrey A. Hoffer Chapter.
Index Building Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules.
Announcements Final NEXT WEEK (August 13 th Thursday at 16:00) Recitations will be held on August 12 th Wednesday We will solve sample final questions.
® Microsoft Office 2010 Building a Database and Defining Table Relationships.
DATA EXCHANGE FORMAT IGES A presentation by Mahesh Babu Gajula (206516) Data Management for Engineering Applications
Customizing ClientSpace With Dataforms Tim Borntreger, Director of Client Services.
DM ChurchLast Updated: 7 May 2012 Intro to Next Generation Sequencing.
CIS 210 Systems Analysis and Development Week 6 Part II Designing Databases,
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Confidential ACL Functions Corporate Audit Services Technology Solutions Team Charlene Vallandingham and Jack Hauschild September 29, 2008.
C-Store: Data Model and Data Organization Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May 17, 2010.
Microsoft Excel 2003 Illustrated Complete Data with Other Programs Exchanging.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Lesson 13 Databases Unit 2—Using the Computer. Computer Concepts BASICS - 22 Objectives Define the purpose and function of database software. Identify.
Tim Borntreger, Director of Client Service. Agenda  Introduction to Dataforms  Adding & Editing Dataforms  Adding & Editing Dataform Fields  Questions.
Copyright © 2009 Pearson Education, Inc. Publishing as Prentice Hall Essentials of Systems Analysis and Design Fourth Edition Joseph S. Valacich Joey F.
Pasewark & Pasewark 1 Access Lesson 2 Creating a Database Microsoft Office 2007: Introductory.
Copyright 2001 Prentice-Hall, Inc. Essentials of Systems Analysis and Design Joseph S. Valacich Joey F. George Jeffrey A. Hoffer Chapter 9 Designing Databases.
This tutorial will describe how to navigate the section of Gramene that allows you to view various types of maps (e.g., genetic, physical, or sequence-based)
What is GIS? “A powerful set of tools for collecting, storing, retrieving, transforming and displaying spatial data”
Databases Flat Files & Relational Databases. Learning Objectives Describe flat files and databases. Explain the advantages that using a relational database.
B Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Working with PDF and eText Templates.
Chapter 5 Record Storage and Primary File Organizations
Compression by Reference a rational approach to storing aligned sequence data.
Copyright 2002 Prentice-Hall, Inc. Modern Systems Analysis and Design Third Edition Jeffrey A. Hoffer Joey F. George Joseph S. Valacich Chapter 12 Designing.
Lesson 17 Mail Merge. Overview Create a main document. Create a data source. Insert merge fields into a main document. Perform a mail merge. Use data.
1 Record Modifications Chapter 13 Section 13.8 Neha Samant CS 257 (Section II) Id 222.
Indexing Goals: Store large files Support multiple search keys
Microsoft® Office 2007 Access Chapter 1:
Modern Systems Analysis and Design Third Edition
REDCap Data Migration from CSV file
Chapter 12 Designing Databases
What is a Database? A collection of data organized in a manner that allows access, retrieval, and use of that data.
TransCAD Working with Matrices 2019/4/29.
Presentation transcript:

The cSRA file format

example.csra -A cSRA-file contains a serialized file-structure -It is a read-only archive file format, similar to a tar-file -The tool “kar” can extract the directories and files -All tools in the sra-toolkit can access the data inside directly without prior extracting -cSRA compresses the sequence data by replacing aligned base pairs with reference data

example.csra PRIMARY_ALIGNMENT SECONDARY_ALIGNMENT SEQUENCE REFERENCE -The cSRA-archive contains the above 4 tables -The SEQUENCE-table is mandatory for the archive -The PRIMARY and SECONDARY ALIGNMENT tables can be missing ( if there is no aligned data in the archive ) -The PRIMARY and SECONDARY ALIGNMENT tables depend on the REFERENCE table -An archive that has ALIGNMENT tables - but no REFERENCE table is broken -The SECONDARY_ALIGNMENT table can be missing ( if the archive does not contain secondary aligned data )

example.csra PRIMARY_ALIGNMENT SECONDARY_ALIGNMENT SEQUENCE REFERENCE The “vdb-dump” – tool can display what tables are inside a cSRA archive >vdb-dump example.csra -E >>> enumerating the tables of database >example.csra< tbl #1: PRIMARY_ALIGNMENT tbl #2: REFERENCE tbl #3: SEQUENCE

example.csra PRIMARY_ALIGNMENT SEQUENCE REFERENCE SECONDARY_ALIGNMENT “Reassembles” most of the data that came from the original spot Information about where and how a primary alignment occurs Information about where and how secondary alignments occur Contains the reference locally or points to an external reference

SEQUENCE PRIM. SEC. A B C ASpot A has 2 reads: both are primary aligned BSpot B has 2 reads: the 1 st read has a primary and a secondary alignment the 2 nd read is primary aligned only CSpot C has 2 reads: the 1 st read is primary aligned the 2 nd read is not aligned This slide shows how the Sequence table points to data in the primary and secondary tables with three different use cases

The following slides explain the columns in the cSRA file format. The more important columns are highlighted and all other columns support those

SEQUENCE ALIGNMENT_COUNT vector of integers, how many alignments per read BASE_COUNT how many bases are in the whole table BIO_BASE_COUNT how many bases (excluded adapters) are in the whole table CMP_BASE_COUNT how many unaligned bases are in the whole table CMP_READ compressed read, only the unaligned reads COLOR_MATRIX static field to describe the translation between color-space and base-space CSREAD translated READ-column into color-space CS_KEY key for translation between color-space and base-space CS_NATIVE flag, to say that the sequence was produced in color-space FIXED_SPOT_LEN flag, set if all reads have the same length MAX_SPOT_ID id of the last spot in the table MIN_SPOT_ID id of the first spot in the table NAME name of the spot, generated from the row-id PLATFORM name of the platform used to sequence the table PRIMARY_ALIGNMENT_ID pointer back to primary alignment table through row-id in prim. alignment table QUALITY stored quality values, in the direction how it was sequenced part 1 usually a static column (same value for all rows in the table)

SEQUENCE READ assembled or stored bases, in the direction t was sequenced READ_FILTER vector of flags, one for each read, compatibility with SRA READ_LEN vector of integers, one for each read, length of each read READ_SEG vector of integer-pairs, one for each read [ zero-based start offset of read, length of read] READ_START vector of integers, one for each read, zero-based start offset of read READ_TYPE vector of flags, one for each read, tells if read is biological or adapter and the direction it was sequenced SIGNAL_LEN compatibility with SRA, lengths of recorded signal SPOT_COUNT how many spots are in the table ( == MAX_SPOT_ID ) SPOT_GROUP describes grouping in the reads, equivalent to read-group in BAM SPOT_ID row-id of each spot ( 1 based ) SPOT_LEN how many bases are in this spot TRIM_LEN length of the section not subject to be trimmed TRIM_START start of the section not subject to be trimmed part 2 usually a static column (same value for all rows in the table)

ALIGN_ID row-id BASE_COUNT how many bases are in the whole table BIO_BASE_COUNT how many bases (excluded adapters) are in the whole table CIGAR_LONG long form of the cigar-string CIGAR_SHORT short form of the cigar-string COLOR_MATRIX static field to describe the translation between color-space and base-space CS_KEY key for translation between color-space and base-space CS_NATIVE flag, to say that the sequence was produced in color-space GLOBAL_REF_START global position in the reference table HAS_MISMATCH bitfield of mismatches HAS_REF_OFFSET bitfield of offsets in the reference, used to represent indels LABEL label of this alignment, for future compatibility to represent multi - ploid alignment LABEL_LEN length of the label-part to be used LABEL_START start offset of the label-part to be used MAPQ mapping quality part 1 PRIMARY_ALIGNMENTSECONDARY_ALIGNMENT EDIT_DISTANCE number of mismatches usually a static column (same value for all rows in the table)

MATE_ALIGN_ID row-id of the mate of this read ( if any ) MATE_CIGAR_LONG long form of the cigar-string of the mate MATE_CIGAR_SHORT short form of the cigar-string of the mate MATE_EDIT_DISTANCE number of mismatches in the mate MATE_REF_ID row-id in the reference-table in the mate MATE_REF_LEN mate alignment lines in reference coordinates MATE_REF_NAME reference-name to which the mate is aligned MATE_REF_ORIENTATION orientation of the mate MISMATCH base values of the mismatches MISMATCH_QUAL qualities of the mismatches NAME auto-generated name of the alignment from row-id QUALTIY quality of the aligned sequence in the direction of the reference part 2 PRIMARY_ALIGNMENTSECONDARY_ALIGNMENT MATE_REF_POS mate position on the reference MAX_SPOT_ID id of the last spot in the table MIN_SPOT_ID id of the first spot in the table PLATFORM name of the platform used to sequence the table usually a static column (same value for all rows in the table)

RAW_READ original sequence read in the direction of sequencing READ sequence read in the direction of the reference REF_ID row-id in the reference table REF_POS position on the reference to the start of alignment REF_READ chunk of the reference on which alignment is projected REF_SEQ_ID sequence id of the reference REF_TABLE name of the reference table part 3 PRIMARY_ALIGNMENTSECONDARY_ALIGNMENT REF_LEN length of alignment in reference coordinates REF_NAME name of the reference REF_OFFSET orientation of original sequence to the reference REF_START offset in the row-id of the reference where alignment starts RD_FILTER vector of flags, one for each read, compatibility with SRA READ_FILTER vector of flags, one for each read, compatibility with SRA READ_LEN vector of integers, one for each read, length of each read READ_START vector of integers, one for each read, zero-based start offset of read READ_TYPE vector of flags, one for each read, tells if read is biological or adapter and the direction it was sequenced

SAM_FLAGS flags to be used in SAM-format SEQ_NAME auto-generated name of the sequence from sequence-row-id part 4 PRIMARY_ALIGNMENTSECONDARY_ALIGNMENT SAM_QUALITY quality converted to ascii presentation from sequence-row-id SEQ_READ_ID read-id of sequence being aligned SEQ_SPOT_ID sequence spot id SPOT_COUNT how many spots are in the table ( == MAX_SPOT_ID ) SPOT_GROUP describes grouping in the reads, equivalent to read-group in BAM SPOT_LEN how many bases are in this spot TRIM_LEN length of the section not subject to be trimmed TRIM_START start of the section not subject to be trimmed TEMPLATE_LEN size of the template usually a static column (same value for all rows in the table)

BASE_COUNT how many bases are in the whole table BIO_BASE_COUNT how many bases (excluded adapters) are in the whole table CGRAPTH_INDELS total number of indels in this chunk CGRAPH_LOW minimum depths of coverage in this chunk CGRAPH_MISMATCHES total number of mismatches between sequence and this chunk CIRCULAR flag if this reference is circular CMP_BASE_COUNT number of bases stored locally LABEL_LEN length of description LABEL_START start offset of description part 1 CMP_READ locally stored reference REFERENCE CGRAPH_HIGH maximum depths of coverage in this chunk COLOR_MATRIX static field to describe the translation between color-space and base-space CSREAD translated READ-column into color-space CS_KEY key for translation between color-space and base-space CS_NATIVE flag, to say that the sequence was produced in color-space LABEL description of this chunk usually a static column (same value for all rows in the table)

MAX_SEQ_LEN maximum size for the chunks in this table MAX_SPOT_ID id of the last chunk in this table NAME name of the sequence, equivalent what BAM used in the reference-sequence-name-field NAME_RANGE technical column, used for index lookup internally PRIMARY_ALIGNMENT_IDS list of row-id’s from primary alignment table which start their alignment in this chunk QUALTIY stores the quality of the reference, auto-generated when not available SECONDARY_ALIGNMENT_IDS list of row-id’s from secondary alignment table which start their alignment in this chunk SEQ_ID id of remotely stored sequence, used as a key to find the sequence part 2 REFERENCE MIN_SPOT_ID id of the first chunk in this table READ the sequence of the reference, merges remote and local reference into one column READ_FILTER vector of flags, one for each read, compatibility with SRA RD_FILTER vector of flags, one for each read, compatibility with SRA READ_LEN vector of integers, one for each read, length of each read READ_START vector of integers, one for each read, zero-based start offset of read READ_TYPE vector of flags, one for each read, tells if read is biological or adapter and the direction it was sequenced usually a static column (same value for all rows in the table)

SEQ_LEN the length of the chunk from the remotely stored sequence SEQ_START the start of this chunk on the remote sequence part 3 REFERENCE SPOT_COUNT number of spots SPOT_GROUP describes grouping in the reads, equivalent to read-group in BAM SPOT_ID row-id of current chunk SPOT_LEN length of this chunk, used for compatibility with SRA TRIM_LEN length of the section not subject to be trimmed TRIM_START start of the section not subject to be trimmed

The following slides show how the sequences are reconstructed from the data stored in cSRA. Play the PowerPoint slides to see the full animation effect

AGTACGC reference sequence HAS_MISMATCH HAS_REF_OFFSET MISMATCH REF_OFFSET case: MISMATCH A 0 0 C 0 0 G 0 0 A 1 0 A A 0 0 C 0 0 G 0 0 A

AGTACGC reference sequence HAS_MISMATCH HAS_REF_OFFSET MISMATCH REF_OFFSET case: INSERT A 0 0 C 0 0 G 0 0 A 1 0 A T 0 1 A 0 0 C 0 0 A

AGTACGC reference sequence HAS_MISMATCH HAS_REF_OFFSET MISMATCH REF_OFFSET case: DELETE A 0 0 C 0 0 G 0 0 A 0 1 C 0 0 G 0 0 T +1

AGTACGC reference sequence HAS_MISMATCH HAS_REF_OFFSET MISMATCH REF_OFFSET case: COMBINED A 0 0 A 1 0 G 0 0 A 1 1 A T 0 1 C 0 0 G 0 0 A A A A +1

AGTACGC reference sequence HAS_MISMATCH HAS_REF_OFFSET MISMATCH REF_OFFSET case: SOFTCLIP G 0 0 T 0 0 T A ATA -2 TATA defined by ref_pos

The next slides show the conversion between exploded file structure (created by the loader) and the kar format

exploded storage static kar storage less storage space used only one file read only more storage space used many directories and files read- and writable

exploded storage static kar storage kar –c karfile_to_create –d path_of_exploded_storage kar –x karfile_to_extract_from –d path_to_be_created

SRA cSRA One table Containing one submission Available as exploded storage or as kar-file Self-containing, no need of external files to extract data Up to 4 tables Containing one BAM-file Available as exploded storage or as kar-file Requires external / remote files to extract all data Difference between SRA and cSRA formats

How to use vdb-dump to inspect a cSRA-archive ( part 1 ) What tables are in the cSRA-achive? $vdb-dump example.csra –E >>> enumerating the tables of database >example.csra< tbl #1: PRIMARY_ALIGNMENT tbl #2: REFERENCE tbl #3: SEQUENCE What columns are available in a table? $vdb-dump example.csra –T SEQUENCE –o ALIGNMENT_COUNT (U8) BASE_COUNT (U64) BIO_BASE_COUNT (U64) CMP_BASE_COUNT (U64) CMP_READ (INSDC:dna:text) COLOR_MATRIX (U8) CSREAD (INSDC:color:text) CS_KEY (INSDC:dna:text) CS_NATIVE (bool) FIXED_SPOT_LEN (INSDC:coord:len) MAX_SPOT_ID (INSDC:SRA:spotid_t) …

How to use vdb-dump to inspect a cSRA-archive ( part 2 ) How to restrict the output to certain columns? $vdb-dump example.csra -T SEQUENCE –C READ,QUALITY READ: CAGGGCGGGCAGCGGGCCTGCCCCCCACCCCCGCGCCCCATGACCCGC… QUALITY: 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, … READ : AGGACACAATTACAAGGTGCTGGCCCAACTACTTTCAGTGTACCGTCT… QUALITY: 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, … How to restrict the row-range of the output? $vdb-dump example.csra –T SEQUENCE –R –C READ READ: TGATCCATCAGCATCGGCCTCCCAAAGTGCTGGGATTACAGGTGT... READ: AGCCAGGCGTGGTGGTGCGACCCTGTAATCCCAGCTACTTGGGAG... READ: TAGTGGAGGCCGGCGCAGGAACAGGTTGAACAGTCTACCCTCCCT... READ: ACTCCAGCCTGGGCAACAGAGCAAGATTCTGACTCAAAAAAAAAA... READ: TTCTTTCTAAGACAGGGTCTCACTCTGTCGCCCAGGCTGGAGTGC... READ: TTTCTTTCTCTCTCTCTCTCTTTTTTTTTTTTTTTGAGACAGGGT... …

How many rows are in a table? $vdb-dump example.csra -T SEQUENCE –r id-range: first-row = 1, row-count = How to use vdb-dump to inspect a cSRA-archive ( part 3 ) How to create tab-separated output? $vdb-dump example.csra -T SEQUENCE –C READ,QUALITY –f tab CATGTGACTGAACTCTTCACCCCAGTC30, 30, 30, 30, 30, 30 AAGAGATCCGACATCAAGTGCCCACCT30, 30, 30, 30, 30, 30 CTCTGTCTCTGCCCCCAGCATCACATT30, 30, 30, 30, 30, 30 TCCCACAGCTTTAATCACCATCTAAAA30, 30, 30, 30, 30, 30 TGACTCCCACCTTCACTCTCCCATGTC30, 30, 30, 30, 30, 30 How to output phred33-quality ? $vdb-dump example.csra -T SEQUENCE –C ‘(INSDC:quality:text:phred_33)QUALITY’ ???????????????5???????5???????5?????????+????5 ??????????????????????????????????????????????? ???????????5?+???55+55????5?+??5?55???5+?? ?+? ?++555?+?? ?+?++?5?5++

REFERENCE PRIMARY ALIGNMENT READS SEQUENCE SECONDARY ALIGNMENT unaligned READS The reference feeds into the primary alignment table which in turn feed data into the sequence table. The secondary alignment table takes data from the reference and the sequence tables to form the alignment data. The sequence table can also includes unaligned reads General BAM Alignment Process

REFERENCE PRIMARY ALIGNMENT PRIMARY ALIGNMENT READS SEQUENCE unaligned READS ALLELES EVIDENCE ALIGNMENT EVIDENCE ALIGNMENT EVIDENCE INTERNALS EVIDENCE INTERNALS SECONDARY ALIGNMENT SECONDARY ALIGNMENT Complete Genomics BAM Alignment Process