Data Workflow Overview Genomics High- Throughput Facility Genome Analyzer IIx Institute for Genomics and Bioinformatics Computation Resources Storage Capacity Public Web Servers ● ~ 800 processors ● Sun Grid Engine ● ~ 100TB (secured) ● Fast drives ● 30TB for HTS ● HTTP, FTP ● Dedicated hosts ● User accounts HTS: 700GB/day Bandwidth: 10Gb/s USER Sample Analysis Requests (via web interface) Analysis Results (FTP server)
Data Analysis Workflow IMAGES 2-4 TB INTENSITIES GB Image Analysis Firecrest Base Calling Bustard BASE CALLS GB SEQUENCES + SCORES 20/30 GB Synthesis Gerald GENOME ALIGNMENT >100 GB Alignment ELAND + Reference Genome READ COUNTS Read Counting Casava VDC Sample-Specific Analysis, Visualization… e.g. Genome alignment, RNAseq, CHIPseq analysis Downloadable files for HTS users FASTQ files
Sequences, Scores ATATTCTTATATAAAAATATAATTATTTTAATATTTGGTCCTTTCGTACTAAAATAT +HWUSI-EAS1562_0001:8:1:1119:18138#0/1 AGAAAGCTTTGAAAATTATGTATACGCCTCGTAAGCCCAGTCCAAAGTCAAGACCA +HWUSI-EAS1562_0001:8:1:1119:13476#0/1 a_^`a`_a[[NOONN__V__`Y^`^X]R[]]]]]Q```Y````__`^W`YVUPR]] Sequence identifierRaw Sequence Phred base calling quality scores (0 to 62 encoded using ASCII 64 to 126)
Genome Alignment (ELAND) HWUSI-EAS1562_0001:8:1:1119:18138#0/1 ATATTCTTATATAAAAATATAATTATTTT AATATTTGGTCCTTTCGTACTAAAATAT U chr1.fa F 23G HWUSI-EAS1562_0001:8:1:1119:13476#0/1 AGAAAGCTTTGAAAATTATGTATACGCC TCGTAAGCCCAGTCCAAAGTCAAGACCA U chr12.fa F Sequence identifier Raw Sequence Type of match Number of exact/1-error/2-error matches Chromosome/Position/Direction Substitution
Read Counts (Casava VDC) Matchs with Genes, Exons, Splice junctions ChromosomeGeneMatchs Files for visualization (GenomeStudio) Genome alignment, Gene expression, RNAseq and CHIPseq analysis