Giuseppe D'Auria Norwich September 2014 FISABIO, Valencia Introduction into the processing of raw data
Data StorageSize ranges Sanger Sequencing Datasets in the order of thousands of sequences 454 Dataset in the order of hundred of thousands Illumina Dataset in the order of millions of sequences Solid Dataset in the order of xxx of million of sequences
Data StorageBackUp We spend much more money for sequencing than for securing obtained data!!!! Think to BackUp Our PC/Server Time Machine, Rsync, Chron, etc.... Few euros PC Daily Few euros PC Weekly
Data StorageDisk structure tmp arg1 biblio 20XX Data new Final1 Analysis new Analysis new2 Analysis new 3 Final2 Final backup backup2 data data2 tmp
Data Storage Project Folder AVOID COPYING AND COPYING AND SECURITY COPYING AND AGAIN COPYING not useful data > ln -s TARGET LINK_NAME Better using symbolic links, just pointing to the needed big data files Disk structure Analysis References Original Sequence data Filtered sequences TXT Analysis 1 Analysis Analysis 1.2 Analysis 1.1
Linux or Windows? Both allow good bioinformatics analysis Linux is more stable for massive data crunching analysis and it is FREE Most of the software work in both systems but several are exclusively working on Linux. Windows is not FREE The best structure for bioinformatics (just my personal advice): A Linux Desktop system (Ubuntu – Fedora) + A virtual machine (Virtual Box) The systemWindows or Linux
Data FormatsFASTA and QUAL QUALITY >G12OEMT03CWVU >G12OEMT03DH3XQ >G12OEMT03DD28C >G12OEMT03DGQ >G12OEMT03C0MSF >G12OEMT03CWVU1 AGAGTTTGATCATGGCTCAGGATGAACGCTAGCGGCAGGCCTAACACATGCAAGTCGAGGGAGGAG CCTTCGGGCTTCGACCGGCGTACGGGTGCGTAACG >G12OEMT03DH3XQ AGAGTTTGATCATGGCTCAGTGCCAGCCGCCGCGGGAGCGCATTAG >G12OEMT03DD28C AGAGTTTGATCCTGGCTCAGGGTGGTCATATGTTTGGAATTGGTGCCAGCCGCCGCGGGAGCGCATT AG >G12OEMT03DGQ48 AGAGTTTGATCATGGCTCAGGAGGTGCCAGCAGCCGCGGAGCGCATTAG >G12OEMT03C0MSF AGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAATACATGCAAGTAGAACGCTGAA GCTTGGCGCTTGCACCGAGCGGATG FASTA
Data FormatsSFF - Standard Flowgram Format SFF >G12OEMT03CWZL8 Run Prefix: R_2011_05_03_06_02_36_ Region #: 3 XY Location: 1078_3006 Run Name: R_2011_05_03_06_02_36_FLX _Administrator_RUN19 Analysis Name: D_2011_05_04_03_47_28_NAVELINA_signalProcessingAmplicons Full Path: /data/R_2011_05_03_06_02_36_FLX _Administrator_RUN19/D_2011_05_04_03_47_28_NAVELINA_signalProcessingAmplicons/ Read Header Len: 32 Name Length: 14 # of Bases: 518 Clip Qual Left: 16 Clip Qual Right: 397 Clip Adap Left: 0 Clip Adap Right: 0 Flowgram: Flow Indexes: Bases: gactacgagtagactCCATTTGATTCGAATGTCTGTTGGCGTAGGATTTCGGAGAGCACGTTTGCGATACGCGTATCTGCTGCTCCGCGGAAAGAATTTAAAAACCGGTGAAATTACGCAGGATGTGCGTGAAGAGAATCTGAGAAT TTTCAAAGAATCTTTAGACATGGTAACCAATCTCAATAACTGGCATGCCTTCATGAATCTTTTTGCTTCTGCAGGCTATTTGAAAGGCAGCCTGGTGGCATCATCCAATGCGGTAGTTTTCAGCTATGTTTTATATCTGATCGGAA AATATGAGTATAAAGTATCGTCTGTTGAACTTCAGAAATTATTCGTAAATGGTATTTTTATGTCTACGTATTACTGGTATTTTATACGGGTATCTACAGAATCAgaggttagaaaactagtttgctgatttgcgagatgtccatcatgcagatgaattcgtatc atatctgaattctgttatcggcaaccgtatttaacggatgacttactttgtttattcgtcg Quality Scores:
Output 1:N:0:ATTTCT ATCTGACCGCCGCATTTGATGCAGTAAATTATTTATATGAGCAAGGGCATA 1:N:0:ATTCCT 1:Y:18:ATCACG GGAGTTTCATTACAATTTATATATTTAAAGAGGNNNANGNNNNNGACTGAA + 1:N:0:ATTCCT TTCAGTTTGTGATGTGCGACGATGGTTCGCTCANGCGNCTNNNGTTCTGCG + 1:N:0:ATTCCT CTCCACACTAACAATACCGTTCCCCAGGTGGTATCGCCAGNNCAGTAGAGC 1:N:0:ATTCCT GCCGCCCAGCTGAAAAACATCATCATGCTGATCNNNANTNNNNNAGGCAGA FASTQ SequenceID Sequence Quality Optional
@EAS139:136:FC706VJ:2:2104:15343: :Y:18:ATCACG 1:Y:18:ATCACG GGAGTTTCATTACAATTTATATATTTAAAGAGGNNNANGNNNNNGACTGAA + CCCFFFFFHGHHFIJIJJBHHIDHJIFHEFEEG###1#1#####00?DGFH Unique instrument nameRun idFlowcell idTile number within the flowcell lane'x'-coordinate of the cluster'y'-coordinate of the clusterThe mate member of a pairY if the read fails filter (read is bad), N otherwiseControl bitsIndex sequenceFlowcell lane Output formatsFASTQ
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL !"#$%&'()*+,-./ :; | | | | | | S - Sanger Phred+33, raw reads typically (0, 40) X - Solexa Solexa+64, raw reads typically (-5, 40) I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) (Note: See discussion above). L - Illumina 1.8+ Phred+33, raw reads typically (0, 41) Output formatsFASTQ CCCFFFFFHGHHFIJIJJBHHIDHJIFHEFEEG###1#1#####00?DGFH 1:Y:18:ATCACG GGAGTTTCATTACAATTTATATATTTAAAGAGGNNNANGNNNNNGACT GAA + CCCFFFFFHGHHFIJIJJBHHIDHJIFHEFEEG###1#1#####00?DGFH Q phred = -10 log 10 (e) e = estimated probability of a base being wrong
Output formats Illumina (Solexa) FastQ Solid FastQ 454 Fasta + Qual FastQ SFF Standard Flowgram Format Now we can go to our VirtualBox machine Quality assessment and sequence filtering Project definition and folder structuring
Double click on VirtualBox Icon If not already imported: follow me Turn On your virtual Machine embo2013 Open the Virtual Machine
Some basic linux commands Upper case and Lower case are different!
# Take a look at the sequences cd data/Sequences ls -ltr less dataset1.fasta less dataset1.fasta.qual # Go back one folder cd.. # Creating project folder mkdir project # change directory to "project" cd project # Create original_data directory mkdir original_data # Create filtered data directory mkdir passed # Link data from Sequence folder in /home/embo/Sequences ln -s /home/embo/Sequences/* original_data/ # Go to original_data folder cd original_data # Take a look at the folder ls -ltr less dataset1.fasta less dataset1.fasta.qual Some basic linux commands
less dataset1.fasta.qual #take a look at the folder ls -ltr less dataset.fasta less dataset.fasta.qual # Convert FASTA + QUAL to FASTQ prinseq-lite.pl -fasta dataset1.fasta -qual dataset1.fasta.qual -out_format 3 -out_good dataset1 # Obtain reports config file prinseq-lite.pl -fastq dataset1.fastq -graph_data dataset1.gd -graph_stats ld,gc,qd,de ls -ltr # Obtain reports prinseq-graphs-noPCA.pl -i dataset1.gd -o dataset1 - html_all ls -ltr firefox dataset1.html & # Go to filtered data direcotry cd../passed # Trim low quality terminal and obtain reports config file prinseq-lite.pl -fastq../original_data/dataset1.fastq - trim_qual_type mean -trim_qual_step 1 -trim_qual_window 20 -trim_qual_right 30 -out_good passed -out_format 3 # Obtain reports config file prinseq-lite.pl -fastq passed.fastq -graph_data passed.gd - graph_stats ld,gc,qd,de,da,sc # Obtain reports prinseq-graphs-noPCA.pl -i passed.gd -o passed -html_all firefox passed.html & Quality assessment
Perl is a scripting language widely used for system administration and programming on the World Wide Web. It originated in the UNIX community and has a strong UNIX slant, but usage on Windows has grown rapidly. ActivePerl is a quality-assured binary distribution of Perl for popular UNIX platforms and Windows. perl (small 'p') is the program used to interpret the Perl language. For INTREPID and BRAVE people
For INTREPID and BRAVE people II R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.
Thank you again for your attention