Download presentation
Presentation is loading. Please wait.
1
Introduction into the processing of raw data
Giuseppe D'Auria FISABIO, Valencia Norwich September 2014
2
Data Storage Size ranges Sanger Sequencing Illumina
Datasets in the order of thousands of sequences Illumina Dataset in the order of millions of sequences Solid Dataset in the order of xxx of million of sequences 454 Dataset in the order of hundred of thousands
3
Daily Our PC/Server Weekly Data Storage BackUp
Time Machine, Rsync, Chron, etc.... Few euros PC Daily Our PC/Server Few euros PC Weekly We spend much more money for sequencing than for securing obtained data!!!! Think to BackUp
4
Data Storage Disk structure data 20XX Data new data2 Final Final arg1 biblio tmp tmp Analysis new new2 Analysis new 3 Final1 backup backup2 Final2
5
Original Sequence data
Data Storage Disk structure Project Folder AVOID COPYING AND COPYING AND SECURITY COPYING AND AGAIN COPYING not useful data Original Sequence data Filtered sequences Analysis Analysis 1 Analysis 1.1.1 Analysis 1.2 Analysis 1.1 Better using symbolic links, just pointing to the needed big data files References > ln -s TARGET LINK_NAME TXT
6
The system Windows or Linux Linux or Windows? Both allow good bioinformatics analysis Linux is more stable for massive data crunching analysis and it is FREE Windows is not FREE Most of the software work in both systems but several are exclusively working on Linux. The best structure for bioinformatics (just my personal advice): A Linux Desktop system (Ubuntu – Fedora) + A virtual machine (Virtual Box)
7
Data Formats FASTA and QUAL FASTA QUALITY >G12OEMT03CWVU1
AGAGTTTGATCATGGCTCAGGATGAACGCTAGCGGCAGGCCTAACACATGCAAGTCGAGGGAGGAGCCTTCGGGCTTCGACCGGCGTACGGGTGCGTAACG >G12OEMT03DH3XQ AGAGTTTGATCATGGCTCAGTGCCAGCCGCCGCGGGAGCGCATTAG >G12OEMT03DD28C AGAGTTTGATCCTGGCTCAGGGTGGTCATATGTTTGGAATTGGTGCCAGCCGCCGCGGGAGCGCATTAG >G12OEMT03DGQ48 AGAGTTTGATCATGGCTCAGGAGGTGCCAGCAGCCGCGGAGCGCATTAG >G12OEMT03C0MSF AGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAATACATGCAAGTAGAACGCTGAAGCTTGGCGCTTGCACCGAGCGGATG FASTA QUALITY >G12OEMT03CWVU1 >G12OEMT03DH3XQ >G12OEMT03DD28C >G12OEMT03DGQ48 >G12OEMT03C0MSF
8
SFF - Standard Flowgram Format
Data Formats SFF - Standard Flowgram Format SFF >G12OEMT03CWZL8 Run Prefix: R_2011_05_03_06_02_36_ Region #: 3 XY Location: 1078_3006 Run Name: R_2011_05_03_06_02_36_FLX _Administrator_RUN19 Analysis Name: D_2011_05_04_03_47_28_NAVELINA_signalProcessingAmplicons Full Path: /data/R_2011_05_03_06_02_36_FLX _Administrator_RUN19/D_2011_05_04_03_47_28_NAVELINA_signalProcessingAmplicons/ Read Header Len: 32 Name Length: # of Bases: Clip Qual Left: 16 Clip Qual Right: 397 Clip Adap Left: 0 Clip Adap Right: 0 Flowgram: Flow Indexes: Bases: gactacgagtagactCCATTTGATTCGAATGTCTGTTGGCGTAGGATTTCGGAGAGCACGTTTGCGATACGCGTATCTGCTGCTCCGCGGAAAGAATTTAAAAACCGGTGAAATTACGCAGGATGTGCGTGAAGAGAATCTGAGAATTTTCAAAGAATCTTTAGACATGGTAACCAATCTCAATAACTGGCATGCCTTCATGAATCTTTTTGCTTCTGCAGGCTATTTGAAAGGCAGCCTGGTGGCATCATCCAATGCGGTAGTTTTCAGCTATGTTTTATATCTGATCGGAAAATATGAGTATAAAGTATCGTCTGTTGAACTTCAGAAATTATTCGTAAATGGTATTTTTATGTCTACGTATTACTGGTATTTTATACGGGTATCTACAGAATCAgaggttagaaaactagtttgctgatttgcgagatgtccatcatgcagatgaattcgtatcatatctgaattctgttatcggcaaccgtatttaacggatgacttactttgtttattcgtcg Quality Scores:
9
Output formats FASTQ FASTQ SequenceID Sequence Optional Quality
@AAII-ZZ123:123:ABCDEFGHT:4:1101:1885:2240 1:N:0:ATTTCT ATCTGACCGCCGCATTTGATGCAGTAAATTATTTATATGAGCAAGGGCATA + @AAII-ZZ123:123:ABCDEFGHT:4:1101:1969:2247 1:N:0:ATTCCT TAAACGCCCGCAGTTGCGATCCCAGGTGCATGACAGAGGCAATAAACCCGA @CCFFFFFHHHHHJJJJJIJJJJIJIFHHIIJIJIJIJIIIIIJIJJIEHH @EAS139:136:FC706VJ:2:2104:15343: :Y:18:ATCACG GGAGTTTCATTACAATTTATATATTTAAAGAGGNNNANGNNNNNGACTGAA CCCFFFFFHGHHFIJIJJBHHIDHJIFHEFEEG###1#1#####00?DGFH @AAII-ZZ123:123:ABCDEFGHT:4:1101:2226:2183 1:N:0:ATTCCT TTCAGTTTGTGATGTGCGACGATGGTTCGCTCANGCGNCTNNNGTTCTGCG CCCFFEFFHHHHHGHGGIIIJIJJJGIJIIJIJ#07B#-7###--;CHIJH @AAII-ZZ123:123:ABCDEFGHT:4:1101:2094:2194 1:N:0:ATTCCT CTCCACACTAACAATACCGTTCCCCAGGTGGTATCGCCAGNNCAGTAGAGC @AAII-ZZ123:123:ABCDEFGHT:4:1101:2544:2173 1:N:0:ATTCCT GCCGCCCAGCTGAAAAACATCATCATGCTGATCNNNANTNNNNNAGGCAGA FASTQ SequenceID Sequence Optional Quality
10
Output formats FASTQ SequenceID Unique instrument name Run id
@EAS139:136:FC706VJ:2:2104:15343: :Y:18:ATCACG GGAGTTTCATTACAATTTATATATTTAAAGAGGNNNANGNNNNNGACTGAA + CCCFFFFFHGHHFIJIJJBHHIDHJIFHEFEEG###1#1#####00?DGFH Unique instrument name Run id Flowcell id Flowcell lane Tile number within the flowcell lane 'x'-coordinate of the cluster 'y'-coordinate of the cluster The mate member of a pair Y if the read fails filter (read is bad), N otherwise Control bits Index sequence @EAS139:136:FC706VJ:2:2104:15343: :Y:18:ATCACG SequenceID
11
e = estimated probability of a base being wrong
Output formats FASTQ @EAS139:136:FC706VJ:2:2104:15343: :Y:18:ATCACG GGAGTTTCATTACAATTTATATATTTAAAGAGGNNNANGNNNNNGACTGAA + CCCFFFFFHGHHFIJIJJBHHIDHJIFHEFEEG###1#1#####00?DGFH CCCFFFFFHGHHFIJIJJBHHIDHJIFHEFEEG###1#1#####00?DGFH Quality Qphred = -10 log10(e) e = estimated probability of a base being wrong SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL | | | | | | S - Sanger Phred+33, raw reads typically (0, 40) X - Solexa Solexa+64, raw reads typically (-5, 40) I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) (Note: See discussion above). L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)
12
Project definition and folder structuring
Output formats 454 Fasta + Qual FastQ SFF Standard Flowgram Format Illumina (Solexa) FastQ Solid FastQ Project definition and folder structuring Quality assessment and sequence filtering Now we can go to our VirtualBox machine......
13
Double click on VirtualBox Icon
Open the Virtual Machine Double click on VirtualBox Icon If not already imported: follow me Turn On your virtual Machine embo2013
14
Some basic linux commands
Upper case and Lower case are different!
15
Some basic linux commands
# Take a look at the sequences cd data/Sequences ls -ltr less dataset1.fasta less dataset1.fasta.qual # Go back one folder cd .. # Creating project folder mkdir project # change directory to "project" cd project # Create original_data directory mkdir original_data # Create filtered data directory mkdir passed # Link data from Sequence folder in /home/embo/Sequences ln -s /home/embo/Sequences/* original_data/ # Go to original_data folder cd original_data # Take a look at the folder ls -ltr less dataset1.fasta less dataset1.fasta.qual
16
Quality assessment less dataset1.fasta.qual #take a look at the folder ls -ltr less dataset.fasta less dataset.fasta.qual # Convert FASTA + QUAL to FASTQ prinseq-lite.pl -fasta dataset1.fasta -qual dataset1.fasta.qual -out_format 3 -out_good dataset1 # Obtain reports config file prinseq-lite.pl -fastq dataset1.fastq -graph_data dataset1.gd -graph_stats ld,gc,qd,de # Obtain reports prinseq-graphs-noPCA.pl -i dataset1.gd -o dataset1 -html_all firefox dataset1.html & # Go to filtered data direcotry cd ../passed # Trim low quality terminal and obtain reports config file prinseq-lite.pl -fastq ../original_data/dataset1.fastq -trim_qual_type mean -trim_qual_step 1 -trim_qual_window 20 -trim_qual_right 30 -out_good passed -out_format 3 prinseq-lite.pl -fastq passed.fastq -graph_data passed.gd -graph_stats ld,gc,qd,de,da,sc prinseq-graphs-noPCA.pl -i passed.gd -o passed -html_all firefox passed.html &
17
Quality assessment
18
perl (small 'p') is the program used to interpret the Perl language.
For INTREPID and BRAVE people Perl is a scripting language widely used for system administration and programming on the World Wide Web. It originated in the UNIX community and has a strong UNIX slant, but usage on Windows has grown rapidly. ActivePerl is a quality-assured binary distribution of Perl for popular UNIX platforms and Windows. perl (small 'p') is the program used to interpret the Perl language.
20
For INTREPID and BRAVE people II
R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.
21
Thank you again for your attention..........
Thank you again for your attention
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.