Pyrosequencing for Metagenomics: accessing and organizing raw data Giuseppe D’Auria FISABIO, Valencia Norwich September 2014
We will start from a single sff (standard flowgram format) file containing a metagenome and a metatranscriptome experiments labelled by two MIDs (Multiplex Identifiers) Recruitment protocol by MUMmer Search for tRNA Assembly protocol via MIRAAnnotate 16S rRNA Organize data and folders Extract fasta and quality files belonging to each dataset Practice workflow Searching for rRNAsClusterize 16S rRNA
Practice workflow We will start from a single sff (standard flowgram format) file containing a metagenome and a metatranscriptome experiments labelled by two MIDs (Multiplex Identifiers) Organize data and folders Extract fasta and quality files belonging to each dataset
Extracting MIDs → FASTA file → Fasta Qual → mid_fasta_file Identify Mids and separate Fasta and Fasta Quality files bin_fasta_on_mid_primers.pl SFF FASTA-Mid1 QUALITY-Mid1 FASTA-Mid2 QUALITY-Mid2 FASTA-MidX QUALITY-MidX Excercise 2 sff_extract 1)Use sff_extract to extract sequences from sff -c parameter to remove adaptor sequences and make possible MIDs to be identified bin_fasta_on_mid_primers.pl 2) Use bin_fasta_on_mid_primers.pl to separate mids Extract fasta and quality files belonging to each dataset
Open the terminal out_midi_CCAACC Metagenome out_midi_CGCCAT Metatranscriptome Extract fasta and quality files belonging to each dataset # Go to data folder cd data # Create project2 folder mkdir project2 # Go to project2 folder cd project2 # Link SFF file ln -s ~/data/Sequences/dataset2.sff ~/data/project2/dataset2.sff # Extract FASTQ and QUALITY from sff sff_extract -c -A dataset2.sff # Sort reads by MIDs bin_fasta_on_mid_primers.pl -r dataset2.fasta -q dataset2.fasta.qual -m../Sequences/mids.fas -b out
Open the terminal Extract fasta and quality files belonging to each dataset # Create Metagenome folder mkdir metage # Create Metatranscriptome folder mkdir metatra # Move project files in folders mv out_midi_CCAACC.fasta* metage/ mv out_midi_CGCCAT.fasta* metatra/ # Go to Metagenome folder cd metage # Take a look at the folder ls -ltr
We will start from a single sff (standard flowgram format) file containing a metagenome and a metatranscriptome experiments labelled by two MIDs (Multiplex Identifiers) Organize data and folders Extract fasta and quality files belonging to each dataset Practice workflow Recruitment protocol by MUMmer
Open the terminal Mapping and recruitment graph # Link file to simpler name ln -s out_midi_CCAACC.fasta metage.fas # Mapping of reads on reference genome # Obtaining mapping coordinates nucmer --prefix=recruit../../References/reference.fasta metage.fas --coords # Obtaining mapping image (postscript) mummerplot recruit.delta -R../../References/reference.fasta -Q metage.fas --coverage --postscript -p recruit # Visualizing mapping evince recruit.ps &
We will start from a single sff (standard flowgram format) file containing a metagenome and a metatranscriptome experiments labelled by two MIDs (Multiplex Identifiers) Organize data and folders Extract fasta and quality files belonging to each dataset Practice workflow Recruitment protocol by MUMmer Assembly protocol via MIRA
# Linking metagenome file for assembly ln -s out_midi_CCAACC.fasta metage_in.454.fasta ln -s out_midi_CCAACC.fasta.qual metage_in.454.fasta.qual ln -s../dataset2.xml metage_traceinfo_in.454.xml # Start denovo assembly mira --project=metage --job=denovo,genome,draft, _SETTINGS -LR:ft=fasta # Goto results folder cd metage_assembly cd metage_d_results # Take a look at the results tablet metage_out.ace & Assmebly viewer
We will start from a single sff (standard flowgram format) file containing a metagenome and a metatranscriptome experiments labelled by two MIDs (Multiplex Identifiers) Organize data and folders Extract fasta and quality files belonging to each dataset Practice workflow Recruitment protocol by MUMmer Assembly protocol via MIRASearching for rRNAs
cd../../../ cd metatra # Link needed files ln -s out_midi_CGCCAT.fasta metatra.fas # Searching for 16S sequences rna_hmm3.py -i metatra.fas -m ssu -o metatra_16S -L # Extract 16S sequences from the 16S table extract_sequences_by_list.pl -f metatra.fas -t metatra_16S -c 0 -o -d 1 extract_sequences_by_list One of my perl scripts
Practice workflow We will start from a single sff (standard flowgram format) file containing a metagenome and a metatranscriptome experiments labelled by two MIDs (Multiplex Identifiers) Organize data and folders Extract fasta and quality files belonging to each dataset Recruitment protocol by MUMmer Assembly protocol via MIRASearching for rRNAsClusterize 16S rRNA
Clustering # Filtering out chimeras #ChimeraSlayer.pl --query_FASTA 16S.list.fasta # Clustering 16S sequences cdhit -i 16S.list.fasta -o 16Sc90s90 -c 0.9 -s 0.9 -bak 1 cd-hit_translate.pl 16Sc90s90.bak.clstr > 16S.tab cd-hit_translate Oneother of my perl scripts
Practice workflow We will start from a single sff (standard flowgram format) file containing a metagenome and a metatranscriptome experiments labelled by two MIDs (Multiplex Identifiers) Organize data and folders Extract fasta and quality files belonging to each dataset Recruitment protocol by MUMmer Assembly protocol via MIRASearching for rRNAsClusterize 16S rRNAAnnotate 16S rRNA
# 16S assignation by RDP_classifie java -jar ~/Software/rdp_classifier_2.2/rdp_classifier-2.2.jar -q 16S.remain.fasta -o 16S_rdp -f fixrank Annotate 16S rRNA
Practice workflow We will start from a single sff (standard flowgram format) file containing a metagenome and a metatranscriptome experiments labelled by two MIDs (Multiplex Identifiers) Organize data and folders Extract fasta and quality files belonging to each dataset Recruitment protocol by MUMmer Assembly protocol via MIRASearching for rRNAsClusterize 16S rRNAAnnotate 16S rRNA Search for tRNA
# Searching for tRNAs tRNAscan-SE -B 16S.remain.fasta > tRNAs.tab # Extract tRNAs sequences from the tRNAs table extract_sequences_by_list.pl -f 16S.remain.fasta -t tRNAs.tab -c 0 -o tRNAs -d 1 Searching for tRNAs extract_sequences_by_list.pl Another of my perl scripts
Running out of physical limits
For INTREPID and BRAVE people
Perl is a scripting language widely used for system administration and programming on the World Wide Web. It originated in the UNIX community and has a strong UNIX slant, but usage on Windows has grown rapidly. ActivePerl is a quality-assured binary distribution of Perl for popular UNIX platforms and Windows. perl (small 'p') is the program used to interpret the Perl language.
For INTREPID and BRAVE people II R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.
Thank you again for your attention