Workshop on Microbiome and Health Hands on: Metagenomics data types, statistics and quality control Esteban Pérez Wohlfeil & Oswaldo Trelles {estebanpw, ortrelles}@uma.es Computer Architecture Department, University of Malaga, Spain Faculte des Sciences; Universite Sidi Mohamed Ben Abdellah 2017
Global Agenda Contents and time distribution AGENDA (1h 00m) Getting to know our Virtual Machine Interacting and exploring the metagenomic samples Quality control step using QTrim Running a sequence comparison with a reference database using BLAST
Getting to know the Virtual Machine The provided VM has Ubuntu 16.04 and several software already installed to facilitate the hands-on. The following software is incorporated already: BLAST suite: BLASTn, BLASTx, BLASTp, … MEGAN Qtrim Trimmomatic METAGECKO EMBOSS toolkit Rstudio with R 3.3.3 Several scripts: FastaQ to Fasta converter, spreadsheets, plotting tools, etc.
Getting to know the Virtual Machine Log into the “metagenomics-pipeline” user with the password: student
Getting to know the Virtual Machine Examples that we will use during the hands-on are located in /home/student/Documents/Example
Getting to know the Virtual Machine These examples include: Folder 454: Lean_TS1.fastq Obese_TS19.fastq Folder calc: Spreadsheet to calculate differential abundance Folder database: A reference database containing several genomes commonly found in gastrointestinal human system Folder results: Empty folder to store processed files
Exploratory analysis
? Exploratory Analysis FastQ Fasta A metagenome can be seen as a long signal (often incomplete and noisy) that requires processing in order to detect anything significant Although not mandatory, it is very recommended that we take a look into our samples always before starting a processing pipeline FastQ Fasta ?
Exploratory Analysis In your Virtual Machine, start by opening a terminal clicking on the black command prompt in the left tab
Exploratory Analysis We can execute commands on the terminal just as if we were double clicking on programs. Lets first open up our metagenomes using the less command. Do as follows: This will open up the lean_TS1.fastq metagenome. Does it look ok? To navigate through the metagenome use the arrow keys. To exit, just press q
Exploratory Analysis (Skip this if you already know it) The terminal is a powerful tool to manage files. There are a few commands that always come in handy: Command Description cp <file to copy> <destination> Copies a file to another place mv <file to move> <location to move> Moves a file to another location rm <file to delete> Deletes a file less <file to read> Reads a text file in the terminal ls Displays the contents of the folder pwd Shows the current working directory cd <folder to enter> Enters a folder. Use cd ../ to go back one level
Exploratory Analysis Now we will check the distribution of lengths of the reads to see that there are no outliers. First convert from fastQ to fasta: And then run the script exploratory.sh with the new fasta file as argument: This will generate a .png image with a histogram of the distribution of length of reads.
Exploratory Analysis Are there any outliers? Does it make sense taking into account the kind of sequencer it comes from? Will it be different after the quality control step? Before QC
Exploratory Analysis We can also check the average length, the number of reads and the maximum length by opening the file that was generated automatically:
Quality Control Step Quality control
Quality Control Step Now we will perform the Quality Control step to trim and filter impurities in the samples. This goes from adapters to errors that have been included in the sequencing process. Lets filter and trim both samples: lean_TS1.fastq and obese_TS19.fastq. To do so, execute the following commands into the terminal: python ~/Qtrim/QTrim_v1_1/QTrim_v1_1.py -m 26 -fastq $DATA/454/lean_TS1.fastq -o $DATA/454/lean_TS1.trimmed.fastq And also: python ~/Qtrim/QTrim_v1_1/QTrim_v1_1.py -m 26 -fastq $DATA/454/obese_TS19.fastq -o $DATA/454/obese_TS19.trimmed.fastq
Quality Control Step Remember to convert both of them to fasta format so we can run them in our pipeline: And also: This will generate the fasta files ready to be processed.
Quality Control Step Now run the exploratory analysis for the new trimmed lean_TS1.trimmed.fasta file and compare previous and new plot Are there any outliers? Does it make sense taking into account the kind of sequencer it comes from? Is it any different after the quality control step? Before QC After QC
Quality Control Step Notes Quality control should be rightly parametrized depending on the preparation libraries used in sequencing and the sequencing instrument A strong biological knowledge is needed Still, a filtering process will usually improve quality The –M parameter can be adjusted for more/less filtering