MCB3895-004 Lecture #10 Sept 25/14 SRA, Illumina data QC.

MCB3895-004 Lecture #10 Sept 25/14 SRA, Illumina data QC

Underwstanding the BBC cluster What is the cluster? Many individual computers controlled by a "head node" The head node is what you log onto by default using SSH It is bad etiquette to run things off the head node Can slow down the entire system

Using the cluster - method 1 When you SSH in, use the " qlogin " command to take you to a subordinate note Running programs here will not disrupt the head node You need to stay connected to the network until all of your programs are completed

Checking a qlogin job Use the terminal command " top " Shows all processes running on your node kill a process by pressing " k " and then entering its PID when prompted

Using the cluster - method 2 Use the command " qsub " combined with a shell script e.g., qsub script.sh shell is a programming language commonly used for controlling actual processes The BBC has example scripts for you to modify: http://bioinformatics.uconn.edu/understanding- the-bbc-cluster-and-sge/ http://bioinformatics.uconn.edu/understanding- the-bbc-cluster-and-sge/ This method allows you to walk away once your script is running

qsub bash script #!/bin/bash ############################################################# ##### TEMPLATE SGE SCRIPT - BLAST EXAMPLE ################### ##### /common/sge_templates/template_single.sh ############## ############################################################# # Specify the name of the data file to be used INPUTFILENAME="test.fasta" # Name the directory (assumed to be a direct subdir of $HOME) from which the file is listed PROJECT_SUBDIR="test" # Specify name to be used to identify this run #$ -N blastp_job # Email address (change to yours) #$ -M bioinformatics@uconn.edu # Specify mailing options: b=beginning, e=end, s=suspended, n=never, a=abortion #$ -m besa

qsub bash script # Specify that bash shell should be used to process this script #$ -S /bin/bash # Specify working directory (created on compute node used to do the work) WORKING_DIR="/scratch/$USER/$PROJECT_SUBDIR-$JOB_ID" # If working directory does not exist, create it # The -p means "create parent directories as needed" if [ ! -d "$WORKING_DIR" ]; then mkdir -p $WORKING_DIR fi # Specify destination directory (this will be subdirectory of your home directory) DESTINATION_DIR="$HOME/$PROJECT_SUBDIR/$JOB_ID-$INPUTFILENAME" # If destination directory does not exist, create it # The -p in mkdir means "create parent directories as needed" if [ ! -d "$DESTINATION_DIR" ]; then mkdir -p $DESTINATION_DIR fi

qsub bash script # navigate to the working directory cd $WORKING_DIR # Get script and input data from home directory and copy to the working directory cp $HOME/$PROJECT_SUBDIR/$INPUTFILENAME./test.fasta cp $HOME/template_single.sh. # Specify the output file #$ -o $JOB_ID.out # Specify the error file #$ -e $JOB_ID.err # Run the program blastp -query $INPUTFILENAME -db /usr/local/blast/data/refseq_protein -num_alignments 5 - num_descriptions 5 -out my-results # copy output files back to your home directory cp * $DESTINATION_DIR # clear scratch directory rm -rf $WORKING_DIR

Checking a qsub job Use " qstat " to understand the status of your job Shows jobs waiting to be executed Monitor a running job's status using qstat -j Retrieve information about a completed job using qaact -j

Cluster etiquette Never run something on the head node! Always check that your processes will run correctly before starting a large task! Best strategy: run commands individually using a reduced input dataset If using a loop to execute multiple commands, only go through a single iteration (e.g., use die )

The first part of Assignment #4 Write a perl script that subsamples the first ~10000 reads of your input fasta file(s) Allows you to do quick troubleshooting Can be modified later to examine the effect of sampling depth

SRA "Sequence Read Archive" http://www.ncbi.nlm.nih.gov/sra The part of NCBI that holds raw sequencing data Generally, this is where you need to put your raw data when you publish genomic research

A SRA record

SRA run browser

For kicks… Go to http://www.ncbi.nlm.nih.gov/srahttp://www.ncbi.nlm.nih.gov/sra Search "Escherichia coli MG1655" Note different results Different sequencing platforms Note mutant strains! Try "Escherichia coli MG1655 Pacbio"

Downloading SRA data Possible to do from web browser, but transferring large files is cumbersome Better: use NCBI's SRA Toolkit on the BBC server to perform file conversion while downloading /opt/bioinformatics/sratoolkit.2.3.5-2- centos_linux64/bin/fastq-dump --split- files SRR826450 Decompresses files, splits paired ends into separate files

The fastq file format 4 lines per sequence: Line 1: begins with "@", followed by sequence ID Line 2: raw sequence data Line 3: begins with "+", may have sequence ID Line 4: Phred quality score for each position, in ASCII @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%++)(%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 ASCII from low (left) to high (right): !"#$%&'()*+,-./0123456789:; ?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ http://en.wikipedia.org/wiki/FASTQ_format

Phred scores Developed by old program "Phred" during human genome project, adopted as standard throughout field Phred score = -10log(P(base call error)) e.g., Phred score of 10 = 90% base call accuracy Phred score of 20 = 99% base call accuracy Phred score of 30 = 99.9% base call accuracy Phred score of 40 = 99.99% base call accuracy etc.

FastQC - QC for raw reads FastQC is the most common software to understand the quality of raw sequencing reads http://www.bioinformatics.babraham.ac.uk/proj ects/fastqc/http://www.bioinformatics.babraham.ac.uk/proj ects/fastqc/ Runs using a java applet Using the server, we have to run via command line

FastQC screenshot Starts into specific figures Summary stats What it thinks of your run quality - NOT HARD AND FAST RULES!!

FastQC - Per base quality Blue line: mean Red line: median Boxes: 25- 75% range Whiskers: 10-90% range Phred score

FastQC - Per read quality Highlights systematic problems e.g., a region of flowcell is problematic

FastQC - Per base sequence content Unbiased sequences should have the same content across all bases Will show biases if some sequence is hugely overrepresent ed e.g., adapter contamination e.g., biased fragmentation

FastQC - Per Sequence GC content Unbiased sequencing should have a normally distributed %GC content Deviations may indicated contamination e.g., adapter e.g., two species with different %GC contents

FastQC - Per base N count Ns indicate that the base caller could not determine a base at that position Global N abundance generally correlates with sequence quality

FastQC - Sequence length Some methods yield non- uniform read lengths e.g., Pacbio (shown) Illumina will only show one uniform value here

FastQC - Duplicate sequences An unbiased library should have few duplicates A few duplicates may indicate saturated template sequencing High duplication may indicate adapter contamination or enrichment bias

FastQC - Kmer content Tests for kmers enriched as a certain read position Graphs 6 worst, tabulates the rest Can indicate sequencing/libr ary bias Can indicate contamination by one sequence, e.g., primers, adapters

FastQC - Overrepresented sequences May indicate how read diversity is limited, e.g., adapter/primer contamination May be biological, e.g., repeats

FastQC - Adapter content Specifically looks for known adapter/primer contamination Indicates reads are longer than insert size

FastQC - Per tile sequence quality Shows flowcell tiles that are particularly error-prone Illumina data only, and only if positional metadata is included with reads

Running FastQC on the server Very simple: $ fastqc Produces a.html file as output Transfer the html to your computer and open it using your favorite web browser

Getting rid of adapters using Trimmomatic Trimmomatic is a standard method to remove adapter contamination http://www.usadellab.org/cms/?page=trimmom atichttp://www.usadellab.org/cms/?page=trimmom atic Bolger et al. 2014 Bioinformatics btu170

Running Trimmomatic Admittedly, a complex syntax java -jar PE [-threads ]... java -jar /opt/bioinformatics/Trimmomatic- 0.32/trimmomatic-0.32.jar PE SRR826450_1.fastq SRR826450_2.fastq output_forward_paired.fq output_forward_unpaired.fq output_reverse_paired.fq output_reverse_unpaired.fq ILLUMINACLIP:/opt/bioinformatics/Trimmomatic- 0.32/adapters/TruSeq3-PE.fa:2:30:10

Assignment #4 Download these two E. coli K-12 MG1655 genome sequencing reads from NCBI SRA: SRR826450, SRR956947 What are the differences? Write script to subsample fastq files Analyze your input data using fastqc If appropriate, adapter trim using Trimmomatic

MCB3895-004 Lecture #10 Sept 25/14 SRA, Illumina data QC.

Similar presentations

Presentation on theme: "MCB3895-004 Lecture #10 Sept 25/14 SRA, Illumina data QC."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MCB3895-004 Lecture #10 Sept 25/14 SRA, Illumina data QC.

Similar presentations

Presentation on theme: "MCB3895-004 Lecture #10 Sept 25/14 SRA, Illumina data QC."— Presentation transcript:

Similar presentations

About project

Feedback