Using command line tools to process sequencing data Tapio Vuorenmaa, Krista Kokki Using command line tools to process sequencing data
This hands-on session: Part 1 – Bedtools Part 2 – DEMO: Galaxy with the command line Part 3 – Bedtools excercises
Part 1: Bedtools Allows to do wide range of genomics tasks easily Command line tool Easy to use Perfect example We’re focusing on using command line and parameters, not to tool itself ”intersect”
Bedtools - intersect ”Do my two features in the set overlap with each other?” Files: “genes.bed” and “markers.bed” (in exercises) .bed format
.bed format? Flexible way to define data lines displayed in annotation track One of the file types Genome Browser uses Three required fields: chr, chr start, chr end Additional fields; name, score, strand etc. Our ”data” is simplified
Our ”data” Chromosome 1. Chromosome 2.
Back to intersect Question: Do my chip-seq peaks overlap? The basic command: bedtools intersect –a genes.bed –b markers.bed > result.bed Optional parameters Program Command First file Second file Redirect output (optional)
Optional parameters -wa Display the original feature for each overlap -u Display only one (the first) overlap found. -s Only display overlaps found on the same strand. -c Count the number of overlaps. -v Complement, display those which do not overlap. -S Only display overlaps found on the opposite strand. And many more. See your bedtools cheat sheet.
That’s not all there’s to it There are number of other things you can do with bedtools, such as Coverage Merge Cluster
Part 2: DEMO: Galaxy with the command line Step 1. .sra file into fastqc fastq-dump NAME.sra --offset 33 Output: .fastq file Step 2. Quality report of your data fastqc NAME.fastq Output: .html file Quality conversion (to get quality score from ASCII code)
DEMO: Galaxy with the command line Step 3. Trimming (optional) fastx_trimmer -i NAME.fastq -o TRIMMED_NAME.fastq -f 1 -l 50 Input file Output file First base to keep Last base to keep
DEMO: Galaxy with the command line Step 4. Quality filtering fastq_quality_filter -i TRIMMED_NAME.fastq -o QFILT_NAME.fastq -q 10 -p 100 Input file Output file Minimum quality score to keep Minimum % of bases that must have –q quality
DEMO: Galaxy with the command line Step 5. Removing quality information fastx_collapser -i QFILT_NAME.fastq -o COLPS_NAME.fasta Input file Output file
DEMO: Galaxy with the command line Step 6. Mapping bowtie //hg19 -f COLPS_NAME.fasta, --best -v 2 -m 3 -k 1 output.sam Output: .sam Genome Query input files (f=fasta) Result in best-to-worst order No more than 2 mismatches Not reporting alignments for reads having more than 3 reportable alignments Report up to 1 valid alignments Output in sam format
DEMO: Galaxy with the command line Finally: Samtools Genome browser doesn’t use .sam format (output from mapping). .sam must be converted to .bam samtools view –bS NAME.sam > NAME.bam Output: .bam Now you can visualize the result in Genome browser. Sam to bam (when having header)
So, why the command line? Remote access: you can easily access and operate other computers, such as the computing servers, using the command line. Speed: Programs are (usually) controlled by parameters. Once you learn to use these commands and parameters, they are very quick to use. Control: Using pipelining and redirection enables users to perform powerful tasks with a single line of commands. Automation: With scripting users can create sequences of program tasks to execute automatically without further user interaction.
Don’t be scared of the command line! www.uef.fi