Download presentation
Presentation is loading. Please wait.
Published byAntony Tate Modified over 8 years ago
1
High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering genomic patterns from HTS data.
2
Mac user, type in terminal: $ ssh username@gator.hpc.ufl.edu If you do not have an HPC acct: $ ssh gms6014@159.178.28.30gms6014@159.178.28.30 Windows, Open in Putty: gator.hpc.ufl.edu or 159.178.28.30159.178.28.30 Practice: log into UFHPC / Linux server.
3
Connect and log into the system with Putty. Switch to scratch: “$> cd /scratch/lfs/username” Make a directory for the course: “$> mkdir GMS6014” Type “ls” or “ls -l” to verify the folder. Enter into the folder by typing “$> cd GMS6014” Practice: navigate HPC
4
Practice: Decompress achieve files. Download the two HTS archieve files with “$> wget –c URL” First load the “sra” module, “$> module load sra” Decompress both.sra files in the directory: “$> fastq- dump.2 *.sra” Practice: Retrieve HTS files from GEO
5
Large Data Set Analysis. Hardware considerations: 1.) Data storage. FASTA record of a protein (1,000 aa) ~ 1 KB. Human proteome, or Chromosome 21 ~ 50 MB Human genome ~ 1.5 GB HTS transcriptome analysis (4 samples @ 40 million reads each) original and derived data sets ~ 200 GB
6
Large Data Set Analysis. Hardware considerations: 2.) Processors and RAM. Comparison: tbalstn of 5 protein sequences against 1.2GB genome, ~15 sec CPU time. Map a single 10 M reads illumina run to human genome ~15,000 CPU sec (> 4 hours). When RAM < data size, the computer will come to a crawl.
7
Large Data Set Analysis. Hardware considerations: 3.) Operating system determines the availability of tools. Linux is the default development system for most bioinformatics groups. It is also the OS of the UFHPC. Easy control and automation. Most tools are portable to Mac OSX, but often requires recompiling the source code.
8
Navigating the Linux command line environment: User rights ~ Program can not run unless you have the rights to read/write/execute the file. Basic commands to survive: cd, ls -l, cp, mv, pwd, chmod, etc.
9
High Throughput DNA-Sequencing (HTS) data analysis 1.Sources and representation of HTS data. 2.Visualization of HTS data. 3.Discovering genomic pattern from HTS data. 4.Integrated data analysis and hypothesis- generating exploration.
10
Your own ( sequencing service ). Public databases, such as NCBI/GEO. Major genomic /epigenomic projects, such as ENCODE (ENCylopedia Of DNA Elements); the Cancer Genome Project, etc. Other internet sources. Source of HTS data
11
Retrieving HTS data Retrieving HTS data from the web using wget. Loading to and unloading data from UFHPC (check with HPC instructions).
12
Recoding sequence information – sequence file format FASTA format– suitable for single gene or genomic region, pre-genomic era. > Gene_name or accession, (other info) ACTGGGTTTATGACGTGTCATGCATGCA ATGTAGCTAGATGCTAGCTAGATGCTAG CTAGATGCTA…. Defined format is necessary for computers to identify and process the information.
13
Recording sequence reads from the machine – FASTQ FASTA: >My_sequence AATTACGCGCGATACGAT FASTQ: @My_sequence AATTACGCGCGATACGAT +My_sequence quality efcfffffcfeeYBBsdf Recording of quality assessment allows filtering based on sequence quality.
14
Paint the sequence reads to the genome HTS reads @reads_1 AATTACGCGCGATACGAT + efcfffffcfeeYBBsdf @reads_2 ACCGAGGCGCGTATGTCT + efcfffffcfeeYBBsea …. @reads_1,000,001 Corresponding location on the genome ELAND (Illumina) Bowtie, etc. ChIP-Seq; RNA-Seq De novo assembly of genomes, chromatin conformation, genomic abnormality, etc…
15
Recording sequence and quality information FASTQ format = FASTA + Quality @HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1 TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCTT +HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1 !"#$%&'()*+,-./0123456789:; ?@ABCDEFGHIJKLMNOPQabdefghadfda Two identification lines (@, +) for each sequence. Identification line format depends on specific sequencing platform. Quality line using characters representing integer values.
16
HTS data file Sequence and quality information are recorded as multi-FASTQ files. For efficient storage and transmission, they are transformed into SRA (Sequence Read Archives) format. Observe: transform the SRA file to fastq. “$ fastq-dump.2 path_to_sra_file”
17
Representation of (HTS) data – BED (Browser Extensible Data) file chr2 1000019210000217U00+ chr21000022710000252U10- chr21000031010000335U20+ chr31000049610000521U10- chr21000055610000581U20+ Chrom.Start EndnameScorStrand With the completion of the genome, there is no need to record the base pair identity (if it is the same as the reference genome). Detailed description of genomic data formats: http://genome.ucsc.edu/FAQ/FAQformat.html http://genome.ucsc.edu/FAQ/FAQformat.html
18
HTS data – map to genome “bwa” or “bowtie” are the two most popular software that implement a similar strategy (Burrows-Wheeler Transform). Can benefit from multi-processor. map the reads to hg19. bowtie2 -x hg19 -U SRR1186251.fastq -p 2 -S Input.sam bowtie2 -x hg19 -U SRR1186252.fastq -p 2 -S P53ChIP.sam
19
ChIP-Seq – identifying TF binding sites MACS- Model-based Analysis of ChIP-Seq Practice: Identifying peacks macs2 callpeak -t P53ChIP.sam -c Input.sam -f SAM -g dm -n P53_GM00011 -B
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.