Topics read length distribution genome coverage

Topics read length distribution genome coverage
Practical Biocomputing Week 12

Insert length distribution
map reads to reference (bowtie2) select reads where both mates map with high quality (samtools) python calculate mean, standard deviation plot histogram example: Practical Biocomputing Week 12

insert_size.py Practical Biocomputing 2018 Week 12
"""================================================================================================= insert_size.py Calulate insert size based on a SAM file of mapped reads. To get only high quality mapped read pairs use the samtools command samtools view -q 20 -f 0x82 SRR bam > SRR mapped SAM format is (all one line, whitespace separated fields) SRR AT1G S150M = NCT...CAG #A<...FJF AS:i: XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z: YS:i: YT:Z:CP the insert length is field 8, the last field before the sequence Michael Gribskov April 2018 =================================================================================================""" import sys map = None try: map = open(sys.argv[1], 'r') except: print('unable to open input file ({}'.format(sys.argv[1])) exit(1) nread = 0 for line in map: nread += 1 print(line) if nread > 10: break print('{} reads read from {}'.format(nread, sys.argv[1])) exit(0) Practical Biocomputing Week 12

insert_size.py histogram boilerplate from example
difficult to decipher error import sys import matplotlib.mlab as mlab import matplotlib.pyplot as plt map = None try: map = open(sys.argv[1], 'r') except: print('unable to open input file ({}'.format(sys.argv[1])) exit(1) nread = 0 lendata = [] for line in map: nread += 1 field = line.split() print('{}\t{}'.format(field[0], field[8])) lendata.append(field[8]) if nread > 1000: break print('\n{} reads read from {}'.format(nread, sys.argv[1])) n, bins, patches = plt.hist(lendata, bins=100, normed=1, facecolor='blue', alpha=0.75) plt.xlabel('Length') plt.ylabel('Probability') plt.title('Library Insert Length') plt.axis([40, 160, 0, 0.03]) plt.grid(True) plt.show() exit(0) Traceback (most recent call last): File "/scratch/snyder/m/mgribsko/biocomputing/utils/insert_size.py", line 38, in <module> n, bins, patches = plt.hist(lendata, bins=100, normed=1, facecolor='blue', alpha=0.75) File "/apps/rhel6/Anaconda/4.4.0-py36/lib/python3.6/site-packages/matplotlib/pyplot.py", line 3081, in hist stacked=stacked, data=data, **kwargs) File "/apps/rhel6/Anaconda/4.4.0-py36/lib/python3.6/site-packages/matplotlib/__init__.py", line 1898, in inner return func(ax, *args, **kwargs) File "/apps/rhel6/Anaconda/4.4.0-py36/lib/python3.6/site-packages/matplotlib/axes/_axes.py", line 6180, in hist if len(xi) > 0: TypeError: len() of unsized object Practical Biocomputing Week 12

insert_size.py it turns out that the data vector (lendata) must be floats actually they were strings now i get a plot but there’s nothing in it, but if i run in debugger i get something different disappears when the plt.axis() command runs duh, in the example the data is in a different range axis() sets the plot limits nread = 0 lendata = [] for line in map: nread += 1 field = line.split() print('{}\t{}'.format(field[0], field[8])) lendata.append(float(field[8])) if nread > 1000: break print('\n{} reads read from {}'.format(nread, sys.argv[1])) n, bins, patches = plt.hist(lendata, bins=100, normed=1, facecolor='blue', alpha=0.75) plt.xlabel('Length') plt.ylabel('Probability') plt.title('Library Insert Length') plt.axis([40, 160, 0, 0.03]) plt.grid(True) plt.show() Practical Biocomputing Week 12

insert_size.py it works! problems i have some negative lengths
i want the bars to have a black outline insert = float(field[8]) insert = max( insert, -insert) lendata.append(insert) n, bins, patches = plt.hist(lendata, bins=100, normed=1, facecolor='blue', edgecolor='black', linewidth=0.25, alpha=0.75 ) Practical Biocomputing Week 12

insert_size.py all 15.7 M reads Practical Biocomputing Week 12

insert_size.py bells and whistles
calculate mean and standard deviation import statistics as stat lenmean = stat.mean(lendata) lensd = stat.stdev(lendata) write mean and standard deviation on plot draw mean line on plot # the following is for adding annotation to the figure # must do it before plotting the histogram fig = plt.figure() ax = fig.add_subplot(111) n, bins, patches = plt.hist(lendata, bins=100, normed=1, facecolor='blue', edgecolor='black', linewidth=0.25, alpha=0.75 ) plt.xlabel('Length') plt.ylabel('Probability') plt.title('Library Insert Length') plt.grid(True, linestyle='-', linewidth=0.1) plt.text( 0.02, 0.9, 'mean: {:.1f}\nstandard deviation: {:.1f}'.format(lenmean, lensd), fontsize=10, transform=ax.transAxes) plt.axvline( lenmean, color='red', linewidth=1.5) Practical Biocomputing Week 12

read_dist.py plot the read distribution on a reference sequence
uses SAM file again make an array the corresponds to the sequence increment the positions based the beginning position of the read (POS) and the alignment (CIGAR) """================================================================================================= Plot the distribution of reads on a reference sequence based on mapped reads in a SAM file SAM format is (all one line, whitespace separated fields) SRR AT1G S150M = NCT...CAG #A<...FJF AS:i: XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z: YS:i: YT:Z:CP SRR QNAME read name FLAG mapping bit flags AT1G RNAME reference sequence name POS leftmost position of mapped read MAPQ mapping quality 1S150M CIGAR alignment = RNEXT name of mate/next read PNEXT position of mate/next read TLEN inferred insert size NCT...CAG SEQ sequence #A<...FJF QUAL quality the remaining columns are application specific AS:i: XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z: YS:i: YT:Z:CP =================================================================================================""" Practical Biocomputing Week 12

read_dist.py Practical Biocomputing 2018 Week 12 import sys
# # main if __name__ == '__main__': map = None try: map = open(sys.argv[1], 'r') except: print('unable to open input file ({}'.format(sys.argv[1])) exit(1) nread = 0 seq = [] # assume that the mapped read begins at POS # use the CIGAR string to increment counts in the seq array for line in map: nread += 1 field = line.split() # print('{}\t{}'.format(field[0], field[8])) pos = int(field[3]) cigar = field[5] if nread > 1000: break Practical Biocomputing Week 12

read_dist.py CIGAR string, examples
151M – perfect match 151 matching bases 75S57M19S – 75 do not match, 57 match, 19 do not match (151 bases) 48S102M1 – 48 do not match, 102 match, 1 does not match 8S 82M 8I 2M 1I 36M 14S – 8 no match, 82 match 8 base insertion in read 2 base match 1 base insertion in read 36 base match, 14 base no match M alignment match (can be a sequence match or mismatch) I insertion to the reference D deletion from the reference N skipped region from the reference S soft clipping (clipped sequences present in SEQ) H hard clipping (clipped sequences NOT present in SEQ) P padding (silent deletion from padded reference) = sequence match X sequence mismatch Practical Biocomputing Week 12

read_dist.py CIGAR codes, mostly M and S, a few D or I Actions
S increment sequence positions but do not count as a match M increment sequence positions and count as a match I (insertion in read) ignore, do not increment position D (deletion in read) ignore, increment sequence position M alignment match (can be a sequence match or mismatch) I insertion to the reference D deletion from the reference N skipped region from the reference S soft clipping (clipped sequences present in SEQ) H hard clipping (clipped sequences NOT present in SEQ) P padding (silent deletion from padded reference) = sequence match X sequence mismatch Practical Biocomputing Week 12

read_dist.py Practical Biocomputing 2018 Week 12 istr = '' m = 0
for char in cigar: if char.isdigit(): istr += char continue i = int(istr) if char == 'M': # matching positions m += 1 for j in range(pos,pos+i-1) seq[pos] += 1 pos += i - 1 elif char == 'S': # soft clipped positions elif char == 'I': # insertions in read, do nothing elif char == 'D': # deletion in read, increment pos else: # must be a character we don't care about, # we'll just ignore these for now pass return m # end of add_cigar istr = '' m = 0 for char in cigar: if char.isdigit(): istr += char continue i = int(istr) if char == 'M': # matching positions m += 1 for j in range(pos,pos+i-1) seq[pos] += 1 elif char not in ‘SD’: # must be a character we don't care about, # we'll just ignore these for now # I also gets dealt with here # M, S, and D increment the position, S and D do nothing else pos += i - 1 return m # end of add_cigar Practical Biocomputing Week 12

read_dist.py problem with seq list since i did not initialize it
its ugly if i hardwire an arbitrary large value such as 50 M changes with the genome could read from command line i just have to know what the biggest sequence might be maybe i can do it on the fly? if __name__ == '__main__': map = None try: map = open(sys.argv[1], 'r') except: print('unable to open input file ({}'.format(sys.argv[1])) exit(1) nread = 0 seq = [] # assume that the mapped read begins at POS # use the CIGAR string to increment counts in the seq array bases_mapped = 0 for line in map: nread += 1 field = line.split() # print('{}\t{}'.format(field[0], field[8])) pos = int(field[3]) cigar = field[5] bases_mapped += add_cigar( seq, pos, cigar) if nread > 1000: break print('{} bases mapped'.format(bases_mapped)) exit(0) Traceback (most recent call last): File "/scratch/snyder/m/mgribsko/biocomputing/utils/read_dist.py", line 94, in <module> bases_mapped += add_cigar( seq, pos, cigar) File "/scratch/snyder/m/mgribsko/biocomputing/utils/read_dist.py", line 52, in add_cigar seq[pos] += 1 IndexError: list index out of range Practical Biocomputing Week 12

read_dist.py fixing the array overrun problem detecting the problem
check the current end of the array versus the new alignment istr = '' m = 0 for char in cigar: if char.isdigit(): istr += char continue i = int(istr) if char == 'M': # matching positions #check to make sure seq list is big enough, if not add some more elements if pos + i + 1 > len(seq): extend_list(seq, pos + i ) for j in range(pos, pos + i - 1): m += 1 seq[j] += 1 elif char not in 'SD': # must be a character we don't care about, we'll just ignore these for now # I gets dealt with here # only M, S, and D fall through to here # M, S, and D increment the position, S and D do nothing else pos += i - 1 return m # end of add_cigar Practical Biocomputing Week 12

read_dist.py fixing the array overrun problem extend_list() function
def extend_list(arr, end, init=0): """ extend the existing list arr by adding indices from the current end of the list to the specified end pos (the new last index in list) :param arr: list :param end: last index to create :param init: value to initialize elements with :return: int, new list size """ begin = len(arr) arr += [init for k in range(begin, end + 1)] return len(arr) Practical Biocomputing Week 12

read_dist.py in testing, i find a few sam lines cause problems
filter them out (main program) bases_mapped = 0 for line in map: if # skip header lines continue field = line.split() # print('{}\t{}'.format(field[0], field[8])) pos = int(field[3]) mapq = field[4] cigar = field[5] # filter some lines if mapq==0 or cigar =='*': nread += 1 bases_mapped += add_cigar(seq, pos, cigar) if nread > : break print('{} bases mapped'.format(bases_mapped)) exit(0) Practical Biocomputing Week 12

read_dist.py the seq array is pretty sparse so i wrote a function to compress it def condense_seq(seq): """ condense the seq array by removing consecutive positions with the same value. This allows easier printing. For plotting, each interval ends at the indicated position. pos1, val1 pos2, val2 ... SAM files use a 1 origin, so the first interval is 1 to pos1 with value val1, the second is pos1 + 1 to pos2 with val2, etc. :param seq: list of int, coverage at each base :return: list of tuples, (pos, val) """ compressed = [] cover = seq[1] for pos in range(2, len(seq)): if seq[pos] != cover: compressed.append((pos - 1, cover)) cover = seq[pos] return compressed Chr4 gene ID=AT4G00290;Name=AT4G00290 Chr4 mRNA ID=AT4G ;Parent=AT4G00290 Chr4 3'UTR ID=AT4G00290:three_prime_UTR:1;Parent=AT4G ;Name=AT4G00290:three_prime_UTR:1 Chr4 exon ID=AT4G00290:exon:7;Parent=AT4G ;Name=AT4G00290:exon:7 Chr4 exon ID=AT4G00290:exon:6;Parent=AT4G ;Name=AT4G00290:exon:6 Chr4 exon ID=AT4G00290:exon:5;Parent=AT4G ;Name=AT4G00290:exon:5 Chr4 exon ID=AT4G00290:exon:4;Parent=AT4G ;Name=AT4G00290:exon:4 Chr4 exon ID=AT4G00290:exon:3;Parent=AT4G ;Name=AT4G00290:exon:3 Chr4 exon ID=AT4G00290:exon:2;Parent=AT4G ;Name=AT4G00290:exon:2 Chr4 5'UTR ID=AT4G00290:five_prime_UTR:2;Parent=AT4G ;Name=AT4G00290:five_prime_UTR:2 Chr4 exon ID=AT4G00290:exon:1;Parent=AT4G ;Name=AT4G00290:exon:1 Chr4 5'UTR ID=AT4G00290:five_prime_UTR:1;Parent=AT4G ;Name=AT4G00290:five_prime_UTR:1 reverse strand Practical Biocomputing Week 12

read_dist.py Practical Biocomputing 2018 Week 12
majorlocator = MultipleLocator(1000) minorlocator = MultipleLocator(100) majorformatter = FormatStrFormatter('%d') fig, ax = plt.subplots(1, 1, figsize=(15,3)) pos = [i + 1 for i in range(0, len(seq))] ticks = [i for i in range(3000, 8000) if i % 100 == 0 ] ax.fill(pos, seq, linewidth=0.75) # plt.yscale('log') plt.xticks(ticks) plt.xlim(3000, 8000) plt.ylim(0, 150) ax.xaxis.set_major_locator(majorlocator) ax.xaxis.set_major_formatter(majorformatter) ax.xaxis.set_minor_locator(minorlocator) ax.set(xlabel='position', ylabel='Read Count', title='Read Distribution') ax.grid() plt.show() Practical Biocomputing Week 12

read_dist.py Practical Biocomputing 2018 Week 12
Chr4 exon ID=AT4G00020:exon:27;Parent=AT4G ;Name=BRCA2(IV):exon:27 Chr4 CDS ID=AT4G00020:CDS:27;Parent=AT4G ;Name=BRCA2(IV):CDS:27 Chr4 exon ID=AT4G00020:exon:26;Parent=AT4G ;Name=BRCA2(IV):exon:26 Chr4 CDS ID=AT4G00020:CDS:25;Parent=AT4G ;Name=BRCA2(IV):CDS:25 Chr4 CDS ID=AT4G00020:CDS:22;Parent=AT4G ;Name=BRCA2(IV):CDS:22 Chr4 CDS ID=AT4G00020:CDS:21;Parent=AT4G ;Name=BRCA2(IV):CDS:21 Chr4 CDS ID=AT4G00020:CDS:19;Parent=AT4G ;Name=BRCA2(IV):CDS:19 Chr4 CDS ID=AT4G00020:CDS:18;Parent=AT4G ;Name=BRCA2(IV):CDS:18 Chr4 CDS ID=AT4G00020:CDS:17;Parent=AT4G ;Name=BRCA2(IV):CDS:17 Chr4 CDS ID=AT4G00020:CDS:16;Parent=AT4G ;Name=BRCA2(IV):CDS:16 Chr4 CDS ID=AT4G00020:CDS:15;Parent=AT4G ;Name=BRCA2(IV):CDS:15 Chr4 CDS ID=AT4G00020:CDS:14;Parent=AT4G ;Name=BRCA2(IV):CDS:14 Chr4 CDS ID=AT4G00020:CDS:13;Parent=AT4G ;Name=BRCA2(IV):CDS:13 Chr4 CDS ID=AT4G00020:CDS:12;Parent=AT4G ;Name=BRCA2(IV):CDS:12 Chr4 CDS ID=AT4G00020:CDS:11;Parent=AT4G ;Name=BRCA2(IV):CDS:11 Chr4 CDS ID=AT4G00020:CDS:10;Parent=AT4G ;Name=BRCA2(IV):CDS:10 Chr4 CDS ID=AT4G00020:CDS:8;Parent=AT4G ;Name=BRCA2(IV):CDS:8 Practical Biocomputing Week 12

read_dist.py With exons marked Practical Biocomputing 2018 Week 12
# add exon locations exon = [('Chr4', 'CDS', 4127, 4149), ('Chr4', 'CDS', 4227, 4438), ('Chr4', 'CDS', 4545, 4749), ('Chr4', 'CDS', 4839, 4901), ('Chr4', 'CDS', 4977, 5119), ('Chr4', 'CDS', 5406, 5588), ('Chr4', 'CDS', 5657, 5855), ('Chr4', 'CDS', 6605, 6676), ('Chr4', 'CDS', 6760, 6871), ('Chr4', 'CDS', 6975, 7056), ('Chr4', 'CDS', 7144, 7194), ('Chr4', 'CDS', 7294, 7375), ('Chr4', 'CDS', 7453, 7638), ('Chr4', 'CDS', 7712, 7813), ('Chr4', 'CDS', 7914, 7947)] span = for e in exon: begin = (e[2] )/span end = (e[3] )/span plt.axhline(5.0, begin, end, color='black', linewidth=6.0) Practical Biocomputing Week 12

Topics read length distribution genome coverage

Similar presentations

Presentation on theme: "Topics read length distribution genome coverage"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Topics read length distribution genome coverage

Similar presentations

Presentation on theme: "Topics read length distribution genome coverage"— Presentation transcript:

Similar presentations

About project

Feedback