Topics read length distribution genome coverage Practical Biocomputing 2018 Week 12
Insert length distribution map reads to reference (bowtie2) select reads where both mates map with high quality (samtools) python calculate mean, standard deviation plot histogram example: https://matplotlib.org/1.2.1/examples/pylab_examples/histogram_demo.html Practical Biocomputing 2018 Week 12
insert_size.py Practical Biocomputing 2018 Week 12 """================================================================================================= insert_size.py Calulate insert size based on a SAM file of mapped reads. To get only high quality mapped read pairs use the samtools command samtools view -q 20 -f 0x82 SRR5295840.bam > SRR5295840.mapped SAM format is (all one line, whitespace separated fields) SRR5295840.120 163 AT1G07250.1 101 44 1S150M = 347 398 NCT...CAG #A<...FJF AS:i:300 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:150 YS:i:300 YT:Z:CP the insert length is field 8, the last field before the sequence Michael Gribskov 1 April 2018 =================================================================================================""" import sys map = None try: map = open(sys.argv[1], 'r') except: print('unable to open input file ({}'.format(sys.argv[1])) exit(1) nread = 0 for line in map: nread += 1 print(line) if nread > 10: break print('{} reads read from {}'.format(nread, sys.argv[1])) exit(0) Practical Biocomputing 2018 Week 12
insert_size.py histogram boilerplate from example difficult to decipher error import sys import matplotlib.mlab as mlab import matplotlib.pyplot as plt map = None try: map = open(sys.argv[1], 'r') except: print('unable to open input file ({}'.format(sys.argv[1])) exit(1) nread = 0 lendata = [] for line in map: nread += 1 field = line.split() print('{}\t{}'.format(field[0], field[8])) lendata.append(field[8]) if nread > 1000: break print('\n{} reads read from {}'.format(nread, sys.argv[1])) n, bins, patches = plt.hist(lendata, bins=100, normed=1, facecolor='blue', alpha=0.75) plt.xlabel('Length') plt.ylabel('Probability') plt.title('Library Insert Length') plt.axis([40, 160, 0, 0.03]) plt.grid(True) plt.show() exit(0) Traceback (most recent call last): File "/scratch/snyder/m/mgribsko/biocomputing/utils/insert_size.py", line 38, in <module> n, bins, patches = plt.hist(lendata, bins=100, normed=1, facecolor='blue', alpha=0.75) File "/apps/rhel6/Anaconda/4.4.0-py36/lib/python3.6/site-packages/matplotlib/pyplot.py", line 3081, in hist stacked=stacked, data=data, **kwargs) File "/apps/rhel6/Anaconda/4.4.0-py36/lib/python3.6/site-packages/matplotlib/__init__.py", line 1898, in inner return func(ax, *args, **kwargs) File "/apps/rhel6/Anaconda/4.4.0-py36/lib/python3.6/site-packages/matplotlib/axes/_axes.py", line 6180, in hist if len(xi) > 0: TypeError: len() of unsized object Practical Biocomputing 2018 Week 12
insert_size.py it turns out that the data vector (lendata) must be floats actually they were strings now i get a plot but there’s nothing in it, but if i run in debugger i get something different disappears when the plt.axis() command runs duh, in the example the data is in a different range axis() sets the plot limits nread = 0 lendata = [] for line in map: nread += 1 field = line.split() print('{}\t{}'.format(field[0], field[8])) lendata.append(float(field[8])) if nread > 1000: break print('\n{} reads read from {}'.format(nread, sys.argv[1])) n, bins, patches = plt.hist(lendata, bins=100, normed=1, facecolor='blue', alpha=0.75) plt.xlabel('Length') plt.ylabel('Probability') plt.title('Library Insert Length') plt.axis([40, 160, 0, 0.03]) plt.grid(True) plt.show() Practical Biocomputing 2018 Week 12
insert_size.py it works! problems i have some negative lengths i want the bars to have a black outline insert = float(field[8]) insert = max( insert, -insert) lendata.append(insert) n, bins, patches = plt.hist(lendata, bins=100, normed=1, facecolor='blue', edgecolor='black', linewidth=0.25, alpha=0.75 ) Practical Biocomputing 2018 Week 12
insert_size.py all 15.7 M reads Practical Biocomputing 2018 Week 12
insert_size.py bells and whistles calculate mean and standard deviation import statistics as stat lenmean = stat.mean(lendata) lensd = stat.stdev(lendata) write mean and standard deviation on plot draw mean line on plot # the following is for adding annotation to the figure # must do it before plotting the histogram fig = plt.figure() ax = fig.add_subplot(111) n, bins, patches = plt.hist(lendata, bins=100, normed=1, facecolor='blue', edgecolor='black', linewidth=0.25, alpha=0.75 ) plt.xlabel('Length') plt.ylabel('Probability') plt.title('Library Insert Length') plt.grid(True, linestyle='-', linewidth=0.1) plt.text( 0.02, 0.9, 'mean: {:.1f}\nstandard deviation: {:.1f}'.format(lenmean, lensd), fontsize=10, transform=ax.transAxes) plt.axvline( lenmean, color='red', linewidth=1.5) Practical Biocomputing 2018 Week 12
read_dist.py plot the read distribution on a reference sequence uses SAM file again make an array the corresponds to the sequence increment the positions based the beginning position of the read (POS) and the alignment (CIGAR) """================================================================================================= Plot the distribution of reads on a reference sequence based on mapped reads in a SAM file SAM format is (all one line, whitespace separated fields) SRR5295840.120 163 AT1G07250.1 101 44 1S150M = 347 398 NCT...CAG #A<...FJF AS:i:300 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:150 YS:i:300 YT:Z:CP SRR5295840.120 QNAME read name 163 FLAG mapping bit flags AT1G07250.1 RNAME reference sequence name 101 POS leftmost position of mapped read 44 MAPQ mapping quality 1S150M CIGAR alignment = RNEXT name of mate/next read 347 PNEXT position of mate/next read 398 TLEN inferred insert size NCT...CAG SEQ sequence #A<...FJF QUAL quality the remaining columns are application specific AS:i:300 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:150 YS:i:300 YT:Z:CP =================================================================================================""" Practical Biocomputing 2018 Week 12
read_dist.py Practical Biocomputing 2018 Week 12 import sys # -------------------------------------------------------------------------------------------------- # main if __name__ == '__main__': map = None try: map = open(sys.argv[1], 'r') except: print('unable to open input file ({}'.format(sys.argv[1])) exit(1) nread = 0 seq = [] # assume that the mapped read begins at POS # use the CIGAR string to increment counts in the seq array for line in map: nread += 1 field = line.split() # print('{}\t{}'.format(field[0], field[8])) pos = int(field[3]) cigar = field[5] if nread > 1000: break Practical Biocomputing 2018 Week 12
read_dist.py CIGAR string, examples 151M – perfect match 151 matching bases 75S57M19S – 75 do not match, 57 match, 19 do not match (151 bases) 48S102M1 – 48 do not match, 102 match, 1 does not match 8S 82M 8I 2M 1I 36M 14S – 8 no match, 82 match 8 base insertion in read 2 base match 1 base insertion in read 36 base match, 14 base no match M alignment match (can be a sequence match or mismatch) I insertion to the reference D deletion from the reference N skipped region from the reference S soft clipping (clipped sequences present in SEQ) H hard clipping (clipped sequences NOT present in SEQ) P padding (silent deletion from padded reference) = sequence match X sequence mismatch Practical Biocomputing 2018 Week 12
read_dist.py CIGAR codes, mostly M and S, a few D or I Actions S increment sequence positions but do not count as a match M increment sequence positions and count as a match I (insertion in read) ignore, do not increment position D (deletion in read) ignore, increment sequence position M alignment match (can be a sequence match or mismatch) I insertion to the reference D deletion from the reference N skipped region from the reference S soft clipping (clipped sequences present in SEQ) H hard clipping (clipped sequences NOT present in SEQ) P padding (silent deletion from padded reference) = sequence match X sequence mismatch Practical Biocomputing 2018 Week 12
read_dist.py Practical Biocomputing 2018 Week 12 istr = '' m = 0 for char in cigar: if char.isdigit(): istr += char continue i = int(istr) if char == 'M': # matching positions m += 1 for j in range(pos,pos+i-1) seq[pos] += 1 pos += i - 1 elif char == 'S': # soft clipped positions elif char == 'I': # insertions in read, do nothing elif char == 'D': # deletion in read, increment pos else: # must be a character we don't care about, # we'll just ignore these for now pass return m # end of add_cigar istr = '' m = 0 for char in cigar: if char.isdigit(): istr += char continue i = int(istr) if char == 'M': # matching positions m += 1 for j in range(pos,pos+i-1) seq[pos] += 1 elif char not in ‘SD’: # must be a character we don't care about, # we'll just ignore these for now # I also gets dealt with here # M, S, and D increment the position, S and D do nothing else pos += i - 1 return m # end of add_cigar Practical Biocomputing 2018 Week 12
read_dist.py problem with seq list since i did not initialize it its ugly if i hardwire an arbitrary large value such as 50 M changes with the genome could read from command line i just have to know what the biggest sequence might be maybe i can do it on the fly? if __name__ == '__main__': map = None try: map = open(sys.argv[1], 'r') except: print('unable to open input file ({}'.format(sys.argv[1])) exit(1) nread = 0 seq = [] # assume that the mapped read begins at POS # use the CIGAR string to increment counts in the seq array bases_mapped = 0 for line in map: nread += 1 field = line.split() # print('{}\t{}'.format(field[0], field[8])) pos = int(field[3]) cigar = field[5] bases_mapped += add_cigar( seq, pos, cigar) if nread > 1000: break print('{} bases mapped'.format(bases_mapped)) exit(0) Traceback (most recent call last): File "/scratch/snyder/m/mgribsko/biocomputing/utils/read_dist.py", line 94, in <module> bases_mapped += add_cigar( seq, pos, cigar) File "/scratch/snyder/m/mgribsko/biocomputing/utils/read_dist.py", line 52, in add_cigar seq[pos] += 1 IndexError: list index out of range Practical Biocomputing 2018 Week 12
read_dist.py fixing the array overrun problem detecting the problem check the current end of the array versus the new alignment istr = '' m = 0 for char in cigar: if char.isdigit(): istr += char continue i = int(istr) if char == 'M': # matching positions #check to make sure seq list is big enough, if not add some more elements if pos + i + 1 > len(seq): extend_list(seq, pos + i + 10000) for j in range(pos, pos + i - 1): m += 1 seq[j] += 1 elif char not in 'SD': # must be a character we don't care about, we'll just ignore these for now # I gets dealt with here # only M, S, and D fall through to here # M, S, and D increment the position, S and D do nothing else pos += i - 1 return m # end of add_cigar Practical Biocomputing 2018 Week 12
read_dist.py fixing the array overrun problem extend_list() function def extend_list(arr, end, init=0): """--------------------------------------------------------------------------------------------- extend the existing list arr by adding indices from the current end of the list to the specified end pos (the new last index in list) :param arr: list :param end: last index to create :param init: value to initialize elements with :return: int, new list size ---------------------------------------------------------------------------------------------""" begin = len(arr) arr += [init for k in range(begin, end + 1)] return len(arr) Practical Biocomputing 2018 Week 12
read_dist.py in testing, i find a few sam lines cause problems filter them out (main program) bases_mapped = 0 for line in map: if line.startswith('@'): # skip header lines continue field = line.split() # print('{}\t{}'.format(field[0], field[8])) pos = int(field[3]) mapq = field[4] cigar = field[5] # filter some lines if mapq==0 or cigar =='*': nread += 1 bases_mapped += add_cigar(seq, pos, cigar) if nread > 100000: break print('{} bases mapped'.format(bases_mapped)) exit(0) Practical Biocomputing 2018 Week 12
read_dist.py the seq array is pretty sparse so i wrote a function to compress it 125166 35 125169 34 125173 37 125174 39 125176 40 125179 38 125180 37 125184 39 125185 40 125190 42 125191 43 125192 44 125193 48 125194 49 125195 50 125197 52 125202 56 125204 57 125210 60 125213 61 125214 62 125215 73 125216 76 125217 77 125218 84 125219 85 125220 92 125222 91 125223 93 125224 95 125226 97 125227 99 125228 100 125229 102 125236 114 125238 110 125240 112 125241 116 125243 118 125249 116 125250 121 125252 124 125253 127 125255 125 125258 126 125261 125 125265 126 125270 125 125272 123 125290 124 125294 122 125295 121 125298 119 125302 117 125304 118 125305 117 125308 4 125342 3 125368 4 125401 3 125422 4 125452 3 125492 2 125551 1 127052 0 127136 1 127181 0 127238 1 128630 0 128631 2 128633 6 128637 7 128650 9 128654 10 128655 14 128659 16 128660 21 128661 31 128662 33 128664 35 128665 36 128667 43 128668 44 128670 46 128671 48 128672 55 128673 59 128674 60 def condense_seq(seq): """--------------------------------------------------------------------------------------------- condense the seq array by removing consecutive positions with the same value. This allows easier printing. For plotting, each interval ends at the indicated position. pos1, val1 pos2, val2 ... SAM files use a 1 origin, so the first interval is 1 to pos1 with value val1, the second is pos1 + 1 to pos2 with val2, etc. :param seq: list of int, coverage at each base :return: list of tuples, (pos, val) ---------------------------------------------------------------------------------------------""" compressed = [] cover = seq[1] for pos in range(2, len(seq)): if seq[pos] != cover: compressed.append((pos - 1, cover)) cover = seq[pos] return compressed Chr4 gene 122851 125591 ID=AT4G00290;Name=AT4G00290 Chr4 mRNA 122851 125591 ID=AT4G00290.1;Parent=AT4G00290 Chr4 3'UTR 122851 123096 ID=AT4G00290:three_prime_UTR:1;Parent=AT4G00290.1;Name=AT4G00290:three_prime_UTR:1 Chr4 exon 122851 123207 ID=AT4G00290:exon:7;Parent=AT4G00290.1;Name=AT4G00290:exon:7 Chr4 exon 123362 123583 ID=AT4G00290:exon:6;Parent=AT4G00290.1;Name=AT4G00290:exon:6 Chr4 exon 123730 123837 ID=AT4G00290:exon:5;Parent=AT4G00290.1;Name=AT4G00290:exon:5 Chr4 exon 123966 124066 ID=AT4G00290:exon:4;Parent=AT4G00290.1;Name=AT4G00290:exon:4 Chr4 exon 124271 124548 ID=AT4G00290:exon:3;Parent=AT4G00290.1;Name=AT4G00290:exon:3 Chr4 exon 124627 125304 ID=AT4G00290:exon:2;Parent=AT4G00290.1;Name=AT4G00290:exon:2 Chr4 5'UTR 125301 125304 ID=AT4G00290:five_prime_UTR:2;Parent=AT4G00290.1;Name=AT4G00290:five_prime_UTR:2 Chr4 exon 125477 125591 ID=AT4G00290:exon:1;Parent=AT4G00290.1;Name=AT4G00290:exon:1 Chr4 5'UTR 125477 125591 ID=AT4G00290:five_prime_UTR:1;Parent=AT4G00290.1;Name=AT4G00290:five_prime_UTR:1 reverse strand Practical Biocomputing 2018 Week 12
read_dist.py Practical Biocomputing 2018 Week 12 majorlocator = MultipleLocator(1000) minorlocator = MultipleLocator(100) majorformatter = FormatStrFormatter('%d') fig, ax = plt.subplots(1, 1, figsize=(15,3)) pos = [i + 1 for i in range(0, len(seq))] ticks = [i for i in range(3000, 8000) if i % 100 == 0 ] ax.fill(pos, seq, linewidth=0.75) # plt.yscale('log') plt.xticks(ticks) plt.xlim(3000, 8000) plt.ylim(0, 150) ax.xaxis.set_major_locator(majorlocator) ax.xaxis.set_major_formatter(majorformatter) ax.xaxis.set_minor_locator(minorlocator) ax.set(xlabel='position', ylabel='Read Count', title='Read Distribution') ax.grid() plt.show() Practical Biocomputing 2018 Week 12
read_dist.py Practical Biocomputing 2018 Week 12 Chr4 exon 4127 4149 ID=AT4G00020:exon:27;Parent=AT4G00020.2;Name=BRCA2(IV):exon:27 Chr4 CDS 4127 4149 ID=AT4G00020:CDS:27;Parent=AT4G00020.2;Name=BRCA2(IV):CDS:27 Chr4 exon 4227 4438 ID=AT4G00020:exon:26;Parent=AT4G00020.2;Name=BRCA2(IV):exon:26 Chr4 CDS 4227 4438 ID=AT4G00020:CDS:25;Parent=AT4G00020.2;Name=BRCA2(IV):CDS:25 Chr4 CDS 4545 4749 ID=AT4G00020:CDS:22;Parent=AT4G00020.2;Name=BRCA2(IV):CDS:22 Chr4 CDS 4839 4901 ID=AT4G00020:CDS:21;Parent=AT4G00020.2;Name=BRCA2(IV):CDS:21 Chr4 CDS 4977 5119 ID=AT4G00020:CDS:19;Parent=AT4G00020.2;Name=BRCA2(IV):CDS:19 Chr4 CDS 5406 5588 ID=AT4G00020:CDS:18;Parent=AT4G00020.2;Name=BRCA2(IV):CDS:18 Chr4 CDS 5657 5855 ID=AT4G00020:CDS:17;Parent=AT4G00020.2;Name=BRCA2(IV):CDS:17 Chr4 CDS 6605 6676 ID=AT4G00020:CDS:16;Parent=AT4G00020.2;Name=BRCA2(IV):CDS:16 Chr4 CDS 6760 6871 ID=AT4G00020:CDS:15;Parent=AT4G00020.2;Name=BRCA2(IV):CDS:15 Chr4 CDS 6975 7056 ID=AT4G00020:CDS:14;Parent=AT4G00020.2;Name=BRCA2(IV):CDS:14 Chr4 CDS 7144 7194 ID=AT4G00020:CDS:13;Parent=AT4G00020.2;Name=BRCA2(IV):CDS:13 Chr4 CDS 7294 7375 ID=AT4G00020:CDS:12;Parent=AT4G00020.2;Name=BRCA2(IV):CDS:12 Chr4 CDS 7453 7638 ID=AT4G00020:CDS:11;Parent=AT4G00020.2;Name=BRCA2(IV):CDS:11 Chr4 CDS 7712 7813 ID=AT4G00020:CDS:10;Parent=AT4G00020.2;Name=BRCA2(IV):CDS:10 Chr4 CDS 7914 7947 ID=AT4G00020:CDS:8;Parent=AT4G00020.2;Name=BRCA2(IV):CDS:8 Practical Biocomputing 2018 Week 12
read_dist.py With exons marked Practical Biocomputing 2018 Week 12 # add exon locations exon = [('Chr4', 'CDS', 4127, 4149), ('Chr4', 'CDS', 4227, 4438), ('Chr4', 'CDS', 4545, 4749), ('Chr4', 'CDS', 4839, 4901), ('Chr4', 'CDS', 4977, 5119), ('Chr4', 'CDS', 5406, 5588), ('Chr4', 'CDS', 5657, 5855), ('Chr4', 'CDS', 6605, 6676), ('Chr4', 'CDS', 6760, 6871), ('Chr4', 'CDS', 6975, 7056), ('Chr4', 'CDS', 7144, 7194), ('Chr4', 'CDS', 7294, 7375), ('Chr4', 'CDS', 7453, 7638), ('Chr4', 'CDS', 7712, 7813), ('Chr4', 'CDS', 7914, 7947)] span = 8000 - 3000 for e in exon: begin = (e[2] - 3000)/span end = (e[3] - 3000)/span plt.axhline(5.0, begin, end, color='black', linewidth=6.0) Practical Biocomputing 2018 Week 12