Yuval Hart, Weizmann 2010© 1 Introduction to Matlab & Data Analysis Tutorial 13: That’s all, Folks! Please change directory to directory E:\Matlab (cd E:\Matlab;) From the course website ( ) Download: tFinal.zip
2 Outline Parsing files Efficient programming - vectorization (Profiling) Correlation coefficients Passing extra parameters Image plotting Curve Fitting & Optimization Figure handling
3 “Rotation in 60 minutes”
4 Rotation in 60 minutes: During the past month you’ve measured promoter activity of 20 genes. Your PI wants you to present your results at the next group meeting.
5 To Do List Get the sequences of the genes from a GenBank+Fasta files and calculate GC content Display all correlation coefficients of the measured PA and relation to GC content Find for the highest 4 genes, how correlation decays with distance from initial gene in the pathway
6 To Do List Get the sequences of the genes from a GenBank+Fasta files and calculate GC content Display all correlation coefficients of the measured PA and relation to GC content Find for the highest 4 genes, how correlation decays with distance from initial gene in the pathway
7 GenBank file format
8 Step 1: get data from files Get the DNA sequence from the fasta file: % Extract gene sequences from the fasta file fid_fasta_data = fopen(fnamefasta,'r'); %check that file was opened correctly if fid_fasta_data<0 error('GenBank File name is not correct, please issue file name again'); end celledFasta=textscan(fid_fasta_data, '%c'); % '%c' is single character fclose(fid_fasta_data); fasta=celledFasta{1}'; %fasta is a char array of the sequence
9 Step 1: get data from files Put the entire file in a cell array divided to rows: %open the gene file fid_gene_input=fopen(fnamegene,'r'); %check that file was opened correctly if fid_gene_input<0 error('GenBank File name is not correct, please issue file name again'); end % parse the file such that every row of file is inside a cell element celledData=textscan(fid_gene_input,'%s','delimiter','\n'); fclose(fid_gene_input); % remove all white spaces from beginning and end of rows celledDataTrim=strtrim(celledData{1});
10 Step 2: Get genes names and sequence position from GenBank data Get the genes names and sequence location from the GenBank file: % indCDS has an index for all ocurrances of CDS % line format is 'CDS pos1..pos2' or 'CDS complement(pos1..pos2)' % so CDSposition are the tokens from the row CDSposition=regexp(celledDataTrim,'^CDS\s+(?:complement\()*(\d+)\.\.(\d+)','tok ens'); indCDS=~cellfun('isempty',CDSposition); % gene name is one row below the CDS info so shift index one place right indGene=circshift(indCDS,[1 1]); % since already looked for right patterning, only need to check if there is % complement or not. indComplement indicates if it is a complement or % regular sequence indComplement=~cellfun('isempty',regexp(celledDataTrim(indCDS),'complement'));
11 Step 2: Get genes names and sequence position from GenBank data Get the genes names and sequence location from the GenBank file: %List of genes corresponding to the CDS found geneNames=regexp(celledDataTrim(indGene),'gene="(\w+)"','tokens'); geneNames=[geneNames{:}]; % Consider only cell elemets that had 'CDS' in them onlyCDSposition=CDSposition(indCDS); % Flatten the tokes cell array such that onlyCDSposition will have odd % elements as position 1 (start of gene) and even elements as position 2 % (end of gene) onlyCDSposition=[onlyCDSposition{:}]; CDSpositionStartEndCelled=cat(1,onlyCDSposition{:}); % cancatinates as two % columns and not in a single row (try cat(2,onlyCDSposition{:}))
12 Step 2: Get genes names and sequence position from GenBank data Get the index of only the genes we are interested in (found in genePool): % indGene specifies all ocurrances of genes in the file that are in the % "pool"/desired list indGeneList=ismember(geneNames,genePool);
13 Step 3: Attach every gene name with its DNA sequence Use indices to build array of gene sequence and calculate GC content: % Initialize gene list index j=0; % Note: i is the index of the vector searched (serial number of the gene in % the genBank list, j is the index of the specified genes, e.g. there could % be only 2 genes but their serial number in genBank file is 151 and 352, therefore % i= [ ] but j=[1 2]) seq=cell(1,sum(indGeneList)); GCcontent=cell(1,sum(indGeneList)); for i=find(indGeneList==1) j=j+1; % get the sequence from the fasta data by the start and end positions seq{j}=fasta(CDSpositionStartEndNum(i,1):CDSpositionStartEndNum(i,2)); % GCcontent is the percent of G or C in the sequence GCcontent{j}=length(regexp(seq{j},'[GC]'))/length(seq{j}); end
14 Step 3: Attach every gene name with its DNA sequence Build the structure with all needed fields: % Build the structure Genes with the desired genes and their data: % name, startPosition, endPosition, sequence, complement (1/0), GCcontent % This is also the way to preallocate for structures: % Genes(1,sum(indGeneList))=struct( 'name', [], 'complement', [], 'sequence',[],... % 'StartPosition',[],'EndPosition',[],'GCcontent',1); Genes=struct('name',geneNames(indGeneList),… 'complement', num2cell(indComplement(indGeneList)'),... 'StartPosition',CDSpositionStartEndCelled(indGeneList,1)',… 'EndPosition',CDSpositionStartEndCelled(indGeneList,2)',... 'sequence',seq,'GCcontent',GCcontent); a=Genes; Note: Structures are assigned one by one only with cell arrays
15 Profiling Compare runs of these two files: GetGenesData.m pars_gb_file.m What are the pitfalls of each one ? (hint: efficiency vs. memory usage).
16 To Do List Get the sequences of the genes from a GenBank+Fasta files and calculate GC content Display all correlation coefficients of the measured PA and relation to GC content Find for the highest 4 genes, how correlation decays with distance from initial gene in the pathway
17 Calculate and plot Correlation Matrix Load the list of genes and measurements % Input: % measurement mat file contains: % geneList - a cell array of the genes Names % measurements - a matrix of 20 genes measurements at 1001 time points % GenesGCcontent - a vector of the genes GCcontent values %measurements has a row for each gene containing its measurements through %1001 time points and the geneList names load measurements
18 Plot GC content and mean PA dependence Plot mean PA vs. GC content with the correlation coefficient figure(1); corrGCvsPA=corrcoef(ScaledGCcontent,MeanPA); plot(ScaledGCcontent,MeanPA,'or','MarkerSize',8,'LineWidth',2); set(gcf,'units','normalized','outerposition',[ ]);%set the plot to full screen title(sprintf('Mean Promoter Activity vs. GCcontent, Correlation is %2.4f',... corrGCvsPA(1,2)),'FontSize',14); xlabel('Scaled GC content [% deviation from 0.5]','FontSize',14); ylabel('Mean Promoter Activity [a.u.]','FontSize',14); hold on;
19 Plot GC content and mean PA dependence Plot fit results upon the previous graph: % Check for a linear fit to the curve fittedfunc=polyfit(ScaledGCcontent,MeanPA',1); plot(ScaledGCcontent,polyval(fittedfunc,ScaledGCcontent),'r','LineWidth',2); % Smooth the data and then fit to a polynomial: SmoothPA=smooth(ScaledGCcontent,MeanPA,0.25,'rloess'); %plot the smooth data set with robust smoothing plot(ScaledGCcontent,SmoothPA,'ob','MarkerSize',8,'LineWidth',2); Smofittedfunc=polyfit(ScaledGCcontent,SmoothPA',1); plot(ScaledGCcontent,polyval(Smofittedfunc,ScaledGCcontent),'b','LineWidth',2); text(0.05,2.1,['\leftarrow', sprintf('y= %2.2f x+%2.2f', fittedfunc(1),fittedfunc(2))],... 'HorizontalAlignment','left','FontSize',18,'Color',[1 0 0]); %See text properties text(-0.11,4,['\leftarrow',sprintf('y= %2.2f x+%2.2f‘...,Smofittedfunc(1),Smofittedfunc(2))], 'HorizontalAlignment','left','FontSize',18,... 'Color',[0 0 1]); %See text properties Robust smooth
20 Plot GC content and mean PA dependence Plot fit results upon the previous graph: Note: Smoothed data can lower the effect of outliers
21 Calculate and plot Correlation Matrix Calculate and display the corr. matrix figure(2); %note that corrcoef works on columns so we need to transpose measurements %calculate the correlation matrix of all genes measurements corrMat=corrcoef(measurements'); colormap('hot'); %set color scheme, popular choices are also: 'jet','hsv' imagesc(corrMat); %creates the image, data is scaled to max value of matrix colorbar; %plots also the color bar in the figure. set(gcf,'units','normalized','outerposition',[ ]);%set the plot to full screen set(gca,'XTick',1:20,'XTickLabel',geneList,'FontSize',12,'XAxisLocation','top') %sets the Ticks to be the genes Names and present them at top of figure set(gca,'YTick',1:20,'YTickLabel',geneList,'FontSize',12) %sets the Ticks to be the genes Names title('Gene correlations','FontSize',16);
22 Calculate and plot Correlation Matrix Calculate and display the corr. matrix
23 Calculate and plot Correlation Matrix If we first need to cluster the correlations from high to low: measurementsPermuted=measurements(randperm(GenesAmount),:); corrMatPerm=corrcoef(measurementsPermuted'); colormap('hot'); %set color scheme, popular choices are also: 'jet','hsv' imagesc(corrMatPerm); %Now we want to cluster them together by the mean correlation of % each gene with all other genes: MeanCorrMatPerm=mean(corrMatPerm); [sortedCorr indPerm]=sort(MeanCorrMatPerm,'descend'); imagesc(corrMatPerm(indPerm,indPerm));
24 Calculate and plot Correlation Matrix measurementsPermuted=measurements(randperm(GenesAmount),:); corrMatPerm=corrcoef(measurementsPermuted'); colormap('hot'); %set color scheme, popular choices are also: 'jet','hsv' imagesc(corrMatPerm); %Now we want to cluster them together by the mean correlation of % each gene with all other genes: MeanCorrMatPerm=mean(corrMatPerm); [sortedCorr indPerm]=sort(MeanCorrMatPerm,'descend'); imagesc(corrMatPerm(indPerm,indPerm));
25 To Do List Get the sequences of the genes from a GenBank+Fasta files and calculate GC content Display all correlation coefficients of the measured PA and relation to GC content Find for the highest 4 genes, how correlation decays with distance from initial gene in the pathway
26 Step 1: initialize and set parameters Set figure parameters and external fit parameters of the curves: figure(3); set(gcf,'units','normalized','outerposition',[ ]);%set plot to full screen %want to check if a vertical displacement helps, so added variable: initDis %which is part of the fitting function formula initDis=-0.1; GenesAmount=size(measurements,1);
27 Step 2: Fit correlations to the desired function Using anonymous function to add more parameters and fitting using lsqcurvefit: correl=corrMat(i,(1+i):end); %assigning the current correlation matrix values, from row i and columns after the diagonal % definition of the anonymous function which can have only two inputs, % yet we use three: fitting parameters, x values and initial displacement paramfunc %definition of the % anonymous function c0=[.7 0.1]; %assigning the initial values for the fit search XdataPoints=(1+i):GenesAmount; options = optimset('TolFun',1e-8,'GradObj','on'); % default=1e-6 %lsqcurvefit(function name,init guess,xdata,ydata,lower bound,upper % bound,options) ExpParam=lsqcurvefit(paramfunc,c0,XdataPoints,correl,[0 -1],[1 1],options); for i=1:numGenesToPlot end
28 Step 2: Fit correlations to the desired function Using anonymous function to add more Parameters and fitting using lsqcurvefit: function y_hat=FittingCurveExpGuess(c,x,init) % This assumes an exponential decreasing curve y_hat=init+c(1)*exp(c(2).*x); initDis=-0.1; c0=[.7 0.1]; %assigning the initial values for the fit search paramfunc %def. of the anonymous function ExpParam=lsqcurvefit(paramfunc,c0,XdataPoints,correl,[0 -1],[1 1],options); Function nameInitial guessX dataY data Lower bound upper bound
29 Step 3: Plot the correlation data and fit for i=1:numGenesToPlot % missing parts on previous slides… %Plotting the correlation graph with the found parameters: subplot(numGenesToPlot,1,i); plot(XdataPoints,correl,'ob',… XdataPoints,init+ExpParam(1)*exp((XdataPoints).*ExpParam(2)),'r','LineWidth',2); set(gca,'XTick',XdataPoints,'XTickLabel',geneList(XdataPoints),'FontSize',12); set(gca,'YLim',[0 max(correl)+0.1]); title(sprintf('%s Correlation Data, Fit parameters: c1=%2.2f, c2=%2.2f,… Displacement=%2.2f ',geneList{i},ExpParam(1),ExpParam(2),initDis),'FontSize',14); end Plotting with dots, each subplots with its own genes names and curvefit parameters:
30 Step 3: Plot the correlation data and fit
31 Best of Luck in the Group Meeting !
32 Best of Luck in the Group Meeting ! (and exam )
33 What did we learn? Matlab syntax Array manipulation, Cells, Structures Programming: Functions Writing efficient code Files & strings manipulation Data analysis and Signal Processing
34
35 This is the end, my friend, the end "Louis, I think this is the beginning of a beautiful friendship."