Presentation is loading. Please wait.

Presentation is loading. Please wait.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology STAT115 Lab 3 PART I Homework Q8 The Dot Matrix Method.

Similar presentations


Presentation on theme: "STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology STAT115 Lab 3 PART I Homework Q8 The Dot Matrix Method."— Presentation transcript:

1 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology STAT115 Lab 3 PART I Homework Q8 The Dot Matrix Method

2 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology The Dot Matrix Method. Gets you started thinking about sequence alignment in general. Provides a ‘Gestalt’ of all possible alignments between two sequences. To begin — I will use a very simple 0, 1 (match, no-match) identity scoring function without any windowing. As you will see later today, more complex scoring functions will normally be used in sequence analysis (especially with amino acid sequences) A general way to see similarities in pair-wise comparisons:

3 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Since this is a comparison between two of the same sequences, an intra-sequence comparison, the most obvious feature is the main identity diagonal. Two short perfect palindromes can also be seen as crosses directly off the main diagonal; they are “ANA” and “SIS.”

4 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology The biggest asset of dot matrix analysis is it allows you to visualize the entire comparison at once, not concentrating on any one ‘optimal’ region, but rather giving you the ‘Gestalt’ of the whole thing.

5 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Here you can easily see the effect of a sequence ‘insertion’ or ‘deletion.’ It is impossible to tell whether the evolutionary event that caused the discrepancy between the two sequences was an insertion or a deletion and hence this phenomena is called an ‘indel.’ A jump or shift in the register of the main diagonal on a dotplot clearly points out the existence of an indel. (again zero:one match score function) Check out the ‘mutated’ inter-sequence comparison below:

6 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Another phenomenon that is very easy to visualize with dot matrix analysis are duplications or direct repeats. These are shown in the following example: The ‘duplication’ here is seen as a distinct column of diagonals; whenever you see either a row or column of diagonals in a dotplot, you are looking at direct repeats.

7 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Now consider the more complicated ‘mutation’ in the following comparison: Again, notice the diagonals. However, they have now been displaced off of the center diagonal of the plot and, in fact, in this example, show the occurrence of a ‘transposition.’ Dot matrix analysis is one of the only sensible ways to locate such transpositions in sequences. Inverted repeats still show up as perpendicular lines to the diagonals, they are just now not on the center of the plot. The ‘deletion’ of ‘PRIMER’ is shown by the lack of a corresponding diagonal.

8 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Reconsider the same plot. Notice the extraneous dots that neither indicate runs of identity between the two sequences nor inverted repeats. These merely contribute ‘noise’ to the plot and are due to the ‘random’ occurrence of the letters in the sequences, the composition of the sequences themselves. How can we ‘clean up’ the plots so that this noise does not detract from our interpretations? Consider the implementation of a filtered windowing approach; a dot will only be placed if some ‘stringency’ is met. What is meant by this is that if within some defined window size, and when some defined criteria is met, then and only then, will a dot be placed at the middle of that window. Then the window is shifted one position and the entire process is repeated. This very successfully rids the plot of unwanted noise. Filtered Windowing —

9 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology In this plot a window of size three and a stringency of two is used to considerably improve the signal to noise ratio (remember, I am using a 1:0 identity scoring function).

10 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology TUTORIAL I LAB 3 Alejandro Quiroz-Zárate Daniel Fernandez

11 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology A little of istory R is a dialect of the S language 1991 Created in New Zeeland by Ross Ihaka and Robert Gentleman 1993 1 st announcement of R to the public 1995 Martin Mächler convinces Ihaka and Gentleman to use the GNU General Public License, making R FREE 1997 The R Core Group is formed, controlling the source code 2000 R version 1.0.0 is released 2011 R version 2.12.1 up to 16 th of December 2010 1976 John Chambers and others at Bell Labs create S

12 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Essentially we work with a 40 year-old technology! R is dived in 2 parts – The BASE system What comes with the download from CRAN (Comprehensive R Archive Network) – The packages that you download Based on your needs!!! Over 1000 packages on CRAN – http://www.r-project.org/ http://www.r-project.org/ Last but NOT least – R is FREE!!!!!!

13 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Outline The Console and the Script – Workspace management Objects – Classes and Mode – Some Classes: Vectors, Matrices and data.frames – Some Modes: Lists, strings Loops and conditional statements Functions – R functions – My own functions Handling data – Reading and writing! Plotting! Libraries Exercises

14 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Getting started The Console Essentially were the commands are executed The Script Were the code is written

15 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology An R session Type code here Adjust/Extend code Output appears

16 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Workspace Management Before jumping into R, it is important to ask ourselves – Where am I? getwd() – I want to be there… setwd(“C://”) – With who am I? dir() # lists all the files in the working directory – With who I can count on? ls() #lists all the variables on the current session

17 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Workplace Management (2) Saving –save(x,file=“name.RData”) Saves specific objects –save.image(“name.Rdata”) Saves the whole workspace Loading –load(“name.Rdata”) ‘?function’ and ‘??function’ – ? To get the documentation of the function – ?? Find related functions to the query

18 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology R Objects Almost all things in R are OBJECTS! – Functions, datasets, results, etc… (graphs NO) OBJECTS are classified by two criteria – MODE: How objects are stored in R Character, numeric, logical, factor, list, function… To obtain the mode of an object –mode(object) – CLASS: How objects are treated by functions Vector, matrix, array, data.frame,… To obtain the class of an object –class(object)

19 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology R Objects (2) x1 x2 x3 x4 x5 x6 1234567812345678 MODE: Is determined by the type of things stored (numbers, characters, Boolean,) If only numbers: numeric If it is a mixture: list CLASS: Is determined by how functions deal with this object. If only numbers: matrix If it is a mixture: data.frame

20 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Some classes Vectors!!! –x=c(10,5,3,6) – Calculations on vector are performed on each entry y=c(log(x),x,x^2) – Not necessarily to have vectors of the same length in operations! w=sqrt(x)+2 z=c(pi,exp(1),sqrt(2)) x+z – Logical vectors aux=x<7

21 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Some classes (2) Matrices !!! –x=1:8 –dim(x)=c(2,4) –y=matrix(1:8,2,4,byrow=F) – Operations are applied on each element x*x, max(x) x=matrix(1:28,ncol=4), y=7:10 so then x*y is…? –y=matrix(1:8,ncol=2) y%*%t(y)

22 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Some classes (3) Extracting info –y[1,] or y[,1] Extending matrices –cbind(y,seq(101,104)) –rbind(y,c(102,109)) apply is a useful function! –apply(y,2,mean) –apply(y,1,log)

23 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Some classes (4) data.frame!!! – Creation Several ways to create a data frame – 1) »logical=sample(c(T,F),size=20,replace=T) »numeric=rnorm(20) »my.df=data.frame(logical, numeric) – 2) »test=matrix(rnorm(21),7,3) »test=data.frame(test) class(my.df[1,])

24 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology A mode Lists!!! – Is like a vector An element of a list can be an object of any type and structure –x1=1:5 –x2=c(T,T,F,T,F) –y=list(numbers=x1,questions=x2)

25 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Loops and conditional statements if – Example a=9 if(a<0) {print (“Negative number”) }else {print (“Non-negative number”) }

26 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology for –z=rep(1,10) –for (i in 2:10) { z[i]=z[i]+exp(1)*z[i-1] } while –n=0 –tmp=0 –while(tmp<100) { tmp=tmp+rbinom(1,10,0.5) n=n+1}

27 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Functions! My own functions –function.name=function(arg1,arg2,…,argN) { Body of the function } –fun.plot=function(y,z){ y=log(y)*z-z^3+z^2 plot(z,y)} –z=seq(-11,10) –y=seq(11,32) –fun.plot(y,z)

28 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Functions! (2) The ‘…’ argument – Can be used to pass arguments from one function to another Without the need to specify arguments in the header fun.plot=function(y,z,...) { y=log(y)*z-z^3+z^2 plot(z,y,...) } fun.plot(y,z,type="l",col="red") fun.plot(y,z,type="l”,col=“red”,lwd=4)

29 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Handling data I/O Reading files –read.csv(“filename.csv“) # reads csv files into a data.frame –read.table(“filename.txt“) # reads txt files in a table format to a data.frame –scan(filename) # not friendly for matrices or tables!!! Writing to files –write(x,file=“filename”) # writes the object x to filename –write.table(x,filename) # writes the object x to filename in a table format

30 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Plotting! x.data=rnorm(1000) y.data=x.data^3-10*x.data^2 z.data=-0.5*y.data-90 plot(x.data,y.data,main="Title of the graph",xlab="x label",ylab="y label") points(x.data,z.data,col="red") legend(-2,2,legend=c("Black points","Red points"),col=c("black","red"),pch=1,text.col=c( "black","red"))

31 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Plotting! (2) You can export graphs in many formats – To check the formats that are available in your R installation capabilities() – png png("Lab2_plot.png",width=520,height=440) plot(x.data,y.data,main="Title of the graph",xlab="x label",ylab="y label") points(x.data,z.data,col="red") legend(-2,2,legend=c("Black points","Red points"),col=c("black","red"),pch=1,text.col=c("black","red")) dev.off() –eps postscript("Lab2_plot.eps",width=500,height=440)

32 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Libraries!! Collection of R functions that together perform a specialized analysis or task. Install packages from CRAN install.packages(“PackageName”) Loading libraries –library(LibraryName) Getting the documentation of a library –library(help=LibraryName) Listing all the available packages –library()

33 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Exercise 1 – Probability Transform We know that, and we want to know the probability associated with (a)Plot the theoretical pdf and cdf of X. (b)Generate 10,000,000 observations of the random variable X (c)Compute Y=3X 5 +4X 2 -7 (d)Estimate the probability that (e)Plot histogram and empirical CDF of Y

34 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Exercise 2 – The empire strikes back: GOOG versus BAIDU Plot historical Stock Prices times series using prices from yahoo finance. (a)Download and install tseries package. (b)Include tseries package as a library in your code. (c)Use get.hist.quote to download GOOG and BAIDU historical data. (d)Plot both time series in the same panel and add a legend to the plot.

35 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Exercise 3 – Challenging Challenger On January 28, 1986, the space Shuttle Challenger exploded in the early stages of its flight. Feynman, along a committee determined that the explosion was due to low temperatures and the failure of O-rings sealed on the booster rockets. The ambient temperature was 36 degrees on the morning of the launch. The scientists had data (temperature, number of failures) from previous flights.

36 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Question 3 – Challenging Challenger (a)Plot the number of failures versus the temperature for flights with one or more O-ring failures. Is there any evidence that temperature affects O-ring performance? (b)Plot the number of failures versus temperature for all the flights. Is there any evidence that temperature affects O-ring performance? (c)What’s your conclusion? What do you think the scientists plot before taking the decision to fly that day? Just historical curiosity, Whom played a central role in discovering the causes of the failure and how he announced it?


Download ppt "STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology STAT115 Lab 3 PART I Homework Q8 The Dot Matrix Method."

Similar presentations


Ads by Google