Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains.

Similar presentations


Presentation on theme: "Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains."— Presentation transcript:

1 Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains –Coiled-coiled regions are potential protein-interaction domains –Hydrophilic stretches are potential loops You can discover these regions –Using sliding-widow techniques (easy) –Using prediction methods such as hidden Markov Models (more sophisticated)

2 Sliding-window Techniques Ideal for identifying strong signals Very simple methods –Few artifacts –Not very sensitive Use ProtScale on www.expasy.org Make the window the same size as the feature you’re looking for

3 www.expasy.org/cgi-bin/protscale.pl

4 Hphob. / Eisenberg

5 Using TMHMM TMHMM is the best method for predicting transmembrane domains TMHMM uses an HMM Its principle is very different from that of ProtScale TMHMM output is a prediction

6 Searching for PROSITE Patterns Search your protein against PROSITE on ExPAsy –www.expasy.org/tools/scanprosite PROSITE motifs are written as patterns –Short patterns are not very informative by themselves –They only indicate a possibility –Combine them with other information to draw a conclusion Remember: Not everything is in PROSITE !

7 www.expasy.org/tools/scanprosite P12259

8 www.expasy.org/tools/scanprosite

9 Protein Domains Proteins are usually made of domains A domain is an autonomous folding unit Domains are more than 50 amino acids long It’s common to find these together: –A regulatory domain –A binding domain –A catalytic domain

10 www.ebi.ac.uk/InterProScan

11 www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi

12

13 Secondary Structures Helix –Amino acid that twists like a spring Beta strand or extended –Amino acid forms a line without twisting Random coils –Amino acid with a structure neither helical nor extended –Amino-acid loops are usually coils

14 bioinf.cs.ucl.ac.uk/psipred//?program=psipred

15

16

17 Servers www.predictprotein.org cubic.bioc.columbia.edu/predictprotein www.sdsc.edu/predicprotein www.cbi.pku.edu.cn/predictprotein

18 www.rcsb.org

19

20 ncbi.nlm.nih.gov/BLAST

21 zhanglab.ccmb.med.umich.edu/I-TASSER/

22

23 http://zhanglab.ccmb.med.umich.edu/I-TASSER/output/S102840/

24 R-programming

25 Introduction R is: –a suite of operators for calculations on arrays, in particular matrices, –a large, coherent, integrated collection of intermediate tools for interactive data analysis, –graphical facilities for data analysis and display either directly at the computer or on hardcopy –a well developed programming language which includes conditionals, loops, user defined recursive functions and input and output facilities. The core of R is an interpreted computer language. –It allows branching and looping as well as modular programming using functions. –Most of the user-visible functions in R are written in R, calling upon a smaller set of internal primitives. –It is possible for the user to interface to procedures written in C, C++ or FORTRAN languages for efficiency, and also to write additional primitives.

26 R and statistics oPackaging: a crucial infrastructure to efficiently produce, load and keep consistent software libraries from (many) different sources / authors oStatistics: most packages deal with statistics and data analysis oState of the art: many statistical researchers provide their methods as R packages

27 Data Analysis and Presentation The R distribution contains functionality for large number of statistical procedures. –linear and generalized linear models –nonlinear regression models –time series analysis –classical parametric and nonparametric tests –clustering –smoothing R also has a large set of functions which provide a flexible graphical environment for creating various kinds of data presentations.

28 R as a calculator > log2(32) [1] 5 > sqrt(2) [1] 1.414214 > seq(0, 5, length=6) [1] 0 1 2 3 4 5 > plot(sin(seq(0, 2*pi, length=100)))

29 Object orientation primitive (or: atomic) data types in R are: numeric (integer, double, complex) character logical function out of these, vectors, arrays, lists can be built.

30 Object orientation Object: a collection of atomic variables and/or other objects that belong together Example: a microarray experiment probe intensities patient data (tissue location, diagnosis, follow-up) gene data (sequence, IDs, annotation) Parlance: class: the “abstract” definition of it object: a concrete instance method: other word for ‘function’ slot: a component of an object

31 Object orientation Advantages: Encapsulation (can use the objects and methods someone else has written without having to care about the internals) Generic functions (e.g. plot, print) Inheritance (hierarchical organization of complexity) Caveat: Overcomplicated, baroque program architecture…

32 variables > a = 24 > b<-25 > sqrt(a+b) [1] 7 > a = "The dog ate my homework" > sub("dog","cat",a) [1] "The cat ate my homework" > a = (1+1==3) > a [1] FALSE numeric character string logical

33 variables > paste("X", "Y") > paste("X", "Y", sep = " + ") > paste("Fig", 1:4) > paste(c("X", "Y"), 1:4, sep = "", collapse = " + ")

34 x<-2.17 y<-as.character(x) z<-as.numeric(y) Help(as)

35

36 vectors, matrices and arrays vector: an ordered collection of data of the same type > a = c(1,2,3) > a*2 [1] 2 4 6 Example: the mean spot intensities of all 15488 spots on a chip: a vector of 15488 numbers In R, a single number is the special case of a vector with 1 element. Other vector types: character strings, logical

37 vectors, matrices and arrays matrix: a rectangular table of data of the same type example: the expression values for 10000 genes for 30 tissue biopsies: a matrix with 10000 rows and 30 columns. array: 3-,4-,..dimensional matrix example: the red and green foreground and background values for 20000 spots on 120 chips: a 4 x 20000 x 120 (3D) array.

38 Lists vector: an ordered collection of data of the same type. > a = c(7,5,1) > a[2] [1] 5 list: an ordered collection of data of arbitrary types. > doe = list(name="john",age=28,married=F) > doe$name [1] "john " > doe$age [1] 28 Typically, vector elements are accessed by their index (an integer), list elements by their name (a character string). But both types support both access methods.

39 Data frames data frame: is like a spreadsheet. It is a rectangular table with rows and columns; data within each column has the same type (e.g. number, text, logical), but different columns may have different types. Example: > a localisation tumorsize progress XX348 proximal 6.3 FALSE XX234 distal 8.0 TRUE XX987 proximal 10.0 FALSE

40 id<-c("xx348", "xx234", "xx987") locallization<-c("proximal", "distal", "proximal") progress<-c(F, T, F) tumorsize<-c(6.3, 8.0, 10.0) results<-data.frame(id, locallization, tumorsize, progress) > results id locallization tumorsize progress 1 xx348 proximal 6.3 FALSE 2 xx234 distal 8.0 TRUE 3 xx987 proximal 10.0 FALSE

41 results<-edit(results)

42 > summary(results) id locallization tumorsize progress xx234:1 distal :1 Min. : 6.30 Mode :logical xx348:1 proximal:2 1st Qu.: 7.15 FALSE:1 xx987:1 Median : 8.00 TRUE :2 Mean : 8.10 NA's :0 3rd Qu.: 9.00 Max. :10.00 >x<-summary(results) >x id locallization tumorsize progress xx234:1 distal :1 Min. : 6.30 Mode :logical xx348:1 proximal:2 1st Qu.: 7.15 FALSE:1 xx987:1 Median : 8.00 TRUE :2 Mean : 8.10 NA's :0 3rd Qu.: 9.00 Max. :10.00

43 Subsetting Individual elements of a vector, matrix, array or data frame are accessed with “[ ]” by specifying their index, or their name > results localisation tumorsize progress XX348 proximal 6.3 0 XX234 distal 8.0 1 XX987 proximal 10.0 0 > results[3, 2] [1] 10 > a["XX987", "tumorsize"] [1] 10 > results["XX987",] localisation tumorsize progress XX987 proximal 10 0

44 Subsetting > results localisation tumorsize progress XX348 proximal 6.3 0 XX234 distal 8.0 1 XX987 proximal 10.0 0 > results[c(1,3),] localisation tumorsize progress XX348 proximal 6.3 0 XX987 proximal 10.0 0 > results[c(T,F,T),] localisation tumorsize progress XX348 proximal 6.3 0 XX987 proximal 10.0 0 > results$localisation [1] "proximal" "distal" "proximal" > results $localisation=="proximal" [1] TRUE FALSE TRUE > results[ results$localisation=="proximal", ] localisation tumorsize progress XX348 proximal 6.3 0 XX987 proximal 10.0 0 subset rows by a vector of indices subset rows by a logical vector subset a column comparison resulting in logical vector subset the selected rows

45 results[2,] results[2,2] results[1:3,] results[c(1,3),] results[c(T,F,T),] x<-summary(results) x X[2,2]

46 x = c(1, 1, 2, 3, 5, 8) x[c(TRUE, TRUE, FALSE, FALSE, TRUE, TRUE)] x[c(TRUE, FALSE)] x == 1 x[x == 1] x[x%2 == 0] y = c(1, 2, 3) y[]=3 y

47 Matrix a matrix is a vector with an additional attribute (dim) that defines the number of columns and rows only one mode (numeric, character, complex, or logical) allowed can be created using matrix() x<-matrix(data=0,nr=2,nc=2) or x<-matrix(0,2,2)

48 Data Frame several modes allowed within a single data frame can be created using data.frame() L<-LETTERS[1:4] #A B C D x<-1:4 #1 2 3 4 data.frame(x,L) #create data frame attach() and detach() –the database is attached to the R search path so that the database is searched by R when it is evaluating a variable. –objects in the database can be accessed by simply giving their names

49 a=matrix(1:9, ncol = 3, nrow = 3) a b=matrix(c(TRUE, FALSE, TRUE), ncol = 3, nrow = 3) b x=1:10 y=11:20 z=matrix(c(x,y)) z z=matrix(c(x,y),nrow=2) z z=matrix(c(x,y),nrow=4) z

50 R code max(z) min(z) length(z) mean(z) sd(z) sum(z)

51 index=c(15,27,34,10,9) welcome=c(13,26,30,10,7) paper=c(2,1,3,0,1) days=c("mon", "tues", "wed", "thurs", "fri") filenames=c("index.html", "welcom.png", "paper.pdf") downloads=matrix(c(index,welcome,paper), nrow=5, dimnames=list(days,filenames)) downloads filesizes = c(1624, 23172, 1234065) downloads%*%filesizes image(as.matrix(downloads))

52 Factors A character string can contain arbitrary text. Sometimes it is useful to use a limited vocabulary, with a small number of allowed words. A factor is a variable that can only take such a limited number of values, which are called levels.

53 expression<-factor(c("over","under","over","unchanged","under","under")) levels(expression)

54 protein<-list("glucose oxidas", "1CF3", 63355) protein protein<-list(name="glucose oxidas", accession="1CF3", weight=63355) x<-c(16614, 50660, 6066, 6118) protein$GOIDs<-x protein

55 class(protein) length(protein) attributes(protein)

56 Working directory setwd("D:/data") x<-read.table("profiles.csv", sep=",", header=TRUE)

57 x<-read.table("http://www.bixsolutions.net/profiles.csv", sep=",", header=TRUE) matplot(x, type="l") matplot(x, type="l", xlab="fraction", ylab="quantity", col=1:6, lty=1:5, lwd=2) lty: line style lwd: line width

58 xmax<-apply(x, 2, max) xmax ymax<-apply(x, 1, max) ymax Apply the max function on columns (2) or rows (1) of matrix x

59 cummean = function(x){ n = length(x) y = numeric(n) z = c(1:n) y = cumsum(x) y = y/z return(y) } n = 10000 z = rnorm(n) x = seq(1,n,1) y = cummean(z) X11() plot(x,y,type= 'l',main= 'Convergence Plot') Apply the max function on columns (2) or rows (1) of matrix x


Download ppt "Analyzing Local Properties Many local properties are important for the function of your protein –Hydrophobic regions are potential transmembrane domains."

Similar presentations


Ads by Google