Continuous Representations of Time Gene Expression Data Ziv Bar-Joseph, Georg Gerber, David K. Gifford MIT Laboratory for Computer Science J. Comput. Biol.,10, , 2003
Outline Splines Estimating Unobserved Expression Values and Time Points Model Based Clustering Algorithm for Temporal Data Aligning Temporal Data Results
Splines The word “spline” come from the ship building industry
Splines Splines are piecewise polynomials with boundary continuity and smoothness constraints. The typical way to represent a piecewise cubic curve :
Splines – We have cubic polynomial : – equations are required : – Interpolating splines
Splines B-spline – In terms of a set of normalized Basis functions The application of fitting curved to gene expression time-series data – Convenient with the B-spline basis to obtains approximating or smoothing splines – Fewer basis coefficient than there are observed data points – Avoid overfitting
Splines The basis coefficients : – Interpreted geometrically as control points – The vertices of a polygon that control the shape of the spline but are not interpolated by the curve – The curve lies entirely within the convex hull of this controlling polygon. – Each vertex exerts only a local influence on the curve.
Splines
– 任何 x i 區間中 S(t) 必為 k-1 次的多項式 – S(t) 具有 1,2,…,k-2 階微分的連續性 – 對於同一 k 值而言 – 在 t 的有效區間中 b i,k ≧ 0 ,且任一 b i,k 均僅有唯一極大值,除 k=1,2 外 b i,k 均為連續平滑曲線。 y t 1 x i x i+1 x i+2 x i+3 b i,1 b i,2 b i,3
Splines A uniform knot vector is one in which the entries are evenly space – i.e. – The basis functions will be translated of each other, i.e. – For a periodic cubic B-spline (k=4), the equation specifying the curve :
B-splines – The B-spline will only be defined in the shaded region 3 t 4
Estimating Unobserved Expression Values and Time Points To obtain a continuous time formulation, use cubic B-spline – Getting the value of the splines at a set of control points in the time-series. Re-sample the curve to estimate expression values at any time-points. Spline function are not fit for each gene individually – due to noise and missing value – lead to over-fitting Instead, constraint the spline coefficients of co-expressed genes to have the same covariance matrix – Use other genes in the same class to estimate the missing values of a specific gene.
Estimating Unobserved Expression Values and Time Points A probabilistic model of time series expression data – Assume a set of genes are grouped together Using prior biological knowledge a clustering algorithm
Estimating Unobserved Expression Values and Time Points – – – – – – – –
To learn the parameters of this model ( , , and ) – Use the observed values, and maximize the likelihood of the input data –
Estimating Unobserved Expression Values and Time Points – Decompose the probability : If the values were observed, decompose the probability :
Estimating Unobserved Expression Values and Time Points – Use EM E step : find the best estimation for using the values we have for 2, , and . M step : maximize.
Model Based Clustering Algorithm for Temporal Data A new clustering algorithm that simultaneously solves the parameter estimation and class assignment problems – – EM algorithm E step M step
Model Based Clustering Algorithm for Temporal Data –
Aligning Temporal Data Assume we have two sets of time-series gene expression profiles – Splines for reference – Splines in the set to be warped A mapping – Linear transformation
Aligning Temporal Data The error of the alignment: – Averaged squared distance Find parameters a and b that minimize The error for a set of genes S of size n The averaged squared distance between the two curve Take into account the degree of overlap between the curves.
Aligning Temporal Data – – – –
Results 800 genes in Saccharomyces cerevisiae with five groups Unobserved data estimation
Results Clustering – Explore the effect that non-uniform sampling Two synthetic curves :
Results