Bioinformatics Tool Development Dong Xu Computer Science Department 109 Engineering Building West
Components of development Identify a problem Algorithm Application Math. model Software engineering
Identify the problem What is exactly the problem? New ideas? Is the problem biologically important? Significance of the work? New problem or improvement? Improve accuracy or speed? Is the computationally problem solvable? Simulate human quantum mechanically?
Mathematical Model What is the underlying math problem? Baseline information study Formulation Definition
Algorithm (1) Pick up the right method Implementation Testing
Algorithm (2) Implementation Data structure/representation Language: C, C++, Perl, Java, Matlab? Unix/Linux or Windows? Modular programming (objected oriented) Style: should be user oriented!!!
Algorithm (3) Debugging Tools: gdb, dbx, Visual C++ Logic? Toy cases Print intermediates
Algorithm (4) Testing and code refinement Benchmark select good test set, Jack knifes… internal test application to real cases beta test send to friendly users for initial tests
Software Engineering (1) Suggestions: Easy to read (structured with comments) Avoid “spaghetti” code (goto) Easy to modify Portable to other machines Always think about computational complexity and clock cycles Use dynamic memory allocation
Software Engineering (2) Polynomial evaluation y = a+b*x+ c*x**2.0+d*x**3.0+e*x**4.0+f*x**5.0 (42.3 s) y = a+b*x+ c*x**2+d*x**3+e*x**4+f*x**5 (5.63 s) y = a+b*x+ c*x*x+d*x*x*x+e*x*x*x*x+f*x*x*x*x*x (3.15 s) x2 = x*x (2.83 s) x4 = x2*x2 y = a+b*x+ c*x2+d*x*x2+e*x4+f*x*x4 y = a+x*(b+x*(c+x*(d+x*(e+f*x)))) (1.83 s)
Software Engineering (3) Precision: Big numbers Tiny numbers Iteration effects Machine dependent score = 1- [(1-P 1 ) (1-P 2 ) (1-P 3 ) (1-P 4 )] = 1- exp [ ( Log(1-P 1 ) + Log(1-P 2 ) + Log(1-P 3 ) + Log(1-P 4 ) ) ]
Software Engineering (4) Precision: 1 + ½ + 1/3 + ¼+…+ 1/(M-1) + 1/M = log (M) M -> infinity Forward sumBackward sumlog (M) M= 10^ M=10^
Software Engineering (5) Loop optimization (1): C program for (i=0; i<1000; i++) (78 msec) for (j=0; j<1000; j++) c[i][j] = c[i][j] + a[i][j] + b[i][j] for (j=0; j<1000; j++)(1860 msec) for (i=0; i<1000; i++) c[i][j] = c[i][j] + a[i][j] + b[i][j]
Software Engineering (6) Loop optimization (2): for (i=0; i<100000; i++) (30 msec) x = x*a[i] + b[i] for (i=0; i<100000; i++) y = y*a[i] + b[i] for (i=0; i<100000; i++) (16 msec) { x = x*a[i] + b[i] y = y*a[i] + b[i] }
Software Engineering (7) Compiler optimization switch: -O (often improve by 50%, but depending on machines) -O2 (same as –O on some machines): simple inline optimization -O3 (-O4 on some machines): more complex optimizations designed to pipeline code, but may alter semantics)
Software Engineering (8) Friendly user interface Graphics, Web, options, automation Pipeline interface with other tools parallel computing multiple machine (server/client) network query
Applications Get feedback for adding new features Find good experimental collaborators From tools to papers Continues bug reports
Summary Identify a problem: solvable, biologically important Mathematical model: formulation and definition Algorithm: rigorous method, fast implementation, and systematic testing Software Engineering: friendly user interface and integration of different tools Application: work with experimentalists