Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure.

Similar presentations


Presentation on theme: "Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure."— Presentation transcript:

1 Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure Workshop: Grid Application Planning & Implementation January 5-7, 2005 Southeastern Universities Research Association

2 2 SURA Cyberinfrastructure Workshop January 5-7, 2005 Discussion Topics… Sequence alignment problem Memory efficient algorithm Convergence toward collaboration System configurations  Results (part 1, part 2)  Conclusions  Future work

3 Southeastern Universities Research Association 3 SURA Cyberinfrastructure Workshop January 5-7, 2005 Sequence alignment problem Sequences used to find biologically meaning relationships among organisms Evolutionary information Determining diseases, causes, cures Finding out information about proteins Problem especially compute intensive for long sequences Needleman and Wunsch (1970) - optimal global alignment Smith and Waterman (1981) - optimal local alignment Taylor (1987) - multiple sequence alignment by pairwise alignment BLAST trades off optimal results for faster computation Challenge - achieve optimal results without sacrificing speed

4 Southeastern Universities Research Association 4 SURA Cyberinfrastructure Workshop January 5-7, 2005 Memory efficient algorithm Based on pairwise algorithm Similarity Matrix generated to compare all sequence positions Observation that many “alignment scores” are zero value Similarity Matrix reduced by storing only non-zero elements Row-column information stored along with value Block of memory dynamically allocated as non-zero element found Data structure used to access allocated blocks Parallelism introduced to reduce computation

5 Southeastern Universities Research Association 5 SURA Cyberinfrastructure Workshop January 5-7, 2005 Alignment of DNA sequences: Sequence X: TGATGGAGGT Sequence Y: GATAGG 1 = matching; 0 = non-matching ss = substitution score; gp = gap score Generate Similarity Matrix max score with respect to neighbors using: Similarity Matrix Generation

6 Southeastern Universities Research Association 6 SURA Cyberinfrastructure Workshop January 5-7, 2005 Back trace matrix to find sequence matches Trace sequences

7 Southeastern Universities Research Association 7 SURA Cyberinfrastructure Workshop January 5-7, 2005 Algorithm calculates only non-zero values Memory dynamically allocated as needed Data structure

8 Southeastern Universities Research Association 8 SURA Cyberinfrastructure Workshop January 5-7, 2005 Parallel distribution of multiple sequences Sequences 1-6Sequences 7-12 Seq 1-2 Seq 5-6 Seq 3-4

9 Southeastern Universities Research Association 9 SURA Cyberinfrastructure Workshop January 5-7, 2005 Convergence toward collaboration Algorithm implementation Nova Ahmed, Masters CS student Dr. Yi Pan, CS, graduate advisor Shared memory system – Georgia State Algorithm implementation and initial validation results NMI Integration Testbed program Georgia State –Art Vandenberg, Victor Bolet, et al. University of Alabama at Birmingham –Jill Gemmill, John-Paul Robinson, Pravin Joshi SURA NMI Testbed Grid Looking for applications to demonstrate value

10 Southeastern Universities Research Association 10 SURA Cyberinfrastructure Workshop January 5-7, 2005 System configurations Shared memory – Georgia State SGI Origin 2000 –24 250MHz MIPS R10000; 4 gigabytes total RAM Clusters – University of Alabama at Birmingham Single Cluster –8 node Beowulf cluster (each node 4 550MHz Pentium III; 512 MB RAM) Single Cluster Grid –Same 8 node Beowulf cluster with Globus Toolkit 3.0 Multi-Cluster –2 additional grid-enabled clusters (small SMP systems) Multi-Cluster interconnect speed essentially 100mb/sec

11 Southeastern Universities Research Association 11 SURA Cyberinfrastructure Workshop January 5-7, 2005 Results, part 1 Initial validation of algorithm on Shared memory UAB Cluster As “relative comparison” to shared memory performance UAB grid-enabled cluster To evaluate impact of grid middleware layer

12 Southeastern Universities Research Association 12 SURA Cyberinfrastructure Workshop January 5-7, 2005 Initial Validation: Shared Memory Machine Performance Validates Algorithm Computation time decreases with increased number of processors Limitations Memory Max sequence is 2000 x 2000 Processors Policy limits student to 12 processors Not scalable

13 Southeastern Universities Research Association 13 SURA Cyberinfrastructure Workshop January 5-7, 2005 Results: UAB Clusters; Shared Memory* Increase genome lengths to 3000 (remove student limit shared memory) * NB: results comparing clusters with shared memory are relative; Systems distinctly different.

14 Southeastern Universities Research Association 14 SURA Cyberinfrastructure Workshop January 5-7, 2005 Results: Grid-enabled cluster (Globus, MPICH) Advantages of grid-enabled cluster: Longer Sequences – up to 10,000 length tested Scalable – Can add new cluster nodes to the grid Easier job submission – Don’t need account on every node Scheduling is easier – Can submit multiple jobs at one time

15 Southeastern Universities Research Association 15 SURA Cyberinfrastructure Workshop January 5-7, 2005 Results, part 2 Focus on clusters UAB Cluster UAB grid-enabled cluster Multi-clusters at UAB Multiple Genome alignment – not just pairwise Sequence set from sequence library Approx 150 sequences ranging from 80,000 to 1,000,000 length Globus Toolkit 3.0, MPICH-G2

16 Southeastern Universities Research Association 16 SURA Cyberinfrastructure Workshop January 5-7, 2005 Computation Time Number of elements per processor Using 9 processors in each config (cluster, grid cluster, multi-grid cluster)

17 Southeastern Universities Research Association 17 SURA Cyberinfrastructure Workshop January 5-7, 2005 Computation Time 9 processors available in multi-cluster 32 processors for other configs.

18 Southeastern Universities Research Association 18 SURA Cyberinfrastructure Workshop January 5-7, 2005 Speed up (time 1 cpu / time n cpus) 9 processors available in multi-cluster 32 processors for other configs.

19 Southeastern Universities Research Association 19 SURA Cyberinfrastructure Workshop January 5-7, 2005 Some Conclusions Having cluster nodes available via Testbed beneficial Enables access where resource not available locally Empowers student investigation Grid capability demonstrated Provides awareness and outreach vector Nova Ahmed’s thesis defense - engages other graduate students Concrete “take away” that engages faculty/IT/student discussion Some interesting results Hypothesis: multi-cluster may provide better results than one cluster Research leads to understanding, learning - whatever Hypothesis result Ahmed et al., “Memory Efficient Pair-Wise Genome Alignment Algorithm - A Small-Scale Application with Grid Potential,” Proceedings Grid and Cooperative Computing - GCC 2004, Lecture Notes in Computer Science

20 Southeastern Universities Research Association 20 SURA Cyberinfrastructure Workshop January 5-7, 2005 Future Work Running across clusters at different sites Intelligent agent: submit to mixed environment – shared memory and/or clusters and/or … Using BridgeCA for transparent access Optically connected clusters? Analysis of network factors cf. Warren Matthews, GaTech, et al., end-to-end performance

21 Southeastern Universities Research Association 21 SURA Cyberinfrastructure Workshop January 5-7, 2005 Questions / Contacts Georgia State University Nova Ahmed Nahmed2@student.gsu.edu Nahmed2@student.gsu.edu Yi Pan yipan@gsu.edu yipan@gsu.edu Art Vandenberg avandenberg@gsu.edu avandenberg@gsu.edu

22 Southeastern Universities Research Association 22 SURA Cyberinfrastructure Workshop January 5-7, 2005 Acknowledgement This work is supported in part by the NSF Middleware Initiative Cooperative Agreement No. ANI-0123937. Any opinions, findings, conclusions or recommendations expressed herein are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


Download ppt "Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure."

Similar presentations


Ads by Google