Development of a Chicken Unigene Database Project No. 9 Mentors: Dr. Wellington Martins - Dr. Joan Burnside Animal Science Dept. University of Delaware Jianshan Tang Ruoming Jin Department of CIS University of Delaware Lilian Lacoste DBI - French National School of Aeronautics and Space
Results 2815 contigs 6390 singlets 17,090 ESTs Phrap 9,205 cluster Phrap Clustering Result:
Second clustering method : using BLAST output Contig 1 BLAST output1 Contig 2 BLAST output2 Filtering Parsing Comparing Similarity function Similarity matrix
What ' s " gbc " ? Graph Based Clustering Clustering, a process of partitioning a set of data (or objects) in a set of meaningful sub-classes, called clusters. Graph, the relation of the data could be expressed as graph If there is a relation of two nodes, one edge connects them Working in bioinformatics Protein sequence clustering EST clustering A lot of other applications! Objective of "gbc" Support different input format Efficiently support very large sparse graph clustering Flexible to use by user
How to use " gbc " Output Cluster number, and all the nodes belongs to the cluster Clique clustering a clique is a completely connected subgraph each maximal clique in the graph becomes a cluster clusters many overlap generally produces small but very tight clusters Single-link clustering A maximal connected subgraph becomes a cluster produces larger but weaker clusters
A little about Implementation Works Two clustering algorithm Single-link Clique Graph Classes Efficiently support dense/sparse graph Provide the same interface without modifying clustering code
Analysis program Reset BLAST output Change matrix threshold Reset semantics Run analysis New contig set Number of contigs Comparison algorithm Clustering algorithm Results output Analysis tools Process log output
Analysis tools : contig information Display the BLAST output : - sequences references - sequences annotations - percentage of matching basepairs Display the list of contigs sorted according to their best matching percentage in the BLAST output
Analysis tool : EST selector Display : - frequency vs length (in ESTs) of contigs - list of ESTs in a contig Allows to select the best representative EST according to length and tissue type
First results On a set of 400 contigs representing 1000 ESTs Contig number :79 Contig size :743 Best matching fraction : gb|AF |AF Gallus gallus Rad54b (RAD54B) mRNA, compl e-160 gb|BC |BC Homo sapiens, RAD54, S. cerevisiae, homol e-31 ref|XM_ | Homo sapiens RAD54, S. cerevisiae, homolog of, e-31 gb|AF |AF Homo sapiens RAD54B protein (RAD54B) mRNA e-31 ref|NM_ | Homo sapiens RAD54, S. cerevisiae, homolog of, e-31 emb|AL |HSM Homo sapiens mRNA; cDNA DKFZp434J1672 ( e-31 dbj|AP |AP Homo sapiens genomic DNA, chromosome 8q e-11 gb|AC |AC Homo sapiens chromosome 8, clone RP Contig number :133 Contig size :740 Best matching fraction : gb|AF |AF Gallus gallus Rad54b (RAD54B) mRNA, compl gb|BC |BC Homo sapiens, RAD54, S. cerevisiae, homol e-44 ref|XM_ | Homo sapiens RAD54, S. cerevisiae, homolog of, e-44 gb|AF |AF Homo sapiens RAD54B protein (RAD54B) mRNA e-44 ref|NM_ | Homo sapiens RAD54, S. cerevisiae, homolog of, e-44 emb|AL |HSM Homo sapiens mRNA; cDNA DKFZp434J1672 ( e-44 dbj|AP |AP Homo sapiens genomic DNA, chromosome 8q e-11 gb|AC |CBRG45G04 Caenorhabditis briggsae cosmid G45G04, c dbj|AB |AB Arabidopsis thaliana genomic DNA, chromo
References Gene Index analysis of the human genome estimates approximately 120,000 genes. Liang- Feng; Holt-Ingeborg, Pertea-Geo, Karamycheva-Svetlana, Salzberg-Steven-L, Quackenbush-John Nature-Genetics. June, 2000; 25 (2): The TIGR Gene Indices: Reconstruction and representation of expressed gene sequences Quackenbush-John, Liang-Feng, Holt-Ingeborg, Pertea-Geo, Upton-Jonathan Nucleic-Acids- ResearchJan. 1, 2000; 28 (1): IMAGEne I: Clustering and ranking of I.M.A.G.E. cDNA clones corresponding to known genes. Cariaso-M, Folta-P, Wagner-M, Kuczmarski-T, Lennon-G Bioinformatics-Oxford. Dec., 1999; 15 (12): R. Larson, M. Hearst : Content analysis - Lecture from University of California, Berkeley School of information management and systems T. Ono, H. Hishigaki, A. Tanigami, T. Takagi - Automated extraction of information on protein- protein interaction from biological literature. Bioinformatics vol 17 no 2 - Oxford University Press I. Iliopoulos, A.J. Enright, C.A. Ouzounis - TEXTQUEST: document clustering of medline abstracts for concept discovery in molecular biology. EMBL Cmabridge Outstation, Cambridge CB10 ISD, UK.