Download presentation
Presentation is loading. Please wait.
Published bySpencer Greene Modified over 9 years ago
1
INTRODUCTION ● Expressed sequence tags offer a low cost approach to gene discovery ● For a range of non-model organisms, ESTs represent the only sequence information available ● Using this data to create 'partial genomes' means the data can be interpreted in a genomic context ● To facilitate the creation of partial genomes, we have created a suite of software tools, designed to form a complete EST pipeline ● The first tool in the pipline, trace2dbest, process raw chromatograms into high quality sequence objects ● These sequences are then used to build a partial genome, using the PartiGene tool ● The partial genome is held in an SQL database, which can be made accessible through the web ● A further software tool, prot4EST, provides robust translation of the error prone sequences SUMMARY ● The PartiGene process has been used to create several species specific databases, including nembase (http://www.nematodes.org) and lumbribase (http://www.earthworms.org). ● The software is freely available under a GNU license at http://nema.cap.ed.ac.uk/PartiGene ● The software is under continued development, SimiTri (a tool allowing phylogenetic ● comparisons) is due to be integrated into the pipeline soon. An additional module, annot8er is also under development Raw Chromatogram acatcgaatcgatacatgACGTAGCAGATCAGTAC ATGATACACGTCGTCGTCTGCATGCTTGC CACGTCCAGTTTGGCCATTAGTACGCCC GCTGACCTGACTCTGACCATTGACCACT GATGTCCATGATTccatgacatcttgatcgtgatcga Base Calling (PHRED 1 ) TYPE: EST STATUS: New CONT_NAME: Blaxter ML CITATION: Expressed Sequence Tags from the humus earthworm L. rubellus LIBRARY: Earthworm Lambda Zap Express Library EST#: Lr_adE_01H01_T3 CLONE: Lr_adE_01H01 SOURCE: PCR_F: T3 PCR_B: T7PL PLATE: 01 ROW: H COLUMN: 01 SEQ_PRIMER: T3 P_END: 5' HIQUAL_START: 1 HIQUAL_STOP: 478 DNA_TYPE: cDNA PUBLIC: PUT_ID: gb|AAA74396.1| cytochrome c oxidase subunit IV COMMENT: Sequencing was performed in Edinburgh SEQUENCE: CCAACACCGTCATGTCCGGAGACACGACCATGTTCCCAGGTATCGCCGATCG TATGCAGA AGGAGATCACGAGCATGGCTCCAAGCACGATGAAGATCAAGATCATCGCTCC ACCCGAGC GCAAGTACTCCGTATGGATCGGTGGGTCCATCCTGGCTTCCCTGTCCACCTT CCAGCAGA TGTGGATCAGCAAGCAGGAGTACGACGAGTCCGGCCCATCCATCGTCCACA GGAAGTGCT TCTAAATGCACCGCCGACAACGAGTTACCAAGGGCGACAGAAAGAACCCGCT AACGCGAG CACACACACGCAAGCAAACACACAGCGTGCACGTACATACAACATCACACAA CCCATCTC TATGACTCACACACCTTTTCAACCGAACTTTATCCAAATTACGCAAACCGAAGT TTCGAT TTTATTTCGTCCTTGTGGACACAAAAGTAATTTAAAAATCTCTGTACGCCTTAAT TTGAG GCTATAGTTTGCTTTTGTAACTTAAGGCGATCACAGATTCTAGATGCAATCGTG ACTTTA TATTTTACGATTTAT || Trimming High quality sequence cDNA library information trace2dbest Run DECoder Run ESTScan Parse results Join and extend HSPs prot4EST BLASTN against RNA database BLASTX against mitochondrially encoded proteins BLASTX against SWISSProt Identify longest ORF from six frame translation Partial Genome Sequences Peptide prediction no match fails filters length and quality filters >= 30 residues long sequence similarity (E<e -8 ) sequence similarity (E<e -65 ) + dbEST EST file From ESTs to partial genomes Alasdair Anthony, Ralf Schmid, James Wasmuth, John Parkinson and Mark Blaxter Nematode Genomics, Institute of Cell, Animal and Population Biology, University of Edinburgh, EH9 3JT ● Poor sequence quality, identification of coding region and frame-shifts make EST translation problematic ● prot4EST integrates current translation solutions, BLASTX, DECoder 3, ESTScan 4 ● Fully compatible with PartiGene PartiGene 1 Collate sequences dbEST ● Sequences downloaded from public database 2 Cluster ● Sequences clustered on the basis of similarity (BLAST) using CLOBB 2 3 Assemble ● Clusters assembled to form contigs using phrap (Green, P. unpublished) 4 Partial genome Gene A Gene B Gene C 5 Annotation Example PartiGene HTML results output Nembase was created using php to submit queries to the PartiGene database 6 Web front ends ● PartiGene represents the core of the partial genome creation process ● All ESTs from a particular species are clustered and assembled to form putative gene objects ● These genes can then be annotated and the information presented as a web based resource ● trace2dbest is an interactive utility for processing raw EST data ● the basecalling program phred is used to produce a quality scored sequence ● trace2dbest then performs a series of trimming steps ● cross_match is used to identify leading and trailing vector sequence ● Next user defined leader and adapter sequences are trimmed ● poly(A) tails are identified based on user defined parameters and trimmed ● Translation (prot4EST) ● BLAST ● Under development ● Putative location ● Functional prediction ● Structure prediction ● Domain identification RNA sequences Acknowledgments: the authors would like to thank Ann Hedley and the rest of the Environmental Genomics Data Centre team for their help. The project is funded by NERC. References: 1. Ewing, B., & Green, P. (1998) Base-calling of automated sequencer traces using phred. Genome Res. 8, 175-194 2. Parkinson J., Guiliano D.B. & Blaxter M. (2002) Making sense of EST sequences by CLOBBing them. BMC Bioinformatics. 3, 31 3. Fukunishi, Y. & Hayashizaki, Y. (2001) Amino-acid translation for cDNA with frame-shift error. Physiol. Genomics. 5, 81-87 4. Iseli, C., Jongeneel, C.V., & Bucher, P. (1999) ESTScan: A Program for detecting, evaluating and reconstructing potential coding regions in EST sequences. ISMB7, 138-158 The Environmental Genomics Thematic Programme Data Centre
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.