Part I: Identifying sequences with … Speaker : S. Gaj Date
Annotation Best possible description available for a given sequence at the current time. How to annotate? Combining Alignment Tools Databases Datamining (scripts) Background
Microarrays
Introduction Global alignment Optimal alignment between two sequences containing as much characters of the query as possible. Ex: predicting evolutionary relationship between genes, … Local alignment Optimal alignment between two sequences identifying identical area(s) Ex: Identifying key molecular structures (S-bonds, - helices, …) Background
Introduction Basic Local Alignment Search Tool Aligning an unknown sequence (query) against all sequences present in a chosen database based on a score-value. Aim : Obtaining structural or functional information on the unknown sequence. BLAST
Programs Different BLAST programs available Usable criteria: E-Value, Gap Opening Penalty (GOP), Gap Extension Penalty (GEP), … Terms Query Sequence which will be aligned Subject Sequence present in database Hit Alignment result. BLAST NucleicProtein NucleicBlastNBlastX Protein-BlastP
Common BLAST problems BlastN BLAST CGATAGCCCGCCAGGAT AT ACGATAGCCC -CCAGGAT AT A Sequencing Error Clone seq mRNA Solution: Low penalty for GOP and GEP = 1 |||||||||||||||||||
Translation Problems 6-Frame translation BLAST >embl|J03801|HSLSZ Human lysozyme mRNA, complete cds with an Alu repeat in the 3' flank. ctagcactctgacctagcagtcaacatgaaggctctcattgttctggggct L A L * P S S Q H E G S H C S G A
Translation Problems 6-Frame translation BLAST >embl|J03801|HSLSZ Human lysozyme mRNA, complete cds with an Alu repeat in the 3' flank. ctagcactctgacctagcagtcaacatgaaggctctcattgttctggggct L A L * P S S Q H E G S H C S G A * H S D L A V N M K A L I V L G
Common BLAST problems BLAST Gene X full mRNA mRNA intron exon Translation Splicing
Common BLAST problems BLAST mRNA Clones derived from mRNA Coding region Non-coding region BlastX against protein sequence 3 possible hit-situations
Common BLAST problems BLAST Yields no protein hit Aligns with protein in 1 of the 6 frames. Part perfect alignment Coding region Non-coding region or
Part II: Databases and annotation
Introduction Primary database: – DNA Sequence (EMBL, GenBank, … ) – AminoAcid Sequence (SwissProt, PIR, …) – Protein Structure (PDB, …) Secondary database: – Derived from primary DB – DNA Sequence (UniGene, RefSeq, …) – Combination of all (LocusLink, ENSEMBL, …) Structure: – Flat file databases Databases
Primary Databases EMBL: – DNA Sequence – Human: nucleotides in entries – Clones, mRNA, (Riken) cDNA, … – New sequences can be admitted by everyone. – No curative check before admittance. Databases
Primary Databases SwissProt: – Amino Acid sequence – Human: – Contains protein information – SwissProt (EU) PIR (USA) – Crosslinks to most informative DB (PDB, OMIM) – Part of UniProt consortium. – Each addition needs validation by appointed curators. – Highly curated Databases
Secondary Databases TrEMBL: – Translated EMBL – Hypothetical proteins – After careful assessment SpTrEMBL SwissProt Databases
Secondary Databases UniGene: – Automated clustering of sequences with high similarity – Derived from GenBank / EMBL – 1 consensus-sequence – Species-specific Databases
Secondary Databases LocusLink: – Curated sequences – Descriptive information about genetic loci RefSeq: – Non-redundant set of sequences. – Genomic DNA, mRNA, Protein – Stable reference for gene identification and characterization. – High curation Databases
Database Quality? Databases mRNAProtein EMBLSwissProt Submitter Database Manager Submitter Database Manager Curators DNA
How to Annotate? BlastN against random nucleotide DB – EST’s BlastN against structured nucleotide DB (UniGene, RefSeq) – mRNA hits – Sometimes not annotated at all – Best information Databases
Microarrays
Part III: Annotation Techniques
What do we have? Probe sequence Alignment Tools (e.g. BLAST) Databases !?! What to choose ?!? Annotation
Possibilities? 1.Do it like everyone else does. 2.Make use of curative properties of certain databases Goal: Annotate as many genes with as much information as possible (e.g. SwissProt ID) Annotation
1 st Approach - General “Done by most array manufacturers” Step-by-step approach: – BLAST sequences against nucleic database (preferably UniGene) – Extract high quality (HQ) hits (>95%) – For each HQ hit search crosslinks. – Find a well-described (SwissProt) ID for each sequence. Annotation Techniques
1 st Approach - Concept Annotation Techniques
2 nd Approach - General “Make use of present database curation” Other way around: – Use SwissProt to clean out EMBL – Result: “Cleaned” EMBL database with direct SP crosslinks – BLAST against cEMBL – Extract high quality alignment hits (>95%) – Convert EMBL ID to SP ID. Annotation Techniques
2 nd Approach - Concept Annotation Techniques
Annotating Incyte Reporters Total: cEMBL-approach: (21,47%) SP-IDs DM approach: (74,18%) UG-IDs in which M = (34,9%) SP-IDs ; MR = (38,1%) SP-IDs; MRH = (49,2%) SP-IDs Results
Annotating Incyte Reporters All reporters present on “Incyte Mouse UniGene 1” converted Total: reporters Old annotation : (97,6%) UG-IDs in which Non-existing UG-IDs = (59,5%); M = (20,2%) SP-IDs; MR = (21,8%) SP-IDs; MRH = (26,9%) SP-IDs Datamining approach : (88,9%) UG-IDs in which M = (43,2%) SP-IDs ; MR = (38,1%) SP-IDs; MRH = (60,1%) SP-IDs Custom EMBL-approach : (30,2%) SP-IDs Results
Annotating Incyte Reporters Combined methods “Incyte Mouse UniGene 1” reporters Total: reporters No annotation : (11%) reporters Annotated with SP-ID : (61,3%) reporters of which (22,7%) identical SP-IDs; 532 (5%) reporters with improved SP-IDs by EMBL-method; 174 (1,8%) reporters with different mouse SP-IDs; 5 reporters found only by EMBL-method Results
Conclusions Annotation is much needed Array sequences can point to different genes Direct translation into protein not best option: Sequencing errors Addition or deletion of nucleotides 6-Frame window Public nucleotide databases are redundant. Sequencing errors Differences in sequence-length Attachment of vector-sequence Conclusions
Questions? End