Probe Selection for Microarrays Considerations and Pitfalls Kay Hofmann MEMOREC Stoffel GmbH Cologne/Germany.

Slides:



Advertisements
Similar presentations
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Advertisements

LECTURE 17: RNA TRANSCRIPTION, PROCESSING, TURNOVER Levels of specific messenger RNAs can differ in different types of cells and at different times in.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Walk-thru of CAGE exercise Also at /tag_analysis/ /tag_analysis/
Transcriptome Sequencing with Reference
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
The Sense of Sequense The Sense of Sequense Chris Evelo BiGCaT Bioinformatics Universiteit Maastricht.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction methods Gene indices Mapping cDNA on genomic DNA Genome-genome.
Technologies and utility
Gene Expression And Regulation Bioinformatics January 11, 2006 D. A. McClellan
1 Alternative Splicing. 2 Eukaryotic genes Splicing Mature mRNA.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Genes. Outline  Genes: definitions  Molecular genetics - methodology  Genome Content  Molecular structure of mRNA-coding genes  Genetics  Gene regulation.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Protein Modules An Introduction to Bioinformatics.
Bacterial Physiology (Micr430)
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Alternative Splicing As an introduction to microarrays.
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
1 Characterization, Amplification, Expression Screening of libraries Amplification of DNA (PCR) Analysis of DNA (Sequencing) Chemical Synthesis of DNA.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Microarrays: Theory and Application By Rich Jenkins MS Student of Zoo4670/5670 Year 2004.
Introduce to Microarray
Why microarrays in a bioinformatics class? Design of chips Quantitation of signals Integration of the data Extraction of groups of genes with linked expression.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Eukaryotic Gene Finding
Making, screening and analyzing cDNA clones Genomic DNA clones
and analysis of gene transcription
with an emphasis on DNA microarrays
Fine Structure and Analysis of Eukaryotic Genes
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Probe selection for Microarrays Considerations and pitfalls.
The Genome is Organized in Chromatin. Nucleosome Breathing, Opening, and Gaping.
Data Type 1: Microarrays
Gene Expression Data Qifang Xu. Outline cDNA Microarray Technology cDNA Microarray Technology Data Representation Data Representation Statistical Analysis.
Microarray Technology
Fig Chapter 12: Genomics. Genomics: the study of whole-genome structure, organization, and function Structural genomics: the physical genome; whole.
Technology for Systems Biology. Nucleic Acid Hybridization In principle complementary strands will associate Chemistry is quite different on surfaces.
Remember the limitations? –You must know the sequence of the primer sites to use PCR –How do you go about sequencing regions of a genome about which you.
Expression of the Genome The transcriptome. Decoding the Genetic Information  Information encoded in nucleotide sequences contained in discrete units.
Verna Vu & Timothy Abreo
Part I: Identifying sequences with … Speaker : S. Gaj Date
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Genome Annotation Rosana O. Babu.
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Microarray (Gene Expression) DNA microarrays is a technology that can be used to measure changes in expression levels or to detect SNiPs Microarrays differ.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Overview of Microarray. 2/71 Gene Expression Gene expression Production of mRNA is very much a reflection of the activity level of gene In the past, looking.
How can we find genes? Search for them Look them up.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
High-Throughput Cloning and Expression Library Creation for Functional Proteomics The International Proteomics Tutorial Program.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Transcriptome What is it - genome wide transcript abundance How do you obtain it - Arrays + MPSS What do you do with it when you have it - ?
Finding genes in the genome
From: Duggan et.al. Nature Genetics 21:10-14, 1999 Microarray-Based Assays (The Basics) Each feature or “spot” represents a specific expressed gene (mRNA).
Rest of Chapter 11 Chapter 12 Genomics, Proteomics, and Transgenics Jones and Bartlett Publishers © 2005.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
The Transcriptional Landscape of the Mammalian Genome
Human Genome Project.
Experimental Verification Department of Genetic Medicine
Expression of the Genome
Presentation transcript:

Probe Selection for Microarrays Considerations and Pitfalls Kay Hofmann MEMOREC Stoffel GmbH Cologne/Germany

Probe selection wish list Probe selection strategy should ensure  Biologically meaningful results (The truth...)  Coverage, Sensitivity (... The whole truth...)  Specificity (... And nothing but the truth)  Annotation  Reproducability

Technology Probe immobilization  Oligonucleotide coupling Synthesis with linker, covalent coupling to surface  Oligonucleotide photolithography  ds-cDNA coupling cDNA generated by PCR, nonspecific binding to surface  ss-cDNA coupling PCR with one modified primer, covalent coupling, 2nd strand removal Spotting  With contact (pin-based systems)  Withoug contact (ink jet technology)

Technology-specific requirements General  Not too short (sensitivity, selectivity)  Not too long (viscosity, surface properties)  Not too heterogeneous (robustness)  Degree of importance depends on method Single strand methods (Oligos, ss-cDNA)  Orientation must be known  ss-cDNA methods are not perfect  ds-cDNA methods don’t care

Probe selection approaches AccuracyThroughput Selected Gene Regions Selected Genes Anonymous ESTs Cluster Representatives

Non-Selective Approaches EST spotting  Using clones from a library after sequencing  Little justification since sequence availability allow selection Anonmymous (blind) spotting  Using clones from a library without prior sequencing  Only clones with interesting expression pattern are sequenced  Normalization of library highly recommended  Typical uses:  HT-arrays of ‘exotic’ organisms or tissues  Large-scale verification of DD clones

Spotting of cluster representatives Sequence Clustering  For human / mouse / rat EST clones: public cluster libraries  Unigene (NCBI)  THC (TIGR)  For custom sequence: clustering tools  STACK_PACK (SANBI)  JESAM (HGMP)  PCP (Paracel, commercial)

A benign clustering situation

In the absence of 5‘-3‘ links Two clusters corresponding to one gene !

Overlap too short Three clusters corresponding to one gene !

Chimeric ESTs !! One cluster corresponding to two genes

Chimeric ESTs.. continued  Chimeric ESTs are quite common  Chimeric ESTs are a major nuisance for array probe selection  One of the fusion partners is usually a highly expressed mRNA  Double-picking of chimeric ESTs can fool even cautious clustering programs.  Unigene contains several chimeric clusters  The annotation of chimeric clusters is erratic  Chimeric ESTs can be detected by genome comparison  There is one particularly bad class of chimeric sequences that will be subject of the exercises.

How to select a cluster representative  If possible, pick a clone with completely known sequence  Avoid problematic regions  Alu-repeats, B1, B2 and other SINEs  LINEs  Endogenous retroviruses  Microsatellite repeats  Avoid regions with high similarity to non-identical sequences  In many clusters, orientation and position relative to ORF are unknown and cannot be selected for.  Test selected clone for sequence correctness  Test selected clone for chimerism  Some commercial providers offer sequence verified UNIGENE cluster representatives

Selection of genes  If possible, use all of them  Biased selection  Selection by tissue  Selection by topic  Selection by visibility  Selection by known expression properties  Selection from unbiased pre-screen  Use sources of expression information  EST frequency  Published array studies  SAGE data

Selection of gene regions 3‘ UTR ORF 5‘ UTR

Alternative polyadenylation

 Constitutive polyA heterogeneity  3’-Fragments: reduced sensitivity  no impact on expression ratio  Regulated polyA heterogeneity  Fragment choice influences expression ratio  Multiple fragments necessary  Detection of cryptic polyA signals  Prediction (AATAAA)  Polyadenylated ESTs  SAGE tags

Alternative splicing

 Constitutive splice form heterogeneity  Fragment in alternative exon: reduced sensitivity  No impact on expression ratio  Regulated splice form heterogeneity  Fragment choice influences expression ratio  Multiple fragments necessary  Detection of alternative splicing events  Hard/Impossible to predict  EST analysis (beware of pre-mRNA)  Literature

Alternative promoter usage

Alternative promotor usage  What is the desired readout?  If promoter activity matters most: multiple fragments  If overall mRNA level matters most: downstream fragment  Detection of alternative promoter usage  Prediction difficult (possible?)  EST analysis  Literature

UDP-Glucuronosyltransferases UGT1A8 UGT1A7

Selection of gene regions  Coding region (ORF)  Annotation relatively safe  No problems with alternative polyA sites  No repetitive elements or other funny sequences  danger of close isoforms  danger of alternative splicing  might be missing in short RT products  3’ untranslated region  Annotation less safe  danger of alternative polyA sites  danger of repetitive elements  less likely to cross-hybridize with isoforms  little danger of alternative splicing  5’ untranslated region  close linkage to promoter  frequently not available

A checklist  Pick a gene  Try get a complete cDNA sequence  Verify sequence architecture (e.g. cross-species comparison)  Mask repetitive elements (and vector!)  If possible, discard 3’-UTR beyond first polyA signal  Look for alternative splice events  Use remaining region of interest for similarity searches  Mask regions that could cross-hybridize  Use the remaining region for probe amplification or EST selection  When working with ESTs, use sequence-verified clones

1) Assume that you are interested in the p53-homolog p63, also known as Ket (TrEMBL: Q9UE10) What kind of fragment(s) would you use for expression analysis? Why? 2) The cytochrome P450 family is very important for toxicological microarray analysis since most isoforms repond to different toxic compounds. Is it possible to design a cDNA fragment (minimal size 200 bp) that would be able to separate CYP2A6 and CYP2A7? What is the situation with CYP1A1 and CYP1A2? What region should be used? 3) Check whether probes for p53 (Swissprot: P53_HUMAN), p63 and p73 (P73_HUMAN) are available on the Affymetrix human 35K chip or the mouse 12K chip. Check whether there are sequence verified clones available from Research Genetics. 4) Two (hypothetical) papers using different types of microarrays report very different results for the regulation of the thyroid receptor alpha-2 (Swissprot: THA2_HUMAN). Can you think of a possible explanation? What could you do to resolve this issue? Exercises

1) Literature search with Pubmed: 2) Sequence search & retrieval (SwissProt, Entrez) 3) BLAST searches at SIB Use specific subdatabase! Mind the ‘repsim‘ filter 4) Two-way sequence alignment Tools for Exercises