Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex Looking for signals in tens of thousands.

Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex harry@essex.ac.uk Looking for signals in tens of thousands of GeneChips There are >10 5 GeneChip experiments in the public domain, that cost ~$10 9 to produce. Extracting further information from this resource will be very cost effective.

FacultyDegrees in ….. Dr Andrew HarrisonPhysics Professor Graham UptonStatistics Dr Berthold LausenStatistics + Dr Hugh Shanahan (Royal Holloway)Physics PhD students Farhat MemonComputer Science Anne OwenMathematics Fajriyah RohmatulStatistics Microarray informatics at Essex University Departments of Mathematical Sciences and Biological Sciences Alumni Dr Jose Arteaga-SalasStatistics Dr Renata CamargoComputer Science Dr Caroline JohnstonMolecular Biology and Bioinformatics Dr William LangdonComputer Science and Physics Dr Joanna RowsellMathematics Dr Olivia Sanchez-GrailletComputer Science and Bioinformatics Dr Maria StalteriInorganic Chemistry and Bioinformatics + 4 former MSc students Current MSc and UG students Aleksandra IljinaStatistics and Data Analysis Lina HamadehStatistics and Data Analysis Madalina GhitaMathematics

There is a huge multiple-testing problem. m=log 2 (Fold Change), a=log 2 (Average Intensity) What can be learnt from comparing different experiments? Perfect Match (PM) Mismatch (MM) The biggest uncertainty in GeneChip analysis is how to merge all the probe information for one gene - Harrison, Johnston and Orengo, 2007, BMC Bioinformatics, 8: 195

Some genes are represented by multiple probe-sets. Probe-set AProbe-set B If they are measuring the same thing the signals should be up and down regulated together. Is that always true?No Stalteri and Harrison, 2007, BMC Bioinformatics, 8:13

Probes map to different exons. Alternative splicing may cause some exons to be upregulated and others to be downregulated.

Genes come in pieces. But exons do not. Multiple probes mapping to the same exon should measure the same thing.

We are studying the correlations in expression across >6,000 GeneChips (HGU-133A), sampling RNA from many tissues and phenotypes.

The correlations in intensities (log2) between probes in probeset 208772_at on the HG-U133A array. The number in each square is the correlation ×10 Blue = low correlation Yellow = high correlation Average intensity in GEO The correlation calculated for PM probes 9 and 11, the data in the earlier scatter plot, is reported as 8 (0.76 multiplied by 10 and rounded). Probe order along the gene

This probeset shows no coherent correlations amongst its probes.

Some probesets clearly have outliers.

Probes 1-11 all map to the same exon. This is a different probeset mapping to the same exon – there seems to be one outlier.

The outliers are correlated with each other!

Virtually all of the probes in the group have runs of Guanines within their 25 bases. TCCTGGACTGAGAAAGGGGGTTCCT GAGACACACTGTACGTGGGGACCAC GGTAGACTGGGGGTCATTTGCTTCC There is little sequence similarity between the probes, they are from probe-sets picking up different biology, yet they are correlated!

3 0.14 4 0.42 5 0.49 6 0.62 7 0.75 Number of contiguous Gs Mean Correlation Comparing probes with runs of Gs. We are only looking at a small fraction of the entire probe, yet it is dominating the effects across all experiments.

Probes all have the same sequence in a cell – a run of guanines will result in closely packed DNA with just the right properties to form G-quadruplexes. Upton et al. 2008 BMC Genomics, 9, 613 GGGGGGGG GGGGGGGG GGGGGGGG G-quadruplexes

How do we deal with known outliers such as G-quadruplexes? What is the best way to calculate expression in the presence of outliers?

G-stacks bias which genes are reported to be clustered together within published experiments.

Kerkhoven et al. 2008, PLoS ONE 3(4): e1980 Probes containing GCCTCCC will hybridize to the primer spacer sequence that is attached to all aRNA prior to hybridization.

Log(magnitude) of averaged probe values Colour coded by size. Note the perimeter of bright-dark pairs. Cell (0,0) contains a probe which does not measure any biology

Corner correlations (correlations with values in cell (0,0)) Numbers are correlations times 10 (red greater than 0.8) Negative correlations appear as blanks Filled circles indicate probes not listed in CDF file. Large circles indicate correlations greater than 0.8

Correlations with cell (0,0) Being in the opposite corner has not reduced the correlations of the interior row and column

What are in the sheep pens? Entries are log(mean(Intensity)) Entries are correlation with cell (0,0) Sheep!

Many thousands of probes are correlated with each other simply because they are adjacent to bright probes. We believe that the focus of the scanner may be responsible – regions adjacent to bright spots will gain the same fraction of light. A comparison of many images at different levels of blurriness will appear to indicate that dark regions adjacent to bright regions are correlated in their intensities.

A CEL file contains information about the ID of the scanner as well as the date on which the image was scanned – how does the impact of blur change over time for each scanner? Upton and Harrison, 2010, Stat Appl Genet Mol Biol, 9(1), Article 37

How best to transform a DAT image into a CEL file? We are testing whether ideas from astronomy are applicable. We are checking whether the temporal patterns in scanner performance for human and other organisms are related.

Bioinformatix, Genomix, Mathematix, Physix, Statistix, Transcriptomix are needed in order to extract reliable information from Affymetrix GeneChips Thank you for your attention.

Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex Looking for signals in tens of thousands.

Similar presentations

Presentation on theme: "Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex Looking for signals in tens of thousands."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex Looking for signals in tens of thousands.

Similar presentations

Presentation on theme: "Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex Looking for signals in tens of thousands."— Presentation transcript:

Similar presentations

About project

Feedback