Download presentation
Presentation is loading. Please wait.
Published byWilfrid Parks Modified over 6 years ago
1
The Most Informative Spacing Statistic Identifies Biologically Relevant Patterns in Transcript Level Distributions Stan Pounds Department of Biostatistics St. Jude Children’s Research Hospital
2
Unsupervised Gene Discovery
In cancer, abnormalities in the DNA of a gene impact its RNA expression levels and downstream biological processes. Affected genes have bimodal expression with modes for the affected and unaffected individuals. A gene may be silenced by deletion, activated by a genomic fusion, or have its expression increased by amplification (duplication) in a subset of individuals. These genomic alterations are very relevant to the development, classification, treatment, and prognosis of the disease.
3
Unsupervised Gene Discovery
Cancer biologists often wish to use only RNA expression data to identify these biologically relevant genes in an manner that is not informed (“biased”) by other data sources. Cancer biologists often rank genes by the value of a simple variability metric, such as the standard deviation (SD) or the median absolute deviation (MAD). The SD and MAD may not recognize bimodality in the expression of a biologically relevant gene. Statisticians may consider using the dip test for bimodality, but it may not be well-powered for biologically relevant bimodality.
4
Spacings Let 𝑥 (1) ≤ 𝑥 (2) ≤ … ≤ 𝑥 (𝑛) represent a set of ordered scalar data values. The spacings (Pyke 1965) are the differences between consecutive ordered data values, defined as 𝑑 (1) = 𝑥 (2) − 𝑥 (1) , 𝑑 (2) = 𝑥 (3) − 𝑥 (2) , … 𝑑 (𝑛−1) = 𝑥 (𝑛) − 𝑥 (𝑛−1) A large spacing that is away from the tails can be an indicator of biologically relevant bimodality or outliers. Pyke,R. (1965) Spacings. J. R. Stat. Soc., Series B, 27, 395–449.
5
Spacings Example 1 Example 2 Example 3
d(1) d(2) d(3) d(4) d(5) d(6) Example 1 Equal sized spacings, uniform distribution, BORING! x(1) x(2) x(3) x(4) x(5) x(6) x(7) d(1) d(2) d(3) d(4) d(5) d(6) Example 2 Large central spacing implies bimodality, EXCITING! x(1) x(2) x(3) x(4) x(5) x(6) x(7) d(1) d(2) d(3) d(4) d(5) d(6) Example 3 Large tail spacing, interesting but maybe not as exciting. Pawlikowska et al 2014, Bioinformatics, PubMed
6
Most Informative Spacing Test
The size and placing of the spacing are indicators of potential biological relevance. For each spacing, define the placement-weighted spacing as 𝑣 (𝑖) =2 𝑑 (𝑖) 𝑖(𝑛−𝑖) 𝑛 2 , The most informative spacing is defined as 𝑣 𝑚𝑎𝑥 =max( 𝑣 (𝑖) ) The size of the spacing is scaled by a weight that is defined by the placing of the spacing. Spacings near the center of the distribution are given greater weight. The greatest scaled spacing is the most informative spacing.
7
Example Application Compute the SD, MAD, and MIST statistics for each gene in a RNA-seq expression matrix for 264 cases of T-cell acute lymphoblastic leukemia (T-ALL; Liu et al 2017). Rank genes according to value of SD, MAD, and MIST statistics. Evaluate the performance of SD, MAD, and MIST in terms of giving a high rank to genes with known genomic differences due to gender (X, Y) and genomic disease-related abnormalities observed in these 264 subjects. Liu et al 2017, Nature Genetics, PMID
8
Top Genes by Each Method
TLX1: T-cell Leukemia Homeobox 1 Involved in a gene fusion that defines a T-ALL disease subgroup. MIST NUDT11: Nudix Hydrolase 11 Located on chrX: gender effect No published link to leukemia. DIP MAD SD TMSB15A: Thymosin beta 15a Located on chrX: gender effect No published link to leukemia. RPS4Y1: Ribo. Prot. S4 Y-link 1 Located on chrY: gender effect No published link to leukemia.
9
XY Chromosome Genes MIST and DIP most effectively identified genes involved in the true (but benign) gender differences.
10
T-ALL Fusion Genes MIST most effectively identified fusion genes.
11
T-ALL Deleted Genes MIST most effectively identifies deleted genes.
12
T-ALL Amplified Genes SD most effectively identified amplified genes.
13
An Amplified Gene HIST3H2BB was ranked 6th by MAD and 12th by SD.
MIST did not capture this gene because there was no large spacing in the empirical distribution. DIP missed this gene because the trough was relatively high (looks pseudo-uniform).
14
Discussion In cancer, some biologically relevant genes have a bimodal expression distribution due to underlying genomic alterations. SD and MAD have been used to discover such genes in an “unbiased” manner that is not informed by other data. SD and MAD are not explicitly formulated to detect bimodality, and thus may miss some important genes. The dip test is a classical procedure that was formulated to detect bimodality. More recently, MIST was developed to detect bimodality that is relevant to cancer biology.
15
Discussion In a T-ALL example, MIST identified gender, fusion, and deleted genes as effectively or more effectively than the other three methods. In the same example, SD and MAD more effectively identified amplified genes than MIST. No statistical method is most effective at identifying all forms of bimodal expression exhibited by genes relevant to cancer biology. Further research is needed to develop a statistical method that robustly identifies multiple forms of bimodal expression.
16
Acknowledgements Thanks to the enthusiastic financial of support of St. Jude donors and the tireless efforts of our fundraisers. Thanks to St. Jude patients and families for their participation in research studies. Charles Mullighan, St. Jude Pathology Iwona Pawlikowska, Takeda Pharmaceuticals (formerly SJ)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.