Spanish Inquisition Final Project Week 2 - 4/29/09 Breast Cancer Gene Expression Data Leon Kay, Yan Tran, Chris Thomas Chris Yan Leon
Weka Filtering Used CFS with BestFirst Search Reduced the number of attributes from 1544 to 125 CFS stands for Correlation-based Feature Selection. Basic hypothesis: “A good feature subset is one that contains features highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other.” [1]
CFS Algorithm - Searching Any search algorithm can be plugged into CFS – author describes three - forward selection, backward elimination, and best first. They are all essentially greedy heuristic search algorithms. The greedy search approach reduces the complexity of generating the feature subset. “Best first can start with either no features or all features. In the former, the search progresses forward through the search space adding single features; in the latter the search moves backward through the search space deleting single features. To prevent the best first search from exploring the entire feature subset search space, a stopping criterion is imposed. The search will terminate if five consecutive fully expanded subsets show no improvement over the current best subset.” [1]
CFS Algorithm Visual Diagram [1]
Accuracy (Error Rate) of algorithms before and after applying CFS/BestFit filtering Before*After**Error Rate Reduction J Bagging (J48) Boosting (J48) Random Forests SMO (SVM) * From Week1 - all 1544 Attributes ** After applying CFS/BestFit filtering, 125 attributes
ROC – Receiver Operating Characteristic ROC graphs “depict the tradeoff between hit rates and false alarm rates of classifiers “ [2] “one point in ROC space is better than another if it is to the northwest (tp rate is higher, fp rate is lower, or both) of the first” [2] Therefore, Area Under Curve, or AUC is an accurate numerical value that can be used to compare classifiers.
ROC Data – Area under Curve J48Bagging (J48)Boosting (J48)Random ForestsSMO (SVM) Basal-like Claudin-low HER2+/ER Luminal A Luminal B Normal Breast-like
Example ROC – Random Forests
MeV Analysis Initial Hierarchical Clustering
Analyze the Cluster
FLJ13710 and GATA3 Lowly expressed in basal-like samples. Highly expressed in luminal samples.
GATA3 GATA3 levels are a known indication of breast cancer prognosis. (Basal-like is worse than Luminal.) Associated with estrogen receptor alpha, which is often highly expressed in the early stages of breast cancer.
FLJ13710 Mentioned in a paper on finding prognostic signatures for breast cancer. Couldn’t find any in-depth studies on this gene.
References 1) Mark Hall, “Correlation-based Feature Selection for Machine Learning”, 2)Tom Fawcett, “An introduction to ROC analysis“, doi: /j.patrec – enter into doi: /j.patrec http://dx.doi.org/ 3)Wilson, Brian J., Giguère, Vincent. “Meta-analysis of human cancer microarrays reveals GATA3 is integral to the estrogen receptor alpha pathway”, Molecular Cancer 2008, 7:49. cancer.com/content/7/1/49 4) Hayashi, SI., et al. “The expression and function of estrogen receptor alpha and beta in human breast cancer and its clinical application”, 5) “Suppl. Table 2: List of probe sets significantly differentially expressed between luminal cell lines and basal cell lines. Probe sets are ordered according to decreasing DS (discriminating score). “ 6)Carrivick, L., et al. “Identification of Prognostic Signatures in Breast Cancer Microarray Data using Bayesian Techniques.”