Classification and Regression Trees for Glast Analysis: to IM or not to IM? Toby Burnett Data Challenge Meeting 15 June 2003
T. Burnett GLAST Data Challenge Workshop 2 The problem Bill is using IM classification and regression tree analysis for analysis: –Calorimeter validity –PSF tail suppression –background suppression IM is proprietary, and rather expensive ($5K): only UW and UCSC have academic licenses ($500 single; $1K for 10)
T. Burnett GLAST Data Challenge Workshop 3 Bill’s IM worksheet (PSFAnalysis_14) Training region Analyze results Input tuple Predicion tree
T. Burnett GLAST Data Challenge Workshop 4 The Trees: calculate 4 values with 11 nodes Good calorimeter measurement [1 node] vertex vs. 1 track (thin and thick) [2 nodes] Core vs tail (thin/thick and vtx/1 trk) [4 nodes] Prediction of recon direction error [ 4 nodes] Example: A Good CAL/Bad Cal prediction node CalTwrEdge =26.58, CalTwrEdge 3,611.48, CalTrackDoca>3.96, CalXtalRatio 1.76
T. Burnett GLAST Data Challenge Workshop 5 Bill’s result* * Flawed by G4 problems
T. Burnett GLAST Data Challenge Workshop 6 A Solution IM saves its results as XML files, which are easy to interpret A new package, “classification” defines a class classification::Tree that does the following: –accepts a “lookup” object to obtain a pointer to the double associated with named quantities –parses the XML file, creating trees for each prediction tree found –returns a value from each tree Merit creates and fills the new tuple variables, in a new class ClassificationTree. –duplicates the logic defining the 4 categories –evaluates each of the 4 variables
T. Burnett GLAST Data Challenge Workshop 7 Current Procedure Bill releases an IM file. I strip it down, removing nodes not required for analysis –size reduced by 1/2, to 500 Kb. Rename it, and check it in to cvs as classification/xml/PSF_Analysis.xml Create a tuple with merit, containing the new tuple quantities Feed that tuple to this IM worksheet, which writes a new tuple with both versions
T. Burnett GLAST Data Challenge Workshop 8 Results: the good The comparisons were with generated 100 MeV normal The vertex classification (used to select vertex vs. 1 Track direction estimate) is perfect, as is the core vs. tail
T. Burnett GLAST Data Challenge Workshop 9 Results: the bad The results of the “regression tree” to predict the psf error has two populations! The agreement is rather poor for the “thin vertex” category; otherwise perfect. An explanation: Bill generated two different trees from different data sets, of 1000, and 243 events. (The latter has only two nodes and can only generate 3 values.) –The merit evaluation is only the first tree –The evaluation uses an average of the two trees. –Note that there are three branches.
T. Burnett GLAST Data Challenge Workshop 10 Results: the ugly This is the comparison of the prediction for good energy measurement Again, Bill created two trees, which are apparently being averaged.
T. Burnett GLAST Data Challenge Workshop 11 Observations Fixing the “disagreement” –Bill: will train only one tree –me: average all the trees Using IM to train the classification or regression trees –The current procedure is exploratory –If we decide to use these trees in the final analysis, they must be trained systematically –Another possibility (idea from Tracy): use the classification/regression analysis in S-PLUS, which manages tree objects.
T. Burnett GLAST Data Challenge Workshop 12 S-PLUS No question about academic licenses, ($100 per license at UW) Linux version available Open source alternative: R Scriptable, also callable from C++ Supports the same classification and regression tree functions (we think!) Fit a Regression or Classification Tree DESCRIPTION: Grows a tree object from a specified formula and data. USAGE: tree(formula, data= >, weights= >, subset= >, na.action=na.fail, method="recursive.partition", control= >, model=NULL, x=F, y=T,...) REQUIRED ARGUMENTS: formula a formula expression as for other regression models, of the form `response ~ predictors'.
T. Burnett GLAST Data Challenge Workshop 13 Status Work done by a summer student –Explore classification tree with random x, y in 0,1; good=x<y; See validity plot at right –Explore regression tree: feed it x, y=x^2, have it create a predictor for y. In progress: direct comparison –Choose the GoodCAL category: ifelse((EvtMcEnergySigma > -5. ), "GoodCAL","BadCal") –Use IM (v2) to create classification with independent variables used by Bill. –Write the results to a file for S-PLUS Next steps: –Run the same analysis in S-PLUS, compare –Establish procedures to construct tree predictions with R or S-PLUS