RooFit – Open issues W. Verkerke
Datasets Current class structure Data representation –RooAbsData (abstract base class) –RooDataSet (unbinned [weighted] data) –RooDataHist (binned data) Data storage –RooAbsDataStore (abstract base class) –RooTreeDataStore (TTree based storage) used by both RooDataSet and RooDataHist –RooCompositeDataStore Used by RooDataSet when combining external datasets with Link() rather than Import() –Since there are 2 concrete implementations, most RooFit code already adapted to concept that storage type is not necessarily tree-based (e.g. virtual copy construction through clone functions etc)
Open issues in datasets - storage Project: New STL vector-based storage implementation –May be (much) faster that TTree-based datastore Work needed –Develop new class RooVectorDataStore –Inherits from RooAbsDataStore, implements full functionality of RooTreeDataStore (including support for append/merge/rename operations, storing of ‘cache’ columns). Must be persistable, support string and category data types as well –Workload: 3 days of work –Once done, need to add cmdline option to RooDataSet/Hist to use this alternate storage technique [easy] –Workload: 0.5 days of work –Add new stressRooFit test module that exercises this type of storage –Workload: 0.5 days of work –Need to validate that RooCompositeDataStore works fine with RooVectorDataStores (should be OK) –Workload 0.5 days of work
Open issues in datasets - representation Request for new mixed binned-unbinned data representation type Work needed –Fixed feature is a ‘master category’ variable that indexes the various data subsets. –Write class RooMixedData to represent this. –Need work out precise functionality and interface of such a class Several concepts of binned data not available for unbinned data and vice versa (see next slide) –Could make class that only implement common aspects (as defined in RooAbsData), but in practice only useable as read-only class. OK? –Is (typed) access to component representation needed, i.e. do you need to be able to see subset [i] as a RooDataHist or RooDataSet, (not handled via composite storage scheme, but could be added a separate layer: i.e. RooMixedData owns multiple RooDataHist and RooDataSet objects that each own their own storage, then link their storage objects to a RooCompositeDataStore for unified view. –Workload: ~1 week (depending on what design/interface issues will appear…)
Functionality of RooDataSet/RooDataHist OperationRooDataHistRooDataSet add(RooArgSet)Increase weight of corresponding bin Add data point append(RooAbsData)Add all points merge(RooDataSet)UNDEFINEDAdd columns from imported dataset addColumn(RooAbsArg)UNDEFINEDAdd columns with values of given function set(RooArgSet&,dbl)Set weight of given point to given value UNDEFINED binVolume(RooArgSet&)Return volume of bin in given (subset) of dimensions UNDEFINED weightError()Return error on given weight() UNDEFINED
Open issues in datasets - representation Representation of number-counting data Now –Regular PDF: Gauss(x) RooDataSet(x) with N entries –Extended PDF: Gauss(x)*Poisson(N) RooDataSet(x) with N entries –Number-counting PDF: should be (in analogy) Poisson(N) RooCountingData( ) with N entries but we don’t have that. –Can do: Poisson(N) RooDataSet(N) with 1 entry but that doesn’t (automatically) behave in the right way. –Also requires some thinking on the PDF-side… –Two ways to go
Open issues in datasets - representation Path #1 (Kyle proposal) –Need to label (any) pdf explicit as ‘number counting’ pdf –Effect is that generate() fills a dataset with 1 entry representing the event count, rather than N entries of a dummy observable where the dataset size represents the event count –Possible issue: Special meaning of counting data only clear in contact of (labeled) pdf that generated it, unless data is also labeled itself in some way. [ E.g when calculating total event count of a composite dataset need to know if RooDataSet with 1 entry counts as 1 or as N, simular issue when asking for event count of component dataset ] Path #2 (My original proposal) –Make a wrapper class that represents any pdf as a number counting pdf, e.g. class RooCountingPdf, e.g. ws.factory(“CountingPdf::Nexp(Poisson(Nobs,mu))”) ; –Net effect of class is to redirect output of RooAbsPdf::getVal() to RooAbsPdf::expectedEvents() Return class of type RooCountingData() when generate is called –Requires writing of a class RooCountingData which can be extremely lightweight & fast (just contains 1 double) –Adapt class RooMixedData to be able to also contain RooCountingData –Data and pdf are both self-labeling in terms of interpretation. Should be straightforward to use this in existing RooFit code [ but need to check if there is code that assumes at least one ‘observable’ ] Workload: either way 2-3 days
Conceptual issues with simultaneous pdf / data Need more flexibility in mixing/matching different pdfs Eg sim[ F(x), G(y) | i ] –Will work technically, but fundamental issue is that meaningful observables depend on index I –Unwanted side-effects of present construction: generate() will make random y variable for generation of F(x), and random x variable for generation of G(y). Datasets will always allocate entries for x and y for both dataset subsets (results in a waste of space, especially if x,y are binned) Need several items to resolve this –Composite datasets, where each subset only stores selected observables [ need: a mechanism to specify this ] –A mechanism in RooSimultaneous::generate() to only generate the “relevant” observables for each state [ need: same mechanism to specify this ] –Will need to change RooSimultaneous in any case to store output in a composite datastore [ not done now] to gain needed flexibility
Conceptual issues with simultaneous pdf / data Composite datasets most likely used only in conjunction with RooSimultaneous, so that p.d.f. is likely the most sensible point to make this interface, e.g. ws.factory(“SIMUL::model[idx,a=pdfA(x),b=pdfB(y)]”) then modify internally RooSimultaneous::generate() to follow instructions accordingly. Also need new syntax to construct RooDataSets in this way RooDataSet ds(“ds”,”ds”,RooArgSet(x,y,i),Index(i), Import(dataA,”a”,x), Import(dataB,”b”,y)
Conceptual issues with simultaneous pdf / data Once concept of RooMixedData is implemented can also think of interface binned-vs-unbinned datasets –Construction ‘by hand’ follows trivially from ctor RooMixedData ds(“ds”,”ds”,RooArgSet(x,y,i),Index(i), Import(dataA,”a”,x), Import(dataB,”b”,y) –When generating binned-vs-unbinned is a ‘preference’ (you can always do either way) –Either specify at generation time (requires non-trivial interface), or encode ‘preference’ inside a RooSimultaneous Still requires some creativity to be able insert this preference spec in the factory Otherwise through class interface sim.setGenerateBinned(“a”,kTRUE) ;
Recap of data and simultaneous issues Project 1 –Make RooVectorDataStore ~ 1 week. Easily factorized/delegated Project 2 –Adjust RooDataSet/RooDataHist to accept index-dependent observables [ ~2-3 days ] –Adjust RooSimultanous to specify ‘relevant’ observables for each index [ 1 day ] Project 3 –Make RooCountingData ~ 2-3 days –Make RooMixedData ~2-4 days [ depending on difficulties ] –Adjust RooSimultaneous to use these
Other issues Workspaces –Ability to rename named sets store in datasets [ 1 hour ] –Make EDIT() capable of removing terms in PROD terms [ 1 day ] –Bug in RooHistPdf persistence [ 1-2 days ] Time consuming as it requires intervention in RooAbsArg streamer –Kyle reported 32/64 issues in persistence [need example] [ ?? ] Pdf interface issues –Port generateSimGlobal() to generate() interface [ 1 day ] –Make extendedTerm() return Double_t instead of Int_t to support Asimov datasets [ 0.5 day ] –Common abstract interface for morphing operator PDF [ ??? ] Likelihood interface issues –What normalization set applied to constraint terms? –Need data/pdf combination scheme that allows to detach dataset that has already died from a NLL Simplifies use of setData() in RooStats [1-2 days ]
Addressing RooStats performance issues from RooFit side Avoid need to (re)create likelihoods –Modified data/pdf attachement scheme in RooNLLVar that allow to detach datasets after they have been deleted Allows straightforward use of setData() in RooStats [ 2 days ] Speeding of dataset looping, creation deletion –Vector-based datasets [ ~1 week ] Copy overhead of complex objects –Complex defines as have >>100 nodes –Several optimization already applied on RooFit side (Hash tables etc for reconnection lookup). Biggest speed gain most likely in form of addition of new classes that allow to reduce number of objects Collapse construct of a pdf for N channels into a single one. Needs some details on use cases, but likely good progress possible in O[2-3 days] Profiling of RooStats TLimit macro essential