Exploring Artificial Neural Networks to discover the Higgs boson at the LHC
Overview Introduction –The Standard Model and the mass problem –Higgs search at the LHC (and ANNs) ttH, H to bb and other channels –Production process –Decay –Experimental signatures –Background processes ANNs a possible solution (theory) ANN development issues (on simple 2-D classification problem + results) ANNs applied to Higgs data (results) Summary
Introduction Origin of mass is last big unanswered question in the SM. The Standard Model –To make physical equations in SM gauge invariant we require new terms, these correspond directly to gauge bosons. (eg photon) –Massless particles would preserve SM’s gauge symmetry (easiest, but not the case) – Higgs mechanism allows generation of mass in the SM (by breaking gauge invariance of vacuum) (spontaneous symmetry breaking) – Needs further particle: HIGGS BOSON!! So search for Higgs is important to our understanding of particle interactions. It may be that nature has chosen another mass-generating mechanism, but whatever this mechanism is, it should show itself at the LHC.
Search at the Large Hadron Collider (LHC) (Higgs discovery one of its main aims) H mass not predicted by the SM but production and decay rates can be predicted as function of m H. –From LEP: 114.4GeV < m H (SM). –The LHC, with its detectors ATLAS and CMS, (due to go online in 2007) will collide p-p at 14TeV Higgs mass reach to about 1TeV. –High Luminosity; But Higgs production v. rare! (10 16 proton-proton interactions will occur per year, but less than 100,000 Higgs bosons will be produced ) –As well as Higgs, LHC hopes to find evidence for new physics Supersymmetry (SUSY) (modifies the SM to include a whole new series of particles, supersymmetric partners of all the particles so far known. Has many desirable features, mending some shortcomings of the SM. If SUSY is the theory, we do not know how many Higgs bosons we would see (minimum 5)
ttH, H to bb Channel H production processes: –Gluon fusion is dominant Higgs production process, gg to H (but difficult to separate signal from large QCD background) –Associated Production! ttH, Lower cross section but has leptonic final states Dominant decay mode at m H <130GeV is H to bb bb WW ZZ
ttH, H to bb could account for half the Higgs discovery potential at ATLAS (Cammin) Background; –ttjj (most important, 94% after TDR analysis) Full reconstruction of final state is necessary to minimize combinatorial background and to discriminate signal from large bg. TDR analysis has 3 steps: Preselection -1 isolated lepton, -At least 6 jets, -Exactly 4 tagged as b-jets. Reconstruction -reconstruction of 2 top quarks, minimise: Δ 2 = (m lvb – m t ) 2 + (m jjb – m t ) 2 Cuts on the reconstructed t and H masses (where ANNs come in) -Reconstructed top masses must be within 20GeV of m t. -m bb = m H 30GeV
After this TDR analysis, significance S/√B = 1.94 (for 120GeV Higgs) Could increase significance by: –Better jet pairing –Improving ‘final selection’ (after t reconstruction, apply to events in m bb = m H 30GeV) Applying ANNs promising as makes use of event topology, not just mass cuts! Ie minimising (Δ 2 eqn) does not take into account additional info such as spatial differences between jets! I looked at final selection! Used 10 variables generated by Pythia. (which gave separation in signal and background distributions) Fed variables into a neural network. (to classify event as signal or background)
ANNs Artificial Neural Networks (ANNs) are computational modelling tools Inspired by biological nervous system Good at: –generalization, –non-linear, –learn by example. Want to train network with examples to recognise right data( classification task) and reject rest (ANNs perform better than cut based in theory because can separate classes in feature space non linearly) (but training is difficult, optimisation harder than for cut based methods) x1x1 x2x2 x1x1 x2x2
oioi w ij hjhj w jk xkxk Response function: o i =g(∑ i w ij g(∑ k w jk x k )) Which is non-linear so network able to perform non-linear mappings Architecture and weight settings are what change classification! We want network to output 1 for signal and 0 for all background A neural node A neural network How do ANNs Work?
Weights are changed in proportion to the difference (error) between target output and actual network output for each example. Minimize summed square error function: E = 1/2 ∑ p ∑ i (o i (p) - t i (p) ) 2 with respect to the weights. Error is function of all the weights and forms an irregular multidimensional complex hyperplane with many peaks, saddle points and minima. Error minimized by finding set of weights that correspond to global minimum. (ie get close to 1 for signal and close to 0 for background) Done with gradient descent method – (weights incrementally updated in proportion to δE/δ w ij ) Error Surface
Summary of learning algorithm 1.Initialize w ij and w jk with random values. 2.Pick pattern p from training set. Present input and calculate the output from: o i =g(∑ i w ij g(∑ k w jk x k )) Update weights according to: w ij (t + 1) = w ij (t) – Δw ij w jk (t + 1) = w jk (t) – Δw jk where Δw = -η δE/δw. (…etc…for extra hidden layers). When no change (within some accuracy) occurs, the weights are frozen and network is ready to use on data it has never seen.
2-D problem Initially looked at simple ANN classification problem; –Separate out a single point in a 2-D plane of randomly generated numbers. Generated 2 sets of random numbers Fed network (using SNNS)(2 input 1 output) (show diag!!) examples of signal and background data. (desired output 1 and 0 respectively) Used 300 patterns in both tr. And val sets. Background to signal ratio was 3 to 1. Looked at various net architectures. Results: –Learning shown by error curves –Projections show hyperplanes –3 hidden nodes solve classification task fully! (effectively 1 hidden node is equiv. of 1 linear hyperplane)
–Got spiking behaviour of some error curves. Showed inconsistent learning (updating of weights) Was solved by adjusting some network params (made learning more stable!!!) –Learning parameter, η. –dmax. –Shuffle option. To get a deeper understanding of learning, also looked at weight and bias variables.
Using ANNs for Higgs search Worked with data after reconstruction of top quarks. Variables used; –mbb: the invariant mass of the two b-jets assigned to the Higgs boson, –Δη(tnear, bb): the difference in pseudo rapidity between the bb-system and the reconstructed top quark nearest ΔR. –cos b,b*: the cosine of the decay angle of the two b-jets from the Higgs boson in the rest frame of the bb-system, –Δη(b,b): the difference in pseudo rapidity between the two b-jets from the Higgs boson, –mbb(1): the combination with the smallest invariant mass mbb out of the six combintations which are possible when selecting two b-jets out of four b-jets, –mbb(2): the combination with the second smallest invariant mass mbb out of the six combinations which are possible when selecting two b-jets out of four b-jets, – t1- t2: the difference in phi between the reconstructed top quarks, –pTt1+pTt2: the sum of the transverse momenta of the reconstructed top quarks.
Signal is RED
(only ttjj background used) Rescaled data to [0,1] Separated data into tr and val sets Used 1:1 for signal to background. Looked at various archs (1 and 2 hidden layers) Weak generalization:
Output for best architecture ( ) gave: Signal is RED
Summary Optimisation difficulties and solutions have been identified in net development Some classification produced for Higgs data More work on arch could be needed (more data, lack of generalization) s/√B as fn. of cut on output.
ANN development Require training and validation sets Difficulties in optimisation: –Several factors need to be considered Architecture Learning params (ie stepwidth, local minima) Data size –Finding optimum (of parameter settings) is largely trial and error (rules of thumb)!!! Optimum network Testing Training Error (eg SSE) No. of hidden nodes or training cycles