Mining Scientific Papers DKB Meeting. Sense mining of scientific papers Scientific papers are highly structured texts and display specific properties.

Mining Scientific Papers DKB Meeting

Sense mining of scientific papers Scientific papers are highly structured texts and display specific properties related to their references, but also argumentative and rhetorical structure. ATLAS Notes data mining workflow: 1.List of mapping tags: – 8 TeV – 20.3 fb-1 – 2012 – proton-proton – PYTHIYA8 – mc15_13TeV:mc15_13TeV.361091.Sherpa_CT10_WplvWmqq_SHv21_improved.merge.AOD.e460 7_s2726_r7267_r6282/, – 309523 – etc. 2. Parse PDF, represent text & tables to TXT format 01.04.2016Maria Grigorieva2

ATL-COM-PHYS-2013-1324.pdf 01.04.2016Maria Grigorieva3 ATLAS Internal Notes PDF’s are generated from LaTeX.

Required text representation of ATLAS Notes ### Title Search for low-mass and high-mass Higgs-like diphoton resonances with the ATLAS detector at √s = 8 TeV ### Abstract Abstract: we present the extraction of a 95% C.L. limit on the production cross-section of an additional narrow Higgs-like resonance decaying into two photons, in the 65-600 GeV invariant mass range. By restricting the results to a fiducial cross- section measurement, we provide model-independent limits that can be used to constrain theoretical models predicting additional Higgs-like particles with a narrow width. The dataset uses 20 fb−1 of proton- proton collision data recorded with the ATLAS detector in 2012. ### 1 Introduction Since the recent discovery of the H(126) Higgs boson at the LHC, the study of the Higgs sector has become an important objective of the ATLAS physics program. One aspect of this program is the study the properties of the new boson, in particular its couplings to the W and Z gauge bosons, to investigate its role in the mechanism of Electroweak symmetry breaking and the generation of the SM particles masses. Another equally important objective is to investigate the possibility of an extended Higgs sector with additional states. Many models of beyond the Standard Model (BSM) physics require a second scalar particle at higher mass to fulfil unitarity conditions in the WW and ZZ scattering amplitudes at high-energy [1]. ### 2 Overview of the Analysis The analysis described in this note is based on the standard ATLAS H → γγ analysis [4], and retains its main features. In particular, the calibration hit method is still used, and not the multivariate method foreseen for the final run-1 publications. The presence of two well-identified photons is required, and the analysis searches for narrow signal peaks over a smooth background in the spectrum of their invariant mass, mγγ. Both the signal and background are described by templates given by analytic formulae. The signal template is obtained from MC. #T# Table 1 ###T#column titles Process | Generator | Mass [GeV] | Nevents (×103 ) | NWA ###T# Content ggF VBF WH ZH ttH 13 DRAFT Generator PowHeg+Pythia8 PowHeg+Pythia8 Pythia Pythia Pythia Mass [GeV] 70-75 80-85-90-95-100-105- 110-115 120 125 130-135-140-145 150-160-170-180-190 200-220-240-260-280 300-320-340-360-380 400-420-440-460-480 500-520-540- 560-580 600-650-700-750-800 850-900-950-1000 70-75 80-85-90-100-105-110-115 120 125 130-140-150 200-300-400-500-600 700-800- 900-1000 70-75-80-85-90-100-105-110 115-120-125-130-140-150-200 300-400-500-600 700-800-900-1000 70-75-80-85-90-100-105-110 115-120-125-130-140-150-200 300-400-500-600 700-800-900-1000 70-75-80-85-90-100-105-110 115-120-125-130-140-150-200 300-400- 500-600 700-800-900-1000 Nevents (×103 ) NWA 30 no 100 no 300 no 3000 no 100 no 30 yes 30 yes 30 yes 30 no 50 no 100 no 1000 no 50 no 30 yes 30 no 30 yes 30 no 30 yes 30 no 30 yes ###T# Title Signal samples of the five Higgs production modes, available mass points and whether the NWA width is used or not. 01.04.2016Maria Grigorieva4

PDF text extractors XPDF includes a PDF text extractor, PDF-to-PostScript converter, and various other utilities (http://www.foolabs.com/xpdf/home.html).http://www.foolabs.com/xpdf/home.html PDFMiner is a Python tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data (https://pypi.python.org/pypi/pdfminer/)https://pypi.python.org/pypi/pdfminer/ PDFBox - A Java PDF Library: manipulation of existing documents and the ability to extract content from documents (http://pdfbox.apache.org/)http://pdfbox.apache.org/ PdfTextStream (https://www.snowtide.com/help/extracting-text-from-pdf-documents) https://www.snowtide.com/help/extracting-text-from-pdf-documents RUTA Workbench (https://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.workbench) https://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.workbench – rules to extract information. – it will really speed up and ease your work with UIMA (https://uima.apache.org/).https://uima.apache.org/ ATLAS Notes data mining workflow: 01.04.2016Maria Grigorieva5

ATLAS Notes data mining workflow: 3. Mapping prepared text ### Title Search for low-mass and high-mass Higgs-like diphoton resonances with the ATLAS detector at √s = 8 TeV ### Abstract Abstract: we present the extraction of a 95% C.L. limit on the production cross-section of an additional narrow Higgs-like resonance decaying into two photons, in the 65-600 GeV invariant mass range. By restricting the results to a fiducial cross-section measurement, we provide model- independent limits that can be used to constrain theoretical models predicting additional Higgs-like particles with a narrow width. The dataset uses 20 fb−1 of proton- proton collision data recorded with the ATLAS detector in 2012. ### 1 Introduction Since the recent discovery of the H(126) Higgs boson at the LHC, the study of the Higgs sector has become an important objective of the ATLAS physics program. One aspect of this program is the study the properties of the new boson, in particular its couplings to the W and Z gauge bosons, to investigate its role in the mechanism of Electroweak symmetry breaking and the generation of the SM particles masses. Another equally important objective is to investigate the possibility of an extended Higgs sector with additional states. Many models of beyond the Standard Model (BSM) physics require a second scalar particle at higher mass to fulfil unitarity conditions in the WW and ZZ scattering amplitudes at high-energy [1]. ### 2 Overview of the Analysis The analysis described in this note is based on the standard ATLAS H → γγ analysis [4], and retains its main features. In particular, the calibration hit method is still used, and not the multivariate method foreseen for the final run-1 publications. The presence of two well-identified photons is required, and the analysis searches for narrow signal peaks over a smooth background in the spectrum of their invariant mass, mγγ. Both the signal and background are described by templates given by analytic formulae. The signal template is obtained from MC. #T# Table 1 01.04.2016Maria Grigorieva6

ATLAS Notes data mining workflow: 4. Machine Learning for automatic ATLAS Notes mapping Data Mining & Linguistic Preprocessing Software Freeware: —Natural Language Toolkit (http://www.nltk.org/) - NLTK is a leading platform for building Python programs to work with human language data.http://www.nltk.org/ —GATE: a full-lifecycle open source solution for text processing (https://gate.ac.uk/)https://gate.ac.uk/ —UIMA: Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. (https://uima.apache.org/) https://uima.apache.org/ Enterprise: —IBM Watson Content Analytics (https://www.ibm.com/developerworks/ru/library/ba-watson-dictionary/, https://www.redbooks.ibm.com/redbooks/pdfs/sg247877.pdf)https://www.ibm.com/developerworks/ru/library/ba-watson-dictionary/ https://www.redbooks.ibm.com/redbooks/pdfs/sg247877.pdf 01.04.2016Maria Grigorieva7

Mining PDF Issues: Choose the most suitable PDF parser (from the list on the slide #5). Basic requirements: – extract headers, titles & subtitles, – extract text paragraphs, – clean text from waste elements (page & line numbers, authors list, table of contents, ignore figures & formulas, and so on), – fix splited words (data”\n”set) and lines (for example, the whole dataset name should be placed in one line, without any “\n”) 01.04.2016Maria Grigorieva8

Known issues (tested on xPdf tool) 01.04.2016Maria Grigorieva9 All sentences, not fitted in one line, are splitted by CRLF without saving spaces between words:

Known issues (tested on xPdf tool) 01.04.2016Maria Grigorieva10 Ideal dataset name representation (in one line) Broken dataset names (splitted into two(+) lines)

Mining Scientific Papers DKB Meeting. Sense mining of scientific papers Scientific papers are highly structured texts and display specific properties.

Similar presentations

Presentation on theme: "Mining Scientific Papers DKB Meeting. Sense mining of scientific papers Scientific papers are highly structured texts and display specific properties."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mining Scientific Papers DKB Meeting. Sense mining of scientific papers Scientific papers are highly structured texts and display specific properties.

Similar presentations

Presentation on theme: "Mining Scientific Papers DKB Meeting. Sense mining of scientific papers Scientific papers are highly structured texts and display specific properties."— Presentation transcript:

Similar presentations

About project

Feedback