Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.

Slides:



Advertisements
Similar presentations
Information Extraction Lecture 4 – Named Entity Recognition II CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
Advertisements

Author’s Name/s Goes Here, Author’s Name/s Goes Here, Author’s Name/s Goes Here Address/es Goes Here, Address/es Goes Here, Address/es Goes Here Abstract:
Poster title goes here, containing strictly only the essential number of words... Author’s Name/s Goes Here, Author’s Name/s Goes Here, Author’s Name/s.
Poster Title Goes Here & Must Match Your Submitted Abstract Title Authors’ Names Go Here (must match those on the submitted abstract) Affiliations go here.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.
Author’s Name/s Goes Here, Author’s Name/s Goes Here, Author’s Name/s Goes Here Address/es Goes Here, Address/es Goes Here, Address/es Goes Here Abstract:
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
TERM PROJECT The Project usually consists of the following: Title
First APA Format Research Reports: Common Problems General Formatting Issues – Double spacing only - no extra lines – Five spaces between short title and.
Results Importing / inserting files… Images such as photographs, graphs, diagrams, logos, etc, can be added to the poster. To insert scanned images into.
Keyphrase Extraction in Scientific Documents Thuy Dung Nguyen and Min-Yen Kan School of Computing National University of Singapore Slides available at.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Information Literacy. Information Literacy includes: The ability of a student to: 1.Identify the need for information Select a topic 2.Access information.
Multimodal Alignment of Scholarly Documents and Their Presentations Bamdad Bahrani JCDL 2013 Submission Feb 2013.
Poster title goes here, sentence case, key words designed to attract the right audience Author’s name here, Author’s name here, Author’s name here Address.
MAE126B Report Writing Lecture Refer also to MAE126A Report Writing Lecture.
Paper Title [Arial 50 point, bold, and Upper Case] Affiliations, City, Country, [Arial, 26 point] Author Name(s) [Arial, 36 point] Abstract This.
] ] Poster title goes here, containing strictly only the essential number of words... Author’s Name/s Goes Here 1, Author’s Name/s Goes Here 2 1: Organization(s)
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Poster title goes here, containing strictly only the essential number of words... Author’s Name/s Goes Here, Author’s Name/s Goes Here, Author’s Name/s.
Author’s Name/s Goes Here, Author’s Name/s Goes Here
Poster title goes here, containing strictly only the essential number of words... Introduction First… Check with conference organisers on their specifications.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Poster Title (Resist the temptation for long titles) Author A, Author B, Author C, Author D and Author E Address or affiliation, Address or affiliation,
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
First… Check with conference organisers on their specifications of size and orientation, before you start your poster eg. maximum poster size; landscape,
Sample paper in APA style Sample paper in APA style.
Address/es Goes Here, Address/es Goes Here, Address/es Goes Here
THESIS & DISSERTATION FORMATTING
Title of Poster Arial 88 pt Centered on Poster Small Caps
Address/es Goes Here, Address/es Goes Here, Address/es Goes Here
Poster title goes here (change font size to keep within box)
Author’s Name/s Goes Here1, Author’s Name/s Goes Here2
Address/es Goes Here, Address/es Goes Here, Address/es Goes Here
Poster Title Goes Here & Must Match Your Submitted Abstract Title
Author’s Name/s Goes Here1, Author’s Name/s Goes Here2
Technical Report Writing
Poster Title Center and use TNR (Bold) 80
Address/es Goes Here, Address/es Goes Here, Address/es Goes Here
Poster title goes here, containing strictly only the essential number of words... Author’s Name/s Goes Here, Author’s Name/s Goes Here, Author’s Name/s.
Layout - you need to understand that a simple navigation bar:
Title of Poster Arial 88 pt Centered on Poster Small Caps
Author’s Name/s Goes Here1, Author’s Name/s Goes Here2
Address/es Goes Here, Address/es Goes Here, Address/es Goes Here
Features & Decision regions
––––– Heading 1 Heading 2 Heading 4 Heading 3 Heading 6
خشنه اتره اهورهه مزدا شيوۀ ارائه مقاله 17/10/1388.
Small corporate logo may Go here
Address/es Goes Here, Address/es Goes Here, Address/es Goes Here
Author’s Name/s Goes Here1, Author’s Name/s Goes Here2
Author’s Name/s Goes Here1, Author’s Name/s Goes Here2
Title of Poster Arial 88 pt Centered on Poster Small Caps
Author’s Name/s Goes Here1, Author’s Name/s Goes Here2
Title of Poster Author box centered on poster Author bold centered
Title of Poster Author box centered on poster Author bold centered
Title of Poster Arial 88 pt Centered on Poster Small Caps
Title of Poster Arial 50 pt Centered on Poster Small Caps
TITLE OF THE PAPER , TIMES NEW ROMAN, CENTER, BOLD
Abstract (Maximum 500 words)
Glassy–Winged Sharpshooter: Farmer’s Scourge
Author’s Name/s Goes Here, Author’s Name/s Goes Here
Title of Poster Arial 88 pt Centered on Poster Small Caps
Title Goes Here Title Goes Here Title Goes Here Title Goes Here
Affiliation/ City/Country/
TITLE OF THE PAPER , TIMES NEW ROMAN, CENTER, BOLD
Conclusion & Discussion Research purposes/ Research hypothesis
Title of Virtual Presentation
Author’s Name/s Goes Here, Author’s Name/s Goes Here
Presentation transcript:

Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan

Logical structure annotation in ForeciteReader. The view shows object navigation interface, currently focusing on the list of figure captions. 10/27/20152

Section navigation in ForeCiteReader environment with generic sections 10/27/20153

Overview Methodology – Problem Formulation – Learning Model - CRF – Approach overview – Classification categories Raw-text features Rich document representation Experiments Further analysis 10/27/20154

Problem Formulation Two related subtasks: Logical structure (LS) classification – scholarly document as an ordered collection of text lines – label each text line with a semantic category e.g. title, author, address, etc. Generic section (GS) classification – take the headers of each section of text in a paper – deduce a generic logical purpose of the section.  Sequence labeling tasks - CRF 10/27/20155

Learning Model - CRF CRF in simplified form f: both state & transition functions 10/27/20156 Binary feature State function Transition function Utilize CRF++ package Input for line l i to CRF++ is of the form “value 1 … value m category i "

Approach overview 10/27/20157

Classification categories - example 10/27/20158

Classification categories – full sets Logical structure subtask, 23 categories: address, affiliation, author, bodyText, categories, construct, copyright, , equation, figure, figureCaption, footnote, keywords, listItem, note, page, reference, sectionHeader, subsectionHeader, subsubsectionHeader, table, tableCaption, and title. Generic section subtask, 13 categories: abstract, categories, general terms, keywords, introduction, background, relatedWork, methodology, evaluation, discussions, conclusions, acknowledgments, and references. 10/27/20159

Raw-text features - LS Parscit token-level features + Our line-level features: – Location: relative position within document – Number: patterns of subsections, subsubsections, categories, footnotes – Punctuation: patterns of s & web links bracket numbering  equation – Length: 1token, 2token, 3token, 4token, 5+token  identify majority of lines as bodtyText 10/27/201510

Raw-text features - GS Naïve, yet effective features: – Positions – First and Second Words – Whole Header 10/27/201511

Rich document representation – OCR output Linearlize XML output into CRF features: “Don't-Look-Now,-But-We've- Created-a-Bureaucracy. Loc_0 Align_left FontSize_largest Bold_yes Italic_no Picture_no Table_no Bullet_no". 10/27/201512

Rich document representation – OCR features Position – Alignment: left, center, right & justified – Location: within-page location Format – FontSize: quantize base on frequency, e.g smaller, smaller, base, -2, -1, 0 – Bold– Italic Object – Bullet– Picture– Table 10/27/

Experiments - datasets LS: 20 ACM, 10 CHI 2008, 10 ACL 2009 – fully labeled GS: 211 ACM papers – headers labeled 10/27/ Skewed data

Experiments – metrics TP: # correctly classified text lines (true positive) Similarly, FN, FP, and TN for true negatives. Category-specific performance: – F 1 measure = 2 x P x R / (P+R); Precision = TP/(TP+FP), Recall = TP/(TP + FN) Overall performance: – Macro average: average of all category-specific F 1 – Micro average: percentage of correctly labeled lines 10/27/201515

Experiments – LS results LS PC - baseline using only ParsCit features LS PC+RT : LS PC + raw text features LS PC+RT+RD : LS PC+RT + rich document features (OCR) LS PC+RT+RD, LS PC+RT > LS PC more than 10 F 1 points LS PC+RT+RD < LS PC+RT : minor degradation for four categories LS PC+RT+RD > LS PC+RT : all other categories (many > 4 F 1 scores) Large improvements for footnote, sssHeaders 10/27/201516

Experiments – GS results GS maxent : maximum entropy based system (Nguyen and Kan, 2007) GS CRF : our system GS CRF > GS maxent : in all categories except background Large improvements for discussions 10/27/201517

Further analysis – Text features All contribute to the final composite performance Most influential: position 10/27/201518

Further analysis – rich doc features Format contributes most to macro avg While object influences micro average most Format features help a wider spectrum of categories: paper metadata & section headers Object features enhance fewer categories, but containing a large number of training data e.g. list item, table 10/27/

Further analysis – rich doc features Most features improve both metrics except align & table: trade off macro vs. micro Location, Font, and Bullet as the most effective features in each of the groups position, format, and object 10/27/201520

Error analysis - LS 10/27/201521

Error analysis - GS whole header: non-overlapping tokens with any of the memoized training data instances  Needs to use body text instead (Future work) Similar relative positions of consecutive headers: background vs. method, method vs. discussions, & discussions vs. Conclusions The dataset skew also impacts: large number of method, while much less for background and discussions categories  many headers are mislabelled as method 10/27/201522

10/27/201523

Q & A Thank you! 10/27/201524