Download presentation
Presentation is loading. Please wait.
Published bySimon Bryant Modified over 9 years ago
1
Keyphrase Extraction in Scientific Documents Thuy Dung Nguyen and Min-Yen Kan School of Computing National University of Singapore Slides available at http://wing.comp.nus.edu.sg/downloads/keyphraseCorpushttp://wing.comp.nus.edu.sg/downloads/keyphraseCorpus
2
Thuy Dung Nguyen and Min-Yen Kan 2ICADL 2007 (Hanoi, Vietnam) Keyphrases! To think about: Are tags keyphrases? Credits: Amazon.com, ACM.org, IMDB.com
3
Thuy Dung Nguyen and Min-Yen Kan 3ICADL 2007 (Hanoi, Vietnam) Using Keyphrases in DLs Navigation – Searching: Better weighting for terms – Browsing and Linking: Finding similar documents Reading – Highlighting – Key Concepts Helping to make the transition seamless between the two Why are keyphrases important to digital libraries? Genex
4
Thuy Dung Nguyen and Min-Yen Kan 4ICADL 2007 (Hanoi, Vietnam) Related Work Generation – Kim and Wilbur – statistical properties of distribution – Tomokiyo and Hurst – Phraseness model Selection – GenEx (Frank) – Kea (Frank et al.): just 3 features TF×IDF, position, corpus frequency – Turney: selection not independent, use PMI Assignment – From Ontology (Medelyan & Witten), use graph features
5
Thuy Dung Nguyen and Min-Yen Kan 5ICADL 2007 (Hanoi, Vietnam) Architecture Key difference from previous works: Centered on scientific publications As such, adds two modules to capitalize on this limited domain Preprocessing: - Sentence delimiting - POS tagging - Stemming Candidate Identification -Simplex noun phrase detection Basic Features - TF×IDF - Position Morphological Features - Suffix sequence - POS sequence - Acronym Structural Features - Section distribution vector Plain text HTML formatted output Generic header mapping model Keyphrase selection model Scientific publication Key- phrases
6
Thuy Dung Nguyen and Min-Yen Kan 6ICADL 2007 (Hanoi, Vietnam) 1) Morphological Features POS tags (used in previous work; e.g., Genex) – Used to identify candidates for simplex noun phrases (i.e., matching regex “(JJ|NN)* IN? NN”) – Noun modifiers seem to be more productive than adjectival ones (e.g. “Additive”/NN vs. “Additional”/JJ) Suffixes – sequences on modifiers and headwords (e.g., -ic, -al, -ive on modifiers; -ion, -ics, -ment on headword) – more fine grained than POS tagging
7
Thuy Dung Nguyen and Min-Yen Kan 7ICADL 2007 (Hanoi, Vietnam) Morphological Features Acronym candidate – Binary feature - is the word an acronym? – Using simple adjacent pattern matching of parenthesized text to candidates to their left / right ICADL (Int’l Conf. on Asian Digital Libraries) Int’l Conf. on Asian Digital Libraries (ICADL) – Weakness: - Not comparable to state-of-the-art algorithm, not meant to be - Not yet evaluated as a separate component - A finer-grained feature may be more useful
8
Thuy Dung Nguyen and Min-Yen Kan 8ICADL 2007 (Hanoi, Vietnam) Stemming After other processing, case folding and stemming conflates candidates to obtain accurate phrase counts – Use Lovins iterated stemmer – Represent all stems using the most frequent form voxel(1) Voxels(2) voxelization (5) Voxelization (8)
9
Thuy Dung Nguyen and Min-Yen Kan 9ICADL 2007 (Hanoi, Vietnam) 2) Structural Feature Abstract Introduction Related Work Methods Evaluation Conclusion Abstract Introduction Related Work Methods Evaluation Conclusion Learning which sections are more productive for keyphrases
10
Thuy Dung Nguyen and Min-Yen Kan 10ICADL 2007 (Hanoi, Vietnam) Structural Features Execution: create a feature vector of where a term logically appears Stem A: Stem B: Caveat: Lots of unique headers in documents. Not helpful to say candidate occurs in “Metadata Extraction Approaches” Change it to “Related Work”
11
Thuy Dung Nguyen and Min-Yen Kan 11ICADL 2007 (Hanoi, Vietnam) Mapping to Generic Section Headers Method: also supervised machine learning Map to 14 generic headers 1. Absolute section number (Section 3) 2. Relative position (Section 3 of 11 = 3 / (11-1) =.30) 3. Previous section header text 4. Current section header text Performance (on a corpus of 1020 headers) – Maximum Entropy: 92% accuracy – Hidden Markov Model: 36% accuracy
12
Thuy Dung Nguyen and Min-Yen Kan 12ICADL 2007 (Hanoi, Vietnam) Evaluation - Corpus Collection No publicly available corpus of keyphrase assignments for scientific documents*. What to do? So we collected our own. Freely available at: http://wing.comp.nus.edu.sg/downloads/keyphraseCorpus 211 documents where text was extractable – Superset of previous set – Searched for “keywords general terms filetype:pdf” * Consider citeulike.org?
13
Thuy Dung Nguyen and Min-Yen Kan 13ICADL 2007 (Hanoi, Vietnam) Evaluation 120 documents with at least two sets of keyphrases – One by original author – One or more by student annotators Accuracy by matching top ten extracted keyphrases versus the gold standard – Standard P/R/F 1 – Weighted average: use frequency of phrase in standard 1 + ln(f) Tested Naïve Bayes and Maximum Entropy Using Kea features as the baseline
14
Thuy Dung Nguyen and Min-Yen Kan 14ICADL 2007 (Hanoi, Vietnam) Evaluation Results Maximum Entropy did not work as well as NB NB results show statistical significance at.05 level for both evaluation schemes 3.03 3.25 3.61 3.84 Number of keywords matched
15
Thuy Dung Nguyen and Min-Yen Kan 15ICADL 2007 (Hanoi, Vietnam) Discussion Assigned KeyphrasesKea BaselineOur System Neural network Handover Clusters 3G network Soft handover Soft handover Soft handover (2) 3G Data Cluster analysis Clusters 3G network Self organizing map 3G network Interesting clusters Hierarchical clustering Cell Neural network Errors: Still encourage longer phrase generation General words still appear (e.g., “data”, “cell”)
16
Thuy Dung Nguyen and Min-Yen Kan 16ICADL 2007 (Hanoi, Vietnam) Conclusions Current and Future Work – Enlarge the keyphrase corpus – Integrate tagging with keyphrases – Deploy system into a scholarly digital library Contributions: better keyphrase extraction: – Developed features specifically for scientific documents – Developed mapping model for headers – Created a corpus for keyphrase testing http://wing.comp.nus.edu.sg/downloads/keyphraseCorpus Advertisement: We’re hiring postdocs in terminology extraction and semistructured document processing
17
End of Presentation Backup slides follow
18
Thuy Dung Nguyen and Min-Yen Kan 18ICADL 2007 (Hanoi, Vietnam) ICADL format 23-25 minutes for talk 5 minutes question 30 minutes in total
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.