Download presentation
Presentation is loading. Please wait.
Published byJesse Morrison Modified over 9 years ago
1
FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 Comparative Analysis of Automatic Term and Collocation Extraction Sanja Seljan, Bojana Dalbelo Bašić, Jan Šnajder, Davor Delač, Matija Šamec-Gjurin, Dina Crnec Faculty of Humanities and Social Sciences, Department of Information Sciences Faculty of Electrical Engineering and Computing
2
FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 Overview I.Introduction –Reasons for extraction II.Research –Resources & tools –Extracted lists III.Evaluation –Precision, recall, F-measure IV.Conclusion
3
FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 I. Introduction Monolingual and multilingual resources –Helpful –Integrated –Require human intervention EU pre-accession activities –Speed up + consistency Used in further research and practice
4
FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 List: –Terms (Member State, European Union) –Collocations (adopt a/the resolution, decided as follows) –Multi-word units (depend on, well-being) Term extraction process: –Term extraction (term acquisition)- identification –Term recognition - verification
5
FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 II. Research Resources –10 documents – legislation, Cro-Eng Tools –TermeX tool (FER) – list A –SDL Multi Term Extract + NooJ (FF) – list B Reference list –Evaluation – reference list
6
FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 Reference list 470 terms and collocations Exclude unigrams Balance between lexical coverage, adequacy, practicality –terms (NPs: 346/470) –collocations (VPs)
7
FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 Reference list Contains: –Terms (acquiring company, applicant country) –Collocations (adopt a/the resolution, decided as follows, entry into force, having regard to) –Names and abbreviations (Economic and Monetary Union EMU, European Union EU) –Relevant embedded terms (crime prevention, crime prevention bodies, national crime prevention measures).
8
FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 Language-independent statistically-based SDL Multi Term Extract tool –Frequency treshold set to 4 –Filtered by the list of stop-words -> 369 cand. Language dependant NooJ tool –36 local grammars -> 512 cand. List B
9
FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 List A TermeX –Lexical association measures (AMs) –14 AMs ( PMI, Dice, Chi-square,… ) –Lemmatization –POS filtering –Frequency treshold set to ?
10
FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 List A Extracted terms ranked by AM value –1816 candidates AMs used: –2-grams – PMI –3-grams, 4-grams – heuristic extensions Noun phrases only
11
FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 Results Evaluation –F 1 -measure (precision, recall) –True positives calculated by taking into account inflection (suffix stripping) List AList B No. of terms1816508 Valid terms202234 Precision (%)11.5647.37 Recall (%)42.9849.79 F 1 (%)18.2248.55
12
FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 Results List A unsatisfactory –Low recall – Verb phrases, terms consisting of more than 4 words –Low precision – ranked list, can be improved with cut-off (true positives are better ranked) List B modest –can be improved with lemmatization, definition of upper/lower cases, more detailed local grammar
13
FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 Conclusion Comparison of two hybrid approaches to term extraction Human created lists differ from extracted lists –human knowledge, experience and intuition Space for improvement – automatic extraction combined human intervention
14
FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 Thank you!
15
FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.