Download presentation
Presentation is loading. Please wait.
Published byMildred Fletcher Modified over 9 years ago
1
NLP superficial and lexic level1 Superficial & Lexical level 1 Superficial level What is a word Lexical level Lexicons How to acquire lexical information
2
NLP superficial and lexic level2 Superficial level 1 Textual pre-process Getting the document(s) Accessing a BD Accessing the Web (wrappers) Getting the textual fragments of a document Multimedia documents, Web pages,... Filtering out meta-information tags HTML, XML,...
3
NLP superficial and lexic level3 Superficial level 2 Text segmentation into paragraphs or sentences Tokenization orthographic vs grammatical word Multiword terms dates, formulas, acronyms, abbreviations, quantities (and units), idioms, Named entities NER, NEC, NERC Unknown word Language identification Beeferman et al, 1999 Ratnaparkhi, 1998 Bikel et al, 1999 Borthwick, 1999 Mikheev et al, 1999 Elworthy, 1999 Adams,Resnik, 1997
4
NLP superficial and lexic level4 Superficial level 3 Vocabulary size (V) Heap's Law V = KN K depends on the text 10 K 100 N total number of words depends on the language, for English 0.4 0.6 Vocabulary grows sublinealy but does not saturate tends to stabilize for 1Mb of text (150.000w) words Different words
5
NLP superficial and lexic level5 Superficial level 4 word tokens vs word types Statistical distribution of words in a document Obviously non uniform Most common words cover more than 50% of occurrences 50% of the words only occur once ~12% of the document is formed by word occurring less than 4 times.
6
NLP superficial and lexic level6 Superficial level 5 Zipf law: We sort the words occurring in a document by their frequency. The product of the frequency of a word (f) by its position (r) is aproximatelly constant
7
NLP superficial and lexic level7 Lexical level 1 Part of Speech (POS) Formal property of a word-type determining its acceptable uses in syntax. A POS can be seen as a class of words A word-type can own several POS, a word-token only one Plain categories open, many elements, neologisms, independent and semantically rich classes N, Adj, Adv, V Functional categories closed
8
NLP superficial and lexic level8 Lexical level 2 Repository of lexical information for human or computer use Two aspects to consider Representation of lexical information Acquisition of lexical information Lexicon
9
NLP superficial and lexic level9 Lexical level 3 Orthografic Transcription Phonetic Transcription Flexion model diathesis alternations, subcategorization frames LOVE VTR (OBJLIST: SN). LOVE CAT = VERB SUBCAT = Lexicon content
10
NLP superficial and lexic level10 POS Argument structure Semantic information dictionaries => definition lexicons => semantic types predefined in a hierarchy. Lexical Relations derivation Equivalence with other languages Lexical level 4
11
NLP superficial and lexic level11 Lexical level 5 Form attribute/value pairs, binarr or n-ary relations, coded values, open domain values… Multiple assignments One to many and many to one relations Contextual dependencies … Facets of features Mandatory or optional, cardinality, default values Grading Exact values, preferences, probabilistic assigments. Problems
12
NLP superficial and lexic level12 Lexical level 6 General purpose databases Textual databases Lexical databases OO formalisms OO databases Frames Unification-based formalisms Representation
13
NLP superficial and lexic level13 Lexical level 7 Dictionaries MRD Predefined internal structure Some degree of coding in some contents Internal relations (synonimy, hyponymy,...) (sometimes) restricted vocabulary Some sistematics on building definitions Lexical Information acquisition
14
NLP superficial and lexic level14 Lexical level 8 Colocations Argument structure. Frecuency information Context Grammatical Induction Probabilistic Analysis. Lexical relations Examples of use. Selectional Restrictions Nominal compounds Idioms,... Information present in corpora
15
NLP superficial and lexic level15 Lexical level 9 Raw corpus Horizontal or vertical Corpus Tagged corpora Parenthized corpora Treebanks Corpus typology
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.