Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.

Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a different purpose (happenstantial) – Require software to exploit them Tools as resources – Generic software for use with primary resources – Generic software for exploiting secondary resources Reusability

Primary resources (Lexical) Lexicons Dictionaries for NLP, more than just lists of words Not like dictionaries for human use –Need information that for humans is “obvious” –Don’t need some information typically found in dictionaries for human use –But, see later Not like dictionaries for human use Reusability: “theory-neutral”

Primary resources (Lexical) Structured vocabulary Conceptual structure –WordNet –(Machine readable) Roget’s Thesaurus Ontologies –Structure reflects specialised domain –Defines vocabulary and conceptual relations –Vocabulary reflects reality

Primary Resources Grammatical “Grammar” includes morphology and syntax Rules etc. have to be written, usually by a linguist Generic formalisms devised, somewhat independent of application, with associated implementations Usually depend on (and “implement”) some (linguistic) theory Reusability –Application independent –Direction-neutral (analysis vs synthesis)

Corpora Corpus (pl. corpora) is a collection of texts Use of term usually implies some “value added” –Specific to a domain –Explicitly collected (“planned”) –With some information added as a result of analysis, e.g. POS tags Illustrates usage –Word collocations –Grammatical constructions Used to build applications by machine learning

British National Corpus One of the most widely used corpora (esp. in Britain, but also elsewhere) A balanced synchronic text corpus containing 100 million words (POS tagged) Collected in late 1980s 90% text, 10% transcribed speech Encoded according to TEI standards Associated tools (mainly for searching), but many users write their own (eg in Perl) http://www.natcorp.ox.ac.uk/

Examples of other corpora Wall Street Journal corpus –25m words from WSJ 1987 –Parsed, indexed North American Newstext corpus –350m words of newswire text –Indexed but otherwise not annotated ATIS (Air Travel Information Sysytem) corpus –Transcriptions of real dialogues Various corpora collected for competitions –MUC, TREC, …

Parallel corpora Bilingual and multilingual corpora –Texts and their associated “translations” –Need to be aligned to be useful –Useful for translation studies, and to build MT systems, as reference corpora, or as input to SMT Major examples: –Canadian and Hong Kong Hansards –European parliament and legislation (Europarl) –Stuff from other bilingual countries –User documentation from big companies –Online newspapers with English (etc) versions

Treebanks In some cases corpora have been fully parsed (and verified) Treebanks are a very rich resource, but generally highly theory-specific Major example is Penn Treebank –includes (selections from) WSJ, Brown, ATIS corpora –ongoing

Secondary resources: Lexical Word lists aimed at human users can be useful Notably dictionaries if available in machine-readable form, eg typesetters’ tapes Since content is aimed at humans, needs sophisticated software to extract/convert information

Secondary resources: corpora Any collection of text can be turned into a corpus, in principle Raw text useful for many purposes Machine learning approaches –Language model can be learned statistically Bilingual corpora much used for building statistical MT systems –Similarly, translation rules learned from the examples in the corpus

Generic tools as resources Important idea from computer science of separating algorithms from data Distinguish: –Grammar rules and lexicon that it uses as data –Programs and user interfaces that use the data to process a given input –The algorithms underlying those programs Danger of confusion: eg Brill’s tagger is software that you can use to tag text, but you have to “program” it (actually, train it) for a given language (actually, sublanguage)

Generic tools: reusability Well-known principles of software engineering here: –Write software for a specific purpose, but try to make it as general as possible –Reusable for a different task –Reusable with different data Same principles applies to data –Distinguish between static (declarative) information and what you do with it (procedural) –Since data is voluminous (especially lexical data), important to try to be as neutral as possible regarding different purposes, so it can be reused

Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.

Similar presentations

Presentation on theme: "Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.

Similar presentations

Presentation on theme: "Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a."— Presentation transcript:

Similar presentations

About project

Feedback