DLLS Ontologically-based Searching for Jobs in Linguistics Deryle Lonsdale Funded by:
DLLS The BYU Data Extraction Group Group of faculty (5) and students (15) from CS, Linguistics, SOAIS Goal: ontology-based data extraction NSF funding: CISE/IIS/IDM TIDIE Website: Papers, presentations Tools Demos
DLLS The BYU Data Extraction Group
DLLS Overview Ontology-based extraction Building knowledge sources Jobs in linguistics (Sproat) Putting it all together Some sample results
DLLS Ontologies and IE SourceTarget
DLLS Document-based IE
DLLS Conceptual modeling (OSM) YearPrice Make Mileage Model Feature PhoneNr Extension Car has is for has 1..* * * 1..*
DLLS Recognition and Extraction Car Year Make Model Mileage Price PhoneNr Subaru SW $1900 (336) Elantra (336) HONDA ACCORD EX 100K (336) Car Feature 0001 Auto 0001 AC 0002 Black door 0002 tinted windows 0002 Auto 0002 pb 0002 ps 0002 cruise 0002 am/fm 0002 cassette stereo 0002 a/c 0003 Auto 0003 jade green 0003 gold
DLLS Car-Ads Ontology (textual) Car [->object]; Car [0..1] has Year [1..*]; Car [0..1] has Make [1..*]; Car [0...1] has Model [1..*]; Car [0..1] has Mileage [1..*]; Car [0..*] has Feature [1..*]; Car [0..1] has Price [1..*]; PhoneNr [1..*] is for Car [0..*]; PhoneNr [0..1] has Extension [1..*]; Year matches [4] constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d[^\d]"; substitute "^" -> "19"; }, … End;
DLLS The data-frame library Low-level patterns implemented as regular expressions Match items such as addresses, phone numbers, names, etc. Mileage matches [8] constant { extract "\b[1-9]\d{0,2}k"; substitute "[kK]" -> "000"; }, { extract "[1-9]\d{0,2}?,\d{3}"; context "[^\$\d][1-9]\d{0,2}?,\d{3}[^\d]"; substitute "," -> "";}, { extract "[1-9]\d{0,2}?,\d{3}"; context "(mileage\:\s*)[^\$\d][1-9]\d{0,2}?,\d{3}[^\d]"; substitute "," -> "";}, { extract "[1-9]\d{3,6}"; context "[^\$\d][1-9]\d{3,6}\s*mi(\.|\b\les\b)";}, { extract "[1-9]\d{3,6}"; context "(mileage\:\s*)[^\$\d][1-9]\d{3,6}\b";}; keyword "\bmiles\b", "\bmi\.", "\bmi\b", "\bmileage\b"; end;
DLLS Lexicons Repositories of enumerable classes of lexical information FirstNames, LastNames, USstates, ProvoOremApts, CarMakes, Drugs, CampGroundFeats, etc.
DLLS Accessing the output Extracted information is stored in a relational database Results can be queried using SQL Wide range of views is possible
DLLS Finding jobs in linguistics Linguistlist.org, LSA distribution lists (corpora, langage naturelle, CAAL/ACLA, etc.) Usual commercial sites (monster.com, flipdog.com, dice.com) Word-of-mouth sources
DLLS Sproat’s analysis Random sample (224/2250) of LinguistList postings, Development vs. research, academic vs. industrial Linguists are most often (approx. 80% of the time) offered development jobs Linguists hired more for specific tasks (e.g. grammar, lexicon development) rather than for more general research-oriented tasks (e.g. creating new technological approaches.)
DLLS The banner years Year Academia Industry % Industry % % % % % % % 2001 (mid) % Dramatic rise in 1999, 2000 Steep drop-off since 2001 Rising demand for technical, computational skills
DLLS Linguistic jobs ontology Why? user-specifiable constraints Somewhat closely follows existing ontologies (e.g. jobs, software)
DLLS Data frames and lexicons Language names ethnologue (sub)fields of linguistics Linguistlist.org Tools, toolkits Software components, programming languages Linguistics-related job titles Activities Responsibilities Country names
DLLS The corpus 3237 postings (LinguistList, Corpora, LN, WoM): Some noise (non-English, factored, program descriptions, attachments, etc.) Semi-automatic edits (boilerplate, publicity blurbs about institutions, etc.)
DLLS Sample output Here
DLLS Observations 270 don’t have linguist* (!) Demand for knowledge of English equals that for all other languages combined (G, F, S, J, C) Computer/computational background required for almost 1/3 (1116) Noticeable amount of headhunting, particularly in Seattle, DC areas
DLLS Programming languages
DLLS Popular subfields
DLLS Subfields (another perspective)
DLLS An engineering discipline? 160 linguistics jobs ending in “engineer” Software development cycle research e., software design e. development e., software e. software quality e., linguistic test e., linguistic quality e. linguistic support e., user experience e. presales e., technical sales e. Specific subfields web site e. speech e., voice recognition e., speech recognition application e., speech e., ASR tuning e., audio e. dialog e. tools e. AI e., NLP e. knowledge e. linguist e., natural language e. staff e. human factors e., user interface e.
DLLS Paradigms
DLLS Other observations Often a job title is not even listed (!) More in18 of data frames (e.g. , ph. #) Great need for (preferably hierarchical) lexical repositories related to linguistics job titles theoretical frameworks, subfields typical linguist job activities linguistic research/development venues