A Linguist’s Search Engine Philip Resnik University of Maryland JHU Conference on Spatial Language and Spatial Cognition September 18, 2003
Acknowledgments Collaborators –Christiane Fellbaum (Princeton) –Mari Broman Olsen (Microsoft) Implementors –Aaron Elkiss –Rafi Khan, G. Craig Murray, Saurabh Khandelwal Inspiration –Steve Abney, Chris Manning, Mitch Marcus This work is supported by NSF ITR grant IIS
Facing Variability in Linguistic Data Sapir: “Everyone knows that language is variable” Chomsky: “[C]rucial evidence comes from marginal constructions; for the tests of analyses often come from pushing the syntax to its limits, seeing how constructions fare at the margins of acceptability.'' Student in linguistics talk (whispered to friend): “Does that sound ok to you?”
Traditional Linguistics with Naturally Occurring Data Long tradition outside the generative mainstream (e.g., Oostdijk and de Haan, 1994) Recent, labor intensive efforts (e.g., Macfarland 1995) Frequent back-of-napkin jottings
Grammars and Variability Sapir (1921): “All grammars leak.” Abney (1996): “[A]ttempting to eliminate unwanted readings... Is like squeezing a balloon: every dispreference that is turned into an absolute constraint to eliminate undesired structures has the unfortunate side effect of eliminating the desired structure for some other sentence.”
Theoretical versus Empirical Einstein (1940): Science is the attempt to make the chaotic diversity of our sense-experience correspond to a logically uniform system of thought [in which] experience must be correlated with the theoretical structure… What we call physics comprises that group of natural sciences which base their concepts on measurements… [emphasis added]
Where might data come from? Text collection efforts (British and American national corpora, LDC Gigaword corpora, CHILDES, Switchboard, etc.) The World Wide Web Shallow annotated corpora, e.g. part-of- speech in the Brown Corpus of American English Deeper annotations, e.g. Penn Treebank Even deeper: PropBank, FrameNet
What about tools? Concordancing, KWIC (e.g., Wordsmith) Treebanks and tgrep Gsearch (Corley et al.) Do-it-yourself parsing and search Manning (2003): “…it remains fair to say that these tools have not yet made the transition to the Ordinary Working Linguist without considerable computer skills.”
A Web Search Tool for the Ordinary Working Linguist Must have linguist-friendly “look and feel” Must minimize learning/ramp-up time Must permit real-time interaction Must permit large-scale searches Must allow search on linguistic criteria Must be reliable Must evolve with real use
If you build it, they will come…
Pollard and Sag (1994); discussion in Manning (2003) –(a) We consider Kim to be an acceptable candidate –(b) We consider Kim an acceptable candidate –(c) We consider Kim quite acceptable –(d) We consider Kim among the most acceptable candidates –(e) *We consider Kim as an acceptable candidate –(f) *We consider Kim as quite acceptable –(g) *We consider Kim as among the most acceptable candidates –(h) *We consider Kim as being among the most acceptable candidates
Constructions The Xer the NP1 the Yer the NP2
Overnight collection: 9pm-6am
Objections Chomsky (1979): “You can also collect butterflies and make many observations. If you like butterflies, that’s fine; but such work must not be confounded with research, which is concerned to discover explanatory principles of some depth and fails if it does not do so.”
Manning (2003): “To go out on a limb for a moment, let me state my view: generative grammar has produced many explanatory hypotheses of considerable depth, but is increasingly failing because its hypotheses are disconnected from verifiable linguistic data... I would join Weinreich, Labov, and Herzog (1968, 99) in hoping that ‘a model of language which accommodates the facts of variable usage... leads to more adequate descriptions of linguistic competence.”
Abney (1996): “The focus in computational linguistics has admittedly been on technology. But the same techniques promise progress on issues concerning the nature of language that have remained mysterious for so long. The time is ripe to apply them.”
Jackendoff: “[T]he reaction of some linguists to foundational discussion of the sort I engage in here is: ‘Do I and my students really have to think about this? I just want to be able to do good syntax (or phonology or whatever).... Still, when you’re driving you don’t just look ten feet in front of the car.... [if] integration seems to call for alteration of the larger context, one should not shrink from the challenge.”
Thank you!