Identifying terms with similar meanings across corpora
Sahami and Heilman My Project Kofi Annan UN Secretary General Google(Kofi Annan) Google(UN Secretary General) My Project ForeignAffairs(Kofi Annan) Google(Kofi Annan) BioDatabase(Python) Google(Python)
Main Program Google Search API Web Lucene Pre-computed IDFs
Best Results So Far IMDB “Apocalypse Now” and “Gothika” clearly identified as popular. “The Body”, “Summer School”, “Antitrust” clearly identified as… overshadowed by other meanings. Compound identification (actor names, etc.) would probably be a big help here.
References Sahami, M. and Heilman, T. D. 2006. A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23 - 26, 2006). WWW '06. ACM Press, New York, NY, 377-386. DOI= http://doi.acm.org/10.1145/1135777.1135834