Career Opportunities for Linguists in Industry Vita Markman Computational Linguistics Engineer DIMG.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Linked In 101 Using Social Networking as Part of Your Job Search.
Speed dating Classification What you should know about dating Stephen Cohen Rajesh Ranganath Te Thamrongrattanarit.
Introduction to Computational Linguistics
Distant Supervision for Emotion Classification in Twitter posts 1/17.
Advanced Google Becoming a Power Googler. (c) Thomas T. Kaun 2005 How Google Works PageRank: The number of pages link to any given page. “Importance”
Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Methods in Computational Linguistics II Queens College Lecture 1: Introduction.
Engineering Village ™ ® Basic Searching On Compendex ®
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
EMNLP Industry Panel Comments © 2001, David A. Evans, Clairvoyance Corporation 1June 4, 2001 The Rubber and the Road Industrial Perspectives on NLP EMNLP.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Scalable Text Mining with Sparse Generative Models
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Multimedia Data Mining Arvind Balasubramanian Multimedia Lab (ECSS 4.416) The University of Texas at Dallas.
Chapter 5: Information Retrieval and Web Search
Quick Job Interview Guide Seven Steps to Acing Your Interview.
Multimedia Data Mining Arvind Balasubramanian Multimedia Lab The University of Texas at Dallas.
Introduction to machine learning
Forecasting with Twitter data Presented by : Thusitha Chandrapala MARTA ARIAS, ARGIMIRO ARRATIA, and RAMON XURIGUERA.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
(C) 2013 Logrus International Practical Visualization of ITS 2.0 Categories for Real World Localization Process Part of the Multilingual Web-LT Program.
Interview Tips.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Indonesian Social Media Monitoring Tools a Company Profile Indonesian Social Media Monitoring Tools a Company Profile version 2.0 See What People Say…
A Case Study in Success Online How to generate revenue through content marketing.
CAREERS IN LINGUISTICS OUTSIDE OF ACADEMIA CAREERS IN INDUSTRY.
Some studies on Vietnamese multi-document summarization and semantic relation extraction Laboratory of Data Mining & Knowledge Science 9/4/20151 Laboratory.
CSCI 347 – Data Mining Lecture 01 – Course Overview.
Creating an Online Professional Presence Using Social Media.
Introduction to Natural Language Processing Heshaam Faili University of Tehran.
The Five Step Sales Process The Five Step Sales Process Step One: Plan and Prepare May 11, 2011.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Job Search 101 Free Geek Instructor: Wayne Flower.
Aardvark Anatomy of a Large-Scale Social Search Engine.
1 Computational Linguistics Ling 200 Spring 2006.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Copyright (c) 2003 David D. Lewis (Spam vs.) Forty Years of Machine Learning for Text Classification David D. Lewis, Ph.D. Independent Consultant Chicago,
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Chapter 6: Information Retrieval and Web Search
Understanding The Semantics of Media Chapter 8 Camilo A. Celis.
Research Topics CSC Parallel Computing & Compilers CSC 3990.
Natural language processing tools Lê Đức Trọng 1.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
What Is Text Mining? Also known as Text Data Mining Process of examining large collections of unstructured textual resources in order to generate new.
National Technical University of Ukraine “Kiev Polytechnic Institute” Heat and energy design faculty Department of automation design of energy processes.
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Classification using Co-Training
By: WenHao Wu. A current situation that I have is that I cannot decide if a computer career is for me. I am considering any career in computers, but I.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Introduction to Information Retrieval. What is IR? Sit down before fact as a little child, be prepared to give up every conceived notion, follow humbly.
Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.
Team Hogwarts EED 515 – Dr. Raymond Brie Monday, 7pm CA2 CLASS PORTFOLIO.
Introducing Precictive Analytics
Machine Learning Best Practices with Alfresco & Activiti
Creating your online identity
CS6501 Advanced Topics in Information Retrieval Course Policy
Make Predictions Using Azure Machine Learning Studio
Create your Benner - intro
DATA ANALYTICS AND TEXT MINING
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Mining and Analyzing Data from Open Source Software Repository
Text Mining with JMP Pro 13: A Case Study
Text Analytics and Machine Learning Workshop
Presentation transcript:

Career Opportunities for Linguists in Industry Vita Markman Computational Linguistics Engineer DIMG

Vita Markman2 Linguists in Industry??!!! Overview I. What can a linguist do in industry? II. Some concrete examples of what linguists work on “in the real world” II. From school to work: what skills/knowledge one needs IV. Useful References V. Conclusion and Q&A

3 But Why? Because: ● Having options is good ● You may want to take a year off before going to grad school, yet work on something relevant ● It's really fun ● Research jobs are NOT confined to the academic setting ● Not all industry jobs are like from 'The Office Space'

4 Explosion of Text Data ● Explosion of linguistic data online ● Social media (twitter, facebook, blogosphere) ● Linguistic data is not easily amenable to analysis. It requires much processing and much insight into the nature and structure of language ● Modifying old NLP techniques and tools as well as inventing new ones: for example, no off-the-shelf parser can handle Twitter conversations.

5 A linguist in the industry? Examples: ● Information retrieval (esp. search engines that do semantic search) ● Voice recognition/generation – think 'google voice'! ● Text classification and text clustering ● Text mining – finding grains of useful info in unstructured text ● E-discovery ● Analyzing the language of social media (topic and sentiment extraction from short fragmented and noisy data)

6 Chat filtering: an example ● Computational linguistics application for on- line virtual worlds ● Ensuring safety of on-line chat environments ● Filtering chat for appropriate content ● Filtering is an example of a classification problem: classify text as ‘appropriate’ or ‘inappropriate’

Chat filtering: an example The problem of determining whether lines involve inappropriate content is similar to spam detection Simple word/phrase matches are not enough. It will be too aggressive and not be enough: some nefarious lines may pass through Most inappropriate talk is made up of completely innocent words!

Chat filtering People use innocent words to make inappropriate phrases People also find ways to say things with MixEdCAse or b,r,o.k.en w.o.r,ds that get around the filter We want: a general way of saying: “if you see MixED CaSE or br.ok.en word,s it is probably a sign of something bad” We also want: to capture inappropriate combinations of innocent words A solution: a model, much like the ones used for spam What is the general idea?

The idea behind filtering Look at the words and other features that make up appropriate and inappropriate chat, and ask how likely is a word to appear in inappropriate chat? Example: If a pair of words such as “stew pit” never appears in regular, appropriate chat, it is probably an indicator of something inappropriate. Having done that, we can ask for a new chat segment, is it likely to be inappropriate?

That said… Filtering is an example of text classification, used very broadly Classifying documents by content/topic requires a labeled set of data and a learning algorithm The difficult part is not the algorithm itself, but the fine-tuning of its parameters and manipulating and preprocessing the data This is where creative and analytical thinking becomes truly crucial and the job becomes really fun!

Another example: Twitter Clustering super-short twitter posts by topic Clustering = finding groups in unlabeled data This data is very noisy and fragmented lets see im on lates on monday so dont start till two and could get down from work mail me xxx7/16/ :50:26 AM Well i'm watching mate running at silly o'clock but i'll be free in the late afternoon 7/17/2009 8:18:12 AM Resistant to parsing and part of speech tagging Goal: arrive at K clusters, each representing a topic of the post! Problem: very little to go on!

Twitter Possible solutions: padding – adding more relevant words to each post Applying spellcheckers to reduce the noise in the data Removing various content-less function words, known as stop-words. Since even those posts that share a common topic, do not often share many words in common, look at the mutual contexts in which the words in posts appear This technique is known as Latent Semantic Analysis

13 From school to workplace ● What skills are needed for a future (relevant) career in computational linguistics ? – Statistics: basic understanding of sampling methods and probabilistic reasoning – Some linear algebra, esp. as it relates to matrix and vector manipulation – Some calculus – Machine learning algorithms that are used commonly in NLP such as Naïve Bayes, HMM, and Expectation Maximization – Some programming (python, java, C++)

14 From school to workplace cont'd How can these skills be learned? ● Formal instruction ● Self-teaching ● Taking a class on-line

Companies that employ linguists Google(Bay area, Los Angeles) Yahoo(Bay area, Los Angeles) IBM Microsoft (Seattle, WA) Smaller search engines and semantic search engines: Ask.com Autonomy(Bay area) H5(San Francisco) Cognition(Los Angeles) Companies that do Machine Translation (Systran, Language Weaver) Entertainment Companies (Los Angeles)

Learning while working Internships: Companies hire students to work and learn This can be an invaluable experience ! It can really supplement classroom learning via the actual application of the learned material Sometimes academic knowledge and practical application do not go hand-in-hand…

17 Resources ● Association for Computational Linguistics (Conference in Portland OR in June, 2011) ● KDD Nuggets – a website for data mining and knowledge discovery, contains useful links to tutorials, news, and jobs ● Dice.com – job website for technical jobs only ● NLTK.org – natural language toolkit in python. Can be used to try a 'do it yourself' document classification and clustering. ● NLP Group at Information Science Institute at USC, located in Marina del Rey – great talks to get an overview of what’s going on in the field ● Books and articles: ● Jurafsky and Martin The bible of computational linguistics ● Chris Manning 2008 – Information Retrieval ● Mitchel 1997 Machine Learning; ● WEKA – Data Mining free software and book (Witten and Frank 2005)

18 Conclusion ● As a linguist, you can do lots of interesting work in a non-academic setting ● You must supplement your knowledge of linguistics by some math and computer science knowledge and you are good to go! ● Bottom line: all you really need is to be open to learning new things, which is exactly why we all go to school in the first place! Thank you!