Text Analytics / NLP Charan Puvvala, Mirabel Technologies Inc.

Slides:



Advertisements
Similar presentations
Natural Language Processing (or NLP) Reading: Chapter 1 from Jurafsky and Martin, Speech and Language Processing: An Introduction to Natural Language Processing,
Advertisements

Language Processing Technology Machines and other artefacts that use language.
Leksička semantika i pragmatika 5. predavanje. Ambiguity Find at least 5 meanings of this sentence: –I made her duck I cooked waterfowl for her benefit.
Introduction to Natural Language Processing A.k.a., “Computational Linguistics”
For Friday No reading Homework –Chapter 23, exercises 1, 13, 14, 19 –Not as bad as it sounds –Do them IN ORDER – do not read ahead here.
Oct 2009HLT1 Human Language Technology Overview. Oct 2009HLT2 Acknowledgement Material for some of these slides taken from J Nivre, University of Gotheborg,
Introduction to NLP What is Natural Language Processing?
Natural Language and Speech Processing Creation of computational models of the understanding and the generation of natural language. Different fields coming.
Introduction to Computational Linguistics Lecture 2.
Basi di dati distribuite Prof. M.T. PAZIENZA a.a
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
PSY 369: Psycholinguistics Some basic linguistic theory part3.
تمرين شماره 1 درس NLP سيلابس درس NLP در دانشگاه هاي ديگر ___________________________ راحله مکي استاد درس: دکتر عبدالله زاده پاييز 85.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
ELN – Natural Language Processing Giuseppe Attardi
CAREERS IN LINGUISTICS OUTSIDE OF ACADEMIA CAREERS IN INDUSTRY.
Logic Programming for Natural Language Processing Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc.
9/8/20151 Natural Language Processing Lecture Notes 1.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
1 Computational Linguistics Ling 200 Spring 2006.
Natural Language Processing Introduction. 2 Natural Language Processing We’re going to study what goes into getting computers to perform useful and interesting.
Introduction to NLP ch1 What is Natural Language Processing?
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Introduction to CL & NLP CMSC April 1, 2003.
Natural Language Processing Daniele Quercia Fall, 2000.
Levels of Language 6 Levels of Language. Levels of Language Aspect of language are often referred to as 'language levels'. To look carefully at language.
Natural language processing tools Lê Đức Trọng 1.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
CSA2050 Introduction to Computational Linguistics Lecture 1 Overview.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Friday Finish chapter 24 No written homework.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
CSE467/567 Computational Linguistics Carl Alphonce Computer Science & Engineering University at Buffalo.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 1 (03/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Introduction to Natural.
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Natural Language Processing.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Levels of Linguistic Analysis
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
©2012 Paula Matuszek CSC 9010: Information Extraction Overview Dr. Paula Matuszek (610) Spring, 2012.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Natural Language Processing [05 hours/week, 09 Credits] [Theory]
Approaches to Machine Translation
INAGO Project Automatic Knowledge Base Generation from Text for Interactive Question Answering.
Natural Language Processing (NLP)
Natural Language Processing
Social Knowledge Mining
Writing Analytics Clayton Clemens Vive Kumar.
Statistical NLP: Lecture 9
Extracting Semantic Concept Relations
Approaches to Machine Translation
CSE 635 Multimedia Information Retrieval
How to publish in a format that enhances literature-based discovery?
Levels of Linguistic Analysis
Text Mining & Natural Language Processing
Text Mining & Natural Language Processing
CS246: Information Retrieval
Natural Language Processing
Natural Language Processing (NLP)
CS565: Intelligent Systems and Interfaces
Artificial Intelligence 2004 Speech & Natural Language Processing
From Language to Information
Information Retrieval
Extracting Why Text Segment from Web Based on Grammar-gram
Natural Language Processing (NLP)
Presentation transcript:

Text Analytics / NLP Charan Puvvala, Mirabel Technologies Inc. Hyderabad Machine Learning Group

Outline Introduction Natural Language Processing Linguistic Ambiguities Typical NLP tasks Information Extraction Named Entity Recognition

80:20 - Pareto Principle The principle was suggested by management thinker Joseph M. Juran. It was named after the Italian economist Vilfredo Pareto, who observed that 80% of income in Italy was received by 20% of the Italian population. The assumption is that most of the results in any situation are determined by a small number of causes. Different Twist: 80% of the features are done in 20% of the time and 20% of the laborious features take 80% of the time.

Introduction - NLP definition Natural Language Processing Natural (adjective) - existing in nature and not made of cause by people: coming from nature. Merriam-Webster dictionary Language (noun) - the system of words and sounds that people use to express thoughts and feeling to each other. Merriam-Webster dictionary Process (verb) - to subject or handle through an established usually a routine set of procedures. Merriam-Webster dictionary

Technology making good progress still really hard mostly solved ✓ ✗ Sentiment analysis still really hard mostly solved Best roast chicken in Hyderabad! Question answering (QA) The waiter ignored us for 20 minutes. Q. How effective is ibuprofen in reducing fever in patients with acute febrile illness? Spam detection Coreference resolution ✓ Let’s go to Agra! “I voted for Charan because he was most aligned with my values” he said ✗ Paraphrase Buy Football… Word sense disambiguation (WSD) XYZ acquired ABC yesterday Part-of-speech (POS) tagging I need new batteries for my mouse. ABC has been taken over by XYZ ADJ ADJ NOUN VERB ADV Summarization Parsing Colorless green ideas sleep furiously. The Dow Jones is up Economy is good I can see Alcatraz from the window! The S&P500 jumped Named entity recognition (NER) Housing prices rose Machine translation (MT) PERSON ORG LOC Dialog 第13届上海国际电影节开幕… Where is Citizen Kane playing in NY? Einstein met with UN officials in Princeton The 13th Shanghai International Film Festival… IMax Theatre at 7:30. Do you want a ticket? Information extraction (IE) Party May 27 add You’re invited to our dinner party, Friday May 27 at 8:30

Introduction Dave Bowman: Open the pod bay doors, HAL. HAL: I’m sorry Dave, I’m afraid I can’t do that. (Stanley Kubrick & Arthur C. Clarke, screenplay of 2001: A Space Odyssey )

Introduction - Basic concepts Syntax Open the pod bay doors, HAL. Vs. HAL, is the pod bay door open? Lexical Semantics meaning of component words Compositional Semantics knowledge of how components combine to form large meanings - Colorless green ideas sleep furiously. <-> Dark green leaves rustle furiously. Pragmatics I’m sorry ... , I’m afraid I can’t Vs. No, I won’t open the door. Phonetics & Phonology Morphology Produce contractions I’m, can’t Discourse conventions engaging in structured conversation using reference that I’m sorry Dave, I’m afraid I can’t do that syntax is about sentence formation, and semantics about sentence interpretation, phonetics and phonology cover the field of sentence utterance.

Maths Computer Science Linguistics NLP

Ambiguities Linguistic Ambiguities Syntactic Ambiguities Morphological Ambiguities Syntactic Ambiguities Semantic Ambiguities

Linguistic Ambiguities Example: I made her duck Five possible interpretations I cooked waterfowl/duck for her I cooked waterfowl belonging to her I created the (plaster?) duck she owns I caused to quickly lower her head or body I waved my magic wand and turned her into waterfowl

Linguistic Ambiguities Morphological ambiguity duck: verb or noun her: dative pronoun or possessive pronoun Syntactic ambiguity: make transitive: taking a single direct object (case 2) ditransitive: taking two objects, meaning the first object (her) got made into the second object (duck) taking a direct object and verb, meaning that the object (her) got caused to perform the verbal action (duck) Semantic ambiguity: make cook create

Typical NLP Tasks - Simpler Tokenization RegEx Sentence Splitting RegEx POS-tagging algorithms and tag sets POS-Tagging

Typical NLP tasks - Complex Lemmatization / Stemming Syntactic Parsing Question Answering systems Topic Extraction NER Semantic Analysis Knowledge Graphs

semi-structured information Unstructured information How to link this information to a knowledge base automatically?

Information Extraction Information Extraction (IE) systems Find and understand limited relevant parts of text Gather information from many pieces of text Produce a structured representation of relevant information: Relations (in the database sense) A knowledge base Goals: Organize information so that it is useful to people Put information in a semantically precise form that allows further inferences to be made by computer algorithms Now let's define information extraction! a more general task Goal: get semantic information out of documents, esp. web pages Defined as a dumbing-down of more lofty goal of Natural Language Understanding -- more technologically manageable we're often interested in learning about particular relations (in DB sense) e.g. scouring financial news for movements of executives e.g. [person] [assumes/loses] [role] at [company] I want to scour through text, find relation instances, suck out, put in DB you're allowed to do domain- and problem-specific customization lots of potential applications - business/financial context - biomedical context, clinical medicine there's all this unstructured text data about research and patients -- you'd like to be able to get structured information out of it could lead to other automated information finding: trends, correlations, drug interactions, impact of some protein on expression of a gene, ...

Information Extraction IE systems extract clear, factual information Roughly: Who did what to whom when? E.g., Gathering earnings, profits, board members, headquarters, etc. from company reports The headquarters of BHP Billiton Limited, and the global headquarters of the combined BHP Billiton Group, are located in Melbourne, Australia. headquarters(“BHP Biliton Limited”, “Melbourne, Australia”) Learn drug-gene product interactions from medical research literature Now let's define information extraction! a more general task Goal: get semantic information out of documents, esp. web pages Defined as a dumbing-down of more lofty goal of Natural Language Understanding -- more technologically manageable we're often interested in learning about particular relations (in DB sense) e.g. scouring financial news for movements of executives e.g. [person] [assumes/loses] [role] at [company] I want to scour through text, find relation instances, suck out, put in DB you're allowed to do domain- and problem-specific customization lots of potential applications - business/financial context - biomedical context, clinical medicine there's all this unstructured text data about research and patients -- you'd like to be able to get structured information out of it could lead to other automated information finding: trends, correlations, drug interactions, impact of some protein on expression of a gene, ...

Information Extraction Between AD 1400 and 1450, China was a global superpower run by one family – the Ming dynasty – who established Beijing as the capital and built the Forbidden City.

Named Entities Between AD 1400 and 1450, China was a global superpower run by one family – the Ming dynasty – who established Beijing as the capital and built the Forbidden City. Name Entity Recognition (NER)

Named Entities Between AD 1400 and 1450, China was a global superpower run by one family – the Ming dynasty – who established Beijing as the capital and built the Forbidden City. Name Entity Recognition (NER) China: /location/country Ming dynasty: /royalty/royal_line Beijing: location/city Forbidden city: /location/city

Named Entities: Definition Named Entities: Proper nouns, which refer to entities Named Entity Recognition: Detecting boundaries of named entities (NEs) Named Entity Classification: Assigning classes to NEs, such as PERSON, LOCATION, ORGANISATION, or fine-grained classes such as ROYAL LINE Named Entity Linking / Disambiguation: Linking NE’s to concrete entities in knowledge base China -> Location: Republic of Chine, country in East Asia, Person -> Person: China, Brazilian footballer born in 1964 Music -> China, a 1979 album by Vangelis

Information Extraction: Methods Possible methodologies Rule-based approaches: write manual extraction rules Machine learning based approaches Supervised learning: manually annotate text, train machine learning model Unsupervised learning: extract language patterns, cluster similar ones Semi-supervised learning: start with a small number of language patterns, iteratively learn more (bootstrapping) Gazetteer-based method: use existing list of named entities Combination of the above

NLP & ML Software Natural Language Processing: - GATE (general purpose architecture, includes other NLP and ML software as plugins) - Stanford NLP (Java) - OpenNLP (Java) - NLTK (Python) Machine Learning: - scikit-learn (Python, rich documentation, highly recommended!) - Mallet (Java) - WEKA (Java) - Alchemy (graphical models, Java) - FACTORIE (graphical models, Scala) - CRFSuite (efficient implementation of CRFs, Python)

NLP & ML Software Ready to use NERC software: - ANNIE (rule-based, part of GATE) - Wikifier (based on Wikipedia) - FIGER (based on Wikipedia, fine-grained Freebase NE classes) Almost ready to use NERC software: - CRFSuite (already includes Python implementation for feature extraction, you just need to feed it with training data, which you can also download) Ready to use RE software: - ReVerb (Open IE, extracts patterns for any kind of relation) - MultiR (Distant supervision, relation extractor trained on Freebase) Web Content Extraction software: - Boilerpipe (extract main text content from Web pages) - Jsoup (traverse elements of Web pages individually, also allows to extract text)