Download presentation
Presentation is loading. Please wait.
Published byKristina Acker Modified over 6 years ago
1
Text Analytics / NLP Charan Puvvala, Mirabel Technologies Inc.
Hyderabad Machine Learning Group
2
Outline Introduction Natural Language Processing
Linguistic Ambiguities Typical NLP tasks Information Extraction Named Entity Recognition
3
80:20 - Pareto Principle The principle was suggested by management thinker Joseph M. Juran. It was named after the Italian economist Vilfredo Pareto, who observed that 80% of income in Italy was received by 20% of the Italian population. The assumption is that most of the results in any situation are determined by a small number of causes. Different Twist: 80% of the features are done in 20% of the time and 20% of the laborious features take 80% of the time.
4
Introduction - NLP definition
Natural Language Processing Natural (adjective) - existing in nature and not made of cause by people: coming from nature. Merriam-Webster dictionary Language (noun) - the system of words and sounds that people use to express thoughts and feeling to each other. Merriam-Webster dictionary Process (verb) - to subject or handle through an established usually a routine set of procedures. Merriam-Webster dictionary
5
Technology making good progress still really hard mostly solved ✓ ✗
Sentiment analysis still really hard mostly solved Best roast chicken in Hyderabad! Question answering (QA) The waiter ignored us for 20 minutes. Q. How effective is ibuprofen in reducing fever in patients with acute febrile illness? Spam detection Coreference resolution ✓ Let’s go to Agra! “I voted for Charan because he was most aligned with my values” he said ✗ Paraphrase Buy Football… Word sense disambiguation (WSD) XYZ acquired ABC yesterday Part-of-speech (POS) tagging I need new batteries for my mouse. ABC has been taken over by XYZ ADJ ADJ NOUN VERB ADV Summarization Parsing Colorless green ideas sleep furiously. The Dow Jones is up Economy is good I can see Alcatraz from the window! The S&P500 jumped Named entity recognition (NER) Housing prices rose Machine translation (MT) PERSON ORG LOC Dialog 第13届上海国际电影节开幕… Where is Citizen Kane playing in NY? Einstein met with UN officials in Princeton The 13th Shanghai International Film Festival… IMax Theatre at 7:30. Do you want a ticket? Information extraction (IE) Party May 27 add You’re invited to our dinner party, Friday May 27 at 8:30
6
Introduction Dave Bowman: Open the pod bay doors, HAL.
HAL: I’m sorry Dave, I’m afraid I can’t do that. (Stanley Kubrick & Arthur C. Clarke, screenplay of 2001: A Space Odyssey )
7
Introduction - Basic concepts
Syntax Open the pod bay doors, HAL. Vs. HAL, is the pod bay door open? Lexical Semantics meaning of component words Compositional Semantics knowledge of how components combine to form large meanings - Colorless green ideas sleep furiously. <-> Dark green leaves rustle furiously. Pragmatics I’m sorry ... , I’m afraid I can’t Vs. No, I won’t open the door. Phonetics & Phonology Morphology Produce contractions I’m, can’t Discourse conventions engaging in structured conversation using reference that I’m sorry Dave, I’m afraid I can’t do that syntax is about sentence formation, and semantics about sentence interpretation, phonetics and phonology cover the field of sentence utterance.
8
Maths Computer Science Linguistics NLP
10
Ambiguities Linguistic Ambiguities Syntactic Ambiguities
Morphological Ambiguities Syntactic Ambiguities Semantic Ambiguities
11
Linguistic Ambiguities
Example: I made her duck Five possible interpretations I cooked waterfowl/duck for her I cooked waterfowl belonging to her I created the (plaster?) duck she owns I caused to quickly lower her head or body I waved my magic wand and turned her into waterfowl
12
Linguistic Ambiguities
Morphological ambiguity duck: verb or noun her: dative pronoun or possessive pronoun Syntactic ambiguity: make transitive: taking a single direct object (case 2) ditransitive: taking two objects, meaning the first object (her) got made into the second object (duck) taking a direct object and verb, meaning that the object (her) got caused to perform the verbal action (duck) Semantic ambiguity: make cook create
13
Typical NLP Tasks - Simpler
Tokenization RegEx Sentence Splitting RegEx POS-tagging algorithms and tag sets POS-Tagging
14
Typical NLP tasks - Complex
Lemmatization / Stemming Syntactic Parsing Question Answering systems Topic Extraction NER Semantic Analysis Knowledge Graphs
16
semi-structured information
Unstructured information How to link this information to a knowledge base automatically?
17
Information Extraction
Information Extraction (IE) systems Find and understand limited relevant parts of text Gather information from many pieces of text Produce a structured representation of relevant information: Relations (in the database sense) A knowledge base Goals: Organize information so that it is useful to people Put information in a semantically precise form that allows further inferences to be made by computer algorithms Now let's define information extraction! a more general task Goal: get semantic information out of documents, esp. web pages Defined as a dumbing-down of more lofty goal of Natural Language Understanding -- more technologically manageable we're often interested in learning about particular relations (in DB sense) e.g. scouring financial news for movements of executives e.g. [person] [assumes/loses] [role] at [company] I want to scour through text, find relation instances, suck out, put in DB you're allowed to do domain- and problem-specific customization lots of potential applications - business/financial context - biomedical context, clinical medicine there's all this unstructured text data about research and patients -- you'd like to be able to get structured information out of it could lead to other automated information finding: trends, correlations, drug interactions, impact of some protein on expression of a gene, ...
18
Information Extraction
IE systems extract clear, factual information Roughly: Who did what to whom when? E.g., Gathering earnings, profits, board members, headquarters, etc. from company reports The headquarters of BHP Billiton Limited, and the global headquarters of the combined BHP Billiton Group, are located in Melbourne, Australia. headquarters(“BHP Biliton Limited”, “Melbourne, Australia”) Learn drug-gene product interactions from medical research literature Now let's define information extraction! a more general task Goal: get semantic information out of documents, esp. web pages Defined as a dumbing-down of more lofty goal of Natural Language Understanding -- more technologically manageable we're often interested in learning about particular relations (in DB sense) e.g. scouring financial news for movements of executives e.g. [person] [assumes/loses] [role] at [company] I want to scour through text, find relation instances, suck out, put in DB you're allowed to do domain- and problem-specific customization lots of potential applications - business/financial context - biomedical context, clinical medicine there's all this unstructured text data about research and patients -- you'd like to be able to get structured information out of it could lead to other automated information finding: trends, correlations, drug interactions, impact of some protein on expression of a gene, ...
19
Information Extraction
Between AD 1400 and 1450, China was a global superpower run by one family – the Ming dynasty – who established Beijing as the capital and built the Forbidden City.
20
Named Entities Between AD 1400 and 1450, China was a global superpower run by one family – the Ming dynasty – who established Beijing as the capital and built the Forbidden City. Name Entity Recognition (NER)
21
Named Entities Between AD 1400 and 1450, China was a global superpower run by one family – the Ming dynasty – who established Beijing as the capital and built the Forbidden City. Name Entity Recognition (NER) China: /location/country Ming dynasty: /royalty/royal_line Beijing: location/city Forbidden city: /location/city
22
Named Entities: Definition
Named Entities: Proper nouns, which refer to entities Named Entity Recognition: Detecting boundaries of named entities (NEs) Named Entity Classification: Assigning classes to NEs, such as PERSON, LOCATION, ORGANISATION, or fine-grained classes such as ROYAL LINE Named Entity Linking / Disambiguation: Linking NE’s to concrete entities in knowledge base China -> Location: Republic of Chine, country in East Asia, Person -> Person: China, Brazilian footballer born in Music -> China, a 1979 album by Vangelis
23
Information Extraction: Methods
Possible methodologies Rule-based approaches: write manual extraction rules Machine learning based approaches Supervised learning: manually annotate text, train machine learning model Unsupervised learning: extract language patterns, cluster similar ones Semi-supervised learning: start with a small number of language patterns, iteratively learn more (bootstrapping) Gazetteer-based method: use existing list of named entities Combination of the above
24
NLP & ML Software Natural Language Processing: - GATE (general purpose architecture, includes other NLP and ML software as plugins) - Stanford NLP (Java) - OpenNLP (Java) - NLTK (Python) Machine Learning: - scikit-learn (Python, rich documentation, highly recommended!) - Mallet (Java) - WEKA (Java) - Alchemy (graphical models, Java) - FACTORIE (graphical models, Scala) - CRFSuite (efficient implementation of CRFs, Python)
25
NLP & ML Software Ready to use NERC software: - ANNIE (rule-based, part of GATE) - Wikifier (based on Wikipedia) - FIGER (based on Wikipedia, fine-grained Freebase NE classes) Almost ready to use NERC software: - CRFSuite (already includes Python implementation for feature extraction, you just need to feed it with training data, which you can also download) Ready to use RE software: - ReVerb (Open IE, extracts patterns for any kind of relation) - MultiR (Distant supervision, relation extractor trained on Freebase) Web Content Extraction software: - Boilerpipe (extract main text content from Web pages) - Jsoup (traverse elements of Web pages individually, also allows to extract text)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.