Analýza textu Márius Šajgalík. Spracovanie prirodzeného jazyka Segmentácia reči Rozpoznávanie reči Strojový preklad OCR...

Slides:



Advertisements
Similar presentations
Who’s Sharing with Who? Acknowledgements-driven identification of resources David Eichmann School of Library and Information Science & Information Science.
Advertisements

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Learning and Inference for Hierarchically Split PCFGs Slav Petrov and Dan Klein.
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
Language and Vision: Useful Tools Presenter: Vicente Ordonez.
Learning with Probabilistic Features for Improved Pipeline Models Razvan C. Bunescu Electrical Engineering and Computer Science Ohio University Athens,
The Past Progressive : ( was, were)+ V ing  I was walking down the street when it began to rain.  At eight o’clock last night, we were studying. 
LING 581: Advanced Computational Linguistics Lecture Notes January 19th.
Applications of Sequence Learning CMPT 825 Mashaal A. Memon
6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive.
1 Quasi-Synchronous Grammars  Based on key observations in MT: translated sentences often have some isomorphic syntactic structure, but not usually in.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
1 Part-of-Speech (POS) Tagging Revisited Mark Sharp CS-536 Machine Learning Term Project Fall 2003.
TopicTrend By: Jovian Lin Discover Emerging and Novel Research Topics.
Arabic Natural Language Processing: P-Stemmer, Browsing Taxonomy, Text Classification, RenA, ALDA, and Template Summaries — for Arabic News Articles Tarek.
NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.
Siemens Big Data Analysis GROUP 3: MARIO MASSAD, MATTHEW TOSCHI, TYLER TRUONG.
MAchine Learning for LanguagE Toolkit
TokensRegex August 15, 2013 Angel X. Chang.
ELN – Natural Language Processing Giuseppe Attardi
February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK.
Answering List Questions using Co-occurrence and Clustering Majid Razmara and Leila Kosseim Concordia University
Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.
LING/C SC/PSYC 438/538 Lecture 27 Sandiway Fong. Administrivia 2 nd Reminder – 538 Presentations – Send me your choices if you haven’t already.
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Lecture 18 Ontologies and Wordnet Topics Ontologies Wordnet Overview of MeaningReadings: Text 13.5 NLTK book Chapter 2 March 25, 2013 CSCE 771 Natural.
LING 581: Advanced Computational Linguistics Lecture Notes February 12th.
Partial Parsing CSCI-GA.2590 – Lecture 5A Ralph Grishman NYU.
Ling 570 Day 17: Named Entity Recognition Chunking.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
HW7 Extracting Arguments for % Ang Sun March 25, 2012.
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
Lecture 10 NLTK POS Tagging Part 3 Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings:
University of Edinburgh27/10/20151 Lexical Dependency Parsing Chris Brew OhioState University.
Natural language processing tools Lê Đức Trọng 1.
Information extraction 2 Day 37 LING Computational Linguistics Harry Howard Tulane University.
CS460/626 : Natural Language Processing/Speech, NLP and the Web Some parse tree examples (from quiz 3) Pushpak Bhattacharyya CSE Dept., IIT Bombay 12 th.
Conversion of Penn Treebank Data to Text. Penn TreeBank Project “A Bank of Linguistic Trees” (as of 11/1992) University of Pennsylvania, LINC Laboratory.
A.F.K. by SoTel. An Introduction to SoTel SoTel created A.F.K., an Android application used to auto generate text message responses to other users. A.F.K.
The man bites the dog man bites the dog bites the dog the dog dog Parse Tree NP A N the man bites the dog V N NP S VP A 1. Sentence  noun-phrase verb-phrase.
Lecture 12 Classifiers Part 2 Topics Classifiers Maxent Classifiers Maximum Entropy Markov Models Information Extraction and chunking intro Readings: Chapter.
ELPUB 2010, Helsinki, Finland1 A Collaborative Faceted Categorization System – User Interactions Kurt Maly; Harris Wu; Mohammad Zubair ; Contact:
NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size.
DERIVATION S RULES USEDPROBABILITY P(s) = Σ j P(T,S) where t is a parse of s = Σ j P(T) P(T) – The probability of a tree T is the product.
LING/C SC/PSYC 438/538 Lecture 9 Sandiway Fong. Adminstrivia Homework 4 graded Homework 5 out today – Due Saturday night by midnight – (Gives me Sunday.
NLP. Parsing ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 3 rd.
Prototype-Driven Grammar Induction Aria Haghighi and Dan Klein Computer Science Division University of California Berkeley.
NATURAL LANGUAGE AND DIALOGUE SYSTEMS LABUC SANTA CRUZ Natural Language and Dialogue Systems Lab Dialogue & Narrative Structures: Advanced Research Seminar.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
LING 581: Advanced Computational Linguistics Lecture Notes February 24th.
Problem Solving with NLTK MSE 2400 EaLiCaRA Dr. Tom Way.
Coping with Problems in Grammars Automatically Extracted from Treebanks Carlos A. Prolo Computer and Info. Science Dept. University of Pennsylvania.
Statistical Natural Language Parsing Parsing: The rise of data and statistics.
Natural Language Processing (NLP)
Dept. of Computer Science University of Liverpool
LING/C SC 581: Advanced Computational Linguistics
Prototype-Driven Learning for Sequence Models
CSCE 590 Web Scraping - NLTK
LING/C SC 581: Advanced Computational Linguistics
Dept. of Computer Science University of Liverpool
LING/C SC 581: Advanced Computational Linguistics
Types of Maps: Definition: 1. Physical Map:
Through the Fire and Flames
35 35 Extracting Semantic Knowledge from Wikipedia Category Names
CS4984/CS598: Big Data Text Summarization
Natural Language Processing (NLP)
CSCE 590 Web Scraping - NLTK
From Unstructured Text to StructureD Data
Natural Language Processing (NLP)
Presentation transcript:

Analýza textu Márius Šajgalík

Spracovanie prirodzeného jazyka Segmentácia reči Rozpoznávanie reči Strojový preklad OCR...

Analýza textu Segmentácia textu na vety a slová Určovanie slovných druhov Rozpoznávanie menných entít Určovanie významu slov Parsovanie Extrakcia vzťahov Morfologická segmentácia Zjednodušovanie textu Určovanie koreferencií

StanfordCoreNLP Segmentácia textu na vety a slová Určovanie slovných druhov Rozpoznávanie menných entít Vytvorenie parsovacieho stromu Určovanie koreferencií

Príklad The strongest rain ever recorded in India shut down the financial hub of Mumbai, snapped communication lines, closed airports and forced thousands of people to sleep in their offices or walk home during the night, officials said today.

Určovanie slovných druhov

Určovanie menných entít

Určovanie koreferencií

Určovanie závislostí

(ROOT (S (NP (NP (DT The) (JJS strongest) (NN rain)) (VP (ADVP (RB ever)) (VBN recorded) (PP (IN in) (NP (NNP India))))) (VP (VP (VBD shut) (PRT (RP down)) Parsovací strom

Použitie Properties props = new Properties(); props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref"); StanfordCoreNLP pipeline = new StanfordNLP(props);

Annotation document = new Annotation(text); pipeline.annotate(document); List sentences = document.get(SentencesAnnotation.class); for (CoreMap sentence : sentences) { List tokens = sentence.get(TokensAnnotation.class); for (CoreLabel token : tokens) { String word = token.toString(); String tag = token.get(PartOfSpeechAnnotation.class); String ne = token.get(NamedEntityTagAnnotation.class); } Tree tree = sentence.get(TreeAnnotation.class); SemanticGraph dependencies = sentence.get( CollapsedCCProcessedDependenciesAnnotation.class); }

StanfordCoreNLP Rozšírenia pre viaceré jazyky Java, Python, Ruby, Closure, JavaScript (node.js), C# (IKVM)

Podobné nástroje OpenNLP (tiež Java, v rámci projektu Apache) NLTK pre Python

NLTK – segmentácia na slová >>> import nltk >>> sentence = """At eight o'clock on Thursday morning... Arthur didn't feel very good.""" >>> tokens = nltk.word_tokenize ( sentence ) >>> tokens ['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']

NLTK – určovanie slovných druhov >>> tagged = nltk.pos_tag ( tokens ) >>> tagged [0:6] [('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'), ('Thursday', 'NNP'), ('morning', 'NN')]

NLTK – určovanie menných entít >>> entities = nltk.chunk.ne_chunk ( tagged ) >>> entities Tree('S', [('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'), ('Thursday', 'NNP'), ('morning', 'NN'), Tree('PERSON', [('Arthur', 'NNP')]), ('did', 'VBD'), ("n't", 'RB'), ('feel', 'VB'), ('very', 'RB'), ('good', 'JJ'), ('.', '.')])

NLTK – parsovací strom >>> from nltk.corpus import treebank >>> t = treebank.parsed_sents ('wsj_0001.mrg')[0] >>> t.draw ()

Gensim Nástroj na modelovanie tém (Python) Vie TF-IDF, LSA, pLSA, LDA, HDP Jednoduché použitie

TF-IDF Term frequency – inverse document frequency TF – frekvencia slova v aktuálnom dokumente IDF – záporný logaritmus pravdepodobnosti výskytu slova v dokumente (rovnaká pre všetky dokumenty)

Gensim – LDA >>> import logging, gensim, bz2 >>> logging.basicConfig ( format= '%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO ) >>> # load id->word mapping (the dictionary), one of the results of step 2 above >>> id2word = gensim.corpora. Dictionary.load_from_text ('wiki_en_wordids.txt') >>> mm = gensim.corpora.MmCorpus ('wiki_en_tfidf.mm') >>> # mm = gensim.corpora.MmCorpus (bz2.BZ2File('wiki_en_tfidf.mm.bz2')) >>> print mm MmCorpus( documents, features, non- zero entries)

Gensim – LDA >>> # extract 100 LDA topics, using 1 pass and updating once every 1 chunk (10,000 documents) >>> lda = gensim.models.ldamodel.LdaModel ( corpus=mm, id2word=id2word, num_topics= 100, update_every= 1, chunksize= 10000, passes= 1) >>> lda.print_topics(20)

topic #0: 0.009*river *lake *island *mountain *area *park *antarctic *south *mountains *dam topic #1: 0.026*relay *athletics *metres *freestyle *hurdles *ret *divisão *athletes *bundesliga *medals topic #2: 0.002*were *he *court *his *had *law *government *police *patrolling *their topic #3: 0.040*courcelles *centimeters *mattythewhite *wine *stamps *oko *perennial *stubs *ovate *greyish topic #4: 0.039*al *sysop *iran *pakistan *ali *arab *islamic *arabic *saudi *muhammad

Gensim – LDA There appears to be a lot of noise in your dataset. The first three topics in your list appear to be meta topics, concerning the administration and cleanup of Wikipedia. These show up because you didn't exclude templates such as these, some of which are included in most articles for quality control: leanup The fourth and fifth topics clearly shows the influence of bots that import massive databases of cities, countries, etc. and their statistics such as population, capita, etc. The sixth shows the influence of sports bots, and the seventh of music bots.

WordNet Lexikálna databáza Obsahuje synsety Podstatné mená, slovesá, prídavné mená, príslovky Prepojenia medzi synsetmi Antonymá, hypernymá, hyponymá, ich inštancie, holonymá, meronymá

WordNet dog, domestic dog, Canis familiaris => canine, canid => carnivore => placental, placental mammal, eutherian,... => mammal => vertebrate, craniate => chordate => animal, animate being, beast, brute, creature,... =>...

Vektorová reprezentácia slov Každé slovo má naučený vektor reálnych čísel, ktoré reprezentujú rôzne jeho vlastnosti THE TO AND

t-SNE

Podobnosť medzi slovami Kosínusová podobnosť

Vlastnosti vektorov slov Zachytávajú viaceré lingvistické pravidelnosti vector('Paris') - vector('France') + vector('Italy') ~= vector('Rome') vector('king') - vector('man') + vector('woman') ~= vector('queen')

Strojový preklad v Googli Majúc naučené vektory slov pre viaceré jazyky, preklad znamená nájdenie lineárnej transformácie z jedného vektorového priestoru do druhého

Zdroje StanfordCoreNLP OpenNLP - NLTK - Gensim - WordNet - Word2vec - Vektory slov - t-SNE - Preklad -