Text Analytics Solutions with Azure Machine Learning

Text Analytics Solutions with Azure Machine Learning
Dinesh Asanka Text Analytics Solutions with Azure Machine Learning

Contact Me Blog:

About Me MVP – Data Platform - From 2009.
Senior Architect – Technology VirtusaPolaris pvt Ltd. B.Sc (Eng), MBA (IT) , MSc (AI). Reading for Mphil at University of Moratuwa Sri Lanka. Speaker at Universities in Sri Lanka, Conferences and User Group. Research Gate Blog

#Azure #MachineLearning #Text Analytics

Machine Learning Machine learning is a subset of artificial intelligence in the field of computer science that often uses statistical techniques to give computers the ability to "learn" with data, without being explicitly programmed.

Azure Microsoft Azure is a cloud computing service created by Microsoft for building, testing, deploying, and managing applications and services through a global network of Microsoft-managed data centers.

Text Analytics Text analytics means examining text that was written by, or about, customers. You find patterns and topics of interest, and then take practical action based on what you learn.

Challenges in Text Analytics

Language Detections The language detection algorithm can identify many different languages. The algorithm will analyze each row of text, and assign a probability score for each language. The language in the first result column is the language that got the highest score.

PreProcess Text Removal of stop-words
Using regular expressions to search for and replace specific target strings Lemmatization, which converts multiple related words to a single canonical form Filtering on specific parts of speech Case normalization Removal of certain classes of characters, such as numbers, special characters, and sequences of repeated characters such as "aaaa" Identification and removal of s and URLs

Named Entity Recognition
Named entity recognition is an important area of research in machine learning and natural language processing Does a tweet contain the name of a person? Does the tweet also provide his current location? Which companies were mentioned in a news article? Were specified products mentioned in complaints or reviews? Input text Module output “Boston is a great place to live.” 0,Boston,0,6,LOC

Feature Hashing Feature hashing works by converting unique tokens into integers. It operates on the exact strings that you provide as input and does not perform any linguistic analysis or preprocessing.

N-Grams N-grams, type a number that defines the maximum length of the n-grams to add to the training dictionary. An n-gram is a sequence of n words, treated as a unique unit. N-grams = 1: Unigrams, or single words. N-grams = 2: Bigrams, or two-word sequences, plus unigrams. N-grams = 3: Trigrams, or three-word sequences, plus bigrams and unigrams.

Training & Testing

Feature Hashing Prediction

Extract N-Gram Features from Text
The module works by creating a dictionary of n-grams from a column of free text that you specify as input. The module applies various information metrics to the n-gram list to reduce data dimensionality and identify the n-grams that have the most information value.

Weight Binary Weight: Assigns a binary presence value to the extracted n-grams. In other words, the value for each n-gram is 1 when it exists in the given document, and 0 otherwise. TF Weight: Assigns a term-frequency score (TF) to the extracted n-grams. The value for each n-gram is its occurrence frequency in the given document. IDF Weight: Assigns an inverse document frequency score (IDF) to the extracted n-grams. The value for each n-gram is the log of corpus size divided by its occurrence frequency in the whole corpus. That is: IDF = log of corpus_size / document_frequency. TF-IDF Weight: Assigns an term frequency/inverse document frequency score (TF/IDF) to the extracted n-grams. The value for each n-gram is its TF score multiplied by its IDF score. Graph Weight: Assigns score to the extracted n-grams based on the TextRank graph ranking. TextRank is a graph-based ranking model for text processing. Graph-based ranking algorithms are essentially a way of deciding importance based on global information. For more information, see TextRank- Bringing Order Into Texts by Rada Mihalcea and Paul Tarau.

Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA) is often used in natural language processing (NLP) to find texts that are similar. Another common term is topic modeling. This module takes a column of text, and generates these outputs: The source text, together with a score for each category A feature matrix, containing extracted terms and coefficients for each category A transformation, which you can save and reapply to new text used as input

FREE STANDARD Price Free $9.99 per seat per month $1 per studio experimentation hour Azure subscription Not required Required Max number of modules per experiment 100 Unlimited Max experiment duration 1 hour per experiment Up to 7 days per experiment with a maximum of 24 hours per module Max storage space 10 GB Unlimited - BYO Read data from On-Premises SQL Preview No Yes Execution/performance Single node Multiple nodes Production Web API SLA

References

Your Feedback is Important
Paste Feedback QR Code here

Thanks to our Sponsors SILVER: PASS: VENUE: Global Alliance Partner
Sri Lanka

Text Analytics Solutions with Azure Machine Learning

Similar presentations

Presentation on theme: "Text Analytics Solutions with Azure Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Text Analytics Solutions with Azure Machine Learning

Similar presentations

Presentation on theme: "Text Analytics Solutions with Azure Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback