PolyAnalyst Web Report Training Multilingual Analysis PolyAnalyst Web Report Training Megaputer Intelligence www.megaputer.com © 2014 Megaputer Intelligence Inc.
Multilingual Data Outline
Internet Usage by Language Outline
The proportion of English texts has decreased significantly. Internet Usage by Language Outline The proportion of English texts has decreased significantly.
Growth in Language Usage Outline
Growth of English text data is much slower than other languages. Growth in Language Usage Outline Growth of English text data is much slower than other languages.
Garfield comic translated to Japanese and back Lost in Translation Outline Garfield comic translated to Japanese and back *http://garfieldlostintranslation.blogspot.com/
Original Garfield comic in English Lost in Translation Outline Original Garfield comic in English
Outline PolyAnalyst Languages European Languages English Spanish French German Russian Italian Dutch Polish Portuguese Turkish Greek Asian Languages Chinese (Simplified & Traditional) Japanese Korean
Outline All-Language Functionalities PDL Functions Nodes case() count() empty() except() follow() hcolor() header() macro() near() number() paragraph() pattern() phrase() regex() soundex() stem() term() wildcard() Nodes Bayes & SVM Classification Distinct Texts Keyword Extraction Language Detection Link Terms Search Query Spell Check
Outline Language-Specific Functionalities Text Classification Node Dutch French German Portuguese Russian negate() & possible() PDL functions Chinese (Simplified)
Outline English-Only Functionalities Advanced Text Analysis Nodes Sentiment Analysis Node Entity Extraction Node Semantic PDL Functions antonym() associate() generalize() hold() part() related() singleroot() thesaurus()
Online Feedback for Mobile Chat Apps Case Study Online Feedback for Mobile Chat Apps
Outline Losing Information: Example 1 Turkish Feedback “Surekli kullaniyorum.”
Outline Losing Information: Example 1 Turkish Feedback Machine Translation “Surekli kullaniyorum.” “And a Scrambler.”
Outline Losing Information: Example 1 Turkish Feedback Machine Translation “Surekli kullaniyorum.” “And a Scrambler.” Actual Meaning “I use it all the time.”
Outline Losing Information: Example 2 Turkish Feedback “Insanlarla arani aciyor okunmadigi halde okundu demesi ilginç.”
Outline Losing Information: Example 2 Turkish Feedback Machine Translation “Insanlarla arani aciyor okunmadigi halde okundu demesi ilginç.” “People say, interesting read, even though it hurt okunmadigi arani”
Outline Losing Information: Example 2 Turkish Feedback Machine Translation “Insanlarla arani aciyor okunmadigi halde okundu demesi ilginç.” “People say, interesting read, even though it hurt okunmadigi arani” Actual Meaning “It creates rifts between people it’s interesting that it says read even though it hasn’t been.”
Outline Losing Information: Example 3 Turkish Feedback “4 veriyorum çünku ses kalitesi iyi degil ugultulu ve gidiyor internet full oldugu halde duzeltme yapinn”
Outline Losing Information: Example 3 Turkish Feedback Machine Translation “I'm not as good sound quality 4 because the buzzing and goes well with the internet full duzeltme yapinn” “4 veriyorum çünku ses kalitesi iyi degil ugultulu ve gidiyor internet full oldugu halde duzeltme yapinn”
Outline Losing Information: Example 3 Turkish Feedback Machine Translation “I'm not as good sound quality 4 because the buzzing and goes well with the internet full duzeltme yapinn” “4 veriyorum çünku ses kalitesi iyi degil ugultulu ve gidiyor internet full oldugu halde duzeltme yapinn” Actual Meaning “I give it a 4 because the sound quality isn’t good there’s buzzing and it cuts out even though the internet is full fix it”
End-to-end data analysis Methodology End-to-end data analysis Data Loading Data Cleansing Data-Driven Analysis Analyst-Driven Analysis Visualizations
End-to-end data analysis Methodology End-to-end data analysis Data Loading Data Cleansing Data-Driven Analysis Analyst-Driven Analysis Visualizations
Outline Dictionaries & Indexing Dictionaries of each language are stored and accessed separately Each text analysis node accesses one set of dictionaries at a time That language is either determined during implicit indexing or can be assigned explicitly using Index node
Dictionary Manager Outline
Outline Dictionaries & Indexing Dictionaries of each language are stored and accessed separately Each text analysis node accesses one set of dictionaries at a time That language is either determined during implicit indexing or can be assigned using the Index node
Text Analysis Node Properties Outline
Outline Dictionaries & Indexing Dictionaries of each language are stored and accessed separately Each text analysis node accesses one set of dictionaries at a time That language is either determined during implicit indexing or can be assigned explicitly using Index node
Index Node Outline
Outline Best Practices Run Language Detection Filter data by language Run separate analyses on each separate dataset in the original language for that dataset
Outline Best Practices Run Language Detection Filter data by language Run separate analyses on each separate dataset in the original language for that dataset
Language Detection Outline
Outline Best Practices Run Language Detection Filter data by language Run separate analyses on each separate dataset in the original language for that dataset
Feedback Languages
Focus on English, Russian, Turkish, and Chinese Feedback Languages Focus on English, Russian, Turkish, and Chinese
Outline Best Practices Run Language Detection Filter data by language Run separate analyses on each separate dataset in the original language for that dataset
Separate Analyses per Language Outline
End-to-end data analysis Methodology End-to-end data analysis Data Loading Data Cleansing Data-Driven Analysis Analyst-Driven Analysis Visualizations
English Keywords
Top 5 English Keywords: update, message, excellent, phone, love
Turkish Keywords
Top 5 Turkish Keywords: message, great, super, error, recommend
Keywords by Language
Common keywords across languages Keywords by Language Common keywords across languages
Keywords by Language Keywords Distinct to English: phone, version, crash, fix, voice, friend, chat
Keywords by Language Keywords Distinct to Turkish: error, notification, time, storage, recommendation, single
English Link Terms
Turkish Link Terms package storage internet enough invalid push deliver notification send memory message late card
End-to-end data analysis Methodology End-to-end data analysis Data Loading Data Cleansing Data-Driven Analysis Analyst-Driven Analysis Visualizations
Outline Analyst-Driven Taxonomy For simultaneous highlighting in all languages: Run taxonomy separately on each language-specific dataset Merge scored results
Outline Analyst-Driven Taxonomy For simultaneous highlighting in all languages: Run taxonomy separately on each language-specific dataset One multilingual taxonomy using <or> Separate language-specific taxonomies Merge scored results
Outline Analyst-Driven Taxonomy For simultaneous highlighting in all languages: Run taxonomy separately on each language-specific dataset One multilingual taxonomy using <or> Separate language-specific taxonomies Merge scored results
Multilingual Taxonomy
Can run analyses in English, Chinese, and Russian Multilingual Taxonomy Can run analyses in English, Chinese, and Russian
Can run analyses in English, Chinese, and Russian Multilingual Taxonomy Can run analyses in English, Chinese, and Russian
Can run analyses in English, Chinese, and Russian Multilingual Taxonomy Can run analyses in English, Chinese, and Russian
Outline Analyst-Driven Taxonomy For simultaneous highlighting in all languages: Run taxonomy separately on each language-specific dataset One multilingual taxonomy using <or> Separate language-specific taxonomies Merge scored results
Merge Scored Results
The drill-down contains matches in all 3 languages. Multilingual Drill-Down The drill-down contains matches in all 3 languages. English example
The drill-down contains matches in all 3 languages. Multilingual Drill-Down The drill-down contains matches in all 3 languages. Chinese example
The drill-down contains matches in all 3 languages. Multilingual Drill-Down The drill-down contains matches in all 3 languages. Russian example
End-to-end data analysis Methodology End-to-end data analysis Data Loading Data Cleansing Data-Driven Analysis Analyst-Driven Analysis Visualizations
OLAP: Topics by Language
Link Analysis: Topics by Language
Conclusion Outline PolyAnalyst allows you to run multi-lingual analyses in original languages of data Work with multilingual datasets Work in 14 different languages Identify language-specific characteristics Get the most information out of the data Less subjective; avoid errors in translation
Outline Alternatives Machine Translation API Microsoft (current) SDL (upcoming)
Contacting Megaputer Questions?