Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lexical: Words vs. Characters Syntactic and Stylistic

Similar presentations

Presentation on theme: "Lexical: Words vs. Characters Syntactic and Stylistic"— Presentation transcript:

1 Lexical: Words vs. Characters Syntactic and Stylistic
Predicting Foreign Language Usage from English-Only Social Media Posts Svitlana Volkova, Stephen Ranshous, Lawrence Phillips Data Sciences and Analytics Group, National Security Directorate Background Social Media (SM) is known for multi-cultural and multilingual interactions 45% of codeswitching (CS) due to lexical need 40% due to the choice of a topic Intra-sentential CS: mix languages inside the tweet Inter-sentential CS: mix languages across tweets Why? To express thoughts or feelings, to address a different audience, to attract attention, to emphasize a point Data Multilingual user timelines via public Twitter API 09/15–01/16 27,335 users, 12 Non-English (NE) languages 6,036,085 tweets: 3,718,212 (EN) and 2,317,873 (NE) L1 identification: ESL essays vs. informal tweets Cross-lingual transfer: the influence of NE languages on various levels of linguistic performance in EN Tweet IDs, user IDs and language enrichments available at Approach Predictive signals: lexical (words, chars), semantic (embeddings), syntactic (PoS tags), stylistic (interactions) ADABOOST LOGREG RANDFOREST Deep learning model input: Content: word, character, and byte representations Graph: interactions as vectors Content + Graph: the above combination Content + Transfer: embeddings initialized using transfer learning (large Twitter dataset) Cross-Lingual Analysis: Syntax and Style NOTE: Print this poster file at 100% SCALE to result in a physical print measuring 36” wide x 24” tall. All type size notations shown above are based on the final printed size of the poster. • Contact Digital Duplicating ( , to order poster printing and finishing services for your completed poster design. • Remember to have your poster cleared for public display/distribution through the Information Release system ( • Sidebar “About PNNL” box is considered optional, and can be removed if space is needed for technical content. Language Complexity Tweet length and punctuation: Hindi↑ vs. Polish↓, French↓ Language Subjectivity Emoticons: Tagalog↑ Repeated punctuation: Russian↑ vs. Polish↓ Elongations: German↑ vs. Russian↓ Communication Behavior Hashtags: German↑, French↑, Hindi↓, Korean↓ Semantic: Embeddings Twitter Glove 0.53 0.47 0.57 Twitter NPMI 0.55 0.52 News W2V 0.54 0.46 0.56 Twitter W2V 0.58 0.67 Lexical: Words vs. Characters Word 3-grams 0.72 0.65 Char 3-grams 0.42 0.60 0.59 Char 5-grams 0.64 0.66 Syntactic and Stylistic Profile 0.48 0.41 Style Syntax Task: Predict foreign languages users speak exclusively from their English tweets Hypothesis: Lexical, semantic, syntactic and stylistic choices in English content have different predictive power on inferring non-English languages users speak on social media URLs: Russian↑ vs. Hindi↓ Mentions: Portuguese↑, Italian↑, Hindi↑ vs. German↓ Syntactic Analysis Adjectives: Hindi↑ vs. German↓, Polish↓ Determiners: Hindi↑, Korean↑ vs. German↓, Polish↓ Nouns: Russian↑ vs. Polish↓ Prepositions: Hindi↑ vs. Polish↓, Tagalog↓ Pronouns: Korean↑, Portuguese↑ vs. Tagalog↑, German↓ Adverbs: Korean↑ vs. German↓, Polish↓ Conjunctions and verbs: Korean↑, Hindi↑ vs. German↓, Polish↓ ABOUT Pacific Northwest National Laboratory The Pacific Northwest National Laboratory, located in southeastern Washington State, is a U.S. Department of Energy Office of Science laboratory that solves complex problems in energy, national security, and the environment, and advances scientific frontiers in the chemical, biological, materials, environmental, and computational sciences. The Laboratory employs nearly 5,000 staff members, has an annual budget in excess of $1 billion, and has been managed by Ohio-based Battelle since 1965. For more information on the science you see here, please contact: Svitlana Volkova Senior Scientist Pacific Northwest National Laboratory Richland, WA 99354 (509) ABOUT Pacific Northwest National Laboratory The Pacific Northwest National Laboratory, located in southeastern Washington State, is a U.S. Department of Energy Office of Science laboratory that solves complex problems in energy, national security, and the environment, and advances scientific frontiers in the chemical, biological, materials, environmental, and computational sciences. The Laboratory employs nearly 5,000 staff members, has an annual budget in excess of $1 billion, and has been managed by Ohio-based Battelle since 1965. For more information on the science you see here, please contact: Staff Name Pacific Northwest National Laboratory P.O. Box 999, MS-IN: X#-## Richland, WA 99352 (509) 37X-XXXX 27K users 6M tweets File Name // File Date // PNNL-SA-##### File Name // File Date // PNNL-SA-#####

Download ppt "Lexical: Words vs. Characters Syntactic and Stylistic"

Similar presentations

Ads by Google