Ting Qian Human Language Processing Lab Brain and Cognitive Sciences 1.

Slides:



Advertisements
Similar presentations
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advertisements

Interweaving of L1 and L2 in L2 Thinking Bai Hong’ai Yanbian University.
ADGEN USC/ISI ADGEN: Advanced Generation for Question Answering Kevin Knight and Daniel Marcu USC/Information Sciences Institute.
Metaphorical Uses of Language in Native and Non-native Student Writing: A corpus-based study By: Claudia Marcela Chapetón Castro M.A. in Applied Linguistics.
WRITING CRITIQUE GROUP GUIDELINES Writing responses to your group members’ work and receiving responses from others is the most important step in revising.
Source Domains in Conceptualizations of the State in Chinese and Hungarian Political Discourse Danyang Kou.
Highly Fluent, Balanced Bilingualism Does Not Enhance Executive Function Oliver Sawi 1,2, Jack Darrow 1, Hunter Johnson 1, Kenneth Paap 1 ; 1 San Francisco.
Toss the Markers! Okay…well it’s just an idea!. How it all began…. One day, about two years ago, I was enjoying my class and writing some interesting.
A Corpus-based Study of Discourse Features in Learners ’ Writing Development Yu-Hua Chen Lancaster University, UK.
In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
English Word Origins Grade 3 Middle School (US 9 th Grade) Advanced English Pablo Sherman The etymology of language.
The Essay and the Writing Process
A Successful Chinese Learner A Successful Chinese Learner Robert L. Good Robert L. Good.
1 Advanced Smoothing, Evaluation of Language Models.
Proficiency Approach in Teaching Chinese
Katherine S. Holmes READ 7140 May 28, Georgia Writing Test – 5 th Grade GOAL: To assess the procedures to enhance statewide instruction in language.
© 2008 by PACT PACT Scorer Training Pilot.
SLOW DOWN!!!  Remember… the easiest way to make your score go up is to slow down and miss fewer questions  You’re scored on total points, not the percentage.
Writing a Scientific Argument Using the CER Model Adapted from Dr. Kristen Trent Summer 2014.
A TALE OF TWO CITIES IN-CLASS ANALYTICAL ESSAY Self-Check Guidelines for Shaping Sheets.
UNIT 1 – USING THE SCIENTIFIC METHOD Exploration Science projects/project_scientific_method.shtml?from=Blog#overviewofthescientificmethod.
Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
REVIEWING AND PRACTICING CITATIONS AND QUOTING. TERMS YOU SHOULD KNOW: A REVIEW Database: online collection of resources Paraphrase: putting text into.
Unit 15 Functions & Uses of English Intonation
"One brain, two languages-- educating our bilingual students in the light of Neuroscience“ Dr. Luz Mary Rincon.
Relative Clauses in Mandarin Chinese Conversation Na Wang.
PARAGRAPH DEVELOPMENT
Synthesize and Create an Argument!
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
N o, you don’t understand, I mean… Irini Nomikou supervisor: Dr. Floriana Grasso The one with the conductor and the girl on the train Cond: Did you pay.
Essay Writing.
Doing discourse analysis. Criteria for developing a discourse analysis project a well-focused idea that is phrased as a question or set of closely related.
Language Identification of Web Data for Building Linguistic Corpora Marija Stupar, Tereza Jurić, Nikola Ljubešić Faculty of Humanities and Social Sciences.
Context-Sensitive Information Retrieval Using Implicit Feedback Xuehua Shen : department of Computer Science University of Illinois at Urbana-Champaign.
BILINGUAL EDUCATION A program designed to provide instruction in both a student's native language and in a second language. Bilingual education is based.
Reflection helps you articulate and think about your processes for communication. Reflection gives you an opportunity to consider your use of rhetorical.
PHYSICAL SCIENCE nphs (lm.sw.ke) How to Write a Lab Report.
HOW TO WRITE FORMAL LAB REPORTS. WHAT ARE THE STEPS? 1. Name and Lab partners 2. Period 3. Title 4. Purpose and Hypothesis 5. Procedures 6. Data 7. Data.
The Good, the Bad, and the Ugly of Peer Review Sarah Klotz 6/27/2015.
How Can Corpora Help Me To Be Successful in CO150?
4th grade Expository, biography Social Studies- Native Americans
APK: Activation of Prior Knowledge Write at least 3-5 sentences describing a time when… you were willing to pay any financial price for a good or service.
Science for ALL: Adapting lessons for English Language Learners Susan Gomez Zwiep Science Education CSU Long Beach/K12 Alliance -WestED.
Keys to the Comparison Essay. What is the Comparison essay? THE BASICS  An essay discussing the similarities and differences between two given regions.
The Science Fair Process © Pinellas County Schools Welcome to the Bardmoor Elementary Science Fair Process.
English around the world
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Зепп Я., Рожков Е. – 8бкласс, МОУ СОШ №3. You live a new life for every new language you speak English saying.
Paul Mundy Writing cases Stories that illustrate a project or problem.
Revising Vs. Editing W Can I develop and strengthen my writing by planning, revising, editing, and rewriting, with teacher guidance and peer support?
Defining Discourse.
课时分配 课时板块结合范例 Period 1 Introduction + Reading and Speaking Period 2 Vocabulary +Grammar + Function Period 3 Vocabulary and Listening + Everyday English.
Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
DBQ Tips.
The Effects of Valerian Root on the Pulse Rate of Lumbriculus variegatus By: Jacob Squicciarini and Isabella Cox Albion High School GCC Biology 100 Variables.
Introduction- 1. Reading (3m) Read the quotations and tell whether you agree or disagree. A We have really everything in common with America nowadays,
Writing Exercise Try to write a short humor piece. It can be fictional or non-fictional. Essay by David Sedaris.
Unit Two: English Around the World (Period Four) Writing.
人教修订版 高中一年级 ( 上 ) Unit 2. Reading A.More than 750 million people speak English as their native language or a second language. B. More than 750 million.
Abstract  An abstract is a concise summary of a larger project (a thesis, research report, performance, service project, etc.) that concisely describes.
Collecting Written Data
E303 Part II The Context of Language Research
Distinguish between an experiment and other types of scientific investigations where variables are not controlled,
Building reading skills throughout the year
Q uality uestioning Materials adapted from QUILT curriculum:
Opening Question How can an author’s experiences impact the purpose of their writing?
The Nature of Learner Language (Chapter 2 Rod Ellis, 1997) Page 15
October 6, 2011 Dr. Itamar Arel College of Engineering
Presentation transcript:

Ting Qian Human Language Processing Lab Brain and Cognitive Sciences 1

 Dr. T. Florian Jaeger  My father  My friends who have voluntarily given me their Chinglish essays  People at HLP lab 2

1) Meanwhile, Bren crude hit an all-time peak of $ before falling back. 2) Prices initially rose when the report was released with traders reacting to news that inventories were lower than expected. 3) US light, sweet crude oil rose to a fresh high of $ before slipping back to $

 US light, sweet crude oil rose to a fresh high of $ before slipping back to $  Meanwhile, Bren crude hit an all-time peak of $ before falling back.  Prices initially rose when the report was released with traders reacting to news that inventories were lower than expected. 4

 If humans try to communicate in the most efficient way, they should produce language: Humans as rational agents who optimize the flow of information in language production ActionGoal by putting less information into words or sentences with little prior context, and more later on To ensure the increase of information is uniform 5

 Uniform Information Density (UID) 6

An engineering perspective  The most efficient way of communicating through a noisy channel is to send information at a constant rate. (Information Theory, Shannon 1948). 7

 No good models of the information of a sentence in context exist 8  Methods from natural language processing provide reasonably good estimates of out-of-context information of sentences

 Intuitively, less contextual information is available at the beginning of a discourse.  If speakers/writers communicate efficiently, early sentences should be made more predictable (easier for listeners).  The out-of-context information at the beginning of a discourse should be lower than later in the discourse. 9

10

 Genzel & Charniak (2002) provided evidence for the hypothesis of uniform information by analyzing English discourse.  They found that: ◦ Information of sentences increases with sentence numbers in a discourse. ◦ The effect of increase is due to both lexical (what words are used) and non-lexical (how words are used) factors. 11

 Evaluate UID on Chinese written corpora by measuring information content.  Evaluate UID on a Chinese English (Chinglish) corpus  Ultimately: why is Chinese English harder to understand for native English speakers, but relatively easy for native Chinese speakers? 12

13

 Four corpora are used ◦ XIN – Beijing Xinhua News ◦ SINO – Taiwan Sinorama Magazine ◦ HK – Hong Kong News (too little data) ◦ VOA – Voice of America Chinese News  We build n-gram language models to measure the (un)predictability of written Chinese sentences. 14

二十 年 前 ,许多 中国 家庭 的 梦想 是 拥有 一 部 电话 。 Twenty year ago, many Chinese family ‘s dream is have a piece telephone. 15 二十 年 前 年 前 , 前 ,许多 ,许多 中国 Trigrams 部 电话 。 …... P( 二十 年 前 ) = 0.1%

 Lexicalized part-of-speech n-gram 二十 _CD 年 _M 前 _LC , _PU 许多 _CD 中国 _NR 家庭 _NN 的 _DEG 梦想 _NN 是 _VC 拥有 _VV 一 _CD 部 _M 电话 _NN 。 _PU 16

 With respect to an entire document ◦ Sentence effect in a document ◦ Paragraph effect in a document 17

18

19

 With respect to the immediate containing domain of the linguistic unit in question.  Predictors 1. Sentence position in paragraph 2. Paragraph position in document 3. Word position in sentence Multiple regression on the above three predictors 20

21 Sentence position in paragraph

 Limited amount of context information available. 22 Information goes up and converges (after removal of early words)

二十 年 前 ,许多 中国 家庭 的 梦想 是 拥有 一 部 电话 。 Twenty year ago, many Chinese family ‘s dream is have a piece telephone. 23 二十 年 前 年 前 , 前 ,许多 ,许多 中国 Trigrams 部 电话 。 …...

 We replicated Genzel & Charniak’s study on Chinese corpora. ◦ Sentence effect within documents is not found. ◦ However:  Paragraph effect within documents is consistent with UID.  Sentence effect within paragraphs is also found.  Due to the size of data, effects are observable only early in discourse (viable cut-offs are low). 24

 We are the first to look at the effect of word position within sentences. ◦ Information content increases with word position. ◦ Context estimation leads to early convergence.  Does increase of information only occur locally in Chinese? ◦ Current data seem to support this idea. 25

 Writing style? Could be. ◦ Chinese – Summarization & Expansion ◦ English – Narrative style 26

 A collection of English essays written by native Chinese speakers. ◦ Corpus of English as a Second Language (CESL)  We trained a language model based on the Brown Corpus (American English) and use the model to measure information content of Chinese English sentences. 27

28 XIN: - p<0.001*** CESL: - p= *

 The average information content is much higher in Chinese English (8.2~8.4) than in Chinese (4.5~5.0).  It is also higher than information content of English, which converges at 7.0 bits (Paintadosi, CUNY 2008). 29

 Chinese, English, and Chinglish ◦ Globally, Chinglish essays fail to exhibit the information distribution as predicted by UID, either. ◦ Further studies needed to discover more properties of Chinglish.  Possible reasons that explain why Chinglish is harder to understand ◦ Higher information content ◦ Again, writing style 30

Questions? 31