Evaluation methods How do we judge speech technology components and applications?

Slides:



Advertisements
Similar presentations
Qualitative methods - conversation analysis
Advertisements

Here is how close you are to the knowledge or skills you are trying to develop, and heres what you need to do next.
Conversation Skills will be tested both as part of Formative & Summative Assessment.
On-Demand Writing Assessment
ON DEMAND Introduction. Learning targets  I can identify the modes of writing and distinguish the differences among all 3 modes  I can compare and contrast.
Chapter 1 What is listening?
KEMENTERIAN PENDIDIKAN DAN KEBUDAYAAN BADAN PENGEMBANGAN SUMBER DAYA MANUSIA PENDIDIKAN DAN KEBUDAYAAN DAN PENJAMINAN MUTU PENDIDIKAN AUTHENTIC ASSESSMENT.
STUDENT LEARNING OUTCOMES ASSESSMENT. Cycle of Assessment Course Goals/ Intended Outcomes Means Of Assessment And Criteria For Success Summary of Data.
Understanding Progress in English A Guide for Parents.
Principles of High Quality Assessment
Listening Task Purpose of the test:
Web 2.0 Testing and Marketing E-engagement capacity enhancement for NGOs HKU ExCEL3.
Stages of Second Language Acquisition
National Curriculum Key Stage 2
OCTOBER ED DIRECTOR PROFESSIONAL DEVELOPMENT 10/1/14 POWERFUL & PURPOSEFUL FEEDBACK.
Introduction to IELTS Exam
Chapter 4 Listening for advanced level learners Helgesen, M. & Brown, S. (2007). Listening [w/CD]. McGraw-Hill: New York.
Non-Fiction and Media (Unit 1) Year 11
Language Assessment 4 Listening Comprehension Testing Language Assessment Lecture 4 Listening Comprehension Testing Instructor Tung-hsien He, Ph.D. 何東憲老師.
Data collection and experimentation. Why should we talk about data collection? It is a central part of most, if not all, aspects of current speech technology.
WELCOME Presentation slides can be downloaded from:
Vocabulary Link Listening Pronunciation Speaking Language Link LESSON A Writting Reading Video Program.
Chris Barcock A680: English/ English Language Information and Ideas: Higher and Foundation Tiers.
Classroom Assessments Checklists, Rating Scales, and Rubrics
G050: Lecture 02 Evaluating Interactive Multimedia Products
Developing Communicative Dr. Michael Rost Language Teaching.
Exam Taking Kinds of Tests and Test Taking Strategies.
One Step at a Time: Presentation 6 LISTENING SKILLS Introduction Initial Screen Skills Checklist Classroom Intervention Lesson Planning Teaching Method.
On-Demand Writing in 8 th grade What is it? On-Demand Writing is… Writing to a prompt in a limited amount of time. You will be: –given a choice of two.
OCTOBER ED DIRECTOR PROFESSIONAL DEVELOPMENT 10/1/14 POWERFUL & PURPOSEFUL FEEDBACK.
Classroom Research Workshop at Darunsikkhalai, 2 November 2012 Richard Watson Todd King Mongkut’s University of Technology Thonburi
Lesson Planning: part # 1 Lecture # 7. Review of Lesson # 6 We talked about the following elements of Presentation, Practice and Production stages of.
FCE First Certificate in English. What is it ? FCE is for learners who have an upper- intermediate level of English, at Level B2 of the Common European.
Techniques for Highly Effective Communication Professional Year Program - Unit 5: Workplace media and communication channels.
Listening Development: A Student Centered Approach Chapter 4.
Homework: Check Vocabulary for Media Pg. 61 Fluency Markers, fill in the gaps Pg. 16 Class 3.
How did you learn the skill of note taking? How can this skill contribute to your success? Quickly jot an answer to these questions: Now, QUICKLY, share.
Guidelines for writing a successful speech The Speech.
Informative Speech. Step One Decide on a Topic Choose a topic on a concept Choose a topic that you can create interest for your audience Choose a topic.
Year R Stay and Play Talk. Why?  Communication is the number one skill. Without it, children will struggle to make friends, learn and enjoy life.
Real-World Writing and Classroom Application From Kelly Gallagher Write Like This.
Welcome Parents! FCAT Information Session. O Next Generation Sunshine State Standards O Released Test Items O Sample Test.
Pick who has the most distinctive voice and give five adjectives to describe this.
COMMENTARY LL2 - Coursework. Assessment Objectives Below is the breakdown of how many marks you get for each Assessment Objective you meet: AO1: Select.
Lect 5M 1 Test 1 reminders Study the Study Guide!-it tells you exactly what we are looking for. 4 questions (not 5) Please BRING A BLUE BOOK to the test.
Classroom Assessment Techniques. Rate your own level of familiarity with Classroom Assessment Techniques: A.Never heard of this B.Heard but don’t really.
1 Testing—tuning and monitoring © 2013 by Larson Technical Services.
GGGE6533 LANGUAGE LEARNING STRATEGY INSTRUCTION SUCCESSFUL ENGLISH LANGUAGE LEARNING INVENTORY (SELL-IN) FINDINGS & IMPLICATIONS PREPARED BY: ZULAIKHA.
Preliminary (PET). Preliminary (for Schools) is made up of three papers and speaking. PaperContentMarks (% of total) Purpose Reading and Writing (1 hour.
Teaching Listening Why teach listening?
Lessons and Objectives
Dr Anie Attan 26 April 2017 Language Academy UTMJB
Assessing Musical Behavior
Advanced Higher Modern Languages
Classroom Assessment A Practical Guide for Educators by Craig A
Classroom Assessment Techniques
The structure of a scientific paper:
SECOND LANGUAGE LISTENING Comprehension: Process and Pedagogy
Consumer Behaviour PROJECT WORK Laura Grazzini
National Curriculum Requirements of Language at Key Stage 2 only
Understanding Standards:
Year 11 Parent and Child Workshop Welcome!
Literacy Strategies.
Evaluation and Testing
How to be an effective Learning Helper in the classroom
Economy Project.
How to Create a Tutorial Video
REGISTRATION CODE: EET699
WORLD-READINESS STANDARDS FOR LEARNING LANGUAGES
Engleski jezik struke 3 ponedeljak,
Presentation transcript:

Evaluation methods How do we judge speech technology components and applications?

Why should we talk about evaluation? It is – or should be – a central part of most, if not all, aspects of speech technology The higher grades (A, B; as tested in the home exam assignments and the project) require a measure of evaluation

What is evaluation? “the making of a judgment about the amount, number, or value of something” (Google) “the systematic determination of a subject's merit, worth and significance, using criteria governed by a set of standards” (Wikipedia)

What is evaluation? “The systematic determination of a subject's merit, worth and significance, using criteria governed by a set of standards?” What does this mean? -The method can be formalized, described in detail… Why is this important? -So that evaluations can be repeated, -because we want to compare different systems, -and verify evaluation results

What is evaluation? “The systematic determination of a subject's merit, worth and significance, using criteria governed by a set of standards?” (Google had “value” instead) What does this mean? -We will return to this…

What is evaluation? “The systematic determination of a subject's merit, worth and significance, using criteria governed by a set of standards?” What are the criteria? -We will come back to this, too... Who decides on the standards? -Governments -Organizations (e.g. ISO) -Industry groups -Research groups -…-…

What if there is no standard? By the nature of things, there are many more things to evaluate than there are well-developed standards Not necessarily advisable to use a mismatched standard Fallback: systematic, formalized method

Why evaluate? Wrong question. Start with “For whom do we evaluate?” -Researchers -Developers -Producers -Buyers -Consumer organizations -Special interest groups -…-…

So now: Why evaluate ? What do the groups we mentioned want from an evaluation? Researchers? Test of hypotheses… Developers Proof of progress, functionality Producers Does the manufacturing work? Is it cheaper? Buyers More bang for the buck? Does it meet expectations? Consumer organizations Does it meet promises made? Special interest groups Does it meet specifcations and requirements?

What to evaluate? In other words, what does “merit, worth, significance” and “value” mean?

What to evaluate? In other words, what does “merit, worth, significance” and “value” mean? It depends. -What is the purpose of the evaluation? -What is the purpose of the evaluated?

In summary so far Objective to a point -But be aware of the reason for the evaluation: who wants it, and what do they want to know? Standards are great -But will not be available for all purposes -Squeezing one type of evaluation into another type of standard will produce unpredictable results -If designing new methods, be very clear with the details in the description Must be possible to repeat

How is evaluation done? We’ll use speech synthesis evaluation as our example domain Here, we focus on evaluations that -Test the functionality (with respect to a user) -Prove a concept or an idea -Compare different varieties -…-… We largely disregard -Efficiency -Cost -Robustness -…-…

User studies – representativeness User selection -Demographics -…-… Environment -Sound environment -…-… General situation -Lab environments are rarely representative for the intended usage environment of speech technology … Stimuli/system -Often not possible to text the exact system one is interested in

Synthesis evaluation overview Overview used by MTM, the Swedish Agency for Accessible Media in education Provides people with print impairments with accessible media Books and papers (games, calendars…) Braille and talking books Speech synthesis for about 50% of the production of university level text books Filibuster -In-house developed unit selection system -Tora & Folke (Swedish), Brage (Norwegian bokmål), Martin (Danish)

MTM purposes of evaluation o Ready for release o Comparison of voices o Intelligibility, human-likeness o Fatigue, habituation o …

Test methods: Grading tests Overall impression (mean opinion score, MOS) -Grade the utterance on a scale Specific aspects (categorical rating test, CRT) Intelligibility Human-likeness Speed Stress …

Test methods: Discrimination tests Repeat or write down what you heard Choose between two or more given words Minimal pairs: bil – pil Suitable for diphone synthesis with a small voice database

Test methods: Preference tests Comparison of two or more utterances Typically words or short sentences Choose which you like the best

Test methods: Comprehension tests Listen to a text and answer questions

Test methods: Comments Comment fields The subjects wants to explain what is wrong They are almost never right. Time consuming!

Test methods: problems for narrative synthesis testing You want to evaluate large texts! Grading, discrimination and preference tests Difficult to judge longer texts Evaluation of a very small part of the possible outcome of the US TTS Time consuming You don’t know what the subjects likde or disliked Comprehension tests Does not measure anything else

Ecological validity Representativeness again: ecological validity means that the methods, materials and setting of the study should approximate the real-world that is being examined Userse.g. students, old people Materialuniversity level text book or newspapers with synthetic speech Situationreading long texts (in a learning or informational situation)

Audience response system-based tests Hollywood: evaluations of pilot episodes and movies Clicking a button when the don’t like it Voting in TV shows Classroom engagement

Audience response system-based test For TTS Click when you hear something – Unintelligible – Irritating – You just don’t like it – … Longer speech chunks Possible to give simple instructions Detailed analysis Effectiveness 5 listening minutes = 5 evaluated minutes

Results – number of clicks/subject

Evaluation of conversational systems and conversational synthesis Conversations are incremental and continuous -No straightforward way of segmenting They are produced by all participants in collaboration “Errors” are commonplace, but rarely have an adversary effect Strict information transfer is often not the primary goal So not much use for methods of evaluation that operate in terms of -Efficiency -Quality of single utterances -Grammaticality -Etc.

Other methods New methods are being developed for evaluation of complex systems and interactions. ARS is one. We’ll look at some other examples.

Analysis of captured interactions Measures of machine extractable features, e.g. tone, rhythm, interaction flow, durations, movement, gaze… Comparison to human-human interactions of the same type The colour experiment is an example of this

3 rd -party participant/spectator behaviours People watching spoken interaction behave predictably Monitoring people watching videos can give insights to their perception of the video E.g. gaze patterns

Thank you! Questions?