Crowd++: Unsupervised Speaker Count with Smartphones Chenren Xu, Sugang Li, Gang Liu, Yanyong Zhang, Emiliano Miluzzo, Yih-Farn Chen, Jun Li, Bernhard.

Slides:

Advertisements

Similar presentations

Darwin Phones: the Evolution of Sensing and Inference on Mobile Phones Emiliano Miluzzo *, Cory T. Cornelius *, Ashwin Ramaswamy *, Tanzeem Choudhury *,

Advertisements

Chunyi Peng, Guobin Shen, Yongguang Zhang, Yanlin Li, Kun Tan BeepBeep: A High Accuracy Acoustic Ranging System using COTS Mobile Devices.

Low Cost Crowd Counting using Audio Tones Pravein Govindan Kannan 1, Seshadri Padmanabha Venkatagiri 1, Mun Choon Chan 1, Akhihebbal L. Ananda 1 and Li-Shiuan.

RadioSense: Exploiting Wireless Communication Patterns for Body Sensor Network Activity Recognition Xin Qi, Gang Zhou, Yantao Li, Ge Peng College of William.

Multi-Speaker Detection By Matt Fratkin EE /9/05.

Activity, Audio, Indoor/Outdoor classification using cell phones Hong Lu, Xiao Zheng Emiliano Miluzzo, Nicholas Lane CS 185 Final Project presentation.

DARWIN PHONES: THE EVOLUTION OF SENSING AND INFERENCE ON MOBILE PHONES PRESENTED BY: BRANDON OCHS Emiliano Miluzzo, Cory T. Cornelius, Ashwin Ramaswamy,

Speaker Recognition Sharat.S.Chikkerur Center for Unified Biometrics and Sensors

Multiple People Detection and Tracking with Occlusion Presenter: Feifei Huo Supervisor: Dr. Emile A. Hendriks Dr. A. H. J. Stijn Oomes Information and.

1 A scheme for racquet sports video analysis with the combination of audio-visual information Visual Communication and Image Processing 2005 Liyuan Xing,

F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Evaluation of Speak Project 2b Due March 24th. Overview Experiments to evaluate performance of your audioconference (proj2) Focus not only on how your.

Evaluation of Speak Project 2b Due March 30 th. Overview Experiments to evaluate performance of your audioconference (proj2) Focus not only on how your.

Speaker Adaptation for Vowel Classification

Visually Fingerprinting Humans without Face Recognition

Biometrics: Voice Recognition

WINLAB SCPL: Indoor Device-Free Multi-Subject Counting and Localization Using Radio Signal Strength Rutgers University Chenren Xu Joint work with Bernhard.

Large-Scale Cost-sensitive Online Social Network Profile Linkage.

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

EMOTIONSENSE A Mobile Phones based Adaptive Platform for Experimental Social Psychology Research Kiran K. Rachuri, Mirco Musolesi, Cecilia Mascolo, Jason.

I AM THE ANTENNA: ACCURATE OUTDOOR AP LOCATION USING SMARTPHONES ZENGBIN ZHANG, XIA ZHOU, WEILE ZHANG, YUANYANG ZHANG GANG WANG, BEN Y. ZHAO, HAITAO ZHENG.

Sensing Meets Mobile Social Networks: The Design, Implementation and Evaluation of the CenceMe Application Emiliano Miluzzo†, Nicholas D. Lane†, Kristóf.

Object detection, tracking and event recognition: the ETISEO experience Andrea Cavallaro Multimedia and Vision Lab Queen Mary, University of London

“SoundSense: Scalable Sound Sensing for People-Centric Applications on Mobile Phones” Authors: Hong Lu, Wei Pan, Nicholas D. Lane, Tanzeem Choudhury and.

Crowdsourcing Game Development for Collecting Benchmark Data of Facial Expression Recognition Systems Department of Information and Learning Technology.

Knowledge Base approach for spoken digit recognition Vijetha Periyavaram.

Macquarie RT05s Speaker Diarisation System Steve Cassidy Centre for Language Technology Macquarie University Sydney.

SoundSense by Andrius Andrijauskas. Introduction  Today’s mobile phones come with various embedded sensors such as GPS, WiFi, compass, etc.  Arguably,

Educational Software using Audio to Score Alignment Antoine Gomas supervised by Dr. Tim Collins & Pr. Corinne Mailhes 7 th of September, 2007.

SCPL: Indoor Device-Free Multi-Subject Counting and Localization Using Radio Signal Strength Chenren Xu†, Bernhard Firner†, Robert S. Moore ∗, Yanyong.

Privacy Protection for Life-log Video Jayashri Chaudhari, Sen-ching S. Cheung, M. Vijay Venkatesh Department of Electrical and Computer Engineering Center.

Zhiphone: A Mobile Phone that Learns Context and User Preferences Jing Michelle Liu Raphael Hoffmann CSE567 class project, AUT/05.

Snooping Keystrokes with mm-level Audio Ranging on a Single Phone

Image Classification 영상분류

Human Activity Recognition Using Accelerometer on Smartphones

SocialWeaver: Collaborative Inference of Human Conversation Networks Using Smartphones Chengwen Luo and Mun Choon Chan School of Computing National University.

Experimental Results ■ Observations:  Overall detection accuracy increases as the length of observation window increases.  An observation window of 100.

AUTOMATIC TARGET RECOGNITION OF CIVILIAN TARGETS September 28 th, 2004 Bala Lakshminarayanan.

WINLAB Improving RF-Based Device-Free Passive Localization In Cluttered Indoor Environments Through Probabilistic Classification Methods Rutgers University.

Look who’s talking? Project 3.1 Yannick Thimister Han van Venrooij Bob Verlinden Project DKE Maastricht University.

Singer similarity / identification Francois Thibault MUMT 614B McGill University.

New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,

Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.

Performance Comparison of Speaker and Emotion Recognition

ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION. Introduction What is Speech Recognition?  also known as automatic speech recognition or computer speech.

Predicting Voice Elicited Emotions

It Starts with iGaze: Visual Attention Driven Networking with Smart Glasses It Starts with iGaze: Visual Attention Driven Networking with Smart Glasses.

Turning a Mobile Device into a Mouse in the Air

I Am the Antenna Accurate Outdoor AP Location Using Smartphones Zengbin Zhang†, Xia Zhou†, Weile Zhang†§, Yuanyang Zhang†, Gang Wang†, Ben Y. Zhao† and.

IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.

Evaluation of Speak Project 2b Due November 4 th.

Audio/Speech CS376: November 4, 2004 as presented by Jessica Kuo.

Speaker Verification System Middle Term Presentation Performed by: Barak Benita & Daniel Adler Instructor: Erez Sabag.

Evaluation of Speak Project 2b Due: March 20 th, in class.

Audio Processing Mitch Parry. Resource! Sound Waves and Harmonic Motion.

Using Speech Recognition to Predict VoIP Quality

Detecting Semantic Concepts In Consumer Videos Using Audio Junwei Liang, Qin Jin, Xixi He, Gang Yang, Jieping Xu, Xirong Li Multimedia Computing Lab,

ARTIFICIAL NEURAL NETWORKS

Ayon Chakraborty, Udit Gupta and Samir R. Das WINGS Lab

Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.

Dejavu:An accurate Energy-Efficient Outdoor Localization System

Vijay Srinivasan Thomas Phan

Sharat.S.Chikkerur S.Anand Mantravadi Rajeev.K.Srinivasan

Previous Microphone Array System Integrated Microphone Array System

Anindya Maiti, Murtuza Jadliwala, Jibo He Igor Bilogrevic

John H.L. Hansen & Taufiq Al Babba Hasan

A maximum likelihood estimation and training on the fly approach

汉语连续语音识别年1月4日访北京工业大学 973 Project 2019/4/17 汉语连续语音识别年1月4日访北京工业大学郑方清华大学计算机科学与技术系语音实验室

Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems NDSS 2019 Hadi Abdullah, Washington Garcia, Christian Peeters, Patrick.

Presentation transcript:

Crowd++: Unsupervised Speaker Count with Smartphones Chenren Xu, Sugang Li, Gang Liu, Yanyong Zhang, Emiliano Miluzzo, Yih-Farn Chen, Jun Li, Bernhard Firner ACM UbiComp’13 September 10 th, 2013

Chenren Scenario 1: Dinner time, where to go? 2 I will choose this one!

Chenren Scenario 2: Is your kid social? 3

Chenren Scenario 2: Is your kid social? 4

Chenren Scenario 3: Which class is attractive? 5 She will be my choice!

Chenren Solutions?  Speaker count!  Dinner time, where to go?  Find the place where has most people talking!  Is your kid social?  Find how many (different) people they talked with!  Which class is more attractive?  Check how many students ask you questions! Microphone + microcomputer 6

Chenren The era of ubiquitous listening 7 Glass Necklace Watch Phone Brace

Chenren What we already have 8 Next thing: Voice-centric sensing

Chenren Voice-centric sensing 9 Stressful Bob Family life Alice Speech recognition Emotion detection 2 3 Speaker identification Speaker count

Chenren How to count? 10 Challenge No prior knowledge of speakers Background noise Speech overlap Energy efficiency Privacy concern

Chenren How to count? 11 ChallengeSolution No prior knowledge of speakersUnique features extraction Background noiseFrequency-based filter Speech overlapOverlap detection Energy efficiencyCoarse-grained modeling Privacy concernOn-device computation

Chenren Overview of Crowd++ 12

Chenren Speech detection  Pitch-based filter  Determined by the vibratory frequency of the vocal folds  Human voice statistics: spans from 50 Hz to 450 Hz 13 f (Hz) Human voice

Chenren Speaker features  MFCC  Speaker identification/verification  Alice or Bob, or else?  Emotion/stress sensing  Happy, or sad, stressful, or fear, or anger?  Speaker counting  No prior information 14 Supervised Unsupervised

Chenren Speaker features  Same speaker or not?  MFCC + cosine similarity distance metric 15 θ MFCC 1 MFCC 2 We use the angle θ to capture the distance between speech segments.

Chenren Speaker features  MFCC + cosine similarity distance metric 16 θsθs Bob’s MFCC in speech segment 1 Bob’s MFCC in speech segment 2 Alice’s MFCC in speech segment 3 θdθd θd > θsθd > θs

Chenren Speaker features  MFCC + cosine similarity distance metric 17 histogram of θ s histogram of θ d 1 second speech segment 2-second speech segment 3-second speech segment 10-second speech segment seconds is not natural in conversation! We use 3-second for basic speech unit.

Chenren Speaker features  MFCC + cosine similarity distance metric 18 3-second speech segment 3015

Chenren Speaker features  Pitch + gender statistics 19 (Hz)

Chenren Same speaker or not? 20 Same speaker Different speakers Not sure IF MFCC cosine similarity score < 15 AND Pitch indicates they are same gender ELSEIF MFCC cosine similarity score > 30 OR Pitch indicates they are different genders ELSE

Chenren Conversation example 21 Speaker A Speaker B Speaker C Conversation Overlap Silence Time

Chenren  Phase 1: pre-clustering  Merge the speech segments from same speakers Unsupervised speaker counting 22

Chenren Unsupervised speaker counting 23 Ground truth Segmentation Pre-clustering Same speaker

Chenren Unsupervised speaker counting 24 Ground truth Segmentation Pre-clustering Merge

Chenren Unsupervised speaker counting 25 Ground truth Segmentation Pre-clustering Keep going …

Chenren Unsupervised speaker counting 26 Ground truth Segmentation Pre-clustering Same speaker

Chenren Unsupervised speaker counting 27 Ground truth Segmentation Pre-clustering Merge

Chenren Unsupervised speaker counting 28 Ground truth Segmentation Pre-clustering Keep going …

Chenren Unsupervised speaker counting 29 Ground truth Segmentation Pre-clustering Same speaker

Chenren Unsupervised speaker counting 30 Ground truth Segmentation Pre-clustering Merge

Chenren Unsupervised speaker counting 31 Ground truth Segmentation Pre-clustering Merge Pre-clustering...

Chenren  Phase 1: pre-clustering  Merge the speech segments from same speakers  Phase 2: counting  Only admit new speaker when its speech segment is different from all the admitted speakers.  Dropping uncertain speech segments. Unsupervised speaker counting 32

Chenren Unsupervised speaker counting 33 Counting: 0

Chenren Unsupervised speaker counting 34 Counting: 1 Admit first speaker

Chenren Unsupervised speaker counting 35 Counting: 1 Not sure

Chenren Unsupervised speaker counting 36 Counting: 1 Drop it

Chenren Unsupervised speaker counting 37 Counting: 1 Different speaker

Chenren Unsupervised speaker counting 38 Counting: 2 Counting: 1 Admit new speaker!

Chenren Unsupervised speaker counting 39 Counting: 2 Counting: 1 Same speaker

Chenren Unsupervised speaker counting 40 Counting: 2 Counting: 1 Different speaker Same speaker

Chenren Unsupervised speaker counting 41 Counting: 2 Counting: 1 Merge

Chenren Unsupervised speaker counting 42 Counting: 2 Counting: 1 Not sure

Chenren Unsupervised speaker counting 43 Counting: 2 Counting: 1 Not sure

Chenren Unsupervised speaker counting 44 Counting: 2 Counting: 1 Drop

Chenren Unsupervised speaker counting 45 Counting: 2 Counting: 1 Different speaker

Chenren Unsupervised speaker counting 46 Counting: 2 Counting: 1 Different speaker

Chenren Unsupervised speaker counting 47 Counting: 2 Counting: 3 Counting: 1 Admit new speaker!

Chenren Unsupervised speaker counting 48 Counting: 2 Counting: 3 Counting: 1 Ground truth We have three speakers in this conversation!

Chenren Evaluation metric  Error count distance  The difference between the estimated count and the ground truth. 49

Chenren Benchmark results 50

Chenren Benchmark results 51 Phone’s position on the table does not matter much

Chenren Benchmark results 52 Phones on the table are better than inside the pockets

Chenren Benchmark results 53 Collaboration among multiple phones provide better results

Chenren Large scale crowdsourcing effort  120 users from university and industry contribute 109 audio clips of 1034 minutes in total. 54 Private indoorPublic indoorOutdoor

Chenren Large scale crowdsourcing results 55 Sample number Error count distance Private indoor Public indoor Outdoor The error count results in all environments are reasonable.

Chenren Computational latency 56 Latency (msec) HTC EVO 4g Samsung Galaxy S2 Samsung Galaxy S3 Google Nexus 4 Google Nexus 7 MFCC Pitch Count Total It takes less than 1 minute to process a 5-minute conversation.

Chenren Energy efficiency 57

Chenren Use case 1: Crowd estimation 58 Group 1 Group 2 Group 1 Group 2 Group 3

Chenren Use case 2: Social log 59 Home Class Meeting Lunch Research Sleeping Research Dinner Ph.D student life snapshot before UbiComp’13 submission His paper is accepted!

Chenren Use case 3: Speaker count patterns 60 Different talks show very different speaker count patterns as time goes.

Chenren Conclusion  Smartphones can count the number of speakers with reasonable accuracies in different environments.  Crowd++ can enable different social sensing applications. 61

Chenren Thank you 62 Chenren Xu WINLAB/ECE Rutgers University Sugang Li WINLAB/ECE Rutgers University Gang Liu CRSS UT Dallas Yanyong Zhang WINLAB/ECE Rutgers University Emiliano Miluzzo AT&T Labs Research Yih-Farn Chen AT&T Labs Research Jun Li Interdigital Communication Bernhard Firner WINLAB/ECE Rutgers University