Crowd++: Unsupervised Speaker Count with Smartphones Chenren Xu, Sugang Li, Gang Liu, Yanyong Zhang, Emiliano Miluzzo, Yih-Farn Chen, Jun Li, Bernhard Firner ACM UbiComp’13 September 10 th, 2013
Chenren Scenario 1: Dinner time, where to go? 2 I will choose this one!
Chenren Scenario 2: Is your kid social? 3
Chenren Scenario 2: Is your kid social? 4
Chenren Scenario 3: Which class is attractive? 5 She will be my choice!
Chenren Solutions? Speaker count! Dinner time, where to go? Find the place where has most people talking! Is your kid social? Find how many (different) people they talked with! Which class is more attractive? Check how many students ask you questions! Microphone + microcomputer 6
Chenren The era of ubiquitous listening 7 Glass Necklace Watch Phone Brace
Chenren What we already have 8 Next thing: Voice-centric sensing
Chenren Voice-centric sensing 9 Stressful Bob Family life Alice Speech recognition Emotion detection 2 3 Speaker identification Speaker count
Chenren How to count? 10 Challenge No prior knowledge of speakers Background noise Speech overlap Energy efficiency Privacy concern
Chenren How to count? 11 ChallengeSolution No prior knowledge of speakersUnique features extraction Background noiseFrequency-based filter Speech overlapOverlap detection Energy efficiencyCoarse-grained modeling Privacy concernOn-device computation
Chenren Overview of Crowd++ 12
Chenren Speech detection Pitch-based filter Determined by the vibratory frequency of the vocal folds Human voice statistics: spans from 50 Hz to 450 Hz 13 f (Hz) Human voice
Chenren Speaker features MFCC Speaker identification/verification Alice or Bob, or else? Emotion/stress sensing Happy, or sad, stressful, or fear, or anger? Speaker counting No prior information 14 Supervised Unsupervised
Chenren Speaker features Same speaker or not? MFCC + cosine similarity distance metric 15 θ MFCC 1 MFCC 2 We use the angle θ to capture the distance between speech segments.
Chenren Speaker features MFCC + cosine similarity distance metric 16 θsθs Bob’s MFCC in speech segment 1 Bob’s MFCC in speech segment 2 Alice’s MFCC in speech segment 3 θdθd θd > θsθd > θs
Chenren Speaker features MFCC + cosine similarity distance metric 17 histogram of θ s histogram of θ d 1 second speech segment 2-second speech segment 3-second speech segment 10-second speech segment seconds is not natural in conversation! We use 3-second for basic speech unit.
Chenren Speaker features MFCC + cosine similarity distance metric 18 3-second speech segment 3015
Chenren Speaker features Pitch + gender statistics 19 (Hz)
Chenren Same speaker or not? 20 Same speaker Different speakers Not sure IF MFCC cosine similarity score < 15 AND Pitch indicates they are same gender ELSEIF MFCC cosine similarity score > 30 OR Pitch indicates they are different genders ELSE
Chenren Conversation example 21 Speaker A Speaker B Speaker C Conversation Overlap Silence Time
Chenren Phase 1: pre-clustering Merge the speech segments from same speakers Unsupervised speaker counting 22
Chenren Unsupervised speaker counting 23 Ground truth Segmentation Pre-clustering Same speaker
Chenren Unsupervised speaker counting 24 Ground truth Segmentation Pre-clustering Merge
Chenren Unsupervised speaker counting 25 Ground truth Segmentation Pre-clustering Keep going …
Chenren Unsupervised speaker counting 26 Ground truth Segmentation Pre-clustering Same speaker
Chenren Unsupervised speaker counting 27 Ground truth Segmentation Pre-clustering Merge
Chenren Unsupervised speaker counting 28 Ground truth Segmentation Pre-clustering Keep going …
Chenren Unsupervised speaker counting 29 Ground truth Segmentation Pre-clustering Same speaker
Chenren Unsupervised speaker counting 30 Ground truth Segmentation Pre-clustering Merge
Chenren Unsupervised speaker counting 31 Ground truth Segmentation Pre-clustering Merge Pre-clustering...
Chenren Phase 1: pre-clustering Merge the speech segments from same speakers Phase 2: counting Only admit new speaker when its speech segment is different from all the admitted speakers. Dropping uncertain speech segments. Unsupervised speaker counting 32
Chenren Unsupervised speaker counting 33 Counting: 0
Chenren Unsupervised speaker counting 34 Counting: 1 Admit first speaker
Chenren Unsupervised speaker counting 35 Counting: 1 Not sure
Chenren Unsupervised speaker counting 36 Counting: 1 Drop it
Chenren Unsupervised speaker counting 37 Counting: 1 Different speaker
Chenren Unsupervised speaker counting 38 Counting: 2 Counting: 1 Admit new speaker!
Chenren Unsupervised speaker counting 39 Counting: 2 Counting: 1 Same speaker
Chenren Unsupervised speaker counting 40 Counting: 2 Counting: 1 Different speaker Same speaker
Chenren Unsupervised speaker counting 41 Counting: 2 Counting: 1 Merge
Chenren Unsupervised speaker counting 42 Counting: 2 Counting: 1 Not sure
Chenren Unsupervised speaker counting 43 Counting: 2 Counting: 1 Not sure
Chenren Unsupervised speaker counting 44 Counting: 2 Counting: 1 Drop
Chenren Unsupervised speaker counting 45 Counting: 2 Counting: 1 Different speaker
Chenren Unsupervised speaker counting 46 Counting: 2 Counting: 1 Different speaker
Chenren Unsupervised speaker counting 47 Counting: 2 Counting: 3 Counting: 1 Admit new speaker!
Chenren Unsupervised speaker counting 48 Counting: 2 Counting: 3 Counting: 1 Ground truth We have three speakers in this conversation!
Chenren Evaluation metric Error count distance The difference between the estimated count and the ground truth. 49
Chenren Benchmark results 50
Chenren Benchmark results 51 Phone’s position on the table does not matter much
Chenren Benchmark results 52 Phones on the table are better than inside the pockets
Chenren Benchmark results 53 Collaboration among multiple phones provide better results
Chenren Large scale crowdsourcing effort 120 users from university and industry contribute 109 audio clips of 1034 minutes in total. 54 Private indoorPublic indoorOutdoor
Chenren Large scale crowdsourcing results 55 Sample number Error count distance Private indoor Public indoor Outdoor The error count results in all environments are reasonable.
Chenren Computational latency 56 Latency (msec) HTC EVO 4g Samsung Galaxy S2 Samsung Galaxy S3 Google Nexus 4 Google Nexus 7 MFCC Pitch Count Total It takes less than 1 minute to process a 5-minute conversation.
Chenren Energy efficiency 57
Chenren Use case 1: Crowd estimation 58 Group 1 Group 2 Group 1 Group 2 Group 3
Chenren Use case 2: Social log 59 Home Class Meeting Lunch Research Sleeping Research Dinner Ph.D student life snapshot before UbiComp’13 submission His paper is accepted!
Chenren Use case 3: Speaker count patterns 60 Different talks show very different speaker count patterns as time goes.
Chenren Conclusion Smartphones can count the number of speakers with reasonable accuracies in different environments. Crowd++ can enable different social sensing applications. 61
Chenren Thank you 62 Chenren Xu WINLAB/ECE Rutgers University Sugang Li WINLAB/ECE Rutgers University Gang Liu CRSS UT Dallas Yanyong Zhang WINLAB/ECE Rutgers University Emiliano Miluzzo AT&T Labs Research Yih-Farn Chen AT&T Labs Research Jun Li Interdigital Communication Bernhard Firner WINLAB/ECE Rutgers University