Download presentation
Presentation is loading. Please wait.
Published byLouisa Lawrence Modified over 9 years ago
1
Crowd++: Unsupervised Speaker Count with Smartphones Chenren Xu, Sugang Li, Gang Liu, Yanyong Zhang, Emiliano Miluzzo, Yih-Farn Chen, Jun Li, Bernhard Firner ACM UbiComp’13 September 10 th, 2013
2
Chenren Xulendlice@winlab.rutgers.edu Scenario 1: Dinner time, where to go? 2 I will choose this one!
3
Chenren Xulendlice@winlab.rutgers.edu Scenario 2: Is your kid social? 3
4
Chenren Xulendlice@winlab.rutgers.edu Scenario 2: Is your kid social? 4
5
Chenren Xulendlice@winlab.rutgers.edu Scenario 3: Which class is attractive? 5 She will be my choice!
6
Chenren Xulendlice@winlab.rutgers.edu Solutions? Speaker count! Dinner time, where to go? Find the place where has most people talking! Is your kid social? Find how many (different) people they talked with! Which class is more attractive? Check how many students ask you questions! Microphone + microcomputer 6
7
Chenren Xulendlice@winlab.rutgers.edu The era of ubiquitous listening 7 Glass Necklace Watch Phone Brace
8
Chenren Xulendlice@winlab.rutgers.edu What we already have 8 Next thing: Voice-centric sensing
9
Chenren Xulendlice@winlab.rutgers.edu Voice-centric sensing 9 Stressful Bob Family life Alice Speech recognition Emotion detection 2 3 Speaker identification Speaker count
10
Chenren Xulendlice@winlab.rutgers.edu How to count? 10 Challenge No prior knowledge of speakers Background noise Speech overlap Energy efficiency Privacy concern
11
Chenren Xulendlice@winlab.rutgers.edu How to count? 11 ChallengeSolution No prior knowledge of speakersUnique features extraction Background noiseFrequency-based filter Speech overlapOverlap detection Energy efficiencyCoarse-grained modeling Privacy concernOn-device computation
12
Chenren Xulendlice@winlab.rutgers.edu Overview of Crowd++ 12
13
Chenren Xulendlice@winlab.rutgers.edu Speech detection Pitch-based filter Determined by the vibratory frequency of the vocal folds Human voice statistics: spans from 50 Hz to 450 Hz 13 f (Hz) 50 450 Human voice
14
Chenren Xulendlice@winlab.rutgers.edu Speaker features MFCC Speaker identification/verification Alice or Bob, or else? Emotion/stress sensing Happy, or sad, stressful, or fear, or anger? Speaker counting No prior information 14 Supervised Unsupervised
15
Chenren Xulendlice@winlab.rutgers.edu Speaker features Same speaker or not? MFCC + cosine similarity distance metric 15 θ MFCC 1 MFCC 2 We use the angle θ to capture the distance between speech segments.
16
Chenren Xulendlice@winlab.rutgers.edu Speaker features MFCC + cosine similarity distance metric 16 θsθs Bob’s MFCC in speech segment 1 Bob’s MFCC in speech segment 2 Alice’s MFCC in speech segment 3 θdθd θd > θsθd > θs
17
Chenren Xulendlice@winlab.rutgers.edu Speaker features MFCC + cosine similarity distance metric 17 histogram of θ s histogram of θ d 1 second speech segment 2-second speech segment 3-second speech segment 10-second speech segment...... 10 seconds is not natural in conversation! We use 3-second for basic speech unit.
18
Chenren Xulendlice@winlab.rutgers.edu Speaker features MFCC + cosine similarity distance metric 18 3-second speech segment 3015
19
Chenren Xulendlice@winlab.rutgers.edu Speaker features Pitch + gender statistics 19 (Hz)
20
Chenren Xulendlice@winlab.rutgers.edu Same speaker or not? 20 Same speaker Different speakers Not sure IF MFCC cosine similarity score < 15 AND Pitch indicates they are same gender ELSEIF MFCC cosine similarity score > 30 OR Pitch indicates they are different genders ELSE
21
Chenren Xulendlice@winlab.rutgers.edu Conversation example 21 Speaker A Speaker B Speaker C Conversation Overlap Silence Time
22
Chenren Xulendlice@winlab.rutgers.edu Phase 1: pre-clustering Merge the speech segments from same speakers Unsupervised speaker counting 22
23
Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 23 Ground truth Segmentation Pre-clustering Same speaker
24
Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 24 Ground truth Segmentation Pre-clustering Merge
25
Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 25 Ground truth Segmentation Pre-clustering Keep going …
26
Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 26 Ground truth Segmentation Pre-clustering Same speaker
27
Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 27 Ground truth Segmentation Pre-clustering Merge
28
Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 28 Ground truth Segmentation Pre-clustering Keep going …
29
Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 29 Ground truth Segmentation Pre-clustering Same speaker
30
Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 30 Ground truth Segmentation Pre-clustering Merge
31
Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 31 Ground truth Segmentation Pre-clustering Merge Pre-clustering...
32
Chenren Xulendlice@winlab.rutgers.edu Phase 1: pre-clustering Merge the speech segments from same speakers Phase 2: counting Only admit new speaker when its speech segment is different from all the admitted speakers. Dropping uncertain speech segments. Unsupervised speaker counting 32
33
Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 33 Counting: 0
34
Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 34 Counting: 1 Admit first speaker
35
Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 35 Counting: 1 Not sure
36
Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 36 Counting: 1 Drop it
37
Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 37 Counting: 1 Different speaker
38
Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 38 Counting: 2 Counting: 1 Admit new speaker!
39
Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 39 Counting: 2 Counting: 1 Same speaker
40
Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 40 Counting: 2 Counting: 1 Different speaker Same speaker
41
Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 41 Counting: 2 Counting: 1 Merge
42
Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 42 Counting: 2 Counting: 1 Not sure
43
Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 43 Counting: 2 Counting: 1 Not sure
44
Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 44 Counting: 2 Counting: 1 Drop
45
Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 45 Counting: 2 Counting: 1 Different speaker
46
Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 46 Counting: 2 Counting: 1 Different speaker
47
Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 47 Counting: 2 Counting: 3 Counting: 1 Admit new speaker!
48
Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 48 Counting: 2 Counting: 3 Counting: 1 Ground truth We have three speakers in this conversation!
49
Chenren Xulendlice@winlab.rutgers.edu Evaluation metric Error count distance The difference between the estimated count and the ground truth. 49
50
Chenren Xulendlice@winlab.rutgers.edu Benchmark results 50
51
Chenren Xulendlice@winlab.rutgers.edu Benchmark results 51 Phone’s position on the table does not matter much
52
Chenren Xulendlice@winlab.rutgers.edu Benchmark results 52 Phones on the table are better than inside the pockets
53
Chenren Xulendlice@winlab.rutgers.edu Benchmark results 53 Collaboration among multiple phones provide better results
54
Chenren Xulendlice@winlab.rutgers.edu Large scale crowdsourcing effort 120 users from university and industry contribute 109 audio clips of 1034 minutes in total. 54 Private indoorPublic indoorOutdoor
55
Chenren Xulendlice@winlab.rutgers.edu Large scale crowdsourcing results 55 Sample number Error count distance Private indoor401.07 Public indoor441.35 Outdoor251.83 The error count results in all environments are reasonable.
56
Chenren Xulendlice@winlab.rutgers.edu Computational latency 56 Latency (msec) HTC EVO 4g Samsung Galaxy S2 Samsung Galaxy S3 Google Nexus 4 Google Nexus 7 MFCC42.9036.7124.4122.8623.14 Pitch102.7180.3658.1147.9358.33 Count175.16150.4789.0183.5370.23 Total320.77267.54171.53154.32151.7 It takes less than 1 minute to process a 5-minute conversation.
57
Chenren Xulendlice@winlab.rutgers.edu Energy efficiency 57
58
Chenren Xulendlice@winlab.rutgers.edu Use case 1: Crowd estimation 58 Group 1 Group 2 Group 1 Group 2 Group 3
59
Chenren Xulendlice@winlab.rutgers.edu Use case 2: Social log 59 Home Class Meeting Lunch Research Sleeping Research Dinner Ph.D student life snapshot before UbiComp’13 submission His paper is accepted!
60
Chenren Xulendlice@winlab.rutgers.edu Use case 3: Speaker count patterns 60 Different talks show very different speaker count patterns as time goes.
61
Chenren Xulendlice@winlab.rutgers.edu Conclusion Smartphones can count the number of speakers with reasonable accuracies in different environments. Crowd++ can enable different social sensing applications. 61
62
Chenren Xulendlice@winlab.rutgers.edu Thank you 62 Chenren Xu WINLAB/ECE Rutgers University Sugang Li WINLAB/ECE Rutgers University Gang Liu CRSS UT Dallas Yanyong Zhang WINLAB/ECE Rutgers University Emiliano Miluzzo AT&T Labs Research Yih-Farn Chen AT&T Labs Research Jun Li Interdigital Communication Bernhard Firner WINLAB/ECE Rutgers University
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.