Crowd++: Unsupervised Speaker Count with Smartphones Chenren Xu, Sugang Li, Gang Liu, Yanyong Zhang, Emiliano Miluzzo, Yih-Farn Chen, Jun Li, Bernhard.

Crowd++: Unsupervised Speaker Count with Smartphones Chenren Xu, Sugang Li, Gang Liu, Yanyong Zhang, Emiliano Miluzzo, Yih-Farn Chen, Jun Li, Bernhard Firner ACM UbiComp’13 September 10 th, 2013

Chenren Xulendlice@winlab.rutgers.edu Scenario 1: Dinner time, where to go? 2 I will choose this one!

Chenren Xulendlice@winlab.rutgers.edu Scenario 2: Is your kid social? 3

Chenren Xulendlice@winlab.rutgers.edu Scenario 2: Is your kid social? 4

Chenren Xulendlice@winlab.rutgers.edu Scenario 3: Which class is attractive? 5 She will be my choice!

Chenren Xulendlice@winlab.rutgers.edu Solutions?  Speaker count!  Dinner time, where to go?  Find the place where has most people talking!  Is your kid social?  Find how many (different) people they talked with!  Which class is more attractive?  Check how many students ask you questions! Microphone + microcomputer 6

Chenren Xulendlice@winlab.rutgers.edu The era of ubiquitous listening 7 Glass Necklace Watch Phone Brace

Chenren Xulendlice@winlab.rutgers.edu What we already have 8 Next thing: Voice-centric sensing

Chenren Xulendlice@winlab.rutgers.edu Voice-centric sensing 9 Stressful Bob Family life Alice Speech recognition Emotion detection 2 3 Speaker identification Speaker count

Chenren Xulendlice@winlab.rutgers.edu How to count? 10 Challenge No prior knowledge of speakers Background noise Speech overlap Energy efficiency Privacy concern

Chenren Xulendlice@winlab.rutgers.edu How to count? 11 ChallengeSolution No prior knowledge of speakersUnique features extraction Background noiseFrequency-based filter Speech overlapOverlap detection Energy efficiencyCoarse-grained modeling Privacy concernOn-device computation

Chenren Xulendlice@winlab.rutgers.edu Overview of Crowd++ 12

Chenren Xulendlice@winlab.rutgers.edu Speech detection  Pitch-based filter  Determined by the vibratory frequency of the vocal folds  Human voice statistics: spans from 50 Hz to 450 Hz 13 f (Hz) 50 450 Human voice

Chenren Xulendlice@winlab.rutgers.edu Speaker features  MFCC  Speaker identification/verification  Alice or Bob, or else?  Emotion/stress sensing  Happy, or sad, stressful, or fear, or anger?  Speaker counting  No prior information 14 Supervised Unsupervised

Chenren Xulendlice@winlab.rutgers.edu Speaker features  Same speaker or not?  MFCC + cosine similarity distance metric 15 θ MFCC 1 MFCC 2 We use the angle θ to capture the distance between speech segments.

Chenren Xulendlice@winlab.rutgers.edu Speaker features  MFCC + cosine similarity distance metric 16 θsθs Bob’s MFCC in speech segment 1 Bob’s MFCC in speech segment 2 Alice’s MFCC in speech segment 3 θdθd θd > θsθd > θs

Chenren Xulendlice@winlab.rutgers.edu Speaker features  MFCC + cosine similarity distance metric 17 histogram of θ s histogram of θ d 1 second speech segment 2-second speech segment 3-second speech segment 10-second speech segment...... 10 seconds is not natural in conversation! We use 3-second for basic speech unit.

Chenren Xulendlice@winlab.rutgers.edu Speaker features  MFCC + cosine similarity distance metric 18 3-second speech segment 3015

Chenren Xulendlice@winlab.rutgers.edu Speaker features  Pitch + gender statistics 19 (Hz)

Chenren Xulendlice@winlab.rutgers.edu Same speaker or not? 20 Same speaker Different speakers Not sure IF MFCC cosine similarity score < 15 AND Pitch indicates they are same gender ELSEIF MFCC cosine similarity score > 30 OR Pitch indicates they are different genders ELSE

Chenren Xulendlice@winlab.rutgers.edu Conversation example 21 Speaker A Speaker B Speaker C Conversation Overlap Silence Time

Chenren Xulendlice@winlab.rutgers.edu  Phase 1: pre-clustering  Merge the speech segments from same speakers Unsupervised speaker counting 22

Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 23 Ground truth Segmentation Pre-clustering Same speaker

Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 24 Ground truth Segmentation Pre-clustering Merge

Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 25 Ground truth Segmentation Pre-clustering Keep going …

Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 28 Ground truth Segmentation Pre-clustering Keep going …

Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 31 Ground truth Segmentation Pre-clustering Merge Pre-clustering...

Chenren Xulendlice@winlab.rutgers.edu  Phase 1: pre-clustering  Merge the speech segments from same speakers  Phase 2: counting  Only admit new speaker when its speech segment is different from all the admitted speakers.  Dropping uncertain speech segments. Unsupervised speaker counting 32

Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 33 Counting: 0

Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 34 Counting: 1 Admit first speaker

Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 35 Counting: 1 Not sure

Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 36 Counting: 1 Drop it

Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 37 Counting: 1 Different speaker

Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 38 Counting: 2 Counting: 1 Admit new speaker!

Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 39 Counting: 2 Counting: 1 Same speaker

Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 40 Counting: 2 Counting: 1 Different speaker Same speaker

Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 41 Counting: 2 Counting: 1 Merge

Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 42 Counting: 2 Counting: 1 Not sure

Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 43 Counting: 2 Counting: 1 Not sure

Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 44 Counting: 2 Counting: 1 Drop

Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 45 Counting: 2 Counting: 1 Different speaker

Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 46 Counting: 2 Counting: 1 Different speaker

Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 47 Counting: 2 Counting: 3 Counting: 1 Admit new speaker!

Chenren Xulendlice@winlab.rutgers.edu Unsupervised speaker counting 48 Counting: 2 Counting: 3 Counting: 1 Ground truth We have three speakers in this conversation!

Chenren Xulendlice@winlab.rutgers.edu Evaluation metric  Error count distance  The difference between the estimated count and the ground truth. 49

Chenren Xulendlice@winlab.rutgers.edu Benchmark results 50

Chenren Xulendlice@winlab.rutgers.edu Benchmark results 51 Phone’s position on the table does not matter much

Chenren Xulendlice@winlab.rutgers.edu Benchmark results 52 Phones on the table are better than inside the pockets

Chenren Xulendlice@winlab.rutgers.edu Benchmark results 53 Collaboration among multiple phones provide better results

Chenren Xulendlice@winlab.rutgers.edu Large scale crowdsourcing effort  120 users from university and industry contribute 109 audio clips of 1034 minutes in total. 54 Private indoorPublic indoorOutdoor

Chenren Xulendlice@winlab.rutgers.edu Large scale crowdsourcing results 55 Sample number Error count distance Private indoor401.07 Public indoor441.35 Outdoor251.83 The error count results in all environments are reasonable.

Chenren Xulendlice@winlab.rutgers.edu Computational latency 56 Latency (msec) HTC EVO 4g Samsung Galaxy S2 Samsung Galaxy S3 Google Nexus 4 Google Nexus 7 MFCC42.9036.7124.4122.8623.14 Pitch102.7180.3658.1147.9358.33 Count175.16150.4789.0183.5370.23 Total320.77267.54171.53154.32151.7 It takes less than 1 minute to process a 5-minute conversation.

Chenren Xulendlice@winlab.rutgers.edu Energy efficiency 57

Chenren Xulendlice@winlab.rutgers.edu Use case 1: Crowd estimation 58 Group 1 Group 2 Group 1 Group 2 Group 3

Chenren Xulendlice@winlab.rutgers.edu Use case 2: Social log 59 Home Class Meeting Lunch Research Sleeping Research Dinner Ph.D student life snapshot before UbiComp’13 submission His paper is accepted!

Chenren Xulendlice@winlab.rutgers.edu Use case 3: Speaker count patterns 60 Different talks show very different speaker count patterns as time goes.

Chenren Xulendlice@winlab.rutgers.edu Conclusion  Smartphones can count the number of speakers with reasonable accuracies in different environments.  Crowd++ can enable different social sensing applications. 61

Chenren Xulendlice@winlab.rutgers.edu Thank you 62 Chenren Xu WINLAB/ECE Rutgers University Sugang Li WINLAB/ECE Rutgers University Gang Liu CRSS UT Dallas Yanyong Zhang WINLAB/ECE Rutgers University Emiliano Miluzzo AT&T Labs Research Yih-Farn Chen AT&T Labs Research Jun Li Interdigital Communication Bernhard Firner WINLAB/ECE Rutgers University

Crowd++: Unsupervised Speaker Count with Smartphones Chenren Xu, Sugang Li, Gang Liu, Yanyong Zhang, Emiliano Miluzzo, Yih-Farn Chen, Jun Li, Bernhard.

Similar presentations

Presentation on theme: "Crowd++: Unsupervised Speaker Count with Smartphones Chenren Xu, Sugang Li, Gang Liu, Yanyong Zhang, Emiliano Miluzzo, Yih-Farn Chen, Jun Li, Bernhard."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Crowd++: Unsupervised Speaker Count with Smartphones Chenren Xu, Sugang Li, Gang Liu, Yanyong Zhang, Emiliano Miluzzo, Yih-Farn Chen, Jun Li, Bernhard.

Similar presentations

Presentation on theme: "Crowd++: Unsupervised Speaker Count with Smartphones Chenren Xu, Sugang Li, Gang Liu, Yanyong Zhang, Emiliano Miluzzo, Yih-Farn Chen, Jun Li, Bernhard."— Presentation transcript:

Similar presentations

About project

Feedback