New stuff People were interested in more detailed spatial information about media captures Added area of capture and point of capture attributes Also addresses multi-view use case Offers a more flexible way of associating audio with video Remove the “linear array” audio type, replaced by using area of capture
Other topics to consider Framework has these in appendix to be discussed VAD (voice activity detection) Media source selection (e.g. from a roster) Composition and switching algorithms audio and video
Composition/Switching Algorithms Framework has simple boolean attributes for indicating a Media Capture is switched or composed. Is this enough? If not, what else do we need? Another use case to make it clear? More detailed indications about exactly how a capture is switched or composed? Anything else? Interested people should propose specific additions to the framework
Attributes EXTENSIBILITY Audio attributes Channel Format Stereo Mono Audio attributes Channel Format Stereo Mono Video attributes Spatial scale Image width Video attributes Spatial scale Image width Media Capture attributes Purpose (role) Main Presentation Mixed – true/false Auto switched – true/false Area of Capture - ranges Point of Capture - point Area Scale millimeters Media Capture attributes Purpose (role) Main Presentation Mixed – true/false Auto switched – true/false Area of Capture - ranges Point of Capture - point Area Scale millimeters
Capture Scene VC0VC2VC1 VC3VC4 Cameras People VC1 VC2 VC0 Capture Scene Three cameras Two cameras, moved & zoomed out Switched (based on voice) with composed PiP VC5
Capture Scene VC0VC2VC1 VC3VC4 VC1 VC2 VC0 xBegin=0 xEnd=100 VC5 x = 0 x = 100 x = 200 x = 300 xBegin=100 xEnd=200 xBegin=200 xEnd=300 xBegin=0 xEnd=150 xBegin=150 xEnd=300 xBegin=0 xEnd=300 x = 150 Area of capture Point of capture x = 250 x = 150 x = 50
Capture Set Each alternative representation of a Capture Scene is a row in a Capture Set Three cameras Two cameras, moved and zoomed out Switched (based on voice), composed PiP (VC0, VC1, VC2) (VC3, VC4) (VC5) (AC0) (VC0, VC1, VC2) (VC3, VC4) (VC5) (AC0) Capture Set Rows VC0VC2VC1 VC3VC4 VC5
Video Capture Adjacency cameras people right leftVC0 VC1 right left VC0 VC1 Capture Set: (VC0, VC1) Other capture set rows Capture Set: (VC0, VC1) Other capture set rows x = 0 x = 100 x = 200 x = 0 x = 100 x = 200 x = 100 x = 50 x = 150
Example with Field of View 1 xBegin=0 Point of capture = (673,0) x along straight line xBegin=1446 xEnd=1346 yBegin=3000 yEnd=3000 xEnd=2792 Point of capture = (2119,0) a Angle a = 2 * arctan ((1346/2) / 3000) = 25.3° Field of view angle can be calculated from the area of capture and point of capture attributes. y distance from camera
Example with Field of View 2 xBegin=0 Point of capture = (1396,0) y distance from camera xEnd=1346 yBegin=3000 yEnd=3000 xBegin=1446 xEnd=2792 a yBegin=3000 yEnd=3000 x along arc
Matching Audio with Video Same capture scene Video adjacency matches audio sound stage Rendering side uses Area of Capture attributes to match the audio with the video
Mono x = 0 to 100 Stereo x = 0 to 300 Matching Audio with Video Spatial extent of video Spatial extent of audio LeftRight VC0VC2VC1 x = 0 to 100x = 100 to 200x = 200 to 300 Mono x = 100 to 200 Mono x = 200 to 300 One stereo AC Three mono ACs
Supporting the use cases 3.1 point to point symmetric Different number of audio channels on each side Different number of video and audio channels Match the sound stage with video display Handle gaps/overlap between captures Audio levels match
Supporting the use cases 3.2 point to point asymmetric Send subset of available streams Allow some user choice Sender does composition into one stream Receiver does composition of multiple streams onto one display
Supporting the use cases 3.3 multipoint Site switching Segment switching Still need work on VAD Switch based on manual control Composing reduced image sizes (continuous presence)
Supporting the use cases 3.4 presentation Video/audio streams for presentation Multiple presentation streams BFCP-like control of multiple streams (not in CLUE scope?) Consistent placement of multiple streams at each site
Supporting the use cases 3.5 Heterogeneous systems Transcoding middlebox Single or multiple streams Different bit rates Different layout policies Not settled yet
Supporting the use cases 3.5 Multipoint education Multiple streams with different roles (different scenes) Placing video on correct screen Still need work on VAD Requesting a stream from a particular site
Supporting the use cases 3.5 Multipoint multiview Different views of same scene Assigning camera views to remote displays for best eye contact
Addressing requirements Summary of whether or not items from the requirements document are met