This document outlines the mandatory feature requirements for a hardware video codec implementation in order to fully support Google’s vision of high-quality real-time communication (RTC), including the WebRTC project.
See the Tech Talk from Google I/O 2012 (40 minutes) for a good overview and current status of WebRTC technology, and how to use the APIs.
Google provides silicon-proven VP8 hardware encoder and decoder IP cores meeting the requirements listed in this document free of charge. Visit the hardware page for more information.
This paragraph lists the VP8 codec tools required for a high-quality RTC experience.
The VP8 encoder must support the previous, golden and alternate reference frame types. The encoder API must enable handling of the reference frames as follows:
The three frame types are used for two purposes: temporal scalability and error resilience.
Furthermore, the encoder needs to support a set of coding tools that helps it meet the quality requirements mentioned later in this document. The encoder must also support multiple simultaneous encoding instances.
It is recommended that the coding system implements a denoiser either in the ISP (Image Signal Processor) or in the encoder. A motion-compensated temporal denoiser reference implementation is available in the libvpx code: http://code.google.com/p/webm/downloads/list
The decoder must support all the coding tools defined in the VP8 bitstream specification. It must also support multiple simultaneous decoding instances.
Temporal post-processing is recommended for increasing the quality of base layer frames in case of multiple temporal layer coding, and for removing the lower quality keyframe popping effect.
In the reference post-processing algorithm, each macroblock in the current frame is compared to the co-located macroblock of the previous frame, and if their Sum of Absolute Difference exceeds a set threshold T, a weighted averaging operation between the two is applied.
A reference implementation can be found in the libvpx code: http://code.google.com/p/webm/downloads/list
In the one-to-one call, both encoder and decoder need to be capable of 720p@30fps, 2 Mbps throughput.
The multiparty call provides spatial scalability by the means of each encoder encoding and streaming three separate bitstreams of different resolutions to the RTC backend server. The server will send each receiving client only one of the three possible streams, based on their available bandwidth.
The encoder needs to be capable of simultaneous
throughput. This sums up to 141750 macroblocks per second, at a total bandwidth of 1.8 Mbps.
The decoder needs to be capable of simultaneous
throughput, where N is the maximum number of participants. Using N=20, the total macroblocks per second is 243000 macroblocks per second, at a total bandwidth of 3.2 Mbps.
Both encoder and decoder need to be able to return the maximum supported resolution information to the host.
The encoder needs to process the data in real-time, without buffering frames internally. I.e. each frame received from the sensor needs to be encoded and sent out immediately.
The RTC applications use temporal thinning as a means to quickly adjust data rates to network bandwidth variations. Frames forming each incremental temporal layer can be dropped without affecting the decoding of lower layers.
The encoder must support up to three temporal layers, and use the reference frame handling operations for implementing them as described below:
A typical frame rate for layers 0, 1 and 2 is 7.5 fps, 7.5 fps and 15 fps respectively, i.e. each participant can view either 30 fps, 15 fps or 7.5 fps video based on their bandwidth.
In order to provide a great user experience over variable network conditions, the VP8 encoder must support the following bitrate adjusting features:
Application can set up a region of interest as a rectangular area in the frame. Encoder should use this information to concentrate the usage of available bitstream bandwidth to the specified region of interest. Region of interest control is applied only in constant bitrate mode, and can be implemented using VP8 segments.
On desktop computers, the Google RTC applications are using the libvpx
encoder library with command line settings --rt --cpu-used=-5. For
consistent user experience across devices, a hardware encoder
For quality benchmarking, libvpx can be downloaded at http://code.google.com/p/webm/downloads/list.
The encoder needs to be able to disable the probability table updates so that the entropy tables can be independent between layers.
The aforementioned list of reference frame update/use operations can also be used for error resilience purposes.
Receiver can send information to the encoder about correctly received and decoded golden and alternate reference frames. Receiver can also send information about corrupted frames when they are detected in the bitstream. Encoder can utilize this information to encode the output bitstream in a way that allows receiver to recover from the errors.
When encoder gets information about corrupted frames it should try to recover from the corruption either by encoding the corrupted blocks as intra blocks or by using correctly received alternate and/or golden reference frames as references. In constant bitrate mode recovery must obey given bitrate constraints.
This feature has been designed to closely match the RTP payload format for VP8 video, defined in http://tools.ietf.org/html/draft-westin-payload-vp8.
In screencasting, a client in the RTC session shares his screen with other participants, which is great for seminars, video conferences, teaching, presentations etc. The screencasting supports up to 2560 x 1600 resolution at 5 fps, at 100 kbps. Both encoder and decoder should support this resolution.