Joint Audio and Video Internet Coding (JAVIC)

Transcription

Joint Audio and Video Internet Coding (JAVIC)
University of Bristol
Department of Electrical and Electronic Engineering
Joint Audio and Video Internet Coding
(JAVIC)
Final report
James T. Chung-How and David R. Bull
November 1999
A research programme funded by British Telecom
Image Communications Group
Centre for Communications Research
University of Bristol
Merchant Venturers Building
Woodland Road
Bristol BS8 1UB, U.K.
Tel: +44 (0)117 954 5195
Fax: +44 (0)117 954 5206
Email: Dave.Bull@bristol.ac.uk
JAVIC Final Report
Executive Summary
Executive Summary
This document describes work undertaken by the University of Bristol, Centre for Communications Research,
between March 1998 and October 1999 in the area of robust and scalable video coding for Internet applications.
The main goals that have been achieved are:
•
•
•
the development of an H.263+ based video codec that is robust to packet loss and can provide a layered
bitstream with temporal scalability
the real-time optimisation and integration of the above codec into vic
the development of a continuously scalable wavelet-based motion compensated video codec suitable for realtime interactive applications
The work carried out for the duration of the project is summarised below.
Investigation of robustness of RTP-H.263 to packet loss
The robustness of H.263 video to random packet loss when packetised with the Real-time Transport Protocol was
investigated. Various error concealment schemes and intra-replenishment algorithms were evaluated. The potential
improvements of having a feedback channel enabling selective intra replenishment were assessed.
Improving robustness to packet loss
Ways of improving the loss robustness of H.263 were investigated. It was found that the Error Resilient Entropy
Code (EREC), initially proposed as a potential solution, was unable to cope with packet loss. Forward error
correction of the motion information (MV-FEC) was found to give significant improvement with minimal increase
in bit-rate. The main problem with motion compensated video in the presence of loss lies with the temporal
propagation of errors. A scheme known as Periodic Reference (PR) frame, which uses the Reference Picture
Selection (RPS) mode of H.263+, was proposed and combined with forward error correction in order to stop
temporal error propagation.
Error-resilient H.263+ video codec
MV-FEC and RPS/FEC were then combined to give a robust H.263+ video codec. Extensive simulations with
random packet losses at various rates and standard video sequences were carried out. The results were compared
with H.263 with intra-replenishment and found to provide better robustness for the same bit-rate. Our robust codec
can also produce a layered bitstream that gives temporal scalability without any additional increase in latency.
Integration into vic
The robust H.263+ based codec was then integrated into vic. This required extensive modifications to the source
code to provide buffering of packets. The encoder was also optimised to enable near real-time operation of the
codec inside vic for QCIF images.
Continuously Scalable Wavelet Codec
The development of continuously scalable wavelet-based video codecs was carried out. Motion compensated 2D
wavelet codecs were preferred over 3D wavelet codecs because of the unacceptable temporal latency of 3D-wavelet
decomposition. A block matching MC–SPIHT codec and a hybrid H.263/SPIHT codec were implemented that
produce a continuously scalable bitstream. Continuous scalability, without any drift due to the MC process, is
obtained at the expense of a decrease in compression efficiency compared to the non-scalable situation. The
efficiency of the SPIHT algorithm at coding prediction error images was also investigated.
i
JAVIC Final Report
Contents
Contents
Executive Summary
i
Contents
ii
1
Video over Internet: A review
1.1 Introduction
1.2 The Internet: Background
1.2.1
The Internet Protocol (IP)
1.2.2
Internet Transport Protocols
1.2.3
Multicasting and the Multicast Backbone
1.2.4
Real-time Audio and Video over the MBone
1.3 Real-time Multimedia over the Internet
1.3.1
Real-time Multimedia Requirements
1.3.2
The Real-time Transport protocol (RTP)
1.3.3
Integrated Services Internet and Resource Reservation
1.4 Error Resilient Video over Internet
1.4.1
Video Coding Standards
1.4.2
Existing Video Conferencing Tools
1.4.3
End-to-end Delay and Packet Loss
1.4.4
Packetisation
1.4.5
Robust Coding
1.4.6
Forward Error Correction (FEC) and Redundant Coding
1.4.7
Rate Control Mechanisms
1.4.8
Scalable/Layered Coding
1.5 Summary
1
1
1
1
2
3
3
4
4
5
5
6
6
6
7
7
8
9
10
10
11
2
RTP-H.263 with Packet Loss
2.1 Introduction
2.2 The H.263 and H.263+ Video Coding Standards
2.3 RTP-H.263 Packetisation
2.3.1
RTP-H.263(version 1)
2.3.2
RTP-H.263(version 2)
2.4 Error Concealment Schemes
2.4.1
Temporal Prediction
2.4.2
Estimation of Lost Motion Vectors
2.5 Intra MB Replenishment
2.5.1
Uniform Replenishment
2.5.2
Conditional Replenishment
2.6 Conclusions
12
12
12
13
13
14
15
16
16
17
17
18
20
3
Motion Compensation with Packet Loss
3.1 Introduction
3.2 Redundant Packetisation of Motion Vectors
3.2.1
Motion Information
3.2.2
Redundant Motion Vector Packetisation
3.3 Motion Estimation with Packet Losses
3.3.1
Motion Compensation – Method 1
3.3.2
Motion Compensation – Method 2
3.3.3
Motion Compensation – Method 3
3.4 Performance of Intra-H.261 Coder in vic
3.5 FEC of Motion Information
3.5.1
RS Erasure Correcting Code
3.5.2
FEC of Motion Information (MV-FEC)
3.6 Conclusions
21
21
21
21
22
23
24
24
25
26
27
27
27
30
ii
JAVIC Final Report
Contents
4
Periodic Reference Picture Selection
4.1 Introduction
4.2 Periodic Reference picture Selection
4.2.1
Periodic Reference Frames
4.2.2
FEC of PR frames
4.3 Robust H.263+ Video Coding
4.4 Conclusions
31
31
31
32
33
35
37
5
Scalable Wavelet Video Coding
5.1 Introduction
5.2 SPIHT Still Image Coding
5.3 Wavelet-based Video Coding
5.3.1
Motion Compensated 2D SPIHT Video Codec
5.3.2
Effects of Scalability on Prediction Loop at Decoder
5.3.3
Hybrid H.263/SPIHT Video Codec
5.4 Comparison with Layered H.263+
5.5 Efficiency of SPIHT Algorithm for DFD and Enhancement Images
5.6 Conclusions
38
38
38
40
40
42
44
45
46
47
6
Summary and Future Work
6.1 Summary of Work Done
6.2 Potential Future Work
48
48
49
A
Codec Integration into vic
A.1 Video Codec Specifications
A.2 Encoder Optimisation
A.3 Interface between Codec and vic
50
50
50
51
B
Colour Images
53
References
56
iii
JAVIC Final Report
Chapter 1: Video over Internet: A Review
Chapter 1
Video over Internet: A Review
1.1 Introduction
The widespread availability of high speed links together with low-cost, high computational power
workstations and the recent progress in audio and video compression techniques means that real-time multimedia
transmission over the Internet is now possible. However, the Internet was originally designed for non-real-time data
such as electronic mail or file transfer applications, and its underlying transport protocols are ill suited for real-time
data. Packet losses are inevitable in any connectionless-oriented network such as the Internet and existing Internet
Transport Protocols cope with such losses through retransmissions. Error control schemes based on retransmissions
are unsuitable for real-time applications because of their strict time-delay requirements. This means that real time
audio and video applications over the Internet will suffer from packet losses, and the compression algorithms used in
such applications need to be robust to these losses.
After a brief description of the Internet and the Multicast Backbone (MBone), the requirements of real-time
multimedia applications are described. The remainder of this document concentrates on real-time video over the
Internet, particularly for videoconferencing applications. The problems with current video coding standards, which
are mainly designed for circuit-switched networks, are presented and existing videoconferencing tools are
introduced. A review of the work that has been done in the area of robust video over packet-switched networks then
follows.
1.2 The Internet: Background
The Internet started in the 1960s as an experimental network, funded by the US Department of Defence
(DoD), linking the DoD with military contractors and various universities and was known as the ARPANET. The
aim of the project was to investigate ways of interconnecting seemingly incompatible networks of computers, each
using different technologies, protocols and hardware infrastructure. The result is what is known today as the TCP/IP
Protocol Suite [Comer, 1995].
The network grew rapidly in popularity as universities and research bodies joined in. A number of regional
and national backbone networks were set up, such as the NSFNET in North America, and JANET in the United
Kingdom. The Internet has doubled in size every year since its origins, and it is estimated that there are now over 12
million computers connected to the Internet [Goode, 1997].
1.2.1 The Internet Protocol (IP)
The Internet is essentially a packet-switched network linking a large number of smaller heterogeneous subnetworks connected together via a variety of links and gateways or routers. The underlying mechanism that makes
such a complex interconnection possible is the Internet Protocol (IP). IP provides a connectionless oriented and
‘best-effort’ packet delivery service. Every possible destination on the Internet is assigned a unique 32-bit IP
address. When a host wants to send a data packet to another host on the network, it appends the destination host IP
address to the packet header and transmits the packet over the network to the appropriate router. Every intermediate
host, i.e. router, in the network that receives the packet re-directs it on the appropriate link solely by looking at the
destination IP address and consulting its routing tables.
IP Addressing
IP addresses are unique 32 bit integers and every host connected to the Internet must be assigned an IP
address. Conceptually, each address is divided into two parts, a network part (netid) which identifies on which
1
JAVIC Final Report
Chapter 1: Video over Internet: A Review
specific network a host is connected, and a host part (hostid) which identifies the particular host in the network. This
distinction is done in order to make routing easier. There are five main classes of IP addresses, as depicted in Fig.
1.1.
0 1 2 3 4
0
netid
8
Class A
16
hostid
Class B
1 0
netid
Class C
1 1 0
netid
Class D
1 1 1 0
multicast address
Class E
1 1 1 1 0
reserved for future use
24
31
hostid
hostid
Fig. 1.1. The five classes of IP addresses.
Classes A, B and C differ only in the maximum number of hosts that a particular network can have. Class D
addresses are known as multicast addresses, and are dynamically assigned to a multicast group instead of a unique
Internet host. Multicasting is discussed in more detail in section 2.2.
IP Datagram, Encapsulation and Fragmentation
The basic transfer unit or packet that can be carried using IP is commonly known as an IP datagram or
merely a datagram. A datagram is divided into two parts, a header part and a data part, as shown in Fig. 1.2.
Datagram Header
Datagram Data Area
Fig 1.2. General form of an IP datagram
The datagram header contains the source and destination addresses, and other information such as the version
of the IP protocol used, and the total length of the datagram. The header length can vary from 20 bytes upwards and
the maximum possible size of an IP datagram is 216 or 65,535 bytes. The datagram data area can contain any type of
information, and will generally consist of data packets generated by Internet applications.An IP datagram is a
conceptual unit of information that can exist in software only. When a datagram is transmitted across a physical
network, the entire datagram, including its header, is carried as the data part of a physical network frame. This is
referred to as encapsulation. The maximum size of a physical network frame is generally limited by the network
hardware, e.g. 1500 bytes of data for Ethernet and 4470 bytes for FDDI. This limit is called the network’s maximum
transfer unit or MTU. A datagram may be too long to fit in a single network frame, in which case the datagram is
divided into a number of smaller datagrams, each with their own header and data parts. This process is known as
fragmentation.
1.2.2 Internet Transport Protocols
The fundamental Internet Protocol provides an unreliable, best-effort, connectionless packet delivery system.
Thus, IP provides no guarantees on the timely or orderly delivery of packets and packets can be lost or corrupted
due to routing errors or congested routers. IP is a very simple protocol that deals only with the routing of packets
from host to host, and does not know if a packet is lost, delayed, or delivered out of order. Such conditions have to
be dealt with by higher level protocols or by the application itself.
It is likely that there may be more than one application sending and receiving packets within a single host,
meaning that there must be a means of identifying to which application a datagram is destined. Such functionality is
not provided by IP. Thus, transport protocols have to be used together with IP to enable applications within hosts to
communicate. Two main transport protocols are used over the Internet, namely UDP and TCP.
•
User Datagram Protocol (UDP) – UDP is the simplest possible transport protocol. It is likely that more
than one application will be running on any given host, and UDP identifies to which application a packet
belongs by including a port number in the UDP header. Apart from a checksum indicating whether the
packet has been corrupted in transit, UDP does not add any other functionality to the ‘best-effort’ service
provided by IP.
2
JAVIC Final Report
•
Chapter 1: Video over Internet: A Review
Transport Control Protocol (TCP) – In addition to the services provided by UDP, TCP offers a
reliable connection-oriented end-to-end communication service. An error control mechanism using
sequence numbers and acknowledgements ensures that packets are received in the correct sequence and
corrupted or lost packets are retransmitted. TCP also includes flow control and congestion control
mechanisms. TCP is widely used for data transfer over the Internet. Thus, applications sending packets
using TCP do not have to worry about packets being lost or received out of sequence.
Application
UDP
TCP
Internet Protocol (IP)
Physical Network
Fig. 1.3. Conceptual layered approach of Internet services.
Transport protocols such as TCP or UDP are normally used on top of IP. This means that applications
transmitting data over the Internet pass their data packets to UDP or TCP, which then encapsulates the data with
their respective headers. The UDP or TCP packets are then encapsulated with an IP datagram header to generate the
IP datagram that is then transmitted over the physical network. This conceptual layered approach is illustrated in
Fig. 1.3.
1.2.3 Multicasting and the Multicast Backbone (MBone)
The traditional best effort IP architecture provides a unicast delivery model, i.e. each network host has a
unique IP address and packets are delivered according to the IP address in the destination field of the packet header.
In order to send the same data to more than one receiver, a separate packet with the appropriate header must be
forwarded to each receiver. IP Multicasting is the ability of an application/host to send a single message across the
network so that it can be received by more than one recipient at possibly different locations. This is useful in
applications such as videoconferencing where a single video stream is transmitted to a number of receivers and
results in considerable savings in transmission bandwidth since the video data is only transmitted once across any
single physical link in the network.
Multicasting is implemented in the Internet by an extension of the Internet Protocol (IP) which uses IP class
D addressing [Deering, 1989]. Multicast packets are identified by special IP addresses that are dynamically
allocated for each multicast session. Hosts indicate whether they want to join or leave multicast groups, i.e. to
receive packets with specific multicast addresses, by using the Internet Group Management Protocol (IGMP).
Since multicast is an extension of the original IP, not all routers in the Internet currently support multicasting.
The Multicast Backbone (MBone) refers to the virtual network of routers that support multicasting, and is an
overlay on the existing physical network [Eriksson, 1994]. Routers supporting multicast are known as mrouters.
Multicast packets can be sent from one mrouted host to another via routers that do not support multicasting by using
encapsulation and tunnelling, i.e. the multicast packets are encapsulated and transmitted as normal point-to-point IP
datagrams. As existing commercial hardware is upgraded to support multicast traffic, this mixed system of routers
and mrouters will gradually disappear, eliminating the need for tunnelling.
1.2.4 Real-time Audio and Video over the MBone
With the deployment of high speed backbone connections over the Internet and the availability of adequate
workstation/PC processing power, the implementation of real-time audio and video applications requiring the
transmission of bandwidth hungry and time sensitive packets over the Internet is becoming more of a reality. A
number of real-time applications using the MBone have been proposed such as the visual audio tool (vat) and robust
audio tool (rat) for audio; the INRIA videoconferencing system (ivs), network video (nv) and the video
conferencing tool (vic) for video; and whiteboard (wb) for drawing.
All traffic on the MBone tends to use the unreliable User Datagram Protocol (UDP) as the transport protocol
rather than the more reliable Transport Control Protocol (TCP). The main reason is that the flow and error control
mechanisms used in TCP are generally not suitable for real-time traffic.
3
JAVIC Final Report
Chapter 1: Video over Internet: A Review
The use of UDP with IP implies that traffic sent on the MBone will suffer from:
• Unpredictable and random delay - This is a feature of any packet switched network because of variable
delays caused by queuing at routers due to congestion and transmission through variable speed links.
• Packet loss - This is caused by packets being dropped at congested routers due to filled buffers and
routing errors or unreliable links.
The requirements of real-time multimedia applications are described next and ways by which these
requirements can be met by the existing Internet infrastructure are discussed.
1.3 Real-time Multimedia over the Internet
1.3.1 Real-time Multimedia Requirements
Real-time multimedia applications such as Internet telephony, audio or video conferencing and video-ondemand impose strict timing constraints on the presentation of the medium at the receiver. In designing applications
for real-time media, the temporal properties such as total end-to-end delay, jitter, synchronisation and available
bandwidth must be considered. Invariably, limits on these parameters are imposed by the tolerance of human
perception to such artefacts.
•
End-to-end Delay – Interactive applications impose a maximum limit on the total end-to-end delay
above which smooth human interaction is no longer possible. For speech conversations, ITU-T
Recommendation G.114 specifies a maximum round trip delay of 400 ms and a maximum delay of 600
ms is considered as the limit for echo-free communications [Brady, 1971]. The total delay includes the
time for data capture or acquisition at the transmitter, the processing time at both encoder and decoder,
the network transmission time and the time for data display or rendering at the receiver. The
transmission delay in the case of UDP/IP over the Internet involves the time for processing and routing
of packets, the delays due to queuing at congested nodes and the actual propagation time through the
physical links.
•
Jitter and Synchronisation – Any audio or video application requires some form of intra-media
synchronisation between the individual packets generated by the source. For a video application, each
frame must be displayed at a specific time interval with respect to the previous frame, otherwise a jerky
and inconsistent motion will be perceptible. Some applications where two or more media are received
simultaneously also require some form of inter-media synchronisation. In a video conferencing
application the playback of the audio and video streams must be synchronised to provide a realistic
rendering of the scene. Subjective tests show that a skew of 80-100 ms in lip synchronisation is below
the limit of human perception [Steinmetz, 1996]. The transmission delay in a packet switched network
will vary from packet to packet, depending on the network load and routing efficiency, resulting in what
is known as jitter. Network jitter can be eliminated by using a suitably large buffer, although this adds to
the total delay experienced by the end-user.
•
Bandwidth – Media such as audio and video generally require a large amount of bandwidth for
transmission. Although advances in compression techniques have enabled huge reductions in bandwidth
requirements, the widespread and constantly growing use of multimedia applications means that
bandwidth is still scarce on the Internet. The amount of bandwidth required for a single audio or video
source will depend on the particular application, although for each medium there is minimum bandwidth
below which the quality is perceived as unacceptable.
The Internet was designed mainly for robust transmission of non real-time data, and its underlying
architecture makes it unsuitable for time sensitive data. The Internet traditionally offers a best effort delivery service
with no quality of service guarantees, and is not suitable for time-critical and delay and throughput sensitive data
generated by multimedia applications. In order to overcome some of these limitations and make it more suitable for
real-time transmissions, a number of standards have been proposed or are being developed by the Internet
Engineering Task Force (IETF).
4
JAVIC Final Report
Chapter 1: Video over Internet: A Review
1.3.2 Real-time Transport Protocol (RTP)
Real-time multimedia applications such as interactive audio and video, as described above, require a number
of end-to-end delivery services in additional to the functionalities provided by traditional Internet transport protocols
for non-time-critical data. Such services include timestamping, payload type identification, sequence numbering and
provisions for robust recovery in the presence of packet losses. Different audio and video applications have different
requirements depending on the compression algorithms used and the intended use scenario, and a single welldefined protocol is inadequate for all conceivable real-time application. The Real-time Transport Protocol (RTP)
[Schulzrinne et al., 1996] was defined to be flexible enough to cater for all these needs. RTP is an end-to-end
protocol for the transport of real-time data and does not provide any quality of service guarantees such as ordered
and timely delivery of packets.
The aim of RTP is to provide a thin transport layer which different applications can build upon to cater for
their specific needs. The protocol specification clearly states that RTP “is a protocol framework that is deliberately
not complete” and “is intended to be malleable to provide the information required by a particular application.” The
core document specifies those functions that are expected to be common among all applications where RTP will be
used. Several fields in the RTP header are then defined in an RTP profile document for a specific type of
application. An RTP profile for audio and video conferences with minimal control has been defined [Schulzrinne,
1996]. An RTP payload format specification document then specifies how a particular payload, such as audio or
video compressed with a specific algorithm, is to be carried in RTP. The general format of an RTP packet is shown
in Fig. 1.4. Payload formats for H.261 and H.263 (versions 1 and 2) encoded video have been defined [Turletti and
Huitema, 1996][Zhu, 1997][Bormann et al., 1998].
RTP Header
Payload Header
Payload Data
Fig. 1.4. General format of an RTP packet.
RTP also defines a control protocol (RTCP) to allow monitoring of data delivery in a scaleable manner to
large multicast networks, and to provide minimal control and identification functionality. RTCP packets provide
feedback information on the network conditions, such as packet delay and loss rates. RTP and RTCP are designed to
be independent of the underlying transport and network layers, but are typically used in conjunction with UDP/IP.
1.3.3 Integrated Services Internet and Resource Reservation
The current widely used Internet Protocol (IPv4) does not provide any quality of service (QoS) guarantees,
e.g. there are no bounds on the delay, throughput and losses experienced by packets. This is mainly because of the
single class, best-effort model on which the Internet is based and is unsuitable for real-time multimedia traffic which
requires some guaranteed QoS.
One way of guaranteeing end-to-end delivery services is to avoid congestion altogether through some form of
admission control mechanism, where available network resources are monitored and sources must first declare their
traffic requirements, Network access is denied to traffic that would cause the network resources to be overloaded.
The successor of IPv4, which is known as IPv6, will provide means by which resources can be reserved along a
particular route from a sender to a receiver. A packet stream from a sender to a receiver is known as a flow. Flows
can be established with certain guaranteed QoS, such as maximum end-to-end delay and minimum bandwidth by
using protocols such as the Resource Reservation Protocol (RSVP) which is currently being standardised.
Admission control and provision for various QoS for different types of data will enable the emergence of an
Integrated Services Internet, capable of meeting the requirements of data, audio and video applications by using the
same physical network [White and Crowcroft, 1997]. However, it will take some time before such mechanisms can
be implemented on a large scale on the Internet, and multimedia applications developed in the near future still need
to be able to cope with congestion, delays and packet losses.
5
JAVIC Final Report
Chapter 1: Video over Internet: A Review
1.4 Error Resilient Video over Internet
1.4.1 Video Coding Standards
Video coding or compression is a very active research area and numerous compression algorithms based on
transforms, wavelets, fractals, prediction and segmentation have been described in the literature [Clarke, 1995].
Standards such as JPEG [Wallace, 1991] for still images, H.261 [Liou, 1991], H.263 [Rijkse, 1996], H.263+ [Cote
et al., 1998] and MPEG [Sikora, 1997] for video at various bit-rates have been established. All these standards are
based on block motion compensation for temporal redundancy removal and hybrid DCT and entropy coding for
spatial redundancy removal. H.261 is aimed primarily at videoconferencing applications over ISDNs at p x 64 kbps.
H.263 was initially intended for video communications at rates below 64 kbps, but was later found to outperform
H.261 even at ISDN rates. H.263 is currently the most popular standardised coder for low bit-rate video
communications. H.263 (version 2), also known as H.263+, is a backward compatible extension of H.263, and
provides 12 new negotiable modes. These new modes improves compression performance over circuit-switched and
packet-switched networks, support custom picture size and clock frequency, and improve the delivery of video in
error-prone environments. The MPEG-4 standard, which is currently being developed, is aimed at a much wider
range of applications than just compression, and has a much wider scope, with the objective of satisfying the
requirements of present and multimedia applications. It is an extension of H.263 and will support object-based video
coding.
All these standards were developed primarily for circuit-switched networks in that they generate a continuous
hierarchical bit-stream where the decoding process relies on previously decoded information to function properly. A
number of issues such as packetisation and error and loss control must be addressed if these standards are to be used
over asynchronous networks such as ATM and the Internet [Karlsson, 1996].
1.4.2 Existing Videoconferencing Tools
A number of videoconferencing applications for the Internet and the MBone have been developed, and are
commonly known as videoconferencing tools. Most tools are freely available over the Internet and the most popular
ones are the Network Video Tool (nv) from XeroxParc [Frederick, 1994], the INRIA Videoconferencing System
(ivs) from INRIA [Turletti and Huitema, 1996] and the Videoconferencing tool (vic) from U.C. Berkeley/Lawrence
Berkeley Labs (UCB/LBL) [McCanne and Jacobson, 1995].
XeroxParc’s nv was one of the earliest widely used Internet video coding applications. It supports only video
and uses a custom coding scheme based on a Haar wavelet decomposition. Designed specifically for the Internet, the
compression algorithm used has low computational complexity and is targeted for efficient software implementation.
nv does not use any form of congestion control and transmits at a user-defined rate.
Soon after nv, an integrated audio/video conferencing system was released by INRIA. Based exclusively on
the H.261 standard, ivs provides better compression than nv at the expense of higher computational complexity.
Since ivs uses motion compensation, a lost update will propagate from frame to frame until an intra mode update is
received. In order to speed up recovery in the presence of errors, ivs adapts its intra refreshment rate according to
the packet loss rates experienced by receivers. A source based congestion algorithm is implemented in ivs to
minimise packet losses. For multicast sessions with a small number of receivers, ivs also incorporates an ARQ
scheme where receivers can request lost blocks to be coded in intra mode in subsequent frames.
Based on the experiences and lessons learned from nv and ivs, a new video coding tool, vic, was developed
at UCB/LBL. vic was designed to be more flexible, being network layer independent and providing support for
hardware-based codecs and diverse compression algorithms. The main improvement in vic was its compression
algorithm known as intra-H.261, which uses conditional replenishment of blocks and generates an H.261 compatible
bitstream. The use of conditional replenishment instead of motion compensation provides a better run-time
performance and robustness to packet losses compared to ivs, while the DCT-based spatial coding improves on the
compression efficiency of nv. vic was recently extended to support a novel layered coding known as Progressive
Video with Hybrid transform (PVH) and a receiver-based adaptation scheme referred to as Receiver-driven Layered
Multicast (RLM) [McCanne et al., 1997].
6
JAVIC Final Report
Chapter 1: Video over Internet: A Review
1.4.3 End-to-end Delay and Packet Loss
Variable delay and packet loss is an unavoidable feature of any connectionless-oriented packet-switched
network such as the Internet, where a number of sources are statistically multiplexed over variable capacity links via
routers. The end-to-end delay experienced by a packet is the sum of delays encountered at each intermediate node
on the network from the source to the destination. Each of these delays will be the sum of a fixed component and a
variable component. The fixed component is the transmission delay at a node and the propagation delay from one
node to another. The variable component is the processing and queuing delay at a node, which depends on the link
capacity and network load at that node. Packets will be dropped if congestion at a node causes the buffer at that
node to be full. Packet losses are mainly due to congestion at routers, rather than transmission errors.
Most data transfers over the Internet cope with packet loss by using reliable transport protocols, such as TCP,
which automatically provide for retransmission of lost packets. In the case of real-time video applications,
retransmission is generally not possible because a packet must be received within a maximum delay period,
otherwise it is unusable and considered as lost. Thus, real-time video packets over the MBone are transmitted using
UDP. It is advanced in [Pejhan et al., 1996] that retransmission schemes can be effective in some multicast
situations. However, in the case of real-time applications, such as videoconferencing, where the processing time
required to compress and decompress the video stream already constitute a significant fraction of the maximum
allowable delay, it is likely that ARQ schemes will be ineffective.
A number of studies have been made to try to understand packet loss behaviour in the Internet and
characterise the loss patterns seen by applications [Bolot, 1993][Yajnik et al., 1996][Boyce and Gaglianello,
1998][Paxson, 1999]. All of the measurements show that loss rate varies widely depending on the time of day,
location, type of connection and link bandwidth. Long burst losses are possible and the loss rate experienced for
UDP packets can be anything between 0 and 100%. It is also found that losses tend to be correlated, i.e. a lost
packet is more likely to be followed by another loss. However, for the backbone links of the MBone, it is observed
that the loss tends to be small (2% or less) and consists mainly of isolated single losses with occasional long lost
bursts. No simple method exists for modelling the typical loss patterns likely to be seen over the Internet since the
loss depends on so many untractable factors. Therefore, in all the loss simulations presented in this paper, random
loss patterns were used. This is believed to be a valid assumption since our test sequences are of short duration (10
s) and burst losses can be modelled as very high random loss.
Therefore, any video application that is to be used over the Internet must be able to provide acceptable video
quality at a low bit-rate even in the presence of random packet losses. Packet loss is caused by congestion, which in
turn depends on the network load, i.e. the amount of packet data generated by the video application. One obvious
way of improving video quality is to minimise the probability of packet loss. However, some losses are inevitable
and once packets have been lost, the next best solution is to minimise the effects of the lost packets on the decoded
video. Minimisation of actual packet losses and their effects on decoded video can be achieved by adapting the
coding, transmission, reception or decoding mechanisms in a number of ways according to the channel
characteristics, such as available bandwidth and loss rates. These are described next.
1.4.4 Packetisation
Existing video coding standards such as H.261, H.263 and MPEG generate a hierarchical bit-stream where
lower layers are contained in higher layers. In H.261 and H.263, an image or frame is divided into a number of nonoverlapping blocks for coding. Coded blocks are grouped together to form macroblocks (MB) and a number of MBs
are then combined to form groups of blocks (GOB). Thus, a coded frame will be composed of a number of GOBs.
Each layer is preceded by its header which contains information which is vital for proper decoding of the data
contained in the layer. This means that one needs to receive the frame header in order to decode the GOBs, and for
each GOB, the GOB header is required to enable any MB contained in the GOB to be decoded. This hierarchical
organisation of an H.263 bit-stream is illustrated in Fig. 1.5.
The H.221 Recommendation specifies that for transmission over ISDN, the bitstream from an H.261 encoder
should be arranged in 512-bit frames or packets, each containing 492 bits of data, 2 synchronisation bits and 18 bits
of error-correcting code. This packetisation strategy is suitable if the underlying channel is a continuous serial
bitstream with isolated and independent bit errors, which are taken care of by the error correcting code. This fails for
packet-based networks where a packet loss results in an entire packet being erased, with catastrophic consequences
if any header information is contained in that packet.
7
JAVIC Final Report
Chapter 1: Video over Internet: A Review
Picture Header
GOB Data
GOB Header
MB Data
MB Header
TCoeff
---
GOB Data
Picture Layer
---
MB Data
Group of Block Layer
Block Data
Macroblock Layer
Block Data
---
---
TCoeff
EOB
Block Layer
Fig. 1.5. Hierarchical organisation of an H.263 bitstream.
Therefore, to achieve robustness in the presence of packet losses, packetisation must be done so that
individual packets can be decoded independently of each other. This will typically imply the addition of some
redundant header information about the decoder state, such as quantiser values and block addresses, at the beginning
of the packet. This redundancy should be kept as small as possible while packet size should be close to the
maximum allowable length to minimise per packet overhead.
Packetisation mechanisms for H.261, H.263 and H.263+ used in conjunction with RTP have been defined
[Turletti and Huitema, 1996][Zhu, 1997][Bormann et al., 1998]. For H.261, the packetisation scheme takes the MB
as the smallest unit of fragmentation, i.e. a packet must start and end on a MB boundary. All the information
required to decode a MB independently, such as quantiser value, reference motion vectors and GOB number in
effect at the start of the packet are contained in a specific RTP-H.261 payload header. Each RTP packet contains a
sequence number and timestamp, so that packet losses can be detected using the sequence number. However, when
differential coding such as motion compensation is used as in H.261, a lost packet will results in error artefacts
which will propagate through subsequent differential updates until an intra coded version is received. The use of
more frequent intra coding minimises the duration of loss artefacts at the expense of increased bandwidth.
For H.263 bitstreams, the situation is further complicated by the fact that the motion vector for a MB is coded
with respect to the motion vectors of three neighbouring MBs, requiring more information in the RTP-H.263
payload header for independent decoding. The use of the four negotiable coding options (advanced prediction, PBframes, syntax-based arithmetic coding and unrestricted motion vector modes) introduces complex relationships
between MBs, which impacts on the robustness and efficiency of any packetisation scheme. To try to solve this
problem, three modes are defined for the RTP-H.263 payload header. The shortest and most efficient payload
header supports fragmentation at GOB boundaries. However, a single GOB can be too long to fit in a single packet
and another payload header allowing fragmentation at MB boundaries is defined. The third mode has the longest
payload header and is the only one that supports the use of PB-frames.
A new packetisation scheme has been defined for H.263+ bitstreams, and this scheme can also be used for
H.263 bitstreams. The new format reduces the payload header to a minimum of 16 bits or more. Instead it is up to
the application to decide which packetisation strategy to use. If robustness is not required, i.e. packet loss is very
unlikely, packetisation can be done with minimum overhead. However, if the network is prone to losses, then
packetisation can be done at GOB or slice boundaries to maximise robustness.
1.4.5 Robust Coding
Video coding algorithms such as H.261, H.263 and MPEG exploit temporal redundancies in a video
sequence to achieve compression. Instead of coding each frame separately, it is the difference between the previous
frame or its motion compensated prediction that is coded. This enables high compression because of the large
amount of correlation between frames in a video sequence. In H.261 and H.263, a frame is divided into a number of
non-overlapping blocks and these blocks are transformed using the DCT and runlength/Huffman coded. Blocks can
be coded in intra mode, where the actual pixel values are coded, or in inter mode, where the difference between
pixel values in the previous frame is coded. However, the use of differential coding makes the scheme very sensitive
to packet losses, since if a block is lost, the error will propagate to all subsequent frames until the next intra-coded
block is received. The problem of video transmission over lossy or noisy networks is being extensively researched
and a general review can be found in [Wang and Zhu, 1998].
8
JAVIC Final Report
Chapter 1: Video over Internet: A Review
One way to minimise the effect of packet loss is to increase the frequency of intra blocks so that errors due to
missing blocks do not propagate beyond a number of frames. The use of more intra blocks decreases the
compression efficiency of the coder and adds more redundancy to the compressed video. More frequent intra blocks
also increases the possibility of intra blocks being lost. In the extreme case, each frame is coded in intra mode, i.e.
independently of each other. Such a scheme is implemented in motion-JPEG where each frame is coded as an
individual still image using the JPEG standard. Motion JPEG can provide robust and high quality video in cases
where sufficient bandwidth is available [Willebeek-LeMair and Shae, 1997].
Another solution is to use both intra and inter modes adaptively according to the channel characteristics. In
the INRIA Videoconferencing System (ivs) [Turletti and Huitema, 1996], an H.261 coder using both intra and inter
modes is implemented. In order to speed up the recovery in the presence of losses, the intra refreshment rate is
varied according to the packet loss rate. When there are just a few participants in a multicast session, ivs uses a
scheme where a lost packet causes the receiver to send a NACK back to the source, enabling the source to encode
the lost blocks in intra mode in the following frame. The NACK information can be sent as RTCP packets. The
NACK-based method cannot be used in a multicast environment when there are many receivers because of the
feedback implosion problem. If all participants respond with a NACK, or some other information, at the same
instant in time, this can cause network congestion and lead to instability.
Robust coding can also be achieved by using only intra mode, and transmitting only those blocks that have
changed significantly since they were last transmitted. This is known as conditional replenishment and is used in the
nv and vic videoconferencing tools. In this case errors will not propagate since no prediction is used at the encoder.
Lost intra blocks are simply not updated by the decoder. The efficiency of the algorithm is justified by the fact that
motion in video sequences generally last for several frames, so that a lost intra block is likely to be updated again in
the next frame. The error artefact from a lost block will therefore not last for very long. In both nv and vic, each
block is updated at a regular interval by a background process even if it has not changed. This ensures that a receiver
who joins a session in the middle will eventually receive an update of every block after a specific time interval.
Conditional replenishment is also computationally efficient since it does not use any motion compensation or
prediction, and thus does not need to run a decoder in parallel with the encoder. The decision of whether or not to
update a block is also made early in the coding process.
1.4.6 Forward Error Correction (FEC) and Redundant Coding
FEC is very commonly used for both error detection and correction in data communications and it can also be
used for error-recovery in video communications. However, for packet video, it is much more difficult to apply error
correction because of the relatively large number of bits contained in a single packet. Another problem with FEC is
that the encoding must be applied over a certain number of packets or with some form of interleaving in order to be
useful against burst losses. This introduces a delay at both the encoder and decoder. Parity packets can be generated
using the exclusive-OR operation [Shacham and McKenney, 1990] or the more complex Reed-Solomon codes can
be used [McAuley, 1990][Rizzo, 1997].
The major difficulty with FEC is choosing the right amount of redundancy to cope with changing network
conditions without excessive use of available bandwidth. If a feedback channel is available, then some form of
acknowledgement and retransmission scheme (ARQ) can be used [Girod and Farber, 1999]. It is generally assumed
that protocols based on retransmissions cannot be used with real-time applications. However, it is shown in [Pejhan
et al., 1996] that retransmission schemes can be effective in some multicast situations. Alternatively, a combination
of ARQ and FEC can be used [Nonnenmacher et al., 1998][Rhee, 1998]. A technique compatible with H.263 that
uses only negative acknowledgements (NAK) but no retransmission is proposed in [Girod et al., 1999].
Nevertheless, in the case of real-time applications, such as videoconferencing, where the processing time required to
compress and decompress the video stream already constitute a significant fraction of the maximum allowable delay,
it is likely that ARQ schemes will be ineffective. Also, for multicast applications with multiple receivers, feedback is
generally not possible in practice because of the feedback implosion problem at the transmitter.
Redundant coding of information in more than one packet has been successfully used for Internet audio
coding applications. In the Robust Audio Tool (rat) [Hardman et al., 1995], redundant information in the form of
synthetically coded speech is included in each packet so that lost packets can be replaced by synthetic speech
contained in subsequent packets. The redundancy is very small since synthetic speech coding algorithms provide
very high compression. Similar schemes have also been proposed for video applications. In [Bolot and Turletti,
1996], a robust packet video coding scheme is proposed for a block video coding algorithms, where packet n
contains a low-resolution version of all blocks transmitted in packet {n-k} that are not encoded in packets {n-i}, (i∈
(0,k]). It is claimed that the added redundancy is small because a block coded in packet {n-k} is likely to be
9
JAVIC Final Report
Chapter 1: Video over Internet: A Review
included in packet {n-k+1}. However, this is only true if an entire frame is contained in a single packet, which may
not be possible due to limitations on maximum packet size. The order of the scheme, i.e. the value of k, and the
amount of redundancy, can be varied at the source based on the network losses experienced by the receiver.
1.4.7 Rate Control Mechanisms
Packet losses are mainly caused by congestion at routers, which is due to excess traffic at a particular node.
Congestion and its consequence -packet loss - can only be cleared by reducing the network load. Transport protocols
such as TCP uses feedback control mechanisms to control the data rate of sources of non-real time traffic. This helps
to maximise throughput and minimise network delays. However, most real-time traffic is carried in the Internet using
UDP, which does not include any congestion control mechanisms. Thus if congestion occurs due to excess video
traffic, the congested links will continue to be swamped with video packets, leading to further congestion and
increased packet losses. Packet loss for real-time video can be greatly reduced by using rate-control algorithms,
where a source adjusts its video data rate according to feedback information about the network conditions. Feedback
can be obtained from receivers by using the Real-time Transmission Control Protocol (RTCP). RTCP packets
contain information such as the number of packets lost, estimated inter-arrival jitter and timestamps from which loss
rates can be computed.
Such a rate control scheme is used in ivs, where the video data rate at the source is varied according to the
congestion experienced by receivers [Bolot and Turletti, 1994]. Whether a network is congested or not is decided
according to the measured packet loss rates. This method works relatively well in a unicast environment where
network congestion is easily determined. However in a multicast environment, loss rates may vary widely on
different branches of the multicast tree because of varying network loads and link capacities. The approach used by
ivs is to change the source output rate only if greater than a threshold percentage of receivers are observing
sufficiently low or high packet loss. It can be seen that in such a heterogeneous environment, it is impossible for a
source transmitting at a single rate to satisfy the conflicting requirements of all receivers. Source adaptation based
on an average of each receiver requirement will mean that low-capacity links will be constantly congested and highcapacity links will be always under-utilised.
A solution to this problem is to move the rate-adaptation process from the source to within the network. In
network-assisted bandwidth adaptation, the rate of media flow is adjusted from within the network to match the
available capacity of each link in the network. Thus, the source always transmits at its maximum possible rate and
so-called video gateways are placed throughout the network to transcode portions of the multicast tree into lower
bit-rates, either by changing the original coding parameters or by using a different coding scheme [Amir et al.,
1995]. Network-assisted bandwidth adaptation requires the deployment of transcoding gateways at strategic points
in the network, which is both costly and impractical to implement on a large scale. Transcoding at arbitrary nodes
also increases the computational burden at these nodes, and increases the latency in the network.
1.4.8 Scalable/Layered Coding
The idea behind layered or scaleable coding in to encode a single video signal into a number of separate
layers. The base layer on its own provides a low resolution version of the video, and each additional decoded layer
further enhances the signal such that the full resolution video is obtained by decoding all the layers. A number of
layered coding schemes have been described [Taubman and Zakhor, 1994][Ghanbari, 1989].
Layered coding can be used in a multicast environment by having a source that transmits each separate layer
in a different multicast group. Receivers can then decide how many layers to join depending on the available
capacity and congestion experienced on their respective links. Thus, the source continuously multicasts all the
layers, and multicast pruning is used to limit packet flow only over those links necessary to reach active receivers.
In [McCanne et al., 1997], a specific multicast layered transmission scheme and its associated receiver
control mechanism is described and referred to as Receiver-driven Layered Multicast (RLM). A low complexity
layered video coding algorithm based on conditional replenishment and a hybrid DCT/subband transform is used
which outperforms all existing Internet video codecs at low bit-rates. A receiver detects congestion or spare capacity
by monitoring packet loss or the absence of loss, respectively. When congestion occurs, a receiver quits the
multicast group corresponding to the highest layer it is receiving at the time. When packet loss is low, receivers can
join the multicast group corresponding to the next highest layer and receive better video quality. RLM has been
implemented in the videoconferencing tool vic.
10
JAVIC Final Report
Chapter 1: Video over Internet: A Review
1.5 Summary
The Internet is essentially a packet-switched network that was mainly designed for non-real time traffic and
its best-effort packet delivery approach means that packets can arrive out-of-sequence or even be lost. Internet
transport protocols normally deal with packet losses by requesting the retransmission of lost packets, but they cannot
be used with real-time data, such as interactive audio or video, because of the strict delay requirements of such data.
This means that any real-time video stream transmitted over the Internet will suffer from packet losses, and the video
coding algorithm needs to be able to cope with such losses. Most video compression algorithms are very sensitive to
losses, mainly because of the use of motion compensation, which results in spatial and temporal error propagation.
A number of Internet video coding tools have been developed, and some are now widely used. However,
most of these tools tend to sacrifice compression efficiency for increased robustness, either by not using motion
compensation or by coding more frequent intraframes. Therefore, the compression achieved with these tools is much
less than that possible with state-of-the-art motion-compensated DCT or wavelet codecs.
There are a number of tools and algorithms such as interleaving, forward error-correction, error-concealment,
and scalable coding that can be used to generate a more robust video bit-stream. These can be incorporated in the
video coding algorithm itself or applied to the compressed bit-stream. Further research is needed to look at how
these robust coding techniques can be combined with state-of-the-art video coding algorithms to generate a lossresilient high performance codec suitable for use over the Internet.
11
JAVIC Final Report
Chapter 2: RTP-H.263 with Packet Loss
Chapter 2
RTP-H.263 w ith Packet Loss
2.1 Introduction
In this section, the robustness to packet loss of an H.263 bitstream packetised with the Real-time Transport
Protocol (RTP) is investigated. First the RTP-H.263 packetisation schemes as specified in RFC 2190 [Zhu, 1997]
and RFC 2429 [Borman et al., 1998] are described. The performance of RTP-H.263 for standard video sequences in
the presence of random packet losses is then investigated. A number of error concealment strategies, exploiting the
spatial and temporal redundancies in the video sequence, are used to minimise the decoding error resulting from lost
macroblocks. Even with the use of sophisticated error concealment algorithms, residual errors in the decoded frames
are inevitable and the motion compensation process will result in temporal propagation of these errors. The only
way to minimise this residual error propagation is to use some form of intra replenishment technique, where some
macroblocks are coded without any reference to previously coded blocks. This obviously results in an increase in
bit-rate since intraframe coding is less efficient than interframe coding. The performance, in terms of PSNR
improvement and increase in bit-rate, of using intra replenishment is then assessed.
In some video coding applications, a feedback channel exists between the encoder and decoder whereby the
decoder can inform the encoder of any lost packets by using some form of negative acknowledgement (NAck)
scheme. This enables a more efficient macroblock replenishment strategy since the encoder can decide which
macroblocks to replenish in order to minimise error propagation. Simulations for a range of round-trip feedback
delays and conditional replenishment schemes are carried out.
2.2 The H.263 and H.263+ Video Coding Standards
The H.263 video standard is based on a hybrid motion-compensated transform video coder. H.263 (version
1) includes four negotiable advanced coding modes that give improved compression performance: unrestricted
motion vectors, advanced prediction, PB frames, and syntax-based arithmetic coding. A description of H.263
(version 1) and its optional modes can be found in [Girod et al., 1997].
H.263 (version 2), also known as H.263+, is a backward compatible extension of H.263, and provides 12
new negotiable modes. These new modes improves compression performance over circuit-switched and packetswitched networks, support custom picture size and clock frequency, and improve the delivery of video in errorprone environments. For more details about each of these modes, refer to [Côté et al., 1998]. Four of the optional
H.263+ modes are aimed at improving the error-resilience.
Annex K – Slice-Structured Mode: This mode replaces the original GOB structure used in baseline H.263
with a more flexible slice structure. A slice consists of an arbitrary number of MBs arranged either in scanning order
or in rectangular shape. No dependencies are allowed across slice boundaries. The number of MBs in a slice can be
dynamically selected to allow the data for each slice to fit into a packet of a specific size.
Annex R – Independent Segment Decoding Mode: When used, this mode eliminates any data
dependencies between picture segments, where a segment can be defined as a GOB or a slice. Each segment is
treated as a separate picture where MBs in a particular segment can only be predicted from the picture area of the
reference frame belonging to the same segment.
Annex N – Reference Picture Selection Mode: This mode allows the use of an earlier picture as reference
for the motion compensation, instead of the last transmitted picture. The reference picture selection (RPS) mode can
also be applied to individual segments rather than complete pictures. This mode can be used with or without a back
channel. The use of a back channel enables the encoder to keep track of the last correctly received picture at the
12
JAVIC Final Report
Chapter 2: RTP-H.263 with Packet Loss
decoder. When the encoder learns about an incorrectly received picture or segment at the decoder, it can then code
the next picture using a previous correctly received coded picture as reference.
In some applications scenarios, such as in multicast environment, the use of a back channel may not be
possible or available. In such cases, the reference picture selection mode can be used in a method known as video
redundancy coding (VRC) [Wenger, 1998], where the pictures in a sequence are divided into two or more threads
and each picture in a thread is predicted only from pictures in the same thread.
Annex O – Temporal, SNR, and Spatial Scalability Mode: Scalability means that the bitstream is divided
into two or more layers. The base layer is independently decodable, and each additional enhancement layer increases
the perceived picture. This mode can improve error-resilience when used in conjunction with error control schemes
such as FEC, ARQ or prioritisation. Scalability can also be useful for heterogeneous networks such as the Internet,
especially when employed in conjunction with layered multicast [McCanne et al., 1997][Martins and Gardos, 1998].
Each of the error-resilience oriented modes has specific advantages and are useful only in certain types of
networks and application scenarios [Wenger at al., 1998].
2.3 RTP-H.263 Packetisation
Low bit rate video coding standards, such as H.261 and H.263, were designed mainly for circuit-switched
networks and direct packetisation of the H.261 or H.263 bitstream for use over the Internet would make it
particularly sensitive to packet losses. The packetisation of an H.263 bitstream for use with RTP has been specified
in RFC 2190 [Zhu, 1997]. This was subsequently modified in RFC 2429 [Bormann et al., 1998] to include the new
H.263 (version 2), also known as H.263+., which is a superset of H.263 (version 1). RFC 2190 continues to be used
by existing implementations and may be required for backward compatibility in new implementations. However
RFC 2429 should be used by new implementations. Therefore both formats are described next.
Ideally, in order to achieve robustness in the presence of losses, each packet should be decodable on its own,
without any reference to previous packets. Thus, in addition to the RTP packet header, every RTP packet will have
some amount of payload header information to enable decoding of the accompanying payload data, followed by the
actual payload data, as shown in Fig. 2.1
RTP Header
Payload Header
Payload Data
Fig. 2.1. RTP packet with payload header and data.
2.3.1 RTP-H.263 (version 1)
Packetisation of the H.263 bitstream is done at GOB or MB boundaries. Three modes of RTP-H.263
packetisation are defined in RFC 2190:
Mode A:
This is the most efficient, and allows packetisation at GOB level only, i.e. mode A packets always
start at a GOB or picture boundary.
Mode B:
This allows fragmentation at MB boundaries, and is used whenever a GOB is too large to fit into a
packet. Mode B can only be used without the PB-frame option in H.263.
Mode C:
Same as mode C, except that the PB-frame option can be used
The packetisation mode is selected adaptively according to the maximum network packet size and H.263
coding options used. Mode A is always used for packets starting with a GOB or picture start code. Mode B or C is
used whenever a packet has to start at a MB boundary because a GOB is too large to fit in a single packet. Only
modes A and B will be considered here. The fields of the H.263 payload header depend on the packetisation mode
used. The header information for both modes include some frame level information such as source format, picture
types, temporal reference and options used. In addition to these, since mode B allows fragmentation at a MB level,
the address of the first MB encoded in the packet as well as the number of the GOB to which it belongs is needed in
the mode B header to allow decoding of the packet. The mode B header also contains the motion vector predictors
used by the first MB in the packet. The header fields for modes A and B are given in Fig. 2.2.
13
JAVIC Final Report
Chapter 2: RTP-H.263 with Packet Loss
Mode A Header: 32 bits
0
1
2
3
01 2 3 456 7 01 2 34 5 6 7 01 2 3 4 5 6 7 01 2 3 456 7
F P SBIT EBIT SRC I U S A R DQB TRB
TR
Mode B Header: 64 bits
0
1
2
3
01 2 3 456 7 01 2 34 5 6 7 01 2 3 4 5 6 7 01 2 3 456 7
F P SBIT EBIT SRC QUANT
GOBN
MBA
R
I USA
HMV1
VMV1
HMV2
VMV2
Fig. 2.2. Payload header for modes A and B.
2.3.2 RTP-H.263 (version 2)
RFC 2429 allows packetisation of the H.263+ bitstream at any arbitrary point and keeps the packetisation
overhead to a minimum in the no loss situation. However for maximum robustness, packetisation should be done at
GOB or slice boundaries, i.e. every packet should begin with a picture, GOB or slice start code and all start codes
should be byte aligned. This is possible if the slice structure mode is used such that the slice size can be dynamically
adjusted in order to fit into a packet of a particular size.
The H.263+ payload header is of variable length and the first 16 bits, which are structured as shown in Fig.
2.3, are mandatory. This can be followed by an 8-bit field for Video Redundancy Coding (VRC) as indicated by the
V bit and/or by an extra picture header as indicated by PLEN. The various fields present in the payload header are:
RR: 5 bits
Reserved. Always zero.
P: 1 bit
Indicates that payload starts with a picture/GOB/slice start code.
V: 1 bit
Indicates presence of VRC field.
PLEN: 6 bits
Gives the length in bytes of the extra picture header. If no picture header is present,
PLEN is 0.
PEBIT: 3 bits
Indicates number of bits to be ignored in last byte of picture header.
0
1
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
RR
P V
PLEN
PEBIT
Fig. 2.3. RTP-H.263+ payload header.
Whenever a packet starts with a picture/GOB/slice start code, this is indicated by the P bit and the first 16
zeros in the start code can be omitted. Sending a complete picture header at regular intervals and at least for every
frame is advisable in highly lossy environments, although it results in reduced compression efficiency. In addition,
some fields in the RTP header are also used, such as the marker bit which indicates if the current packet carries the
end of the current frame, the payload type which should specify the H.263+ video payload format, and the
timestamp which indicates the sampling instant of the frame contained in the RTP packet. In H.263+ picture, slice
and EOSBS start codes are always byte-aligned and all other start codes can be byte-aligned.
The RTP-H.263 (version 2) payload specifications will be used in the rest of this document since it can be
used to packetise H.263 as well as H.263+ bitstreams.
When used in a video coding application, the RTP packet is encapsulated into a UDP packet, which is further
encapsulated into an IP datagram, and it is the IP datagram that is actually transmitted across the Internet. This
process is illustrated in Fig. 2.4. The RTP-H263+ header can be 2 or more bytes long, depending on which coding
options are used, while the RTP header is generally about 12 bytes, the UDP header 8 bytes, and the datagram
header is 20 bytes or more. This gives a total header length of 42 bytes or more for a single H.263 data packet. In
the remaining of this document, the calculated bit-rates include only the RTP-H.263 payload header and H.263+
data and no other header information. Since packetisation is always done at GOB boundaries and the VRC option is
14
JAVIC Final Report
Chapter 2: RTP-H.263 with Packet Loss
not used, the 16 bits of the RTP-H.263+ payload header if offset by the fact that first 16 zeros of the GOB start
codes can be omitted.
4-12 bytes
RTP Header
RTP-H263 Header
12 bytes
Datagram Header
20 bytes
H263 Data
RTP-packet
UDP Header
UDP data
8 bytes
UDP packet
Datagram Data
IP Datagram
Fig. 2.4. Encapsulation of RTP with UDP and IP.
In the following experiments, the H.263 encoder is used only with the unrestricted motion vector mode. The
GOB header is always included at the beginning of each GOB in order to avoid dependencies between GOBs,
resulting in a more robust bitstream. The resulting bitstream is then packetised the RTP-H.263+ with one GOB per
packet, unless stated otherwise. All the results involving random packet losses were obtained by averaging over a
number of simulations.
2.4 Error Concealment Schemes
The Foreman sequence (QCIF, 12.5 Hz, and 125 frames) is coded with H.263 at about 60 kbps and
packetised for use with RTP. Only the first frame of the sequence is coded as an intraframe and all subsequent
frames are coded as interframes. Original and decoded images are shown in Fig. 2.5.
(a) Original frames 20 and 60
(b) Coded frames 20 and 60.
Fig. 2.5. Foreman sequence coded with H.263 at 60 kbps.
It is noted that the maximum GOB size obtained in this particular case is about 4000 bits for intraframes and
1500 bits for interframes. Random packet losses are then introduced to give an average packet loss rate of 10%. The
first frame is always assumed to be free of losses. A lost packet results in some macroblocks being unavailable for a
particular frame. The use of sequence numbers by the RTP protocol enables the detection of lost packets, so that the
15
JAVIC Final Report
Chapter 2: RTP-H.263 with Packet Loss
decoder knows which macroblocks in a particular frame are missing. Error concealment strategies can then be used
to predict the missing macroblocks.
2.4.1 Temporal Prediction
A simple error concealment method is to replace missing MBs with those in the same spatial location from
the previous decoded frame. The decoded image quality degrades rapidly as the errors propagate as can be seen
from the images in Fig. 2.6. The results for foreman with 10% packet loss are shown in Fig. 2.7.
Fig. 2.6. Frame 20 and 60 of Foreman with RTP-H.263, packet loss rate = 10% and temporal prediction.
2.4.2 Estimation of Lost Motion Vectors
The previous error concealment scheme works well only if there is very little motion in the sequence and
macroblocks do not change much between frames. For sequences with considerable motion, it is more efficient to
replace the lost MBs with motion-compensated MBs from the previous frame. In order to do so, the motion vectors
of the lost MB must be estimated using information from the correctly received neighbouring MBs. A number of
error-concealment schemes using motion vector estimation have been reported [Lam et al., 1993][Chen et al., 1997]
and three methods of estimating the lost motion vectors are evaluated. In our simulations, each packet contains a
complete GOB. Since for a QCIF image, a GOB contains a row of MBs, a lost packet results in an entire row of
MBs in a frame being corrupted. Three different methods, referred to as A, B, and C, are used to estimate the motion
vectors
• Method A: The lost motion vector is replaced by the average of the motion vectors of the MBs in the same
frame immediately above and below the lost MB.
• Method B: The lost motion vector is assumed to be the same as the motion vector of the MB in the
previous frame, but in the same spatial location as the lost MB.
• Method C: This is a combination of methods A and B. The lost motion vector is the average of the motion
vectors calculated using methods A and B.
In each case, if some neighbouring motion vectors are unavailable because they have been lost or because the
MB is at a picture border, they are assumed to be zero. The results for each method are shown in Fig. 2.7 for a
packet loss rate of 10%. All three motion vector estimation methods provide considerable improvement over error
concealment using only temporal prediction. However, methods A and C perform better than method B, giving an
average improvement of about 3 dB overall. As expected, the improvement is more significant at the beginning of
the sequence, when the images are free of errors, thus making concealment using correctly received information
more efficient. In the remaining of this document, method C will be used for motion vector estimation because it is
thought to be more robust since it uses three neighbouring motion vectors to predict the lost vector.
16
JAVIC Final Report
Chapter 2: RTP-H.263 with Packet Loss
34
32
30
No loss
Method A
Method B
Method C
Temporal prediction
PSNR/dB
28
26
24
22
20
18
0
20
40
60
80
Frame number
100
120
140
Fig. 2.7. Estimation of lost motion vectors for packet loss rate=10%.
2.5 Intra MB Replenishment
As can be seen in Fig. 2.7, even with the use of error concealment, packet losses cause the image quality to
degrade significantly with time as the error resulting from missing macroblocks propagate from frame to frame. This
is a direct result of the use motion compensation and the only way to limit the error propagation is to resynchronise
the encoder and decoder by periodically coding the macroblocks in intra mode, i.e. without any reference to
previously coded macroblocks.
2.5.1 Uniform Replenishment
The periodic replenishment scheme used here is the coding of a fixed number of MBs per frame in intra
mode. The replenished MBs are chosen to be uniformly distributed in the frame in order to match the random packet
losses. Intra coding of MBs effectively limits the propagation of errors, but since intra coding is less efficient that
motion compensated coding, it also results in a significant increase in bit rate for the same decoded image quality.
This is shown in Table 2.1.
Average PSNR/dB
Bit Rate/kbps
Increase in bit-rate
No intraMB
32.01
63.82
0%
5 intraMB/frame
32.07
75.50
18.30%
10 intraMB/frame
32.14
87.14
36.54%
15 intraMB/frame
32.19
97.69
53.07%
Table 2.1. Increase in bit-rate with number of intraMB/frame.
The MB replenishment strategy is used in conjunction with error concealment of lost MBs using the lost
motion vector estimation method C described previously. Simulations were carried out with 10% random packet
losses and with 5, 10 and 15 MBs being replenished per frame. The results are given in Fig. 2.8.
The replenishment of 5 MB/frame improves the decoded image quality by an average of about 2 dB
compared to just using error concealment and limits the error propagation to a certain extent. This comes with an
increase in the bit rate of more than 18%. As the number of intraMB per frames is increased, the improvement in
image quality becomes less significant, with 10 and 15 intraMBs per frame giving more or less the same
performance. Thus, using too many intraMB per frame is a rather inefficient use of the available bit-rate, and it is
very likely that the same bit-rate could be used in a different way to obtain a better image quality. However, in most
17
JAVIC Final Report
Chapter 2: RTP-H.263 with Packet Loss
practical video coding applications, some form of intraMB replenishment will always be necessary to limit error
propagation and allow resynchronisation.
34
No loss
IntraMB/frame=15
IntraMB/frame=10
IntraMB/frame= 5
No intra replenishment
32
PSNR/dB
30
28
26
24
22
0
20
40
60
80
Frame number
100
120
140
Fig. 2.8. Intra MB replenishment with max. packet size = 4000 bits and 10% packet loss.
Fig. 2.9. Frame 20 and 60 with uniform replenishment of 5 intraMB/frame.
2.5.2 Conditional Replenishment
Instead of randomly coding some MB in intra, a much better way of limiting error propagation is to
selectively code in intra only those MBs that are most severely affected by errors. This is possible if we assume a
feedback channel where the decoder can signal to the encoder which MBs have been lost by using some negative
acknowledgement (Nack) scheme. This can be done by using the RTCP packets provided by the RTP protocol. The
encoder can then decide which MBs to intracode in order to minimise error propagation. Similar schemes have been
used in [Turletti and Huitema, 1997] for videoconferencing over the Internet and in [Steinbach, 1997] as a robust
standard compatible extension of H.263.
When a MB is lost, there will be a delay, equal to the round-trip propagation time between the encoder and
decoder, before the encoder knows that a packet has been lost. This delay is variable and can be equivalent to
several frames. The encoder can then decide which MB to intracode in the next frame following the receipt of a
Nack. An error resulting from a lost MB will propagate to spatially neighbouring MBs in the next frame because of
the use of motion compensation. In [8], an error tracking mechanism is used to reconstruct the spatio-temporal error
propagation and only severely distorted MB are replenished. A simplified version of this algorithm will be used.
18
JAVIC Final Report
Chapter 2: RTP-H.263 with Packet Loss
35
No loss
Delay=2 frames, 85.74 kbps
Delay=4 frames, 86.74 kbps
Delay=6 frames, 87.04 kbps
10 intraMB/frame, 87.14 kbps
34
33
32
PSNR/dB
31
30
29
28
27
26
25
24
0
20
40
60
80
Frame number
100
120
140
Fig. 2.10. Same-MB replenishment for different round-trip delays and 10% packet loss rate.
In our algorithm, the spatial error propagation is not taken into account. Following the receipt of a Nack, the
MB at the same spatial position as the reported lost MB is intracoded in the next frame. The results obtained using
same-MB replenishment for a range of frame delays are shown in Fig. 2.10. The error concealment scheme using
motion vector estimation as described earlier is also used to conceal lost MB before they are replenished. Since the
Foreman sequence is coded at 12.5 Hz, each frame delay is equivalent to 80 ms. The maximum packet size used
here is 4000 bits, resulting in one GOB per frame, so that a single packet loss will result in a Nack for an entire
GOB.
Since every lost MB is eventually intracoded, this causes a considerable increase in the total bit-rate. The
results for uniform replenishment with 10 MBs/frame is also included for comparison since it gives roughly the
same bit-rate for a 10% packet loss rate. The same-MB replenishment strategy gives an improvement of about 2 dB
compared to the uniform replenishment method for a round trip delay of 2 frames. However, as the delay increases,
the performance goes down since errors propagate to a larger number of frames before the intracoded MBs are
received. It can be seen that for a delay of 6 frames, same-MB replenishment gives similar results to uniform
replenishment.
However, intracoding of every lost MB is not necessary in scenes where there is very little motion, since the
error concealment algorithm will work well in these situations. The residual error will then be minimal and barely
noticeable. The same-MB algorithm can be improved by intracoding a lost MB only if the distortion resulting from
the lost MB is above a certain threshold. The distortion measure used is the sum of absolute difference (SAD)
between the error-free MB at the encoder and the concealed MB used at the decoder. For each MB in each frame
that is coded, the encoder stores the distortion resulting from that MB being lost, and whenever a lost MB is
reported, the MB is intracoded only if its distortion exceeds a threshold. This method is referred to as selective-MB
replenishment and PSNR results for a threshold of 500 and a round-trip delay of 4 frames are given in Fig. 2.11, and
typical decoded images are shown in Fig. 2.12.
19
JAVIC Final Report
Chapter 2: RTP-H.263 with Packet Loss
34
No loss
Same-MB
Selective-MB
33
32
PSNR/dB
31
30
29
28
27
26
25
0
20
40
60
80
Frame number
100
120
140
Fig. 2.11. Comparison of same-MB and selective-MB replenishment with 4 frame delay.
Fig. 2.12. Frames 20 and 60 for same-MB replenishment with 4 frame delay.
The selective-MB algorithm gives a small reduction in bit-rate compared with the same-MB algorithm for the
same decoded image quality. It is likely that further reductions in bit-rate are possible by exploiting the redundancy
in the replenishment strategy, e.g. the round-trip delay is likely to be several frames in practical applications, and if
the same MB is lost in two consecutive frames, then it only needs to be intracoded once.
2.6 Conclusions
The robustness of the RTP-H.263 packetisation method is simulated under random packet losses. A number
of simple error concealment algorithms using temporal prediction and lost motion vector estimation are
implemented and it is found that the motion vector estimation algorithm using neighbouring correctly received
motion vectors gives the best results. Minimisation of temporal error propagation using intracoded MBs is looked
into and found to be very expensive in terms of increase in bit-rate. A feedback mechanism where the encoder is
informed, by negative acknowledgements from the decoder, of which MBs have been lost is also implemented. The
encoder can then selectively replenish some MBs in order to limit error propagation. This scheme performs well
when the round-trip delay is only a small number of frame duration, but degrades as the delay increases. The next
step will be looking at how best these schemes can be combined with other loss-resilient techniques to yield a loss
resilient high performance video coder.
20
JAVIC Final Report
Chapter 3: Motion Compensation with Packet Loss
Chapter 3
Motion Compensation w ith Packet
Loss
3.1 Introduction
In the previous progress report, it was shown that the concealment of lost macroblocks (MB) using motion
compensated MBs from the previous frame gives good results. Since packetisation is done at the group of block
(GOB) level, a lost packet results in both the motion vectors and the displaced frame difference (DFD) information
being lost. The lost motion vectors were estimated from the neighbouring correctly received MBs. In order to
minimise the effect of packet losses, the inclusion of the same motion information in two consecutive packets was
proposed. Thus, motion vector information is lost only if two or more consecutive packets are lost. Simulations with
random packet losses have shown that this results in a significant improvement in image quality.
In this section, motion estimation in the presence of packet losses is considered. It is found that, although the
motion information represents a relatively small fraction of the overall bit-rate, they contribute very significantly to
the decoded image quality. A scheme using redundant packetisation of the motion information is proposed. Different
methods for extracting the motion vectors and doing the motion compensation in the presence of packet loss are
investigated. The results obtained so far with our modified H263 coder are then compared with the Intra-H261 coder
currently used in vic to show that motion compensation is a viable solution for Internet video coding applications. A
more efficient way of protecting the motion information using forward error correction across packet is then
proposed and extensive simulation results with standards sequences are presented.
3.2 Redundant Packetisation of Motion Vectors
As mentioned in the previous chapter, the RTP-H.263+ packetisation scheme minimises the effect of packet
losses by making each packet independently decodable by doing the packetisation at GOB boundaries and including
all the necessary information in a header. However, in the case of H.263, because motion compensation is used, the
decoder still relies on previously decoded information to decode the current packet. Even though packets are
independently decodable, the decoded information will still be wrong if a previous packet has been lost. One way of
increasing the robustness of packetisation to losses is to include the same information in more than one packet, i.e. a
controlled addition of redundancy in the packets. Such a scheme is used in the Robust Audio Tool (rat) [Hardman et
al., 1995] for robust audio over the Internet, where redundant information about a packet, in the form of
synthetically coded speech, is included in the following packet. A similar scheme is proposed in [Bolot and Turletti,
1996] for robust packet video, where a low-resolution version of all macroblocks coded in a packet is included in
the next packet.
3.2.1 Motion Information
It is known that for a typical video sequence compressed with H.263+, the correct decoding of the motion
vectors play a crucial role in the proper reconstruction of the decoded video [Ghanbari and Seferidis, 1993]. This is
especially true when considering the relatively small fraction of the total bit-rate occupied by the motion
information, compared to the DCT coefficients and other side information. The amount of the various types of
information making up the H.263+ bitstream for the Foreman sequence is shown in Fig. 3.1. About 70-80% of the
total bits is made up of the DCT coefficients whereas the motion vectors only take up 5-15% of the total. Similar
results were obtained for the Salesman sequence, where the motion vectors take up an even smaller fraction of the
total bit budget (between 2 and 5%) since there is much less motion than in Foreman.
21
JAVIC Final Report
Chapter 3: Motion Compensation with Packet Loss
Foreman, 125 frames, QCIF, 12.5 Hz
90
80
70
% of total bit rate
60
Dct coefficients
Header+Side Info
Motion Vectors
50
40
30
20
10
0
30
40
50
60
70
80
90
100
Total Bit-rate/kbps
110
120
130
140
Fig. 3.1. DCT coefficients, header/side information and motion vectors in H.263+ bitstream.
3.2.2 Redundant Motion Vector Packetisation
As a robust packetisation method, we propose to include a copy of the motion vectors in each packet in the
subsequent packet. This means that if a MB is lost, it can still be approximated by its motion compensation
prediction, unless two consecutive packets are lost, in which case both copies of the motion vectors will be
unavailable at the decoder. When this happens, the lost motion vector estimation method as described in Chapter 2
is used.
No loss
Redundant MV Coding, 72.92 kbps
5 intraMB/frame, 75.50 kbps
Error concealment only, 63.82 kbps
34
32
PSNR/dB
30
28
26
24
22
0
20
40
60
80
Frame number
100
120
140
Fig. 3.2. Comparison of redundant motion vector packetisation with uniform MB
replenishment and error concealment using lost motion vector estimation.
22
JAVIC Final Report
Chapter 3: Motion Compensation with Packet Loss
This redundant packetisation scheme can still be used with RTP-H.263 packetisation by simply including a
copy of the motion vectors from the previous packet in the H.263-payload data of the current packet. For the
Foreman sequence coded at 60 kbps with H.263, assuming that an entire GOB is coded in each RTP packet, the
inclusion of the motion vectors within RTP results in a bit-rate increase of 14.3%, from 63.82 kbps to 72.92 kbps.
The performance of this robust redundant packetisation scheme is simulated for random packet losses of 10% and
the results obtained are given in Fig. 3.2.
Fig. 3.3. Frames 20 and 60 with redundant motion vector packetisation
for 10% random packet losses.
The redundant motion vector coding algorithm performs better than the uniform intra replenishment method
by an average of 3 dB while producing a lower bit rate. This confirms the importance of the motion vectors in
motion compensated video. Typical decoded images obtained for a random packet loss of 10% are shown in Fig.
3.3. This scheme can potentially be improved by modifying the encoder such that motion compensation is performed
using the motion vector information only. This will result in a slight loss in compression in the error-free
environment, but will result in better robustness to packet losses since the decoder will be able to remain
synchronised with the encoder even if only the motion information is received.
3.3 Motion Estimation with Packet Losses
The motion estimation and compensation techniques specified in the H261, H263 and MPEG video coding
standards require the use of two reference frames, in addition to the current input frame. During motion estimation,
the input frame is compared to a reference frame and motion vectors are then extracted that best match the blocks in
the input frame with the corresponding blocks in the reference frame. During motion compensation, the motion
vectors are applied to blocks in the second reference frame to get the motion compensated frame. The difference
between the input frame and the motion compensated frame is then coded. Normally, the same reference frame is
used for both motion estimation and compensation, and results in optimum performance, i.e. minimum bit-rate for
given quality. In most cases, the previously decoded frame is used as the reference frame.
However, this is only valid if the reference frame used for motion compensation at the encoder is also
available at the decoder, i.e. the encoder and decoder are perfectly synchronised. In the case of Internet video
coding, this assumption is not valid as packet losses will result in loss of information at the decoder, and the
reference frame will no longer be available at the decoder. This results in a mismatch between the encoder and
decoder that will propagate from one frame to the next.
Redundant MV coding, where two copies of the same motion vectors are packetised in two adjacent packets,
performs better than simple error concealment or error concealment with uniform intra-MB replenishment.
However, it can be seen (Fig. 3.2) that the image quality degrades progressively as the errors, due to lost packets as
explained above, propagate between frames. The best way to limit this error propagation is to ensure that
synchronisation between encoder and decoder is regained by periodically replenishing each MB in intra mode, i.e.
without any reference to previous frames. This results in an increase in bit-rate, but also effectively limits the
propagation of errors. However an intracoded MB can also be lost and a lost intra MB will result in greater
distortion than a lost inter MB. This distortion can be minimised with little cost by transmitting motion vectors for
intra MB as well. Thus if an intra-MB is lost, its motion vector can then be recovered with redundant motion vector
coding, and used to predict the lost MB using the corresponding MB from the previous frame.
23
JAVIC Final Report
Chapter 3: Motion Compensation with Packet Loss
3.3.1 Motion Compensation - Method 1
Our modified H263 codec uses full-search block matching with half-pixel accuracy. Up to now, the previous
decoded frame has been used as reference for both motion estimation and compensation, and this is referred to as
Method 1. Simulations were carried out with 10% random packet loss rates. Redundant MV packetisation where
motion vectors are duplicated in two separate packets, as explained in the last progress report, together with uniform
intra-MB replenishment were used. The results for 0, 5 and 10 intra-MB per frame are given in Fig. 3.4.
34
33
32
PSNR/dB
31
30
29
28
27
RTP-H263, No loss
Method 1, 10 intraMB/frame, 96.2 kbps
Method 1, 5 intraMB/frame, 84.6 kbps
Method 1, No intraMB, 72.92 kbps
26
25
0
20
40
60
80
Frame Number
100
120
140
Fig.3.4. Redundant motion vector packetisation and uniform intra MB replenishment.
As expected, the use of intra-MB replenishment effectively stops the propagation of errors at the expense of
an increase in bit rate.
3.3.2 Motion Compensation - Method 2
Errors propagate from one frame to the next because the reference frame used for motion compensation at the
encoder and decoder are not the same due to packet losses. A possible way to minimise error propagation is to
ensure that both encoder and decoder remain synchronised by using a reference frame for motion compensation that
is likely to be error free at the decoder even in the presence of packet losses. With the duplication of motion vectors
in two separate packets, the probability of motion vectors being unavailable at the decoder is minimised. Thus error
propagation can be reduced by using a reference frame for motion compensation derived using only the motion
vectors and not the DFD information. However, this reduces the efficiency of motion compensation, resulting in a
more complicated DFD and hence a considerable increase in bit rate. Note that the same reference frame is used for
motion estimation as well. The results for this technique, referred to a Method 2, are given in Fig. 3.5, and compared
with Method 1.
When no intra MB replenishment is used, Method 2 performs better as the error propagation from frame to
frame is minimised, at the expense of more than doubling the total bit-rate. However, with intra MB replenishment,
similar performance is obtained with both methods. This is explained by the fact that although error propagation
between frames is minimised, the DFD information is more important and each lost MB results in greater distortion.
24
JAVIC Final Report
Chapter 3: Motion Compensation with Packet Loss
34
33
32
PSNR/dB
31
30
29
28
27
RTP-H263, No loss
Method 2, 10 intraMB/frame, 125.6 kbps
Method 1, 10 intraMB/frame, 96.25 kbps
Method 2, No intraMB, 157.1 kbps
Method 1, No intraMB, 72.92 kbps
26
25
0
20
40
60
80
Frame Number
100
120
140
Fig. 3.5. Motion compensation with reference frame derived using previously
transmitted motion information only.
3.3.3 Motion Compensation- Method 3
34
33
32
PSNR/dB
31
30
29
28
27
RTP-H263, No loss
Method 3, 10 intraMB/frame, 98.9 kbps
Method 1, 10 intraMB/frame, 96.24 kbps
Method 3, No intraMB, 75.9 kbps
Method 1, No intraMB, 72.9 kbps
26
25
0
20
40
60
80
Frame Number
100
120
140
Fig. 3.6. Motion estimation based on original previous frame
The next method uses the original previous frame as reference for extracting the motion vectors, but motion
compensation is still done based on the previously decoded frame as in Method 1. The advantage of this method is
that the motion vectors represent the true motion in the scene, and it was thought that this would make them more
suitable for error concealment. However, the results, shown in Fig. 3.6, are not noticeably better than those obtained
with Method 1, and cause a small increase in bit rate.
25
JAVIC Final Report
Chapter 3: Motion Compensation with Packet Loss
3.4 Performance of Intra-H261 Coder in vic
It is often assumed that motion compensation cannot be used for robust Internet video coding because of its
sensitivity to packet loss. Therefore, the video conferencing tool vic uses a modified version of an H261 encoder,
which is known as intra-261, where motion vectors are not used and all MB are coded in intra mode. A form of
conditional replenishment is done where only the MBs that have changed significantly are transmitted. This means
that in the presence of packet losses, a lost MB is simply not updated and errors do not propagate from one frame to
the next. In Fig. 3.7, the results obtained when the standard Foreman sequence (QCIF, 12.5 Hz, 125 frames) is
coded with intra-H261 are shown for various bit rates and compared with those obtained so far with our modified
H.263 coder with RTP packetisation. The overall bit rate for the intra-H261 coder is varied by changing the
quantisation factor.
35
Modified H263, No loss
Intra-H261 at 264 kbps, No loss
Redundant MV+10 IntraMB, 96.2 kbps, 10% loss
Intra-H261 at 128 kbps, No loss
34
33
PSNR/dB
32
31
30
29
28
27
26
0
20
40
60
80
Frame Number
100
120
140
Fig. 3.7. Comparison of intra-H.261 with modified H.263 coder.
Thus it can be seen that the modified H263 coder using redundant packetisation of motion vectors and
uniform intra-MB replenishment outperforms the intra-H261 coder even in the presence of 10% random packet
losses. This proves that motion compensation, when combined with effective loss resilient techniques, can be more
efficient than intra-coding, even in the presence of loss.
26
JAVIC Final Report
Chapter 3: Motion Compensation with Packet Loss
3.5 FEC of Motion Information
It has been shown that duplication of motion vectors in separate packets gives a significant increase in
robutsness. However, duplication of information is inefficient in terms of redundancy and a better way of protecting
the motion vectors is to use forward error correction (FEC) across packets. FEC can be very effective in this
situation since the underlying RTP/UDP protocol enables lost packets to be detected, resulting in packet erasures.
3.5.1 RS Erasure Correcting Code
Effective erasure codes such as the Reed-Solomon Erasure (RSE) correcting code have been developed
[McAuley, 1990] and implemented in software [Rizzo, 1997]. With a RSE(n,k) code, k data packets can be encoded
into n packets, i.e. with r=n-k parity packets, such that any subset of k encoded packets is enough to reconstruct the
k source packets. This provides for error-free decoding for up to r lost packets out of n, as illustrated in Fig. 3.8. The
main considerations in choosing values of r and k are:
•
•
Encoding/Decoding Delay. In the event of a packet loss, the decoder has to wait until at least k packets
have been received before decoding can be done. So, in order to minimise decoding delay, k must not be
too large.
Robustness to Burst Losses. A higher value of k means that, for the same amount of redundancy, the FEC
will be able to correct a larger number of consecutive lost packets.
k data packets
Data Packet
Data Packet
Data Packet
Data Packet
:
:
:
:
:
:
Data Packet
(n-k)
FEC Packets
Transmission
At least k
packets
received
Data Packet
Data Packet
:
:
FEC
Decoding
:
Lost Packet
Data Packet
Data Packet
Data Packet
Data Packet
FEC Packet
FEC Packet
FEC Packet
Lost Packet
:
:
FEC Packet
FEC Packet
Fig. 3.8. RSE(n,k) across k data packets.
3.5.2 FEC of Motion Information (MV-FEC)
In order to achieve a good trade-off between decoding delay and robustness to burst packet losses, the RSE
code is applied across the packets of a single frame. For QCIF images, this results in a RSE(n,9) code since H.263+
produces 9 GOBs per frame and each RTP packet contains a single GOB. The motion information for each GOB is
contained in the COD and MVD bits of the H.263+ bitstream. The RSE(n, 9) encoding is therefore applied across
these bits segments in each of the nine GOBs, generating (n-9) parity bit segments. The length of the parity bit
segments will be equal to the maximum length of the COD and MVD data segments among the 9 GOBs. When
applying the RSE encoding, missing bits for shorter segments are assumed to be zero. The FEC data segments are
then appended to the data packets of the following frame (Fig. 3.9), so that the number of RTP packets per frame
does not change. So, up to 9 parity packets (i.e. r=9) can be used by such a scheme, and if an RTP packet were to be
lost, there would be an additional one frame delay at the decoder before the motion vectors could be recovered.
27
JAVIC Final Report
Chapter 3: Motion Compensation with Packet Loss
Current Frame
Next Frame
GOB 1
GOB 1
GOB 2
GOB 2
GOB 3
GOB 3
COD and
MVD info
GOB 8
GOB 8
GOB 9
GOB 9
Parity Data
FEC of MV info
with RSE(11,9)
Fig. 3.9. FEC of motion information across packets.
34
32
PSNR/dB
30
28
26
24
No loss - 62.37 kbps
No MV FEC - 62.37 kbps
MV FEC with r=1 - 63.38 kbps
MV FEC with r=2 - 64.39 kbps
MV FEC with r=4 - 66.42 kbps
22
20
0
20
40
60
80
Frame Number
100
120
140
Fig. 3.10. Foreman with 10% loss and different amount of MV-FEC.
Fig. 3.10 and 3.11 show simulation results with different values of r and 10% random loss for the foreman
and salesman sequences, respectively. As expected, the FEC of the motion vectors increases the robustness to loss.
For the foreman sequence, with r=4, i.e. 4 parity packets, the degradation in PSNR caused by packet loss is reduced
by a factor of two for only about 6% increase in rate. Increasing the amount of FEC beyond r=4 does not result in
any significant improvement because, for a 10% loss rate, it is very unlikely that more than 4 out of any 9
consecutive packet are lost. The improvement is less substantial for the salesman sequence because it has relatively
less motion and most of the motion vectors are zero. However, the corresponding increase in bit-rate is also reduced,
being less than 4% for r=4. Typical decoded frames obtained with r=2 and 10% random packet loss are shown in
Fig. 3.12.
28
JAVIC Final Report
Chapter 3: Motion Compensation with Packet Loss
35
34
33
No loss - 43.04 kbps
No MV FEC - 43.04 kbps
MV FEC with r=1 - 43.44 kbps
MV FEC with r=2 - 43.83 kbps
MV FEC with r=4 - 44.62 kbps
PSNR/dB
32
31
30
29
28
27
0
20
40
60
80
Frame Number
100
120
140
Fig. 3.11. Salesman with 10% loss and different amount of MV-FEC.
Fig. 3.12. Frame 20 and 60 for r=2 with 10% random packet loss.
Fig. 3.13 shows the performance of MV-FEC with various values of r for a range of packet loss rates. The
results shown were obtained by averaging over 10 runs for each packet loss rate. The same loss patterns were used
for each value of r and the PSNR is the average over 125 frames of the foreman sequence. As expected, the use of
MV-FEC improves the performance for all packet loss rates with a larger number of parity packets being more
beneficial at high loss rates. For example, r=3 is sufficient for 5% loss rate and using a higher value of r does not
result in any significant improvement for that particular loss rate.
29
JAVIC Final Report
Chapter 3: Motion Compensation with Packet Loss
32
RSE(6,9)
RSE(3,9)
RSE(1,9)
No MV-FEC
Average PSNR/dB
30
28
26
24
22
20
18
0
5
10
15
% packet loss rate
20
25
30
Fig. 3.13. Performance of MV-FEC for different amounts of FEC packets per frame for Foreman.
3.6 Conclusions
The main problem with motion compensation is that errors due to lost packets propagate from one frame to
the next. Different motion estimation and compensation methods were investigated to try to minimise error
propagation, and the simulations with random packet losses show that conventional motion estimation/compensation
based on the previous decoded frame still offer the best trade-off between total bit-rate and robustness to losses.
It was shown that simple error concealment techniques together with duplication of important information,
namely the motion vectors, can greatly improve robustness to packet losses. However, duplicating information is a
very crude way of introducing redundancy. A better method would be to use error-correcting codes across packets
so that specific parts of missing packets can be reconstructed. This would then allow restructuring of the encoded
bitstream to ensure that more important information is placed in parts of the packet protected with FEC. Motion
estimation and compensation could then be based only on the protected information, thus preventing error
propagation as long as the protected information is correctly received. These options will be investigated next.
30
JAVIC Final Report
Chapter 4: Periodic Reference Picture Selection
Chapter 4
Periodic Reference Picture Selection
4.1 Introduction
Here, temporal error propagation resulting from motion compensation is considered. A scheme known as
Periodic Reference Picture Selection (RPS) is introduced which modifies the motion compensation process,
whereby some frames, referred to as Periodic Reference (PR) frames, are predicted using a previously decoded
frames several frames behind. This technique, when used with forward error correction is found to be more efficient
at minimising error propagation from one frame to the next than intraframe coding for an equivalent bit-rate. The
use of PR frames is then combined with the MV-FEC scheme presented in Chapter 3 to give a robust H.263+ based
coder. This codec can also provide a layered packet stream giving temporal scalability, with the PR frames
considered as the base layer, and the remaining frames being the enhancement layer. Extensive simulation results are
presented for random packet loss. This robust codec has been optimised and integrated into vic and this is described
in Appendix A.
4.2 Periodic Reference Picture Selection
In all motion-compensated video-coding schemes, the most recent previously coded frame is generally used
as reference for the temporal prediction of the current frame. The Reference Picture Selection (RPS) mode (annex
N) of H.263+ enables the use of any previously coded frame as reference. Using any other frame than the most
recent one reduces the compression efficiency, but can be beneficial in limiting error propagation in error-prone
environments, as we will show next. Experiments on the standard video sequences have confirmed that it is much
more efficient to predict a picture from a frame several frames behind rather than to use intraframe coding.
H.263+ with Reference Frame Selection for Foreman (125 frames, 12.5 fps)
35
1 frame delay
2 frame delay
3 frame delay
4 frame delay
6 frame delay
10 frame delay
intra only
intra only+aic
34
Average PSNR/dB
33
32
31
30
29
28
0
50
100
150
bit-rate/kbps
200
250
300
Fig. 4.1. Comparison of motion compensation with n frame delay and intracoding.
31
JAVIC Final Report
Chapter 4: Periodic Reference Picture Selection
Fig. 4.1 shows results for foreman where each frame is predicted from the nth previous frame, and n is known
as the frame delay. Note that for n=1, this is exactly the same as conventional motion compensation from the most
recent previous coded frame. Results are also shown for the case where no prediction is used, i.e. each frame is
intracoded, both with and without the Advanced Intra Coding mode (Annex I) of H.263+. It can be seen that
prediction even with 10 frames delay (800 ms) still results in less than half the bit-rate compared to intraframe
coding. Similar results were obtained for different frame rates and sequences.
4.2.1 Periodic Reference frames
The same technique as proposed in [Rhee, 1998] is used here, where every nth frame in a sequence is coded
with the nth previous frame as reference, and this scheme is referred to as periodic RPS. Such frames are known as
periodic reference (PR) frames, and n is the frame period, which is the number of frames between PR frames. All
the other frames are coded as usual, i.e. using the previous frame as reference. This scheme is illustrated in Fig. 4.2.
The advantage of PR frames is that if any errors in a PR frame can be corrected through the use of ARQ or FEC
before it is referenced by the next PR frame, then this will effectively limit the maximum temporal error propagation
to the number of frames between PR frames. We show next that FEC can be used on PR frames, and yet be more
efficient than intraframe coding in terms of bit-rate.
1
2
Periodic
Reference
Frame
3
… n-2 n-1 1
2
3
Periodic
Reference
Frame
… n-2 n-1
1
Periodic
Reference
Frame
Fig. 4.2. Periodic Reference (PR) frame scheme with frame period=n.
Comparison of Periodic RPS and Intraframe replenishment for Foreman (12.5 Hz)
34
33.5
33
Average PSNR/dB
32.5
32
31.5
31
RPS with 15 frame period
RPS with 10 frame period
RPS with 5 frame period
Intra every 15 frames
Intra every 10 frames
Intra every 5 frames
H263+ without RPS
30.5
30
29.5
20
40
60
80
bit-rate/kbps
100
120
140
Fig. 4.3. Comparison of Periodic RPS and intraframe coding.
32
JAVIC Final Report
Chapter 4: Periodic Reference Picture Selection
4
3
Comparison of Periodic RPS with Intraframe replenishment for Foreman (12.5 Hz) with QP=10
x 10
Intraframe every 10 frames
Periodic RPS with 10 frame period
Original H263+
2.5
bit-rate/bits
2
1.5
1
0.5
0
0
20
40
60
80
100
120
140
frame number
Fig. 4.4. Bits/frame for periodic RPS and intraframe coding.
4.2.2 FEC of PR frames
FEC can be used on the PR frames in a similar fashion as for the motion vectors by using the RSE code
across packets. In this case as well, the RSE encoding is applied across the packets of a single frame. However, this
time the generated parity data is transmitted as separate RTP packets as shown in Fig. 4.5. The length of the FEC
packets is equal to the maximum length of the 9 data packets.
GOB 1
GOB 2
GOB 3
9 data packets
GOB 8
GOB 9
FEC Packet 1
(n-9) FEC
packets
:
FEC Packet (n-9)
Fig. 4.5. RSE(n,9) across packets of PR frame.
Fig. 4.6 compares the robustness of the Periodic RPS/FEC scheme and intraframe coding for 10% packet
loss. The results shown are for foreman with a frame period of 10, i.e. for periodic RPS, there is a PR frame every
10 frames, and for the intracoding scheme, there is an intraframe every 10 frames. The RSE(13,9) code was used,
resulting in r=4 parity packets for each PR frame. We observe that the periodic RPS scheme is as effective as
intraframe coding in limiting temporal error propagation, while at the same time being more efficient in terms of bitrate.
33
JAVIC Final Report
Chapter 4: Periodic Reference Picture Selection
Foreman(QCIF,12.5 Hz) with frame period=10 and 4 FEC packets per PR frame
34
33
32
31
PSNR/dB
30
29
28
27
26
Intraframe coding - No loss - 84.6 kbps
Periodic RPS - No loss - 81.8 kbps
Periodic RPS - 10% loss
Intraframe coding - 10% loss
25
24
23
0
20
40
60
80
Frame Number
100
120
140
Fig. 4.6. Periodic RPS/FEC and Intraframe coding for foreman.
Comparison of Periodic RPS/FEC (frame period=10) and Intraframe coding
38
salesman
37
36
foreman
PSNR/dB
35
34
33
32
Periodic RPS - No FEC
Periodic RPS - r=1
Periodic RPS - r=4
Intraframe every 10 frames
31
30
20
40
60
80
100
Bit-rate/kbps
120
140
160
Fig. 4.7. Comparison of Periodic RPS/FEC and intraframe coding.
The amount of loss that the periodic RPS/FEC scheme can tolerate depends on the amount of FEC used.
Experimental results, as shown in Fig. 17 for a frame period of 10, demonstrate that for the foreman sequence, each
extra parity packet per PR frame cause an increase of about 3.6% of the original rate. The corresponding increase is
about 4.5% for the salesman sequence. For foreman, the use of 4 parity packets (r=4) results in a total bit-rate
34
JAVIC Final Report
Chapter 4: Periodic Reference Picture Selection
similar to that required for intraframe coding, whereas for the salesman sequence, 4 parity packets results in less
than half the equivalent increase in rate for intraframe coding with the same PSNR.
4.3 Robust H.263+ Video Coding
The Periodic RPS-FEC technique presented in this Chapter and the MV-FEC scheme presented in Chapter 3
are now combined and their robustness and efficiency with different amounts of FEC are compared.
Most practical applications require some form of resynchronisation in some way or another, such as for
videoconferencing where a receiver joins a session halfway through. So we also include resynchronisation in our
robust codec in the form of uniform intra-MB replenishment. This is applied to the PR frames only where a number
of MBs in each PR frame are intracoded. The following schemes at roughly the same bit-rate are compared:
•
RTP-H.263+: H.263+ packetised with RTP and without any intra-replenishment.
•
RTP-H.263+ with intraframe: Same as previous but with an intracoded frame every 10 frames to minimise error
propagation.
•
RPS/FEC/MV with r=1: H.263+ packetised with RTP together with periodic RPS/FEC with frame period of 10
as well as FEC of motion vectors. FEC with r=1 used on both motion vectors and PR frames. 5 MB per PR
frame are also intracoded.
•
RPS/FEC/MV with r=2: Same as previous but with r=2.
•
RPS/FEC/MV with r=4: Same as previous but with r=4.
Fig. 4.8 and 4.9 shows the results for different packet loss rates. RTP-H.263+ without intra replenishment
performs best for error free conditions but degrades catastrophically with loss. The use of intra replenishment
provides slightly better loss performance at the expense of a decrease in coding efficiency, especially for the
salesman sequence. Depending on the amount of FEC used, our robust scheme can provide better coding efficiency
than intraframe replenishment as well as greater resilience to packet loss. The image quality degrades gracefully
with loss rates and increasing the amount of FEC provides greater robustness with only a minimal effect on coding
efficiency. Typical decoded images obtained with 10% random loss for H.263 without and with intraframe
replenishment at 85 kbps are shown in Fig. 4.10. The same images obtained for RPS/FEC/MV with r=4 at 85 kbps
with 10 and 30% random loss are shown in Fig. 4.11.
35
JAVIC Final Report
Chapter 4: Periodic Reference Picture Selection
Foreman, QCIF, 12.5 Hz, 125 frames
RTP-H.263+ - No replenishment - 84 kbps
RTP-H.263+ - Intra every 10 frames - 85 kbps
RPS/FEC/MV with r=1 - 88 kbps
RPS/FEC/MV with r=2 - 79 kbps
RPS/FEC/MV with r=4 - 86 kbps
34
Average PSNR/dB
32
30
28
26
24
22
0
5
10
15
% packet loss rate
20
25
30
Fig. 4.8. Robustness of Periodic RPS/FEC/MV scheme with different amounts of FEC for foreman.
Salesman, QCIF, 12.5 Hz, 125 frames
35
RTP-H.263+ - No replenishment - 43 kbps
RTP-H.263+ - Intra every 10 frames - 42 kbps
RPS/FEC/MV with r=1 - 45 kbps
RPS/FEC/MV with r=2 - 42 kbps
RPS/FEC/MV with r=4 - 40 kbps
34
33
Average PSNR/dB
32
31
30
29
28
27
26
0
5
10
15
% packet loss rate
20
25
30
Fig. 4.9. Robustness of Periodic RPS/FEC/MV scheme with different amounts of FEC for salesman.
36
JAVIC Final Report
Chapter 4: Periodic Reference Picture Selection
The proposed robust coding scheme is fully compatible with the H.263+ standard and only requires minimal
changes to the RTP specifications so that FEC packets and the amount of FEC used can be signalled to the decoder.
In a typical multicast application, the scheme can be used in an adaptive fashion where the amount of FEC is varied
at the encoder based on the loss rate received from RTCP reports.
Fig. 4.10. Frame 65 with H.263 at 85 kbps with 10% loss (a) no replenishment and (b) intraframe every 10 frames.
Fig. 4.11. Frame 65 using RPS/FEC/MV with r=4 at 85 kbps (a) 10 and (b) 30% random packet loss.
4.4 Conclusions
In this chapter, a modified H.263+ codec has been presented that is robust to packet loss. Simulation results
show that acceptable image quality is still possible even with loss rates as high as 30%. The proposed modifications
are compatible with the H.263+ specifications and only require minor changes to the RTP-H.263+ payload format.
Moreover, the robust codec does not rely on any feedback from the network and does not introduce any additional
delays. Therefore it is suitable for real-time interactive video and multicast applications. The modified H.263+
codec has been implemented and integrated into the software videoconferencing tool vic (more details are given in
Appendix A), which can be used for real-time video multicast over the Internet.
37
JAVIC Final Report
Chapter 5: Scalable Wavelet Video Coding
Chapter 5
Scalable Wavelet Video Coding
5.1 Introduction
In this section, wavelet video coding is addressed, with particular emphasis on algorithms giving a
continuously scalable bitstream. One of the benefits of continuous scalability is that a single encoded bitstream can
be decoded at any arbitrary rate lower than the encoded bit-rate, which is ideally suited to the heterogeneous nature
of the Internet. It also allows precise dynamic rate control at the encoder and this can be used to minimise
congestion by dynamically matching the encoder output rate to the available network bandwidth. A continuously
scalable bitstream can also be divided into any arbitrary number of layers without any loss in efficiency. Each layer
could then be transmitted as a separate multicast stream using Receiver-driven layered multicast. The motion
compensation and 2D wavelet approach is preferred to the 3-D wavelet decomposition mainly because of the latency
resulting from performing the wavelet decomposition in the temporal domain, which makes it unsuitable for
interactive applications. Two wavelet-based video codecs are described - a motion-compensated codec using the 2D
SPIHT algorithm for coding the prediction error, and a hybrid H.263/SPIHT codec that codes a fixed base layer
with H.263 and a scalable enhancement layer using 2D SPIHT. Both codecs provide a continuously scalable
bitstream in terms of image SNR. The performances of the wavelet codecs are compared to non-layered and layered
H.263+.
5.2 SPIHT Still Image Coding
In [Said and Pearlman, 1996], a wavelet-based still image coding algorithm known as set partitioning in
hierarchical trees (SPIHT) is developed that generates a continuously scalable bitstream. This means that a single
encoded bitstream can be used to produce images at various bit-rates and quality, without any drop in compression.
The decoder simply stops decoding when a target rate or reconstruction quality has been reached.
Fig. 5.1. 2-level wavelet decomposition and spatial orientation tree.
In the SPIHT algorithm, the image is first decomposed into a number of subbands using hierarchical wavelet
decomposition. The subbands obtained for a two-level decomposition are shown in Fig. 5.1. The subband
coefficients are then grouped into sets known as spatial-orientation trees, which efficiently exploit the correlation
between the frequency bands. The coefficients in each spatial orientation tree are then progressively coded bit-plane
by bit-plane, starting with the coefficients with highest magnitude and at the lowest pyramid levels. Arithmetic
coding can also be used to give further compression.
38
JAVIC Final Report
Chapter 5: Scalable Wavelet Video Coding
SPIHT of Lenna (256 by 256) with different levels of wavelet decomposition
40
5
4
3
2
35
levels
levels
levels
levels
PSNR/dB
30
25
20
15
10
0
0.1
0.2
0.3
bit-rate/bpp
0.4
0.5
0.6
Fig. 5.2. SPIHT coding of Lenna image (binary uncoded).
SPIHT Coding of Lenna(256 by 256) with 5 levels of Wavelet Decomposition
40
Arithmetic Coding
Binary-Uncoded
35
PSNR/dB
30
25
20
15
10
0
0.1
0.2
0.3
bit-rate/bpp
0.4
0.5
0.6
Fig. 5.3. Binary uncoded vs. Arithmetic coding.
The results obtained without arithmetic coding, referred to as binary-uncoded, for the grayscale Lena image
(256 by 256) are shown in Fig. 5.2 for different levels of wavelet decomposition. In general, increasing the number
of levels gives better compression although the improvement becomes negligible beyond 5 levels. In practice the
number of possible levels can be limited by the image dimensions since the wavelet decomposition can only be
applied to images with even dimensions. The use of arithmetic coding only results in a slight improvement as shown
in Fig. 5.3 for a 5 level decomposition.
39
JAVIC Final Report
Chapter 5: Scalable Wavelet Video Coding
5.3 Wavelet-based Video Coding
Video coding using 3-D wavelet transforms can be used that enable SNR, spatial and frame-rate scalability
[Taubman and Zakhor, 1994][Kim et al., 1999]. However, the latency resulting from the application of the wavelet
transform in the temporal domain makes 3-D wavelet unsuitable for interactive applications with strict delay
requirements. Therefore, only 2-D wavelet applied in the spatial domain [Martucci et al., 1997] is considered here.
In particular, two wavelet-based video codecs are presented here. The first one is a motion compensated blockmatching video codec where the displaced frame difference (DFD) is coded using 2-D SPIHT. This approach gives
a continuously scalable bitstream, resulting in image SNR scalability at the decoder, but suffers from error
propagation along the MC prediction loop when reference information is not present at the decoder due to scaling of
the bitstream. The second codec is a hybrid H.263/SPIHT codec where the base layer is coded with H.263 and the
reconstructed frame error, i.e. the difference between the original frames and the H.263 coded base layer, is then
coded using 2-D SPIHT. This codec is in essence a two-layer codec with a non-scalable base layer. However the
enhancement layer is continuously scalable and the codec does not suffer from error propagation.
5.3.1 Motion Compensated 2D SPIHT Video Codec
A block diagram of the video encoder is shown in Fig. 5.4.
Input frames
+
Wavelet
Transform
SPIHT bitstream
SPIHT
Encoder
SPIHT
Decoder
+
+
Inverse Wavelet
Transform
Overlapped
MC
Motion Vectors
Fig. 5.4. 2D SPIHT motion compensated video encoder.
Overlapped block motion compensation as described in the H.263 standard is employed to remove temporal
redundancy. Overlapping is necessary in order to remove the blocking artefacts resulting from block matching,
which will otherwise adversely affect the performance of the wavelet decomposition. The motion compensated
prediction error frame is then coded using the SPIHT algorithm. Intraframes are also coded using SPIHT. The
motion vectors are predictively and then variable-length coded, exactly as in H.263. Frames can be coded in intra
mode or in inter mode. In inter mode, all the macroblocks are predicted since intra coding of a single macroblock
cannot be efficiently done using frame based wavelet techniques. This greatly simplifies the bitstream syntax
compared to H.263 since there is no mode or quantiser information to transmit. Thus, it is assumed that the bitstream
for an intercoded frame will consist of 400 bytes of header information (approximately the same as for an H.263
bitstream), followed by the motion vector data and the SPIHT coded bits (Fig. 5.5). The number of bits used to
encode each frame can be exactly controlled so that an exact desired bit-rate can be achieved. In the following
results, the available bit-budget is divided equally among all the frames, except for the first frame, which is intra
coded.
Header
(400 bits)
MV data
(variable)
SPIHT bits
(continuously scalable)
Fig. 5.5. Bitstream of intercoded frame.
40
JAVIC Final Report
Chapter 5: Scalable Wavelet Video Coding
Foreman (QCIF,125 frames,12.5 Hz) at 112 kbps
37
36
PSNR/dB
35
34
33
32
H.263 with SAC and overlapping MC
SPIHT with arithmetic coding
SPIHT binary-uncoded
31
30
0
20
40
60
80
frame number
100
120
140
Fig. 5.6. Foreman (QCIF, 12.5 Hz, 125 frames) coded with SPIHT and H.263 at 112 kbps.
Coding of Foreman (QCIF,125 frames,12.5 Hz)
37
H263 with SAC and Overlapped MC
SPIHT with Arithmetic Coding
36
Average PSNR/dB
35
34
33
32
31
30
29
28
20
40
60
80
100
Average bit rate/bpp
120
140
160
Fig. 5.7. Comparison of MC-SPIHT and H.263.
The results obtained with the MC-SPIHT coder for 125 frames of the Foreman sequence (QCIF, 12.5 Hz) are
given in Fig 5.6. Results for H.263 with overlapped motion compensation and syntax-based arithmetic coding are
also given. The first frame is intracoded and all other frames are intercoded. For the 2D SPIHT video coder, the use
of arithmetic coding provides about 1 dB improvement. There is a much larger variation in PSNR over the sequence
compared to H.263 since the same number of bits is used for each frame. This results in high PSNR for scenes with
little or no motion and lower PSNR for scenes with high temporal activity. However, the average performance of
SPIHT with arithmetic coding is similar to that of H.263. The compression performance of our SPIHT video coder
is compared with that of H.263 over a range of bit-rates in Fig. 5.7. The PSNR values (luminance only) were
obtained by averaging over 125 frames of the foreman sequence. Decoded images at 60 kbps are shown in Fig. 5.8
for subjective comparison.
41
JAVIC Final Report
Chapter 5: Scalable Wavelet Video Coding
(a) Frames 20 and 60 with H.263
(b) Frames 20 and 60 with MC-SPIHT
Fig. 5.8. Subjective comparison of H.263 and MC-SPIHT at 60 kbps.
5.3.2 Effects of Scalability on Prediction Loop at Decoder
The advantage of the SPIHT video coding algorithm over conventional DCT-based approach is that the
bitstream generated for each frame is continuously scalable. For simplicity, it is assumed that the encoder packetises
all the bits generated for a frame into a single packet, i.e. the encoder outputs one packet per frame (Fig. 5.9). This
single packet can then transmitted/decoded in its entirety or only an initial portion of the packet can be
transmitted/decoded. The packet could also be partitioned into two or more packets and sent on different channels or
with different priorities. Depending on the application, this task could be accomplished by the sending application
itself, or by some form of intelligent router or transcoder.
I frame
P frame
P frame
P frame
Packet 2
Packet 3
Packet 4
p bits
p bits
p bits
Packet 2
Packet 3
Packet 4
x bits
x bits
x bits
encoded bitstream at p bits/frame
Packet 1
decoded bitstream at x bits/frame
Packet 1
Fig. 5.9. Scalable decoding at x bits/frame of video encoded at p bits/frame.
Thus, the scalable property of the SPIHT algorithm means that a decoder can decode as much or as little of a
frame as it can, and still decode something meaningful. Thus from a single encoded bitstream at p bits/frame, a
42
JAVIC Final Report
Chapter 5: Scalable Wavelet Video Coding
scaled bitstream can be generated for any bit rate less than p bits/frame. However, for a motion compensated codec,
the situation is complicated by the fact that it is the prediction error frame that is being coded with the SPIHT
algorithm. This error frame is then added to the motion compensated previous frame to give the decoded frame. This
decoded frame is subsequently used as reference for the next frame. If the decoder only partially decodes a frame,
the lack of refinement information at the decoder will create a mismatch in the motion compensation loop.
This mismatch can be removed by performing the motion compensation at the encoder using only partially
decoded information [Shen and Delp, 1999]. For example if the encoder is coding a sequence at p bits/frame, the
motion compensation is done using only the information coded with x bits/frame, where x < p. x is known as the MC
loop bit-rate whereas p is called the encoding bit-rate. Then if the decoder decodes at least x bits/frame, there is no
mismatch in the motion compensation process and the encoded bitstream will be continuously scalable for any rate
between x and p bits/frame. However, this results in a loss in coding performance compared to the non-scaled
approach where the MC bit-rate is equal to the encoding bit-rate, since the motion compensation loop becomes less
efficient.
Results obtained for the Foreman sequence for the scalable decoding of bitstreams encoded at 0.70 bpp with
an MC loop rate of 0.10, 0.20 and 0.35 bpp, respectively, are shown in Fig. 5.10. The results obtained for the nonscaled situation where the encoder and decoder are perfectly matched are also given. It can be seen that when the
MC loop rate is equal to the scaled decoded rate, the results are exactly the same as for the non-scaled case.
However, when the decoded rate differs from the MC loop rate, there is a drop in performance that increases as the
difference between the MC loop rate and the decoded rate increases. Note that only the first frame is intra-coded and
all the remaining frames are motion-compensated. In practice some form of intra update is necessary for
resynchronisation, especially in an error-prone environment. Results for the case where all the frames are
independently intracoded with SPIHT are also shown for comparison. This is equivalent to our MC-SPIHT coder
with the MC loop rate at 0 bpp.
Performance of Scalable MC 2D-SPIHT Video Coder
38
36
PSNR/dB
34
32
30
28
SPIHT - non-scaled
Scalable with enc. MC=0.35 bpp (112 kbps)
Scalable with enc. MC=0.20 bpp (65 kbps)
Scalable with enc. MC=0.10 bpp (33 kbps)
SPIHT - all frames intracoded
26
24
0
50
100
150
Decoded bit-rate/kbps
200
250
Fig. 5.10. Performance of continuously scalable MC SPIHT video coder.
Our MC-SPIHT video coder generates a continuously scalable bitstream, i.e. a single bitstream encoded at p
kbps can be decoded at any rate less than p. A single encoded bitstream can also be easily divided into an arbitrary
number of separate bitstreams or layers. However, because of the motion compensation process, although a single
encoded bitstream can be decoded at any bit-rate, the decoded video is only optimal in the rate distortion sense when
the decoded bit-rate is equal to the MC loop bit-rate used by the encoder. This is illustrated in Fig. 5.11 where
frames decoded at various bit-rates from a single encoded bitstream are shown.
43
JAVIC Final Report
Chapter 5: Scalable Wavelet Video Coding
Fig. 5.11. Frame 60 encoded with MC loop at 60 kbps and decoded at (a) 40 (b) 60 and (c) 120 kbps.
The MC-SPIHT video coder that we have developed gives a compression performance comparable to that of
H.263, with the added advantage that our encoder produces a continously scalable bitstream. The only drawback is
that the quality of the decoded images obtained from the scaled bitstream is inferior to that obtained from a nonscaled bitstream at the same rate. It might be possible to improve the decoded image quality for the scaled bitstream
at the expense of a reduction in the scalability properties.
5.3.3 Hybrid H.263/SPIHT Video Codec
The hybrid H.263/SPIHT codec is essentially a motion compensated codec with a fixed base layer and a
scalable enhancement layer. The base layer is encoded with H.263. The difference between the original frames and
the encoded base layer frames is then coded with SPIHT to form the enhancement layer. The enhancement frames
are similar to the so-called EI frames used in layered H.263+ since they are not predicted from previous frames in
the enhancement layer (Fig. 5.12). Therefore the enhancement layer is continuously scalable and a single encoded
bitstream can be decoded at any bit-rate without any problems with error propagation.
Enhancement layer
Coded with SPIHT
EI frame
EI frame
EI frame
EI frame
Base layer
Coded with H.263
I frame
P frame
P frame
P frame
Fig. 5.12. Scalable H.263/SPIHT encoder.
The bitstream syntax for the SPIHT-coded enhancement frames is exactly the same as for the previous codec,
except that no motion vectors are present. The results obtained for 125 frames of the foreman sequence (QCIF, 12.5
Hz) for base layer rates of 30, 40 and 61 kbps are shown in Fig. 5.13. As expected, doing the motion compensation
on the base layer only means that a certain amount of redundant information is being coded for each frame in the
enhancement layer, which increases as the enhancement bit-rate increases.
44
JAVIC Final Report
Chapter 5: Scalable Wavelet Video Coding
Scalable H263+/SPIHT codec for Foreman (QCIF, 12.5 Hz, 125 frames)
36
35
PSNR/dB
34
33
32
31
H263+ - single layer
Base layer = 30 kbps
Base layer = 40 kbps
Base layer = 61 kbps
30
29
0
50
100
150
Total bit-rate/kbps
200
250
Fig. 5.13. Performance of scalable H.263/SPIHT video coder
5.4 Comparison with Layered H.263+
Annex O of the H.263+ specifications provides for temporal, spatial and SNR scalability. Only SNR
scalability will be considered. Note that the scalability provided by H.263+ is, unlike our MC-SPIHT coder, not
continuous, but consists of a number of layers encoded at predefined bit-rates. Each layer must be decoded at the
specific rate at which it was encoded. Rate distortion curves for H.263+ with up to three-layers are shown in Fig.
5.14. We can see that as the number of layers is increased, there is a drop in PSNR of at least 1.5 dB for each
additional layer. Thus H.263+ becomes very inefficient if there is more than 3 or 4 layers. This is not the case for
our continuously scalable SPIHT-based video coders since they generate an embedded bit-stream. Any number of
separate layers can be produced with just a minimal amount of header information as overhead per layer (typically
less than 400 bits/frame).
Layered H.263+(SNR Scalability)
36
35
Average PSNR/dB
34
33
32
31
Single layer
Two layer, base=41 kbps
Three layer, base=41 kbps, layer2=46 kbps
30
29
0
50
100
150
Total bit-rate/kbps
200
250
Fig. 5.14. Efficiency of layered H.263+ with up to 3 layers.
45
JAVIC Final Report
Chapter 5: Scalable Wavelet Video Coding
In Fig 5.15, layered H.263+ is compared with our MC-SPIHT and H.263/SPIHT coders for 125 frames of
Foreman (QCIF, 12.5 Hz). For H.263+, results for a 3-layer codec with the base layer at 32 kbps and the second
layer at 65 kbps are shown. For the MC-SPIHT coder, the MC loop rate is 0.1 bpp (33 kbps) whereas for the
H.263/SPIHT coder, the base layer is coded at 31 kbps. It can be seen that layered H.263+ is only marginally better
than MC-SPIHT for bit-rates above 100 kbps. However, a single encoded H.263+ bitstream can only be decoded at
three specific bit-rates, whereas a single MC-SPIHT encoded bitstream can be decoded at any bit-rate. The
performance of layered H.263+ would drop even further if more layers were used.
36
35
34
Average PSNR/dB
33
32
31
30
29
28
3-layer H.263+ (SNR Scalability)
MC-SPIHT Coder
H.263/SPIHT Coder
27
26
0
50
100
150
Decoder bit-rate/kbps
200
250
Fig. 5.15. Comparison of 3-layer H.263+ with MC-SPIHT and H.263/SPIHT.
5.5 Efficiency of SPIHT Algorithm for DFD and
Enhancement Images
The SPIHT algorithm is very efficient at coding still images because it exploits the correlation between the
coefficients across the various scales of the wavelet transform to code the position of significant coefficients. It
assumes that if a coefficient at a certain level in the pyramid has a large value, then it is very likely that coefficients
at the same spatial orientation further down in the pyramid will also have large values. This is generally true for
natural still images. However, our experiments have shown this assumption is not generally true for prediction error
or enhancement images (Fig. 5.16).
It is possible that more efficient compression of prediction error or enhancement images can be achieved by
modifying the SPIHT algorithm so that the statistics of the wavelet coefficients distribution are better exploited.
46
JAVIC Final Report
Chapter 5: Scalable Wavelet Video Coding
Significant coefficients generated by SPIHT of (a) intra, (b) dfd and (c) enhancement error images
(a) intra
(b) dfd
(c) enhancement
wavelet
decomposition
threshold=512
threshold=64
threshold=32
threshold=256
threshold=32
threshold=16
threshold=128
threshold=16
threshold=8
first dominant
pass
second dominant
pass
third dominant
pass
Fig. 5.16. Validity of hierarchical tree assumption in SPIHT algorithm.
5.6 Conclusions
In this chapter, the work on the development of wavelet-based continuously scalable video codecs has been
presented. Two types of motion compensated codecs are described: a MC 2D-SPIHT codec and a layered
H.263/SPIHT. Both codecs produce a bitstream that is continuously scalable in terms of SNR. When the scalability
property is not used, the compression performance of the MC 2D SPIHT codec is similar to that of H.263. However,
when the encoded bitstream is scaled at the decoder, i.e. decoded at a lower bit-rate than that at which it was
encoded, the motion compensation causes a drift at the decoder, i.e. the encoder and decoder are no longer
synchronised. This is because some refinement information used by the encoder for the motion compensation will be
missing at the decoder. This drift can be avoided by performing the motion compensation at the encoder using only
partially decoded information, but this results in a loss in compression efficiency. Our continuously scalable codecs
are also compared with layered H.263+ and are shown to outperform the later if a large number of layers are used.
47
JAVIC Final Report
Chapter 6: Summary and Future Work
Chapter 6
Summary and Future Work
6.1 Summary of Work Done
The main aims of the JAVIC project from the video coding point of view were to investigate ways of
improving the robustness of existing video codecs to packet loss and to develop novel scalable wavelet-based
codecs suitable for the heterogeneous nature of the Internet. We believe that these goals have been largely achieved
through the work that was carried out for the duration of the project.
Initially, we reviewed the general area of video coding for the Internet, especially for videoconferencing
applications, and this is summarised in Chapter 1. Then, we investigated the robustness of H263 video to packet loss
when packetised according to the latest RTP specifications, as described in Chapter 2. The main problem lies with
the motion compensation process, which causes the temporal propagation of errors caused by lost packets. One way
of stopping this error propagation is to code macroblocks without any reference to the previously coded blocks, and
two intra-replenishment schemes are compared – intra-frame and uniform intra-macroblock replenishment. It is also
shown that considerable improvement is possible if a feedback channel is available. In general, for multicast
situations, it is not practical to have a feedback channel.
We then investigated other ways of minimising temporal error propagation (Chapters 3 and 4). It was found
that the correct decoding of the motion information contribute considerably to the quality of the decoded images,
especially since they represent a relatively small fraction of the overall bit-rate. Therefore a scheme based on
forward error correction of the motion information only (MV-FEC) was proposed which greatly improves the
robustness to loss with only a minimal increase in bit-rate. A technique using the Reference Picture Selection mode
of H.263+ was also proposed to minimise error propagation, whereby some frames known as Periodic Reference
(PR) frames are inserted at regular intervals in a sequence. PR frames are predicted using only the previously coded
PR frames. It is shown that that PR frames protected with FEC (PR-FEC) can be as effective as intraframe coding at
stopping error propagation, but are less costly in terms of bit-rate. The two proposed schemes, MV-FEC and PRFEC, were then combined to give an H.263+ compatible coder that is robust to packet loss. Our coder was
subsequently integrated into the video conferencing tool vic (as described in Appendix A).
The remainder of the work was concentrated on the development of wavelet based video codecs (Chapter 5).
The interesting property of wavelet coding is that they can generate an embedded bitstream that is continuously
scalable. Two types of block motion-compensated codec using the SPIHT wavelet coding algorithm were proposed
– MC-SPIHT and H.263/SPIHT. In MC-SPIHT, the frame prediction error is coded using SPIHT, whereas in
H.263/SPIHT, the base layer is coded with H.263, and the enhancement layer is then coded with SPIHT. The
compression performance of non-scalable MC-SPIHT is comparable to that of H.263. Since the prediction error is
coded with SPIHT, the bitstream for each frame is continuously scalable, i.e. decoding of a frame can stop anywhere
along the bitstream. However, this results in error propagation since prediction information for decoding the next
frame will be unavailable. This error propagation can be avoided by performing the motion compensation at the
encoder at a lower bit-rate than the encoding bit-rate. The resulting bitstream will then be continuously scalable
beyond the motion compensation rate, without any error propagation due to drift. However, this reduces the
compression efficiency. The H.263/SPIHT coder generates a non-scalable base layer and a continuously scalable
enhancement layer without any problem with drift, but its compression efficiency is slightly worse that MC-SPIHT.
The wavelet codecs are compared with layered H.263.
It is worth noting that our continuously scalable codecs are fundamentally different from typical layered
codec such as H.263. Layered codecs generate a number of layers, which can only be decoded at the specific rate at
which they were encoded. In comparison, continuously scalable codec generate a single bitstream that can be
divided into any arbitrary number of layers at any arbitrary bit-rate. A single encoded bitstream can also be scaled
down to any suitable bit-rate either at the decoder or at any suitable point in a network, such as an intelligent router
48
JAVIC Final Report
Chapter 6: Summary and Future Work
or gateway. This can be very useful for congestion avoidance mechanisms using dynamic rate control or for
multicast transmission over heterogeneous networks where a single stream can be decoded at different bit-rates
according to the available bandwidth.
6.2 Potential Future Work
A lot of work remains to be done in the area of continuously scalable video codecs. The SPIHT algorithm
was designed for coding still images or intraframes and is not very efficient at coding prediction error frames
(although it gives similar performance to the DCT). It is possible that the algorithm can be modified to take into
consideration the characteristics of prediction error frames. In addition to improving the compression performance
of our scalable wavelet-based video codec, other issues such as robust packetisation and efficient intra-refresh
mechanisms must also be addressed. The work presented here assumes that each frame is packetised into a single
packet, which may not always be possible due to packet size constraints, or desirable due to loss robustness
considerations. However, as far as the JAVIC project is concerned, time constraints did not allow us to fully
investigate all these issues.
49
JAVIC Final Report
Appendix A: Codec Integration into vic
Appendix A
Codec Integration into vic
The robust codec that we developed was integrated into the video conferencing tool vic. Note that the
integration was originally meant to be carried out by UCL but was in the end done by us. A fully functional version
of vic with our robust H.263+ codec is now available and can be used for videoconferencing over the Internet, for
both unicast and multicast situations. The specifications for our codec are given next. In order to enable real-time
operation of our codec inside vic, we had to optimise the original source code and these changes are also described
here. The interface between our codec and vic is also provided for future reference.
A.1 Video Codec Specifications
The codec that we integrated into vic is essentially an H.263+ codec with the modifications described in
Chapters 3 and 4 of this document. However most of the H.263+ coding options have not been tested yet. The
current specifications for our codec are:
• QCIF and CIF images are supported
• Two packetisation modes are available: 1 GOB per packet or 3 GOBs per packet. Since the maximum size
of a packet is 1500 bytes, this restricts the overall encoding bit-rate.
• The use of PR-FEC and MV-FEC can be turned on or off.
• If PR-FEC is used, the frequency of PR frames and the number of FEC packets/frame can be specified.
• If MV-FEC is used, the number of FEC packets for the motion information can be specified.
• Periodic intraframe coding or uniform intra-macroblock replenishment can be used if PR-FEC and MVFEC are off.
• Only uniform intra-macroblock replenishment can be used if either PR-FEC or MV-FEC or both are on.
In the current version of our codec, all the above coding options must be specified at compile-time in a
header file and cannot be changed during a session.
A minor modification to the RTP-H.263+ payload header is also required at the encoder. The 5th bit of the 5bit Reserved field (RR) at the beginning of the payload header is set whenever a packet belongs to a PR frame that
has been protected with FEC. If FEC is not used, then this modification is not required.
A.2 Encoder Optimisation
Our video encoder is based on the public domain H.263+ source code from the University of British
Columbia (UBC). The original source code for version 3.2 of the H.263+ encoder runs at less than 5 fps for QCIF
images on a Pentium 2 233 MHz PC with 64 MB RAM and running Windows NT 4.0, even with the fast search
motion-estimation algorithm. The encoder was optimised to run faster by making the following changes:
• Using a fast dct and inverse dct with only integer arithmetic.
• Keeping memory allocation and de-allocation to a minimum by allocating memory once for frequently used
variables.
• Minimising the amount of data that is actually saved to disk or displayed on the screen.
• Loop optimisation by avoiding recalculation of unchanging variables.
50
JAVIC Final Report
Appendix A: Codec Integration into vic
After these changes, with the codec running from within vic, simultaneous encoding and decoding can be
performed at a maximum speed of 12 fps for QCIF images, and 5 fps for CIF on a Pentium 2 233 MHz PC with 64
MB RAM and running Windows NT 4.0.
The implementation of the periodic reference frame scheme and the use of FEC required substantial changes
to the code. Packet buffering at both the encoder and decoder was added so that FEC could be applied.
A.3 Interface between Codec and vic
Video Encoder
When the video encoder is required in vic, i.e. when the user wants to transmit video, an object of type
H263Encoder is created. The H263Encoder( ) constructor allocates memory required for frame storage during
encoding. When a frame is ready to be encoded, H263Encoder::consume(VideoFrame *) is called and the frame to
be encoded is passed to the encoder as a VideoFrame * (as declared in vic source). The compressed bitstream is
packetised and each individual packet is assembled in a pktbuf structure (as defined in vic) and is transmitted by
calling Transmitter::send(pktbuf *).
class TransmitterModule /* as defined in Vic source */
class H263EncoderFunctions {
public:
H263EncoderFunctions( );
H263EncoderFunctions( )
:
/*contains functions and variables used by H263Encoder */
}
class H263Encoder:public H263EncoderFunctions, public TransmitterModule {
public:
int consume(const VideoFrame *)
{
:
H263EncodeFrame(unsigned char *);
:
}
int command(int argc, const char*const* argv);
protected:
void InitializeH263Encoder( );
int H263EncodeFrame(unsigned char *);
/* encodes a single frame contained in unsigned char * and outputs data packets
/* using Transmitter::send(pktbuf *pb)
/* performs intra, inter or enhancement frame encoding depending on value of state
/* variables when function is called
H263Encoder( );
/* is passed some coding options and modes as parameters
/* initialises some variables and allocates memory for buffers required to store
/* previous reconstructed frames, depending on coding mode used
~H263Encoder();
/* Declaration of coding options variables and state variables e.g. quantisers, rate
/* control, etc.
/* Declaration of buffers for previous reconstructed frames and base and enhancement
/* layers
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
};
51
Appendix A: Codec Integration into vic
JAVIC Final Report
Video Decoder
Whenever vic is started, an object of type H263Decoder is created. Whenever an RTP packet with the
specified payload is received, it is passed to the decoder by calling H263Decoder::recv(const rtphdr *rh, const
unsigned char *bp, int cc), where rh is a pointer to the RTP header, bp is a pointer to the RTP payload and cc is the
length in bytes of the payload. The packet may either be decoded straight away and the decoded MB stored in a
frame buffer, or the packet may be buffered for decoding later, e.g. if FEC is being used. Whenever a complete
frame has been decoded, it is passed back to vic for display by calling Decoder::redraw(unsigned char *frame)
where frame is a pointer to the decoded frame.
class Decoder
*/ As defined in vic source */
class H263DecoderFunctions {
public:
H263DecoderFunctions( );
H263DecoderFunctions( )
:
/*contains functions and variables used by H263Decoder */
}
class H263Decoder: public H263DecoderFunctions, public Decoder {
public:
H263Decoder( );
~H263Decoder( );
void H263Decoder::recv( const rtphdr *rh, const unsigned char *bp, int cc);
/* decodes a single packet and stores decoded MB in a frame buffer
/* when a complete frame has been decoded, frame is displayed by calling
/* Decoder::redraw(unsigned char *)
private:
*/
*/
};
52
JAVIC Final Report
Appendix B: Colour Images
Appendix B
Colour Images
(a) Original frames 20 and 60
(b) Coded frames 20 and 60.
Fig. 2.5. Foreman sequence coded with H.263 at 60 kbps.
Fig. 2.6. Frame 20 and 60 of Foreman with RTP-H.263, packet loss rate = 10% and temporal prediction.
Fig. 2.9. Frame 20 and 60 with uniform replenishment of 5 intraMB/frame.
53
JAVIC Final Report
Appendix B: Colour Images
Fig. 2.12. Frames 20 and 60 for same-MB replenishment with 4 frame delay.
Fig. 3.3. Frames 20 and 60 with redundant motion vector packetisation
for 10% random packet losses.
Fig. 3.12. Frame 20 and 60 for r=2 with 10% random packet loss.
Fig. 4.10. Frame 65 with H.263 at 85 kbps with 10% loss (a) no replenishment and (b) intraframe every 10 frames.
54
JAVIC Final Report
Appendix B: Colour Images
Fig. 4.11. Frame 65 using RPS/FEC/MV with r=4 at 85 kbps (a) 10 and (b) 30% random packet loss.
(a) Frames 20 and 60 with H.263
(b) Frames 20 and 60 with MC-SPIHT
Fig. 5.8. Subjective comparison of H.263 and MC-SPIHT at 60 kbps.
Fig. 5.11. Frame 60 encoded with MC loop at 60 kbps and decoded at (a) 40 (b) 60 and (c) 120 kbps.
55
JAVIC Final Report
References
References
Amir E., McCanne S. and Zhang H., “An application-level video gateway”, Proc. ACM Multimedia ’95, San
Francisco, Nov. 1995.
Bolot J.-C. and Turletti T., "Adaptive error control for packet video in the Internet," Proc. ICIP 96, Lausanne, Sept.
1996.
Bolot J.-C. and Turletti T., “A rate control mechanism for packet video in the Internet”, Proc. IEEE Infocom ’94:
The conference on computer communications - networking for global communications, Toronto, Canada, vol. 13, ch. 75, pp. 1216-1223, 1994.
Bolot J.-C. and Turletti T., “Adaptive error control for packet video in the Internet”, Proc. ICIP ’96, Lausanne,
Sept. 1996.
Bolot J.-C. and Vega-Garcia A., “The case for FEC-based error control for packet audio in the Internet”, ACM
multimedia systems, ??.
Bolot J.-C., “Characterising end-to-end packet delay and loss in the Internet”, Journal of high-speed networks, vol.
2, no. 3, pp. 305-323, Dec. 1993.
Bormann C., Cline L, Deisher G., Gardos T., Maciocco C., Newell D., Ott J., Sullivan G., Wenger S. and Zhu C.,
"RTP payload format for the 1998 version of ITU-T Rec. H.263 video (H.263+)," RFC 2429, Oct. 1998.
Boyce J.M., Gaglianello R.D., "Packet loss effects on MPEG video sent over the public Internet," ACM Multimedia
’98, Bristol, U.K., 1998.
Brady P. T., “Effects of transmission delay on conversational behaviour on echo-free telephone circuits”, Bell Syst
Tech. Journal, Jan. 1971, pp. 115-134.
Chen M.-J., Chen L.-G. and Weng R.-M., "Error concealment of lost motion vectors with overlapped motion
compensation," IEEE Trans. Circuits and Systems for Video Tech., vol. 7, no. 3, pp.560-563, June 1997.
Clarke R. J., Digital compression of still images and video, Academic Press, 1995.
Comer D., Internetworking with TCP/IP, Prentice Hall, 1995.
Côté G., Erol B., Gallant M. and Kossentini F., "H.263+: Video coding at low bit rates," IEEE Trans. Circuits Syst.
Video Technol., vol. 8, no. 7, pp. 849-866, Nov. 1998.
Deering S., Host extension for IP multicasting, RFC 1112, Aug. 1989.
Eriksson H., “MBone: The multicast backbone”, Comm. of the ACM, vol. 37, no. 8, pp. 54-60, Aug. 1994.
Frederick R., “Experiences with real-time software video compression”, Proc. 6th Int. Workshop Packet Video,
Portland, OR, Sept. 1994.
Ghanbari M. and Seferidis V., "Cell-loss concealment in ATM video codecs," IEEE Trans. Circuits Sys. Video
Technol., vol. 3, no. 3, pp.238-247, June 1993.
Ghanbari M., “Two-layer coding of video signals for VBR networks”, IEEE J. Select. Areas Commun., vol. 7, no. 5,
pp. 771-781, June 1989.
Girod B. and Farber N., "Feedback-based error control for mobile video transmission," Proc. IEEE, vol.87, no.10,
pp.1707-1723, Oct. 1999.
Girod B., Färber N. and Steinbach E., "Error-resilient coding for H.263," in Insights into Mobile Multimedia
Communications, D.R. Bull, C.N. Canagarajah and A.R. Nix, Eds., pp. 445-459, Academic Press, U.K., 1999.
Girod B., Steinbach E. and Farber N., "Performance of the H.263 video compression standard," J. VLSI Signal
Processing: Syst. for Signal, Image, and Video Technol., vol.17, no.2-3, pp.101-111, 1997.
Goode B., “Scanning the special issue on global information infrastructure”, Proc. IEEE., vol. 85, no. 12, pp. 18831886, Dec. 1997.
Hardman V., Sasse M. A., Handley M. and Watson A., “Reliable audio for use over the Internet”, Proc. INET ’95,
International Networking Conference, Honolulu, Hawaii, pp. 171-178, June 1995.
Karlsson G, “Asynchronous transfer of video”, IEEE Commun. Mag., pp. 118-126, Aug. 1996.
Kim B.J., Xiong Z.X., Pearlman W.A. and Kim Y.S., "Progressive video coding for noisy channels," J. Visual
Commun. and Image Repres., vol. 10, no. 2, pp.173-185, 1999.
Lam W.-M., Reibman A.R. and Lin B., "Recovery of lost or erroneously received motion vectors," Proc. ICASSP,
vol. 5, pp.417-420, April 1993.
56
JAVIC Final Report
References
Liou M., “Overview of the p x 64 kbits/s video coding standard”, Commun. of the ACM, vol. 34, no. 4, pp. 60-63,
April 1991.
Martins F.C.M. and Gardos T.R., "Efficient receiver-driven layered video multicast using H.263+ SNR scalability,"
Proc. ICIP ’98, Chicago, IL., 1998.
Martucci S.A., Sodagar I., Chiang T. and Zhang Y.-Q., "A zerotree wavelet video coder," IEEE Trans. Circuits
Syst. Video Technol., vol. 7, no. 1, pp. 109-118, Feb. 1997.
McAuley A.J., "Reliable broadband communications using a burst erasure correcting code," ACM SIGCOMM 90,
September 1990.
McCanne S. and Jacobson V., “vic: A flexible framework for packet video”, Proc. ACM Multimedia, San Francisco,
CA, pp. 511-522, Nov. 1995.
McCanne S., Vetterli M. and Jacobson V., "Low-complexity video coding for receiver-driven layered multicast,"
IEEE J. Select. Areas Commun., vol. 16, no. 6, pp. 983-1001, Aug. 1997.
Nonnenmacher J., Biersack E.W. and Towsley D., "Parity-based loss recovery for reliable multicast transmission,"
IEEE/ACM Trans. Networking, vol. 6, no. 4, pp. 349-361, Aug. 1998.
Paxson V., "End-to-end internet packet dynamics," IEEE/ACM Trans. Networking, vol. 7, no. 3, pp. 277-292, June
1999.
Pejhan S., Schwartz M. and Anastassiou D., “Error control using retransmission schemes in multicast transport
protocols for real-time media”, IEEE/ACM Trans. on Networking, vol. 4. no. 3, pp. 413-427, June 1996.
Rhee I., "Error control techniques for interactive low-bit rate video transmission over the Internet," Proc.
SIGCOMM ’98, Vancouver, Sept. 1998.
Rijkse K., “H.263: Video coding for low-bit-rate communication”, IEEE Comm. Mag., pp. 42-45, Dec. 1996.
Rizzo L, "Effective erasure codes for reliable computer communication protocols," ACM Computer Commun.
Review, vol. 27, no. 2, pp. 24-36, April 1997.
Said A. and Pearlman W.A., "A new, fast and efficient image codec based on set partitioning in hierarchical trees,"
IEEE Trans. Circuits Syst. Video Technol., vol. 6, no. 3, pp. 243-250, June 1996.
Schulzrinne H., Casner S., Frederick R. and Jacobson V., RTP: A transport protocol for real-time applications, RFC
1889, Jan. 1996.
Schulzrinne H., RTP profile for audio and video conferences with minimal control, RFC 1890, Jan. 1996.
Shacham N. and McKenney P., “Packet recovery in high-speed networks using coding and buffer management”,
Proc. IEEE Infocom ’90, San Francisco, CA, pp. 124-131, May 1990.
Shen K. and Delp E.J., "Wavelet based rate scalable video compression," IEEE Trans. Circuits Syst. Video
Technol., vol. 9, no. 1, pp. 109-122, Feb. 1999.
Sikora T., “MPEG digital video coding standards”, IEEE Signal Proc. Mag., pp. 82-100, Sept. 1997.
Steinbach E., Farber N. and Girod B., "Standard compatible extension of H.263 for robust video transmission in
mobile environments," IEEE Trans. Circuits Syst. Video Technol., vol. 7, no. 6, pp. 872-881, Dec. 1997.
Steinmetz R., “Human perception of jitter and media synchronisation”, IEEE J. Select. Areas Commun., vol. 14, no.
1, Jan. 1996, pp. 61-72.
Taubman D. and Zakhor A., "Multirate 3-D subband coding of video," IEEE Trans. Image Processing, vol. 3, pp.
572-588, Sept. 1994.
Turletti T. And Huitema C., “Videoconferencing on the Internet”, IEEE/ACM Trans. Networking, vol. 4, no. 3, pp.
340-351, June 1996.
Turletti T. and Huitema C., RTP payload format for H.261 video streams, RFC 2032, Oct. 1996.
Wallace J. K., “The JPEG still picture compression standard”, Commun. of the ACM, vol. 34, no. 4, pp. 31-44, April
1991.
Wang Y. and Zhu Q.F., "Error control and concealment for video communication: A review," Proc. IEEE, vol. 86,
no. 5, pp. 974-997, 1998.
Wenger S., "Video redundancy coding in H.263+," Proc. AVSPN 97, Aberdeen, U.K., 1997.
Wenger S., Knorr G., Ott J., and Kossentini F., "Error resilience support in H.263+," IEEE Trans. Circuits Syst.
Video Technol., vol. 8, no. 7, pp. 867-877, Nov. 1998.
White P.P. and Crowcroft J., “The integrated services in the Internet: State of the art”, Proc. IEEE, vol. 85, no. 12,
pp. 1934-1946, Dec. 1997.
Willebeek-LeMair M. H. and Shae Z.-Y., “Videoconferencing over packet-based networks”, IEEE J. Select. Areas
Commun., vol. 15, no. 6, pp. 1101-1114, Aug. 1997.
Yajnik M., Kurose J. and Towsley D., "Packet loss correlation in the MBone multicast networks," IEEE Globecom
’96, London, U.K., 1996.
Zhu C., RTP payload format for H.263 video streams, RFC 2190, Sept. 1997.
57