When networks are used to deliver multimedia, timing becomes an issue. As IP networks were developed and deployed for data applications, timing seemed to be somewhat unimportant. Getting data to the destination and preserving its integrity was paramount. That concept began to change when IP started to transport voice (VoIP). Suddenly, we became concerned about pauses between speakers that would cause the participants to feel uncomfortable. Now, there are a variety of ways we manage timing with IP transported information.
One of the earliest techniques was to use the Network Time Protocol (NTP), which focused on internet data traffic. After several revisions to the protocol, it could boast an accuracy among a group of devices that was within a few milliseconds. This was seen in an unfavorable light by the industry that was delivering voice (telecommunications) or video (broadcasters), since they were using leased connections or satellite transmission.
[Byte-Sized Lesson: Troubleshooting Study]
In VoIP, timing is nearly always based on the Real-time Protocol (RTP). It’s also the primary technique used in audio or videoconferencing. In RTP, each source is identified by a code. It is called either the synchronization source or a contributing source. Each of these sources provides a time stamp that is placed in the RTP header. In other words, the clock for an RTP transmission is based on the source of the audio or video. The primary purpose of this time stamp is to help the receiver with the exact time the data is to be presented (played) to the listener or viewer. Normally, network devices such as switches and routers will not tamper with the time stamp; the sender simply creates it and the receiver uses it. Voice and conferencing commonly use UDP at layer four. UDP does not provide for a time stamp. That’s why RTP is needed.
When TCP is used for transport, such as in adaptive bit-rate video, the layer-four protocol is TCP. There is a provision for a time stamp to be inserted in the TCP header. However, both the sender and receiver must negotiate its use during session establishment—that is, during the three-way handshake. This time stamp has become more prevalent as networks have more delay than in the past. Yet, its use in a session is dependent on whether the TCP software in each station can support it. In addition, the application software requesting the creation of the session must include a request for it to be used.
One of the most popular mechanisms to transport audio or video is the MPEG transport stream format (MPTS). In this packet arrangement, audio and video samples are 188 bytes long and an IP packet generally carries seven of these. Each of these samples carries a Presentation Time Stamp (PTS). So, if an MPTS stream is carrying a video signal and two audio signals, the IP packets will contain a PTS values corresponding to each of the three sources. Each sample will have a PTS indicating the clock by which the sounds and video can be remixed into a synchronized output stream for the viewer.
The choice of the timing method seems to be based, to a great extent, on the segment of the industry that develops the products that will stream the audio or video. There is an engineering concern as well. Some of these methods are more accurate than others. Finally, we should mention that the issue of synchronizing a group of output devices is based on a common clock has not been considered here. What we have focused on in this article is the synchronization between one sender and one receiver.