So long, wordpress

I no longer use wordpress to host this blog, I compile it to static files using chronicle instead. It’s still at the same address (blog.corelatus.com), though, so the only way to see this post is to stumble on it via a search engine.

Audio power levels on E1/T1 timeslots: the digital milliwatt

Sometimes, you want to know when the audio on an E1/T1 timeslot has gotten louder than some limit. In a voice mail application, that’s useful for catching mistakes such as a subscriber leaving a message but then not hanging up the phone properly—you don’t want to record hours and hours of silence. In an IVR application, you might want to keep an eye on the audio level so that a frustrated (shouting!) subscriber can be forwarded to a human operator.

GTH provides a “level detector” to do that sort of thing. You start a level detector on a timeslot, give it a loudness threshold, and it’ll notify you whenever the audio on the timeslot goes over that threshold. Here’s an example command which notifies you if the power on timeslot 13 of an E1/T1 is louder than -20dBm0:


<new><level_detector threshold='-20'>
<pcm_source span='2A' timeslot='13'/>
</level_detector></new>

The algorithm is:


Take a 100ms block of audio (800 samples)
Square all the samples
Sum the squares
If the sum exceeds the loudness threshold, send an XML event

There are some details to worry about.

The digital milliwatt

The threshold, e.g. -20 in the example above, has to be relative to something. The standard reference power level in telecommunications is the milliwatt. ITU-T G.711, table 5 and 6 defines the sequences which represent a milliwatt:

A-law: 34 21 21 34 b4 a1 a1 b4
μ-law: 1e 0b 0b 1e 9e 8b 8b 9e.

Here’s what a few periods of the digital milliwatt look like:

small_linear_milliwatt

In this post, the unit ‘dBm0′ means power, in dB relative to the digital milliwatt, as defined by the sequences above. If you have no idea what dB means, wikipedia has a decent article.

What’s the loudest possible sound on a timeslot?

170 is the highest value possible in A-law encoding. It corresponds to linear 4032. That’s about 6dB louder than the digital milliwatt.

85 is the smallest value possible in A-law encoding. It corresponds to linear -1. That’s about 66dB softer than the digital milliwatt.

The range -66dBm0…+6dBm0 sets an upper bound on the range of power on a timeslot. Then there are other things which further limit the practical range, so you’re unlikely to actually use a +6dBm0 threshold in practice, but it’s there if you want it.

Is the G.711 definition the best one?

The sequence given in G.711 is a 1kHz sine wave. The sampling rate on E1/T1 is 8kHz, so the reference sequence can be expressed in just eight values. That’s nice, but that also leads to small errors, about 0.13dB, because of quantisation.

ITU-T O.133 discusses that problem in detail and proposes a test signal which specfically is not 1kHz (i.e. not a submultiple of the sampling rate). For most practical purposes, 0.13dB doesn’t matter and so the simple and robust thing to do is to use the well-defined and well-known G.711 sequence as a reference.

Here’s what a few periods of a 1020Hz signal look like. Notice that the samples, i.e. the red crosses, don’t appear in the same spot one period later—that way we don’t get the same errors over and over again.

1020Hz signal sampled at 125us intervals

1020Hz signal sampled at 125us intervals

Testing pitfall: the .wav header

GTH players play raw A-law or μ-law data. If you feed a player a .wav file in 8kHz A-law, or μ-law if your network uses μ-law, there will be a very short bit of noise at the start of the playback because the .wav header gets treated as though it were audio.

When testing level detection, especially at quiet levels, that header noise is enough to trigger a detector. Here’s a .wav of a 1000Hz sine wave at about -30dBm0:

00000000  52 49 46 46 42 27 00 00  57 41 56 45 66 6d 74 20  |RIFFB'..WAVEfmt |
00000010  12 00 00 00 06 00 01 00  40 1f 00 00 40 1f 00 00  |........@...@...|
00000020  01 00 08 00 00 00 66 61  63 74 04 00 00 00 10 27  |......fact.....'|
00000030  00 00 64 61 74 61 10 27  00 00 d5 c4 f5 f1 f3 f1  |..data.'........|
00000040  f5 c4 d5 44 75 71 73 71  75 44 d5 c4 f5 f1 f3 f1  |...DuqsquD......|
*
00002740  f5 c4 d5 44 75 71 73 71  75 44                    |...DuqsquD|

The first 58 octets (bytes) are the header. If we turn that header into a periodic signal, it’s at about -8dBm0, which is fairly loud. With the default period parameter of 100ms in the level detector, that’ll cause a false level of about -11dBm0.

The period parameter

The level_detector has an optional parameter, the period. The period sets the size of the audio block the GTH considers when measuring the power. A short period makes the GTH responsive to sudden changes in power on the timeslot, which would be useful in an application such as figuring out which of the people in a conference call are currently talking. A long period averages out the power over a longer time, which is useful in deciding whether a voicemail recording has finished.

The default is 100ms.

Sample files

This .zip file contains sample recordings with 1kHz sine waves at 0, -10, -20, -30, -40 and -50 dBm0. The .wav versions are useful for listening to or importing into an audio editing program to calibrate the level meter. The .raw versions are just plain A-law samples—you can use them with a GTH player.

(Aside: wordpress.com won’t let me put audio files here. I don’t know whether that’s because they don’t want people filling up their servers with audio, or if it’s to avoid record label copyright claims. You never know, someone might have a claim on a 1kHz sine wave…).

How does TCP behave on an interrupted network?

GTH E1/T1 modules are always controlled by a general-purpose server, usually some sort of unix machine. The server and GTH are connected by ethernet and communicate using TCP sockets. Normally, that ethernet connection is chosen to be simple and reliable, for instance by putting the server and the GTH in the same rack, connected to the same ethernet switch.

I experimented a bit to see what happens when that network gets interrupted. I
interrupted the network in a reproduceable way by disabling and re-enabling the server’s ethernet port for a known length of time while running a <recorder>. (A <recorder> sends all the data, typically someone talking, from an E1 timeslot to the server over a TCP socket, 8000 octets per second.)

Capturing the ethernet packets

Here’s what I did to capture traffic and interrupt the ethernet:


tcpdump -w /tmp/capture.pcap -s 0 not port 22
sudo ifconfig eth0 down; sleep 5; sudo ifconfig eth0 up

A trace where traffic recovers in time to prevent an overrun

The GTH buffers about two seconds of timeslot traffic. So a ‘sleep’ of
about a second won’t result in an overrun. Here’s what it looks like in wireshark:

Packet Time Direction Flags Seq. #

133 7.596 GTH -> server [PSH, ACK] 59393
134 7.633 server -> GTH [ACK] 1
135 7.724 GTH -> server [PSH, ACK] 60417
136 7.761 server -> GTH [ACK] 1
137 7.852 GTH -> server [PSH, ACK] 61441
138 7.889 server -> GTH [ACK] 1
139 7.980 GTH -> server [PSH, ACK] 62465
140 8.017 server -> GTH [ACK] 1
141 8.108 GTH -> server [PSH, ACK] 63489
142 8.145 server -> GTH [ACK] 1
143 8.236 GTH -> server [PSH, ACK] 64513
144 8.273 server -> GTH [ACK] 1
145 8.364 GTH -> server [PSH, ACK] 65537
146 8.401 server -> GTH [ACK] 1
147 10.151 GTH -> server [PSH, ACK] 66561
148 10.151 server -> GTH [ACK] 1
149 10.151 GTH -> server [ACK] 67585
150 10.151 server -> GTH [ACK] 1

Everything up to packet 146 is normal: the GTH (172.16.2.5) sends 8000 octets every second and the server (172.16.2.1) acks them. It happens to be in chunks of 1024 octets about eight times per second. After packet 146, about 8.4 seconds after the capture started, the ethernet interface went down and stayed down for 1s. The TCP stream started up again after about 1.5s and then ‘caught up’ by sending many packets in quick succession.

A trace where traffic didn’t recover

I took a second trace similar to the first one, except this time, I disabled ethernet for about five seconds:

Packet Time     Source IP     Dest IP    SPort   DPort
----------------------------------------------------------------------
 28   1.040083  172.16.2.5 -> 172.16.2.1 54271 > 45195 [PSH, ACK] Seq=7169
 29   1.040095  172.16.2.1 -> 172.16.2.5 45195 > 54271 [ACK] Seq=1
 30   1.168065  172.16.2.5 -> 172.16.2.1 54271 > 45195 [PSH, ACK] Seq=8193
 31   1.168078  172.16.2.1 -> 172.16.2.5 45195 > 54271 [ACK] Seq=1
 32   1.296067  172.16.2.5 -> 172.16.2.1 54271 > 45195 [PSH, ACK] Seq=9217
 33   1.296079  172.16.2.1 -> 172.16.2.5 45195 > 54271 [ACK] Seq=1
 34   1.424068  172.16.2.5 -> 172.16.2.1 54271 > 45195 [PSH, ACK] Seq=10241
 35   1.424081  172.16.2.1 -> 172.16.2.5 45195 > 54271 [ACK] Seq=1
 36   7.782851  172.16.2.5 -> 172.16.2.1 54271 > 45195 [PSH, ACK] Seq=11265
 37   7.782863  172.16.2.1 -> 172.16.2.5 45195 > 54271 [ACK] Seq=1
 38   7.783406  172.16.2.5 -> 172.16.2.1 54271 > 45195 [ACK] Seq=12289
 39   7.783413  172.16.2.1 -> 172.16.2.5 45195 > 54271 [ACK] Seq=1
 40   7.783569  172.16.2.5 -> 172.16.2.1 54271 > 45195 [ACK] Seq=13737
...
 50   7.784962  172.16.2.5 -> 172.16.2.1 54271 > 45195 [FIN, PSH, ACK] Seq=23873
 51   7.784972  172.16.2.1 -> 172.16.2.5 45195 > 54271 [ACK] Seq=1
 52   7.785026  172.16.2.1 -> 172.16.2.5 45195 > 54271 [FIN, ACK] Seq=1
 53   7.785348  172.16.2.5 -> 172.16.2.1 54271 > 45195 [ACK] Seq=25322

Everything is normal up to packet 35. Then, ethernet is suspended for five seconds and TCP takes a further second to recover, which causes a buffer overrun on the GTH (172.16.2.5). The GTH closes the socket at packet 50 and also sends an overrun event to the application so that it knows why the socket was closed.

Bottom line

GTH uses IP for control and traffic. It is important that the IP link between the GTH and the server is simple and reliable. Ideally the GTH and server should be in the same rack and be connected by an ethernet switch.

It’s possible for a system to survive a short interruption (less than a second) to the ethernet traffic without pre-recorded calls getting interrupted. For longer interruptions, all bets are off.

(Interruptions aren’t the only type of network problem, e.g. radio networks such as 802.11 can suffer significant packet loss, which can trigger TCP congestion avoidance. But that’s another topic.)

Capturing SS7 with wireshark or tshark

I often use wireshark to look at SS7 signalling on E1 links. Up until today, I’ve always done that by capturing the signalling (from a GTH), then converting the captured data to libpcap format and finally loading the file into wireshark.

Someone showed me a better way today: wireshark can read from a pipe or from standard input. That lets me see and filter the packets in wireshark in real time. Here’s how to do it, using the save_to_pcap demo program (included in gth_c_examples):

> ./save_to_pcap gth21 1A 2A 16 - | wireshark -k -i -
capturing packets, press ^C to abort
saving capture to stdout

The same thing works for tshark:

 >./save_to_pcap gth21 1A 2A 16 - | tshark -V -i -
capturing packets, press ^C to abort
saving capture to stdout
Capturing on -
Frame 1 (15 bytes on wire, 15 bytes captured)
    Arrival Time: Aug 10, 2009 20:38:29.388000000
...
   Message Transfer Part Level 2
    .000 1101 = Backward sequence number: 13
    1... .... = Backward indicator bit: 1
    .011 1000 = Forward sequence number: 56
    1... .... = Forward indicator bit: 1
    ..00 0000 = Length Indicator: 0
    00.. .... = Spare: 0
...

A few rough edges

Piping to wireshark/tshark works on all the *nixes, i.e. linux, BSD, OSX, Solaris, but for some reason it doesn’t work on windows. On Windows, you have to save the pcap files and open them. I’m not sure why that is, but then again I rarely use windows, so maybe there’s some easy way around that. If someone knows, send me some mail, or comment.

Wireshark needs both the -i and -k switches for piping to work. That took me a while to figure out. Seems unnecessary.

On some older (as of August 2009) versions of wireshark, possibly in combination with older libraries, the “-i -” switch doesn’t work, at least according to google, even though the tshark version works. Both work fine for me on Debian Linux.

Generating DTMF using a ‘player’ on GTH

The GTH can transmit in-band signalling tones on a timeslot. That’s useful for testing and for building active in-band signalling systems.

DTMF

The tones transmitted when the subscriber presses a number key on fixed or mobile handset are called DTMF. Wikipedia has an article about it. To generate DTMF, all we really need to know is that there are 16 possible DTMF signals, that each signal is made up of two sine waves of particular frequencies and that sending the signal for 100ms is a reasonable thing to do.

Here’s a .zip file with DTMF tones in it. Each file is raw ALAW data, i.e. it’s ready for the GTH to play (transmit) on a timeslot.

The GTH has two ways of playing tones. One way is to stream the audio data in over a TCP socket each time we want to play it. I wrote a post about that earlier. The other way is to store the sample data on the GTH and command its playback whenever it’s needed. Since there’s a small number of different tones (12, or 16 if you want to use the A/B/C/D tones as well) and the tones are short, storing them on the GTH makes sense. To store the tone:


<new><clip name='dtmf5'/></new> 
(and now send the 800 byte file)

to play the tone later on:


<new><player><clip id='clip dtmf5'/><pcm_sink span='3A' timeslot='19'/></player></new> 

Sequences of tones

Sometimes you want to transmit a sequence of DTMF tones, for instance to simulate a subscriber dialling a number. The GTH lets you start a player with a sequence of tones like this:


<new><player><clip id='clip dtmf5'/><clip id='clip dtmf6'/><clip id='clip dtmf8'/><pcm_sink span='3A' timeslot='19'/></player></new> 

But that isn’t a valid sequence of DTMF tones. Why not? Because DTMF expects a gap between tones. The cleanest way to handle that is to define another clip consisting of just silence and putting it between each tone. A good ‘silence’ value on E1 lines is 0×54. 60ms (480 samples) is a reasonable length.

Other in-band tones

DTMF in-band signalling is used in pretty much all handsets (telephones), mostly for dialling, but also to navigate menus in IVR systems. But before SS7 became popular, in-band signalling in the form of CAS and SS5 was even used to communicate call setup information between exchanges. GTH can also generate those tones, but that can be the subject of another post.

Perl example code for GTH: SS7 ISUP decoding and playback/record

To help people get started, www.corelatus.com has some example code for doing useful things with GTH units.

Now it also has Perl example code. It does the same thing as the python examples:

  • Enable an E1/T1 port
  • Start MTP-2 monitoring on a timeslot and decode SS7 ISUP (to print out when calls start and stop). I wrote a post a while back about how to decode ISUP.
  • Dump the contents of a timeslot to a file (for later analysis)
  • Feed a file into a timeslot (for playback of previously captured files)

It’s built on top of a Perl module which provides a Perl API for a subset of the GTH API.

A quick example

Here’s a quick example of how it’s used. We want to enable (turn on) the first E1/T1 interface on a GTH module:


my $api = gth_control->new($host);
$api->send("<set name='pcm$span'><attribute name='mode' value='E1'/></set>");
defined $api->next_non_event()->{ok} || die("error from GTH (bogus PCM?)");
$api->bye();

It’s good for experimenting.

The Perl module the examples are based on, gth_control.pm, is at a level which makes it useful for experiments and prototypes. To build a full-fledged product on top of it, more work is needed.

For a start, you’d probably want to move the XML generation (like the ‘<set name=…’ code above) out of the application code and into the gth_control.pm module, thus making it a pure Perl interface.

Next, you need to come up with a strategy to deal with concurrency, because being limited to recording one timeslot at a time is fine for lab work, but not fine for (say) a voicemail system.

Download

The zipfile of the code is linked from the bottom of the API page.

Decoding MTP-3 and ISUP

Sometimes, you want to look at the signalling on an E1 and use it to figure out when telephone calls start and stop. In SS7 networks, call setup and tear-down is done by the ISUP layer, which fits in to the SS7 stack like this:

Layer 4: ISUP
Layer 3: MTP-3
Layer 2: MTP-2
Layer 1: MTP-1 (typically an 2Mbit/s E1 or a 1.5Mbit/s/T1)

If you have a GTH connected to the E1 you’re interested in, either via a DXC or a monitor point, the GTH takes care of layers one and two. That leaves MTP-3 and ISUP to you.

The easiest way to decode MTP-3 and ISUP is to let wireshark do it for you. There’s a note about how to do that on Corelatus’ official site. But this blog entry is about how to decode MTP-3 and ISUP yourself.

A signal unit (packet)

In SS7, packets are usually called “signal units”. Here’s what an SS7 signal unit looks like ‘on the wire’, octet by octet, with MTP-2 and MTP-1 already decoded:

8d c8 1f 85 02 40 00 00 35 00 01 00 21 00 0a 02
02 08 06 01 10 12 52 55 21 0a 06 07 01 11 13 53
55 00 6e 00

MTP-3

The start of the packet is the MTP-2 (ITU-T Q.703) and MTP-3 (ITU-T Q.704) headers. These headers are easy to decode because they are always fixed-length:

Octet(s) Value Purpose
00–01 8d c8 MTP-2 sequence numbers, safe to ignore
02 1f MTP-2 length indicator. Anything less than 3 is reserved for MTP-2 itself and should be discarded.
03 85 MTP-2 SIO. The SIO tells us which ‘service’ the signal unit is intended for. Q.704 sections14.2.1 and 14.2.2 tell us that anything ending in hex 5 is for ISUP.
04–07 02 40 00 00 MTP-3 Routing label. The routing label is just a “from” and “to” address in the SS7 network. For most applications we can ignore it. Q.704 figure 3 shows what’s in the routing label.

Upshot: to see calls start and stop, all we have to do for MTP-3 is:

  1. Look at the length indicator (offset 2) and discard any signal unit where it’s less than 3.
  2. Look at the SIO (offset 3). Discard if (SIO & 0x0f != 5)

ISUP

The rest of the signal unit is ISUP. Annex C in ITU-T Q.767 tells us how to decode ISUP. ISUP is fiddly because there are several types of ISUP packets, because several of those types have optional fields and because some of those fields are variable length. Here are the octets we have left after removing MTP-2 and MTP-3:

35 00 01 00 21 00 0a 02
02 08 06 01 10 12 52 55
21 0a 06 07 01 11 13 53
55 00 6e 00

The first two octets are the CIC. The third octet is the Message type.

The CIC (Q.767 C.1.2) tells us which circuit this call uses. All the signalling for one call has the same CIC. In ITU networks, it’s a 12-bit value packed into the field in little-endian byte order. In this case CIC=0×0035. We’re sniffing an E1 line, so C.1.2.a tells us that the lower five bits correspond to the timeslot (timeslot 5) and the rest identifies the E1 itself.

The Message Type (Q.767 Table C-3) field tells us what sort of ISUP message this signal unit is. 0×01 is an IAM. 0×10 is RLC. For a minimal “show me what calls are going through the system” hack, we only need to look at the IAM (comes at the start of the call, contains the A and B numbers) and the RLC (sent when the call is finished) messages.

Now we know that the CIC=0×35, that the message is an IAM and we still have about a dozen octets to decode. Q.767 table C-16 tells us how to decode an IAM. There are some uninteresting fixed-length fields followed by the B number and then the A number. Look at the code (or Q.767, section C.3.7) if you’re interested in the details. All we really care about is that these octets

06 01 10 12 52 55 21

represent the B number: 21255512. You can see the number in the raw data if you skip the first three octets and swap every second digit.

Turning those ISUP steps into an algorithm to decode one signal unit:

  1. Save the CIC
  2. Is the message an IAM? Decode it as an IAM, which is fiddly.
  3. Is the message an RLC? Just print the CIC.

That’s all you need to do to make a simple system which prints the start and end of each call. To do something useful, you need to maintain a table of in-progress calls and match up the IAM and RLC messages with the same CIC. You also need to handle things like systems restarting.

Further reading

The ITU now have most of their standards freely available at www.itu.int. So one way to learn more about MTP-3 and ISUP is to read the standards, e.g. all the Q-series standards about signalling are here.

Erlang code

Everything discussed above is implemented in the ss7_sniffer.erl example on corelatus.com. It makes good use of Erlang’s binary syntax, e.g. here’s the MTP-3 decoder:


mtp3(<<_Sub:4, Service_indicator:4>>, <<DPC:14, OPC:14, SLS:4,
 Rest/binary>>) -> 
    case Service_indicator of 
    0 -> % Management 
        ignore; 
    1 -> % Test/maintenance 
        ignore; 
    3 -> % SCCP 
        ignore; 
    5 -> 
        isup(DPC, OPC, SLS, Rest); 
    9 -> % B-ISUP; similar to ISUP, but not compatible. 
        ignore; 
    X -> 
        io:fwrite("ignoring SU with unexpected service indicator=~p\n", [X]) 
    end.

It looks a lot like one of the examples in the original paper about the binary syntax.

Python Code

The same thing done in Python is fairly straightforward once you discover the Python ‘struct’ library, which is basically the same thing as PERL’s pack/unpack. The code is in sniff_isup.py, inside the GTH python examples zip.

It feels like I haven’t discovered whatever it is python people use to unpack bitfields, e.g. something neater than:

is_even = ((ord(num[1]) & 0x80) == 0)

or, better still, a clean way to decode the MTP-3 routing label.

The timestamp field in signalling headers

When the Corelatus GTH is used to monitor (sniff) signalling, it sends each sniffed packet to your server over a TCP socket, along with a header. For instance, for SS7 MTP-2 the header looks like this:

octet 0x00: Length (16 bits)
octet 0x02: Tag (16 bits)
octet 0x04: Flags (16 bits)
octet 0x06: Timestamp (48 bits)

Every field is big-endian, i.e. the most significant byte comes first. Here’s an actual header from a GTH, octet by octet:

00 1c 00 00 00 00 01 20 34 ee fa 61 99 99 99 99 ...

The timestamp is thus 0x012034eefa61, or decimal 1237838658145. For most applications, you just want to know which packet came first, so the interpretation of that number doesn’t matter much, though it’s useful to know that it’s the number of milliseconds since the unix epoch. (wikipedia has a decent article about unix time)

Sometimes, though, you want to represent that as a human-readable time. Unix (and, most likely, Win32) provides functions to do that in the C library, so, after throwing away the last three digits (the milliseconds), this C program does it:

#include <time.h>
#include <stdio.h>
 
int main() {
const time_t time_stamp = 1237838658;
printf("%d corresponds to %s\n", time_stamp, ctime(&time_stamp));
 
return 0;
}

The output agrees with what the clock on my wall says:

1237838658 corresponds to Mon Mar 23 21:04:18 2009

Python

Since I’ve been messing around with python, the same thing in python:

>>> import time
>>> time.ctime(1237838658)
'Mon Mar 23 21:04:18 2009'

Erlang

Erlang doesn’t have an interface to the ‘ctime’ call, but you can use the gregorian calendar functions:

1> Epoch = calendar:datetime_to_gregorian_seconds({{1970, 1, 1}, {0,0,0}}).
62167219200
2> calendar:gregorian_seconds_to_datetime(1237838658 + Epoch).
{{2009,3,23},{20,4,18}}

Why use milliseconds?

Why is the GTH timestamp in milliseconds instead of either seconds or a ‘timeval’-like seconds + microseconds?

We chose millisecond resolution for several reasons. Firstly, the shortest possible useful packet in SS7 takes a bit more than a millisecond to transmit at 64kbit/s. Secondly, the practical limit of NTP time synchronisation over the internet is about one millisecond at a typical site.

GTH audio streaming: why stream over TCP?

GTH lets you stream audio from a TCP socket to a timeslot on an E1/T1 line. Some people are surprised by the choice to use TCP. When I added that support back in 2002, my first thought was to use RTP (RFC 1889). RTP is simple: you just dump the audio in a UDP packet with some timestamping information and shoot it out on ethernet at the right rate.

I’d worked with RTP before and I’d been at a couple of SIP interops where most of the attendees had trouble emitting audio at ‘the right rate’, i.e. 8000 samples/s. One manufacturer’s system would emit 8007 samples/s. Another would play it back at 7999 samples/s. What do you do with the extra 8 samples per second? If you do nothing, you get endlessly growing delays and, eventually, a buffer overflow. If you come up with a strategy for throwing away samples, it’s bound to interact badly with something, sooner or later.

The thing is, when you’re streaming in pre-recorded audio, you don’t need it to be at the right rate. You just need to make sure it doesn’t overrun or underrun the GTH’s internal buffer. I.e. you need flow control, not rate control. TCP has flow control, and everyone knows how to use TCP sockets. In 2002, doing things that way was right at the limit of what our 50MHz embedded CPU could keep up with. Now it’s no problem at all.

Python

I’m playing around with python at the moment. Here’s how to put some
data on an E1 timeslot, straight from the python shell. First, set up a listening TCP socket:


import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.listen(0)
addr, port = s.getsockname()

Next, open another socket to the GTH command port (2089) and tell it we want to stream in audio on the socket we opened above:

a = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
a.connect(("172.16.2.7", 2089))
my_ip, _port = a.getsockname()
command = "" % (my_ip, port)
header = "Content-type: text/xml\r\nContent-length: %d\r\n\r\n" % len(command)
a.sendall(header + command)

Finally, accept() and send the data:


d, _ = s.accept()
d.sendall("hello world")
d.close()

That looks OK to me, though I imagine the style betrays my Erlang mindset. There’s a more complete python example at the bottom of the API page.

SOX parameters for downsampling to 8kHz Alaw

A technician working for an operator mailed me a few days ago wondering why the recorded voice clips they use for their IVR sound so bad, “like they’re coming from the bottom of a deep well”. It turned out that the clips actually sounded OK on a telephone, just not through his laptop’s speaker. He asked if I recommend any specific filter parameters when converting audio from 44.1kHz wav to 8kHz Alaw voice clips.

An example

I took this audio snippet from the introduction to an audio book. It was originally a .ogg file. I converted it to a .wav file with a 44.1kHz sampling rate and 16 bits per sample. For my purposes any artefacts from ogg vorbis are negligible.

1_mono.wav (44.1kHz, 16 bit linear samples)

Next, I converted it to 8kHz Alaw using sox. 8kHz Alaw is what runs on the fixed telephone network in most of the world. (The US uses a minor variant, μlaw):

sox 1_mono.wav -A -r 8000 2_8kHz_alaw.wav

2_8kHz_alaw.wav (8kHz, 8 bit Alaw samples)

That sounds a bit less clear than the original, but it’s OK. It’s what you’d expect coming out of a telephone. There’s some weirdness though. The audible difference between the two files varies from one PC to another and even one playback program to another. Why? Because laptop speakers vary in quality and because playback programs usually quietly convert everything back to 48kHz or 44.1kHz sampling rates, and they do it with different approaches. For fun, I resampled to 44.1kHz:

sox 2_8kHz_alaw.wav -r 44100 -s 3_resampled.wav

3_resampled.wav (44.1kHz, 16 bit linear samples)

2_8kHz_alaw.wav and 3_resampled.wav should sound almost the same. But on some PCs they sound markedly different.

The GTH just plays octets (bytes)

The GTH has a simple approach to playing back audio. It just copies the bytes you give it to the destination timeslot. No format or rate conversion happens, though the GTH does make sure the data is played out at the E1′s frame rate (8000Hz). The downside of that is that you have to convert all the files for your IVR system before giving them to a GTH, e.g. using sox. The upside is that it’s simple. Nothing happens behind your back.

What are the best SOX options to use?

I don’t know. I used to suggest the following as a reasonable starting point:

sox original.wav -r 8000 -c 1 -A -t raw gth.raw resample -q

As of a few years ago, sox improved and the ‘resample’ effect got deprecated. So now I suggest just letting sox do what it thinks is best:

sox original.wav -r 8000 -c 1 -A -t raw gth.raw

At the time of writing, it uses its “rate” effect with reasonable default parameters for the bandwidth and filter characteristics. I experimented a bit with the -m, -h, -v and -s switches for the “rate” effect. I could not reliably hear a difference, let alone decide that one sounded better.

Why does the phone system use 8kHz anyway?

There’s a certain sound quality level expected in telephone networks, and part of that is that the network carries everything up to about 3500Hz. Analog local loop specifications mention that, and pretty much all digital telephone systems use an 8kHz sampling rate, which is what you need to be able to carry audio up to 3.5kHz. Even the GSM and AMR codecs start off with the assumption that the incoming audio is limited to 3500Hz.

So the bar is set pretty low. I haven’t come across any systems which set out to provide higher quality, e.g. even skype compresses the hell out of the audio to save bandwidth. Even when both parties in a conversation have huge amounts of it. Surprising, why not aim for VOIP to sound much better than a regular telephone?