Technology: Audio Time Compression

by Steve Cunningham

For most of us, it’s only been available for about 15 years, but we all wonder how we ever got on without it. It comes standard with almost every audio editor out there, and it’s available as a third- party plug-in as well. It makes our jobs infinitely easier and our clients appreciate it, but our listeners often hate it and with good reason.

It’s time compression and expansion. It’s the ability to shrink a 32 second spot down to the required 29.9 seconds with a click and a drag. It’s the ability to fix a rushed VO read, or to beat-match a song for a promo. It’s the ability to fit a half-page of single-spaced legal mumbo-jumbo into the last eight seconds of a spot, without making the talent sound like Alvin and the Chipmunks. And I doubt that many car dealer spots are ever completed without it. ‘nuff said.

Time compression and expansion (or TCE for short) can shorten or lengthen an audio file without changing the pitch of the talent’s voice or the formants of his or her vocal cavity. Closely related to time compression is pitch shifting, which is designed to change the pitch of an audio clip without changing its speed. But since pitch shifting is best used for effects, we’ll focus here on time compression and expansion.

The human voice can be described not only in terms of pitch, but also by the natural fixed filtering (also called formant filtering) that is created by the mouth, throat, and sinus cavities. It is these formants that give James Earl Jones much of his larger-than-life character — if you don’t believe me, ask any male baritone you know to say “this... is CNN” and see how close they can get to Jones. My guess is “not very” because of the formant filter created by Jones’ physiology.

Some TCE plug-ins include a formant control, which effectively changes the size of the vocal tract. This can be great fun if you want to perform a gender change on a VO track, but it too tends to fall into the category of special effects. Good time compression does not change the spectral characteristics of audio, and so does not affect formants.

Although TCE is included with most every software editor, each flavor has its own capabilities and limitations. Some give you a great deal of control over parameters, allowing you to tweak such obscure settings as Overlap and Splicing Frequency, while others only allow you to set the target time. Most work offline rather than in real time, due to the CPU intensive nature of the TCE process (for more, see HOW DO THEY DO THAT ANYWAY, below). Let’s look at a few specifics.

audition

AUDITION

Time Stretch comes as a plug in Adobe Audition that combines both TCE and pitch shift. It’s an offline plug, meaning it does not work in real time (although it does have a Preview function that lets you hear the effect before actually committing to it). It comes with a small array of presets that includes settings for TCE, pitch shift, and a combination that mimics speeding up and slowing down tape. Its stand out feature is gliding stretch, which while a bit clumsy to use, yields a very good impression of a record playing coming to a stop with the needle still down.

The quality of the stretch is good out to about 20%, where artifacts begin to creep into playback. It’s a bare-bones plug and about average in terms of processing speed, but it works.

soundforge

SOUND FORGE

Included with Sound Forge, Sony’s Time Stretch is another offline plug-in although it does include a Real-Time option that causes it to do the stretch in the background so it seems like real time. It is another no-frills plug, although it comes with a couple dozen presets for music, speech, drums and so on, along with a handy Preview button. It will stretch cleanly past 20%, but for speech the artifacts become intolerable past 25%. It is one of the faster offline TCE processors, and will grab all of your CPU while it works.

ProTools

PRO TOOLS

Digidesign’s TCE can best be described as serviceable in terms of quality. What is outstanding about it is its speed and ease of use.

By clicking and holding on the Trim tool, you can change it into a TCE tool. Thereafter, grabbing either end of an audio region stretches or shrinks it immediately upon release. There are no bells, whistles, or controls to set, and you have to watch the Length parameter above the waveform display to check what the finished length will be before you release it. It’s good just over 15% stretch before artifacts become evident, but boy is it fast to use.

wavelab

WAVELAB

While I am a huge WaveLab fan, its included Time Stretching plug-in leaves me cold. There’s nothing wrong with it and it does the basic job up to almost 20% stretch, but the three Quality buttons are superfluous (why would you use anything other than High Quality?) and there’s not much else there in terms of controls. I’m happy to have it, and happier that I didn’t pay extra for it.

PnTLE

PITCH ‘N TIME LE

Serato’s Pitch ‘n Time is probably the most highly regarded of all TCE plug-ins available today. It is an AudioSuite plug-in so it works on both Macs and PCs, but only with Pro Tools. The Pro version includes both TCE and pitch-shift, as well as some fun features like “Time Morphing” and “Varispeed Mode,” both of which allow variable amounts of TCE and pitch shift over time. There’s really nothing like it out there, and the quality is uniformly good even at settings over 30%.

The Pro version of the software (with Varispeed and Morph) weighs in at a hefty $799 USD, but pictured here is the interface for the newly-released LE version. This version uses the same engine as the Pro version, and carries a retail price of $399. It only does TCE and pitch shift, but does them equally as well as its pricier cousin. Check the website at www.serato.com.

RADIUS

Newly released is iZotope’s Radius, which at this time runs only on Apple’s Logic Pro. Radius stood up well against Pitch ‘n Time in my evaluation, although its processing time is substantially longer. With a price of $199 USD, one would hope that a version will appear for use in other editors as do iZotope’s other plug-ins, but the company wouldn’t discuss it at this writing. Visit www.izotope.com to keep tabs on this one.

TIPS AND MY FAVES

In general, all these TCE plugs work better at compressing audio than they do at expanding it, for reasons that should be obvious. The more you expand a clip, the longer the vowel sounds will play and the longer the spaces between words will be. These are where any artifacts from TCE will show up, so be sure to listen to them closely.

If your TCE plug comes with presets, then by all means try several on each sound. Each preset tweaks the plug’s obscure technical parameters to optimize the process for a specific type of audio. Just because a preset is labeled “Solo Instruments” doesn’t mean it won’t be effective on a VO track.

With almost any of these plug-ins, it’s easy to get up to about 15% of compression or expansion cleanly. Over 15% and you’re on your own, although I like Pitch n’ Time best for extreme compression and expansion; I’ve lengthened VO by 30% and it was remarkably good-sounding. As mentioned above, Pro Tools’ TCE tool wins my vote for the easiest of all to use, even if the quality is only adequate. It’s lightning-fast.

But whatever you do, go easy on the legal tags, okay? Thanks.

HOW DO THEY DO THAT, ANYWAY?

DSP MAGIC

Getting audio to play faster or slower without changing its pitch was nigh on to impossible in the days of analog audio. With the advent of digital audio, time compression and expansion is commonplace and requires only some Digital Signal Processing (DSP) magic, and one heck of a load of complex math. Let’s put on our propeller-beanies and take a look. No, seriously, put on the beanie. You’ll need it.

DSP algorithms are a bit like physics equations in that they use math to describe physical behavior. For example, we can drop a tennis ball from a rooftop and videotape the event, which allows us to measure how long the ball took to fall and how fast it fell. Alternatively, we can use Newton’s Second Law to describe the effect of gravity on a falling object, plugging in values for the mass of the ball and the height of the drop, and we’ll get the same results for the time and speed. Most DSP algorithms are like using Newton’s Second Law; they work with mathematical representations rather than with the actual digital audio.

One aspect of the DSP magic that makes time compression possible is the Fast Fourier Transform, or FFT. FFT is a mathematical algorithm used to compute the frequency versus time characteristics of a sound. In other words, it lets us look at a small chunk of digital audio and describe that chunk’s component frequencies and amplitudes, as well as how they change over time, in terms of a complex equation.

FFT is an extremely efficient algorithm that can be performed quickly by a computer’s CPU, and it’s the basis for many of the on-screen displays found in audio editing software. Spectrum analyzers, equalizers, and VU-meters all may use FFT to calculate and draw their displays. The difference between these types of displays lies in the equations they use, and whether those equations generate values for intensity or for decibel levels that are used in the graphic result.

Once we’ve created a Fourier representation of a chunk of audio, we can then apply an inverse FFT (IFFT) to that representation to re-create the chunk of audio. It’s something like performing an analog-to-digital conversion to turn audio into digital bits, and then performing a digital-to-analog conversion to turn those bits back into analog audio. Note that as errors can creep into the A-D/D-A process, so can errors creep into the FFT/IFFT process.

SAY WHAT?

With that understood, let’s look at time compression using a phase vocoder. The earliest time compressors (and the simplest of them) used this method to stretch or shrink the audio without changing its pitch. Note that a phase vocoder bears little resemblance to a channel or musical vocoder, commonly used in pop music to make a musical instrument “speak.” Instead, the phase vocoder uses phase information to predict how frequencies change over short periods of time.

To fully appreciate the magic of time compression, you must grasp this principle: rather than manipulating individual bits of digital audio, the phase vocoder actually manipulates an equation that describes that digital audio. Here is an abbreviated version of the phase vocoding process for time compression, broken down into steps:

1. Break the audio into small “windows,” typically 2048 samples in length for audio at 44.1kHz.

2. Perform an FFT on a single window to turn it into a mathematical Fourier representation with values.

3. Apply mathematical operations and processes to this Fourier representation to make the time-base longer or shorter, thus changing the speed of the sound.

4. Perform an IFFT to turn the representation back into windows of audio samples, and connect those windows together to recover the entire audio clip.

As you can see, it is in step 3 that the stretching or shrinking actually takes place, but the vocoder is modifying an equation rather than the audio itself. It then reverses the process using the modified equation to re-build the digital audio.

While the phase vocoder works reasonably well for simple waveforms like speech, it begins to fail on transients such as drum loops at even low rates of compression or expansion. The resulting artifacts include pre-echo and a “smeared” quality to the sound. Recent improvements to the phase vocoder have improved its performance, but the smearing still remains with complex material.

Time Domain Harmonic Scaling (TDHS) is an alternative method for time compression that generates fewer artifacts. TDHS is based on estimating the fundamental frequency of the sound in the window to be processed. The time-base is changed by copying the input samples to the output in an overlap-and-add manner, while simultaneously reading the inputs samples more quickly by a factor related to the estimated fundamental pitch. This results in the input samples being read at a different speed than their sampling rate — either faster or slower — while aligning them to the estimated fundamental frequency.

This algorithm works well with signals that have a prominent fundamental frequency, and can be used with all kinds of monophonic signals including speech. It also works well with complex musical sound, provided that the overlap is adjusted to make artifacts less audible. The basic problem with TDHS is in estimating the pitch period of the sound, especially in cases where the actual fundamental frequency is missing. But pitch detection algorithms have been around since the 1970s, and are now quite accurate.

Most commercial time compression plug-ins use a combination of the phase vocoder and TDHS techniques. With the increased power of desktop computers, software manufacturers have been able tweak and expand these basic DSP algorithms to improve performance and reduce artifacts without increasing processing time. These algorithms are therefore proprietary to the companies and most are jealously guarded, so it’s difficult to know exactly how they’re performing the magic. Most time compression plugs process audio offline rather than in real time, due in part to the complex math involved in the DSP magic.

♦

HOW DO THEY DO THAT, ANYWAY?

Tech News

20 Years Ago