Spatial audio
Updated: Jun 10, 2025
An essential element of fully immersive audio is spatialization, which is the ability to play a sound as if it is positioned at a specific point in three-dimensional space. Spatial audio is essential to deliver an immersive experience because it provides powerful cues that make the user feel they are actually in the 3D environment.
The two key components to spatialization are direction and distance. This guide will cover both of these topics and the technologies which enable them.
Localization and the human auditory system
Localization, in the context of immersive experiences, refers to the brain’s ability to determine the 3D location of sound in the real world. Despite only having two ears, humans are able to deduce the 3D position of the sounds around us. Relying on audio cues such as timing, phase, level, and spectral modifications to deduce the location of a sound.
This section summarizes how humans localize sound and how you can apply that knowledge to solve spatialization for your immersive app and how to use a monophonic sound and transform its signal, so that it sounds like it comes from a specific point in 3D space.
Below are the cues humans use to determine the direction of a sound source.
Laterally localizing a sound is the simplest type of localization. When a sound is closer to the left, the left ear hears it, before the right ear hears it, and it sounds louder. The closer to parity, the more centered the sound, generally speaking.
There are, however, some interesting details. First, localize a sound based on the delay between the sound’s arrival in both ears, known as interaural time difference (ITD). Humans may also localize a sound based on the difference in the sound’s volume level in both ears, known as the interaural level difference (ILD). The localization technique depends heavily on the frequency content of the signal.
Sounds below a certain frequency (anywhere from 500 to 800 Hz, depending on the source) are difficult to distinguish based on level. However, sounds in this frequency range have half wavelengths greater than the dimensions of a typical human head, allowing us to rely on timing information (or phase) between the ears without confusion.
At the other extreme, sounds with frequencies above approximately 1500 Hz have half wavelengths smaller than the typical head. Phase information is therefore no longer reliable for localizing the sound. At these frequencies, humans rely on level differences caused by head shadowing, or the sound attenuation that results from our heads obstructing the far ear.

Visualization of head shadowing, when sound is attenuated by our heads obstructing whichever ear is farthest away from the source.
Also key on the difference in time of the signal’s onset. When a sound is played, which ear hears it first plays a big part in determining its location. There is a transitional zone between ~800 Hz and ~1500 Hz in which both level differences and time differences are used for localization.
Front versus back localization is significantly more difficult than lateral localization. Humans cannot rely on time differences, since interaural time and/or level differences may be zero for a sound in front of or behind the listener.
In the following figure, see how sounds at locations A and B would be indistinguishable from each other since they are the same distance from both ears, giving identical level and time differences.

Visualization of front and back ambiguity, showing how identical sounds from the front and back could be indistinguishable.
Humans rely on spectral modification of sounds caused by the head and body to resolve this ambiguity. These spectral modifications are filters and reflections of sound caused by the shape and size of the head, neck, shoulders, torso, and especially, by the outer ears (or pinnae). Because sounds originating from different directions interact with the geometry of our bodies differently, our brains use spectral modification to infer the direction of the sound’s origin. For example, sounds approaching from the front produce resonances created by the interior of our pinnae, while sounds from the back are shadowed by our pinnae. Similarly, sounds from above may reflect off our shoulders, while sounds from below are shadowed by our torso and shoulders.
These reflections and shadowing effects combine to create a direction-selective filter.
Simply turning our head changes difficult front/back ambiguity problems into lateral localization problems that humans are better equipped to solve.
In the following figure, sounds at A and B are indistinguishable from each other based on level or time differences, since they are identical. By turning their head slightly, the listener alters the time and level differences between ears, helping to disambiguate the location of the sound. D1 is closer than D2, which is a cue to the listener that the sound is now closer to the left, and therefore behind them.

Visualization showing how head movement can disambiguate the location of sound coming from the front and back.
Likewise, cocking our heads can help disambiguate objects vertically. In the following figure, the listener cocks their head, which results in D1 shortening and D2 lengthening. This helps the listener determine that the sound originated above their head instead of below it.

Visualization showing how cocking one's head to the side can help determine whether a sound originated from above or below.
The sounds humans experience are directly impacted by the shape and geometry of our body (especially our ear), as well as the direction of the incoming sound. These two elements: our body + the direction of the audio source, form the basis of Head-Related Transfer Functions (HRTFs), which are filters used to localize sound. A direction-selection filter can be encoded as a HRTF. The HRTF is the cornerstone of most modern 3D sound spatialization techniques.
The most accurate method of HRTF capture is to take an individual, put a couple microphones in their ears (right outside the ear canal), place the subject in an anechoic chamber (a room with no echo), play sounds in the chamber from every direction identified as important to the experience you are creating, and record these sounds from the ear-mounted microphones. Then compare the original sound with the captured sound and compute the transfer function, this is the Head-Related Transfer Function.
Do this for both ears, and capture sounds from a sufficient number of discrete directions to build a usable sample set. But this only captured HRTFs for a specific person. While HRTFs are personal, they are similar enough to each other that a generic reference set is adequate for most situations, especially when combined with head tracking.
Most HRTF-based spatialization implementations use one of a few publicly available data sets like those outlined below. These are captured either from a range of human test subjects or from a synthetic head model such as the KEMAR.
Most HRTF databases do not have HRTFs in all directions. For example, there is often a large gap representing the area beneath the subject’s head, as it is difficult, if not impossible, to place a speaker one meter directly below an individual’s head. Some HRTF databases are sparsely sampled, including HRTFs only every 5 or 15 degrees.
Most implementations either snap to the nearest acquired HRTF (which exhibits audible discontinuities) or use some method of HRTF interpolation. This is an ongoing area of research, but for immersive applications on desktops, it is often adequate to find and use a sufficiently-dense data set.
Given an HRTF set, if the direction a sound is to appear to come from is known, select an appropriate HRTF and apply it to the sound. This is usually done either in the form of a time-domain convolution or a frequency domain convolution using FFT. Since HRTFs take the listener’s head geometry into account, it is important to use headphones when performing spatialization. Without headphones, you are effectively applying two HRTFs: the simulated one, and the actual HRTF caused by the geometry of the listener’s body.
Listeners instinctively use head motion to disambiguate and fix sound in space. Take this ability away and their capacity to locate sounds in space is diminished, particularly with respect to elevation and front/back. Even ignoring localization, if unable to compensate for head motion, then sound reproduction is mediocre at best. When a listener turns their head 45 degrees to the side, provide an accurate audio response, or immersion will be lost.
Immersive headsets such as the Meta Quest provide the ability to track a listener’s head orientation and position. By providing this information to a sound package, you can update the spatial audio to make it feel like the sound is grounded in the virtual world as the user moves through the space. (This assumes that the listener is wearing headphones.) It is possible to mimic this with a speaker array, but it is significantly less reliable, more cumbersome, and more difficult to implement, and thus impractical for most immersive applications.
Distance localization and distance modeling
ILD, ITD and HRTFs help us determine the direction to a sound source, but they give relatively sparse cues for determining the distance to a sound. Use a combination of the following factors to determine distance.
Loudness is the most obvious distance cue, but it can be misleading. If a listener lacks a frame of reference, then they can’t judge how much the sound has diminished in volume from its source, and thus estimate a distance. Fortunately, listeners will be familiar with many of the sources they encounter in life, such as musical instruments, human voice, animals, vehicles, and so on, so you can predict these distances reasonably well. For synthetic or unfamiliar sounds, listeners might have no such frame of reference, and must rely on other cues or relative volume changes to predict if a sound is approaching or receding.
Initial time delay describes the interval between the direct sound and its first reflection. If the source is close, the direct sound will arrive immediately, and there will be a longer delay between the direct sound and the arrival of the first reflection. If a sound is closer to the wall the direct and reflected sound will arrive close together.

Initial time delay visualization, a direct sound will arrive immediately while the other will first reflect off a surface.
Anechoic (echoless), or open environments such as deserts, may not generate appreciable reflections, which makes estimating distances more difficult. Initial time delay is significantly harder to model than loudness, as it requires computing the early reflections for a given set of geometry, along with that geometry’s characteristics. This is both computationally expensive and awkward to implement architecturally (sending world geometry to a lower level API is often complex).
Ratio of direct sound to reverberation
In a reverberant environment there is a long, diffuse sound tail consisting of all the late echoes interacting with each other, bouncing off surfaces, and slowly fading away. The more listeners hear a direct sound in comparison to the late reverberations, the closer they will assume it is.
Direct to reverberant sound (wet and dry mix) is a natural byproduct of any system that attempts to accurately model reflections and late reverberations. Unfortunately, such systems tend to be very expensive computationally. With ad hoc models based on artificial reverberators, the mix setting can be adjusted in software, but these are strictly empirical models.
Note: This property has been used by audio engineers for decades to move a musical instrument or vocalist “to the front” or “to the back” of a song by adjusting the “wet/dry mix” of an artificial reverb.
Motion parallax (the apparent movement of a sound source through space) indicates distance because nearby sounds, and typically exhibit a greater degree of parallax than far-away sounds. For example, a nearby insect can traverse from the left to the right side of your head very quickly, but a distant airplane may take many seconds to do the same. As a consequence, if a sound source travels quickly relative to a stationary perspective, humans tend to perceive that sound as coming from nearby. The magnitude of directional changes for small head movements helps inform humans of the distance of sounds.
High-frequency attenuation and air absorption
High frequency attenuation due to air absorption is a minor effect, but it is also reasonably easy to model by applying a simple low-pass filter. In practice, HF attenuation is not very significant in comparison to the other distance cues. High frequencies attenuate faster than low frequencies, so over long distances listeners can infer a bit about distance based on how attenuated those high frequencies are. Sounds must travel hundreds or thousands of feet before high frequencies are noticeably attenuated (i.e., well above 10 kHz). This is also affected by atmospheric conditions, such as temperature and humidity.
If you’re ready to kick off the technical side of audio for your app, review the following documentation: