Immersive Audio (What Is It & How Does It Work?)

Last updated:
Disclosure: We may receive commissions when you click our links and make purchases. Read our full affiliate disclosure here.
  • What is The Definition & Process of Immersive Audio?
  • A Historical Background of Immersive Audio
  • Popular Types of Immersive Audio

Immersive audio – the term in the audio industry that’s been given a revival in popularity.

Thanks to the rise of advanced technologies in different companies, including Dolby Laboratories, DTS, Inc., Auro-3D, and Apple, listeners are now able to experience music and audio on an even larger and more interactive scale than traditional surround sound.

But what exactly is immersive audio? In this article, we dive deep into the definition, process, historical background, and popular types of this new sonic technology. 

Definition of Immersive Audio & The Basic Process

Human hearing and perception include a phenomenon called sound localization.

This is the ability to figure out where a sound source is located in relation to the ears and also figure out how far said source is from the ears. 

With a stereo speaker setup and traditional headphone setup, there are two sound sources, so if sitting in the sweet spot, the listener has to rely on the timing, phase, and level differences to perceive the direction of, for example, different instruments in a musical mix.

True or natural stereo has the sound source captured with an array of multiple mics that record at the same time, resulting in each channel being similar but with its own unique timing and sound pressure levels.

Artificial or pan stereo is when a mono sound is sent to multiple speakers and the pan pots are used to create a perceived stereo field.

Stereo is a majority format of what recorded music has been mixed to for the most part.  

There are some qualities of stereo mixes that aren’t reflective of how human hearing works in an actual listening environment.

Firstly, in stereo, humans differentiate the originating location (left, center, right) of the sound source by how loud it is among the speakers.

If listening to audio directly from the source, the brain localizes sound through inter-aural time differences.

Second, in stereo, all of the frequencies in the audible frequency spectrum (for humans) are present and reach the listener; however, directly from the source, there’s a decrease in high-end frequencies as the source moves farther from the listener.

Thirdly, the most obvious issue with stereo is the lack of a way to pan the audio in front vs. behind the listener and above or below the listener.

When directly listening to sound, humans perceive sound originating from all directions.

There are multiple definitions of the term immersive audio if you do a quick internet search. They all vary in the details but keep the main idea.

To put some of the various definitions into one, immersive audio is an expanded sound field from traditional stereo or surround sound created by implementing the usage of additional lateral and overhead loudspeakers to give an additional perception of height.

This allows a listening experience that makes it seem as if the sound source is capable of originating from an infinite number of points in the listening environment. 

The Listening Setup For Immersive Audio

 Stereophonic, quadraphonic, and traditional 5.1 surround are all referred to as channel-based formats.

This means the ratio between the number of audio channels vs. speakers is fixed. Each channel is routed to a specific speaker.

With the correct setup, the listening experience is great.

The problem occurs when a less optimal setup causes constructive and destructive interference of the sound waves – some components of the sound will get louder and some will get quieter or cancel one another out. 

The need for immersive audio emerged from the desire for more information and localization cues in the listening experience along with an easier way to implement it in each unique listening setting and speaker configuration.

In immersive audio, a data stream from a computer is decoded in real-time and then mapped to designated speakers. An advanced decoding system allows for unique mapping.

Thanks to advances in computer power in real-time operation, technology is capable of decoding immersive data for any possible speaker configuration, as well as into a binaural encoder for headphone usage.

Different immersive audio formats have three categories used to describe their basis: channel, scene, or object.

Channel-based immersive audio can be thought of as maintaining the fixed channel-to-speaker ratio by adding speakers over the listener for additional height information.

Scene-based immersive audio is where one data stream is used to give the 3D sound field.

Object-based immersive audio puts a bunch of different audio streams and metadata into a decoder, giving instructions on how each stream should be placed in the 3D sound field.

Basically, each channel is going to have information about a specific part of the mix, relative to the specified speaker configuration.

Channel and scene-based have audio mixed prior to getting to the immersive audio encoder, while object-based has the main mix info and metadata that gives instructions on how the mix will be used in the specified listening environment. 

A History Lesson

In order to understand immersive audio today, one must look back in time at the audio practices that sowed the seeds of what was to come.

The first 2-channel audio setup was demonstrated by Clément Ader at the 1881 Paris Electrical Exhibition via connected telephone transmitters.

This was soon commercialized with the Théâtrophone and the Electrophone in France and England, respectively.

Later, during the 1930s, audio engineer Alan Blumlein would explore different ways to provide listeners with more directionality in stereo and binaural recording.

A big breakthrough occurred on April 27, 1933. An experiment by Harvey Fletcher and Dr. Leopold Stokowski was done with the Philharmonic Symphony Orchestra.

The performance by the orchestra was captured with microphones in the Academy of Science Hall in Philadelphia. That signal was transmitted to Constitution Hall in Washington, D.C., which was 140 miles away.

The effect that was successfully achieved that day, was to make it seem like the orchestra was performing in Constitution Hall and give listeners the experience of being present in the concert hall.

The listeners reported the illusion of the orchestra being behind the curtain on the stage. The setup was done with 3 channels. Each channel had its own mic, amp, transmission, and recording gear. 

Evidently, this setup was reliable and yielded good performances.

Fast forward to 1938: film studio MGM started using 3 tracks for recording movie musical soundtracks onto the film reels.

It was later increased to 4 tracks, but the idea was one dialog track, 2 music tracks, and one sound FX track. 

Although this movie didn’t do so well when it first came out, the next big step was the 1940 release of Disney’s Fantasia, with a soundtrack recorded by the Philadelphia Symphony Orchestra.

The whole thing was a collaboration between Disney and RCA. The Disney-coined Fantasound system was built off of developments of the system Fletcher and Stokowski had demonstrated.

Fantasia’s Fantasound also used 4 tracks on the film for its audio. The left, center, and right channels each took up one channel. The fourth track had three tones used to control the levels of said left, center, and right channels.

The setup necessitated the setup of speakers in the rear of the theater and the use of panoramic potentiometers (“pan pots”). In total, ten Fantasound speaker setups were created.

Symphonie pour un homme seul (Symphony for One Man Only) (1950), made by Pierre Schaeffer and Pierre Henry at Radiodiffusion-Télévision Française, was a 22-movement (later reduced to 12) musique concrète piece.

It involved 4 channels being sent through differently-placed speakers: left, right, behind, and overhead.

After WWII, Jack Mullen brought the magnetic tape recording technology to the U.S. from Germany (see Analog Tape article).

This gave rise to the 33mm magnetic tape among other formats, which by the early 1950s, was used for mixing by all of the major film studios.

From 1957 to 1959, Henry Jacobs created the Vortex Concerts. These concerts incorporated abstract images with contemporary avant-garde electronic music.

The concerts took place at San Francisco’s Morrison Planetarium due to it having the advanced audio and visual technology that the pieces required – the planetarium housed a 65-foot dome.

It involved a multi-directional audio system of 40 speakers that were individually controlled by a console. There were 5 Vortex Series concerts. 

Kontakte by Karlheinz Stockhausen is considered to be the first piece written for a quadrophonic system. Four audio channels are routed to four speakers. It premiered in 1960 in Cologne, Germany.

In the 1970s, there was the quadraphonic format which, as the name suggests, involved four separate tracks.

It was a valiant attempt that was toward the immersive audio experience, but alas, it was limited by the technology of its day, and it faded out in the main audio industry and never became any sort of standard.

Acousmonium is a sound diffusion system that was invented by Francois Bayle.

It consisted of multiple speakers arranged in front of, around, and within the audience seating.

It was meant to be controlled by a diffusion console. The speakers can be thought of as being associated with how they’re set in regard to the background and foreground.

This position is then used to determine what is to be sent to the speakers.

Michael Gerzon developed Ambisonics in the 1970s. The intention was to provide a way for the format of the audio to be independent of how the speaker configuration.

In other words, the goal was to have a setup where the audio format could be adapted to any number of speakers of any position.

The Ambisonics systems weren’t really a success commercially. It used four audio channels but couldn’t reproduce a sound with clear enough directionality. 

In 1982, Dolby Stereo was adapted for home use, called Dolby Surround.

At the time, home video recordings were starting to be released as hi-fi stereo. The Dolby Pro Logic system was used to decode the Dolby Surround encoded audio. 

Higher order Ambisonics (HOA) was developed in the 1990s. It was a method that allowed the user to store and reproduce a sound field at a specific point with spatialization consistent with the original recording environment.

It improved upon the pitfall of Ambisonics’ lack of spatial resolution.

In 1991, Naut Humon founded Recombinant Media Labs in order to research the potential of spatial audio in the cinema setting. That same year, Dolby AC-3 was first released as the Dolby Digital standard. 

A new system to reproduce audio, wave field synthesis (WFS) was invented in 1988 by Christiaan Huygens. This method stands out in that it doesn’t use psychoacoustic techniques to operate.

Star Wars: Episode I – The Phantom Menace (1999) became the 1st film released with Dolby Digital Surround EX. Dolby Surround 7.1 was introduced in 2010. 

In 2000, an improved version of Dolby Pro Logic, called Dolby Pro Logic II (DPL II) was introduced. Dolby Pro Logic II was followed by DPL IIx in 2003. 

In 2012, Dolby Surround 7.1 was introduced. The first implementation of Dolby Atmos occurred during the premiere of Pixar’s Brave at the El Capitan Theater in Los Angeles.

The theater was the first one to have Dolby Atmos technology installed.

More firsts for Dolby Atmos followed. The Pro Logic systems were soon replaced with Dolby Surround (this was a reboot of the past use of the term).

Dolby Atmos was said to be a new expanded surround sound listening experience with the new use of the overhead (vertical) dimension. 

Dolby Atmos was demonstrated for home theater at the CEDIA Expo in 2014.  

In 2016, the third season of Power became the first TV show that was natively mixed and aired in Dolby Atmos. That same year, Game of Thrones was up mixed from 5.1 to Dolby Atmos for its Blu-Ray release. 

The first major music release in Dolby Atmos was the anniversary release of R.E.M.’s album, Automatic for the People in 2017.

Popular Types of Immersive Audio

Dolby Atmos

This is the surround sound technology created by Dolby Laboratories.

It can be described as a hybrid-based immersive audio format. The format is capable, presently, of running up to 128 audio tracks along with their associated spatial audio describing metadata.

The metadata includes things such as location and pan automation data. Each audio track can be assigned to a bed or to an audio object.

An object is an audio signal and its metadata specifies the apparent sound source location as a set of 3D coordinates.

These coordinates are defined as relative to audio channel locations and theater boundaries defined by the user.

A bed is how the tracks and their inputs and outputs are grouped relative to the user’s speaker setup. It can be from stereo to 7.1.2.  

In movie theaters, the Dolby Atmos Cinema Processor is used. This has up to 128 discrete audio tracks and up to 64 unique speaker feeds.

Each speaker gets its own unique feed based on its actual location in the theater.

The theater’s system renders the objects in real time, and this, along with the 3D coordinate system results in objects being perceived as if originating from said desired coordinates.

There are different settings in regard to bed channels. Dolby Atmos for theaters has 9.1-bed channels (ambiance stems or center dialog), leaving 118 tracks for objects. 

Dolby Atmos for home theater applications use the 1-bed channel for LFE. Usually, there are up to 118 dynamic objects. In a 7.1.4 setup, there’s a 7.1 layout, with 4 overhead speakers.

This can be done by extending a 5.1 setup by grouping the speakers into virtual arrays.

When compared to the movie theater setup, the Dolby Atmos home setup has more limited bandwidth, significantly less processing power, and only supports up to 7.1.4 channels.

In video games, Dolby Atmos uses an intermediate spatial format (ISF). There are 32 total active objects with 7.1.4 beds and 20 dynamic objects. The first video game to use Dolby Atmos was Star Wars: Battlefront.

For music, so far, Dolby Atmos is used by Tidal and Apple Music. Tidal uses E-AC3 while Apple Music uses spatial audio with “support for Dolby Atmos and lossless audio.”

In the video streaming realm, Netflix, Disney Plus, Vudu, Apple TV Plus, and HBO Max (movies only) are known platforms that have Dolby Atmos capability.

Dolby Atmos is available as plugins: Dolby Atmos Renderer, Dolby Atmos Production Suite, or Dolby Atmos Mastering Suite. It also is available in the large mixing consoles Neve DFC and the Harrison MPC5. 


Ambisonics involves representing and/or describing the sound field at a point in space. It’s described as a scene-based immersive audio format.

There are variations in mic setups. A tetrahedral array with 4 cardioid capsules can be used. This one is used a lot in post-production. The result has more flexibility in the channels for height and depth.

Another mic setup is the native/Nimbus/Halliday mic array. These are 3 coincident mics. One Omni, one forward-facing figure 8 mic, and a left-facing figure 8 mic. This records directly into Ambisonic B format (see below). 

Credit –

In Ambisonics, there are letters that designate which stage of the recording and processing the format is in.

A-format is the output of the 4 mic capsules of the recording setup (omni, the 3 bi-directional mics on the XYZ axes where the omni capsule is).

The B-format is the set of four signals from the A-format. They are converted through math software and hardware in a decoder.

The decoder approximates a sound field on a sphere around the mic, transforming the mic array from the A-format into four signals: W, X, Y, and Z.

The W signal is the encoded omni-sound pressure component. The X, Y, and Z signals are the encoded pressure gradient (velocity) elements from the 3 bi-directional mics.

B-format is used in mic emulation software and 360-degree video productions. The mic emulation is achieved by getting rid of the Z signal.

The C-format is the type of matrix encoding. This encodes the B-format for transmission over fewer channels.

The vertical info is thrown out. It involves the compatibility and directional resolution being compromised; however, this format has less use today since technology allows full-bandwidth 4-channel audio delivery.

The D-format represents the speaker array for rendering. It involves a phase-amplitude decoding matrix, sometimes with a set of shelf filters. There’s no specific set channel plan.

It’s decoded B-format audio that’s rendered for a specific speaker array, derived from a linear combo of ambisonic component signals. People usually just call this the speaker setup instead of the “D-format.”

A more obsolete term is the G-format, which refers to pre-decoded ambisonic material to 5.1. It was used in the early days of DVD. 

Higher order ambisonics refers to when more groups or directional elements are added to B-format.

This increases resolution and increases the sweet spot. It’s not possible to directly mic above first-order ambisonics.

In order to achieve higher order, difference signals have to be derived from spaced omni capsules with a lot of DSP.

An ambisonic encoder can take mono sources and pan them into an encoder. There are usually 2 controls. The azimuth (horizon) and the elevation angle.

Sometimes there is a radius control that is there to compensate for the near-field effect (distance-dependent bass boost/attenuation).

Ambisonics are found in technology still. Sennheiser has a first order tetrahedral VR mic. Zoom has an Ambisonics field recorder.

Dolby also acquired and liquidated imm Sound, a Spanish company that specialized in ambisonics, before it launched Dolby Atmos.


This immersive 3D format was created by Auro Technologies. It has 3 layers of sound built onto a single horizontal layer of the 5.1 or 7.1 surround format.

The three layers are the surround, height, and overhead ceiling. The spatial sound field is created when the height layer is added around the listeners.

This gives the sound more localization cues. There are also additional height reflections that end up in the information of the lower surround layer.

The height info captured during recording is mixed into a 5.1 surround PCM carrier. When played back, a decoder extracts the original height channels from the said stream.


In order to use this immersive format, the audio listening setup will require an Auro-3D engine (consisting of an Auro-Codec and Auro-Matic up mixing algorithm to convert any legacy formats into the Auro-3D one) and the Creative Tool Suite which is a set of plugins that create the native immersive 3D audio environment.

Auro-3D is compatible with existing productions and theater processes and systems.

There are different versions of Auro-3D for home theaters, large cinematic theaters, and headphones.


 This immersive audio format was made by DTS (Digital Theater Systems), Inc.

It’s considered the main rival to the Dolby format.

DTS differs in that it operates on the principle that the surround channels should be directional instead of diffused, and the audio is put on a CD-ROM that’s separate from the film.

This gives it the advantage of having a bigger storage capacity than Atmos, giving it higher audio fidelity and supposed better dynamic range.

With the separate audio media from the film, said physical media is not subjected to the normal wear and tear on the film print.

To implement it, a 24-bit timecode is optically imaged onto film. An LED reader scans said timecode and sends this data to the DTS processor.

The timecode is used to sync the projected film image with the DTS soundtrack audio.

Multi-channel DTS audio is recorded into a compressed format on a standard CD-ROM at 882 kbit/sec bitrate.

An APT-X100 system is used to compress the audio at a fixed 4:1 ratio, with sub-band coloring that has linear prediction and adaptive quantization.

The DTS CD-ROM timecode and DTS timecode on the 35 mm film print have the same movie title labels to match the audio with the picture for showing. 

The DTS processor acts like a transport device, holding and reading the audio on the discs. Initially, 1 disc or 2 discs were used, but now 3 discs are used.

One DTS processor works with the 2-disc film soundtrack while the 3rd disc is used for the film trailers.

Each DTS CD-ROM has a DOS program that the DTS processor uses for soundtrack playback. The DOS program makes it easier to add updates.

The home version of DTS has 5 discrete channels on the CD-ROMs. The LFE track is mixed into the discrete surround channels on the disc and then is recovered by a low-pass filter in the theater. 

There are different versions of DTS, but the main immersive format, that is compared to Dolby Atmos the most often, is the DTS:X format.

In this version, the location (direction from the listener) and the objects (audio tracks) are specified as polar coordinates. The DTS processor dynamically renders audio outputs based on the number and layout of the speakers being used.

It can be compared to Dolby Atmos, but with DTS:X for cinema settings, it can be thought of as Dolby Atmos + Auro-3D.

There’s also DTS Surround Sensation (formerly named DTS Virtual) which enables virtual 5.1 surround sound through standard headphones.

This uses DTS Headphone:X which has metadata that can be encoded on top of a 2-channel lossy DTS bitstream, giving it the capability of reproducing 12 channels of binaural surround sound for any pair of stereo headphones.

It features a head-related transfer function created especially by DTS and has compensation for unique room or studio acoustic features.

Wave Field Synthesis (WFS)

As mentioned before, WFS doesn’t use psychoacoustic techniques.

The purpose is to create a space where every spatial position is the “sweet spot.” A large number of speakers simulate and synthesize a virtual acoustic environment.

The speakers are arranged in arrays around the listener and a computer synthesizes and controls the individual speakers’ membranes, simulating the virtual wavefront passing through the listening space.

Highlights of this technique include being able to make a sound be perceived as originating outside the physical placements of the speakers and from within the listener’s head.

It requires a high-powered computer to process the advanced algorithms that are calculating signals.

The WAF Principle
The WAF Principle


Clearly, immersive audio is a new format that is gaining traction more and more each day.

The current music and audio industry have an increasing number of engineers taking it upon themselves to learn about the ins and outs of various versions of immersive audio.

The desire to have a listening experience that includes more surround information, along with the additional height layer, has been present since the start of audio recording and reproduction.

These new advances in audio technology have allowed engineers to provide listeners with an experience that is akin to the sound source actually being in the room with them.