Fundamentals of Sound

The term auditory localization describes judgments on the location (orientation & distance), movement, and size of a sound source, based solely on auditory cues.

Systematic identification of these cues and their physical/acoustical correlates supports:

_ microphone selection and placement, appropriate to a given sonic context and desired spatial effect outcome.

_ postproduction audio techniques and processes able to convey spatial imagery solely through sonic means.

_ deliberate creation of realistic or imaginary soundscapes in multisensory experiences
    (e.g. for films, computer games, other A/V composites, or immersive AR or VR experiences)

     _ Watch this video illustrating 3D audio (by Sennheiser). Use headphones and start at 3'10".
     _ Explore this, similar approach (by Sony).
     _ Scan through the Wikipedia entry on 3-D audio.

Terminology

   Monaural signal
       Signal entering a single ear (usually through headphones - earplug in one ear and signal/stimulus in the other)

   Binaural signal
       Signal entering both ears via the air (most common type of listening in real-world environments - can be simulated via headphones)

           Diotic (special case of Binaural)
               Exactly the same sound signal entering both ears (artificial type of binaural listening - e.g. listening to a mono
               recording through stereo headphones)
           Dichotic (special case of Binaural - also referred to as Binaural Splitting)
               Completely different sound signals entering each ear (artificial type of binaural listening - through headphones).

   Ipsilateral ear
       Ear closest to the sound source

   Contralateral ear
       Ear furthest from the sound source

Auditory localization judgments are mainly described in terms of the apparent position of a sound source relative to the listener's head (see to the right).

Azimuth describes sound-source location relative to the transverse (mid-horizontal) plane.
Elevation describes sound-source location relative to the median (mid-sagittal) plane.
Front center: 0⁰ Left/Up: positive angles
Right/Down: negative angles

Lateralization describes judgments on the apparent position of a sound source when listening through stereo headphones, which is limited within a listener's head.
3D Headphones circumvent this limitation (article by Waves Audio Ltd. - concepts apply broadly).

(in Plack, 2005: 174)

The JND of auditory localization is defined in terms of minimum audible angles (MAAs) of rotation in azimuth, elevation, or both, necessary to convey a corresponding change in sound-source positioning relative to the head (see below and to the right)

MAAs depend on:
i) rotation orientation, ii) starting angle, & iii) frequency.

The smallest possible MAA is ~ 1⁰
i) in the azimuth, ii) at 0⁰ starting angle, & iii) at low frequencies.

(Source: Freigang et al., 2014)

(in Plack, 2005: 175).

We localize sound sources based on three kinds of interaural differences (i.e. differences between the portion of sonic energy arriving at each ear): _ intensity level differences *_ arrival time (i.e.* period-related time- or phase-) differences _ spectral differences**
Interaural differences in intensity and arrival time (phase) constitute the most important sound-source localization cues. The theory outlining their contribution to sound-source localization judgments is referred to as the duplex theory of sound-source localization and was introduced in the early 20th century by British physicist, Lord Rayleigh (John William Strutt, 3rd Baron Rayleigh).
INTERAURAL INTENSITY / LEVEL DIFFERENCES
For high frequency sound signals (>~1500Hz), with wavelengths <~1/2 of the average head's circumference (i.e. <~9 inches or <~0.23m), auditory localization judgments are based mainly on interaural intensity/level differences (IIDs or ILDs) . Frequencies >~1500Hz cannot diffract efficiently around a listener's head, which blocks acoustic energy and produces interpretable intensity level differences. The higher the frequency the larger the portion of the signal energy that will be blocked by the head, producing increasingly stronger and more salient interaural intensity differences.	For frequencies below 500Hz, IIDs are negligible (why?) and increase gradually with frequency. For frequencies >1500Hz, IIDs change systematically with azimuth changes and provide reliable localization cues on the horizontal plane, except for front-to-back confusion, where IID=0 (why?).
Artificial IIDs (i.e. imposed through headphones and not due to sound-source positioning relative to the head) can provide localization cues at all frequencies. Example: Three successive 300Hz tones are presented in stereo (must use headphones) _In the first, the intensity level is higher at the left channel (by 7dB); _In the second, the level is the same across channels; _In the third, the level is higher at the right channel (by 7dB). This arrangement results in signal lateralization that appears to move from the left, to the middle, to the right, following the signals' IID. Audio engineers often use artificial IIDs to simulate different sound source positions. This practice may introduce complications, when listening through loudspeakers (rather than through headphones). One such complication is related to a perceptual strategy developed to address auditory localization ambiguities when listening in reflective environments. The manifestation of this strategy (referred to as "precedence effect") is addressed at the end of the module.

INTERAURAL TIME / PHASE DIFFERENCES
For low frequency sound signals (<~500Hz), with wavelengths >~ 4/3 of the average head's circumference (i.e. >~ 26 inches or >~0.66m), auditory localization judgments are based mainly on single-cycle interaural time differences (ITDs), equivalent to interaural phase differences (IPDs). Signals of frequencies <~500Hz diffract efficiently around a listener's head and do not produce interaural level differences that are large enough to be perceptible. However, they do arrive at the two ears at different times, with periods that are longer than the largest possible interaural time difference and wavelengths that are larger than the average head's diameter. Consequently, at low frequencies, sound-source location produces unambiguous and interpretable interaural phase differences (other than for locations directly front/back/up/down). The maximum possible interaural time difference that can occur due to sonic energy travelling around the head is ~0.65ms or ~0.00065s (assuming an average head diameter of ~0.22m and speed of sound = 345m/s). For a sine signal to take advantage of IPD cues unambiguously, its period must be at least twice this value (= 2x0.00065s/cycle ~0.0013s/cycle) so that the entire head can fit in half a cycle. The absolute highest frequency for which IPDs provide useful cues is 1/0.0013 = ~770Hz. The minimum detectable interaural time difference is measured at 10microseconds or 0.000010s, corresponding to a shift in sound source location by 1⁰ in azimuth relative to straight ahead (consistent with the minimum audible angle).	In Plack, 2005: 176 .
At higher frequencies, single-cycle IPDs do not provide useful localization cues because they are ambiguous cannot be interpreted reliably. They depend not only on sound source location on the horizontal & median planes but also on frequency and, most importantly, distance. Different sound source locations can result in the same IPDs, while the same angular location in the localization coordinate system may result in different IPDs, depending on distance. Strongest separation in lateralization (i.e. in virtual location inside the head) occurs for IPDs = 1/4 cycle, corresponding to signals with: Period T = 4 x "time to travel around the head" = 4 x 0.00065s = 0.0026s/cycle; and Frequency f = 1/0.0026 = ~385Hz. Example: Three successive 300Hz tones are presented in stereo (must use headphones - no onset difference) _In the first, the left channel leads by 1/4 cycle; _In the second, the phase is the same across channels; _In the third, the right channel leads by 1/4 cycle. This arrangement results in signal lateralization that appears to move from the left, to the middle, to the right, following the signals' IPD. Listen to three stereo signals with the same phase relationships, as above, at 100Hz and 8000Hz. What types of lateralization do they result in? Do you still get a clear sense of motion? *Interaural phase difference of exactly 1/2 cycle (180⁰) results in a wider stereo image* rather than in lateralization changes. Example: Two successive 300Hz tones are presented in stereo (must use headphones - no onset difference) _In the first, both channels are in phase (i.e. IPD = 0⁰); _In the second, there is a 1/2 cycle phase difference between channels (i.e. IPD = 180⁰). IPDs also provide useful localization cues for complex signals whose spectrum includes components with frequencies <770Hz. For complex signals with no low-frequency content (or for amplitude-modulated high-frequency-content signals), IPDs remain useful as long as several of the frequency components are separated by (or as long as the modulation rate is) <750Hz. In such cases, the complex signal's envelope (in Plack, 2005:178) will display amplitude fluctuations at rates <750Hz and the IPDs between the signal envelopes arriving in each ear will provide interpretable localization cues.
Interaural Time Difference (ITD) vs. Interaural Phase Difference (IPD) (must listen over headphones) * When IPD=0, the signal appears to come from the center, whether ITD=0 (top-left graph) or ITD≠0 (bottom-left graph). * When IPD≠0, the signal appears to come from one side, whether ITD≠0 (top-right graph) or ITD=0 (bottom-right graph) In other words, apparent location of the sound source is determined by the IPD rather than the ITD values; ITDs are relevant only in terms of the IPD values they impose.
INTERLUDE: DICHOTIC BEATS
The phenomenon of dichotic beats (often inaccurately referred to as 'binaural beats') describes a beating-like sensation arising when two signals with slightly different frequencies are presented dichoticaly (i.e. one per ear through headphones). Dichotic/binaural beats have acquired cult status, with hundreds of websites and videos linking them to a variety of mental effects. The apparent fascination is partially due to the disorienting sensation elicited by this unnatural form of listening (dichotic listening) and partially due to the coincidence between the most salient dichotic beat rates and the frequencies of some brain-waves (e.g. Theta and Delta brain-waves). The sensation is based on our hearing mechanism's use of static IPDs as sound-source localization cues and of dynamic IPDs as sound-source motion detection cues, at low frequencies. Contrary to the beating sensations that accompany signals with amplitude fluctuation rates <~15 fluctuations/sec., dichotic beats are not the result of periodic alterations between constructive and destructive interference; dichotic presentation does not permit physical interaction. Rather, they are the result of periodic changes in IPDs and a direct manifestation of our ability to detect the systematic IPD changes that, for low frequencies, accompany sound-source motion. For very small frequency differences (<~4Hz) between ears, the resulting IPD modulations give rise to a "rotating" sensation inside the head that can be easily identified as such (e.g. Channel 1: 250Hz & Channel 2: 250.5Hz). For larger frequency differences, (>~4Hz), the sensation does resemble the loudness fluctuations (beating) that would result if the tones in each ear were allowed to interfere, because the rotation rate is faster than the hearing mechanism's ability to follow it. However, this beating-like sensation is much less pronounced than it would be if actual interference had taken place (e.g. Channel 1: 250Hz & Channel 2: 257Hz - listen via headphones and via loudspeakers; is there a perceptual difference?). Dichotic and interference-based beating sensations are manifestations of different physical, physiological, and perceptual phenomena. (For a short outline see here - for more details see here)

For sound signals of intermediate frequencies (~ 500<f<~1500), IID and IPD cues do provide some useful localization information, but only in the azimuth, and only if combined (IID cues are perceivable down to ~500Hz and IPD cues are perceivable up to ~770Hz).
For such frequencies, where IID and IPD cues are ambiguous/unreliable in the azimuth, as well as for all frequencies in the case of sound-source elevation changes, which do not produce IIDs or IPDs, the auditory system relies on monaural spectral cues and interaural spectral difference cues.

Monaural spectral cues and interaural spectral difference cues are due to the torso, head, and outer ear performing azimuth- and, most importantly, elevation-dependent spectral filtering on signals. This filtering 'colors' the spectral composition of the signal arriving in each ear differently, depending on sound-source location, and is commonly referred to as Head Related Transfer Function (HRTF) or, more accurately --to address the contribution of parts of the body other than the head-- Anatomic Transfer Function (ATF).

As is the case with most sound source localization cues, spectral cues are not foolproof. For example, in the median plane, signals rich in high frequencies tend to be localized higher than signals rich in low frequencies, even when the source elevation remains the same.

At frequencies low enough (<~200Hz) for their wavelengths to be larger than the dimensions of the structures involved (head, torso, pinnae), interaural spectral differences and the associated HRTFs are not useful at any orientation.

Interaural spectral differences due specifically to structural differences between the pinnae of the two ears contribute to better lateralization (i.e. more accurate virtual location of the sound within the head), especially in the median (i.e. elevation) plane. Pinna-related spectral filtering does not help us 'construct' a complete aural 'image' of the outside world.
Experiments where ATFs obtained from dummy heads/torsos are imposed on the equalization and reproduction of signals indicate that interaural spectral differences contributed by the entire Torso/Head/Pinnae system contain information that better helps us in perceiving and reconstructing the actual source location outside the head.

Head Related Transfer Function (HRTF) is defined as the ratio of the sound pressure spectrum measured at the eardrum to the sound pressure spectrum that would exist at the center of the head if the head was removed. The figure to the right displays HRTFs as a function of sound-source elevation angle. From this and other similar datasets it has been inferred that:

� The 8kHz region seems to correlate with overhead perception (i.e. spectral changes in this region correlate with changes in the perceived location of sound sources above our heads)
� Regions in the frequency bands 300-600Hz & 3000-6000Hz seem to correlate with frontal perception
� Regions centered at around 1200Hz & 12000Hz seem to correlate with rear perception.

HRTFs are personalized, as they depend on variable pinna, head, and torso construction among individuals (e.g. data in the figure, below).
Consequently, spectral sound-source localization cues and the associated HRTFs are most likely learned through experience, with individuals generally localizing better with their own cues that with those of others.

At the same time, it has been shown that physiological differences may result in some listeners performing much better than others on auditory localization tasks.
Imposing 'good' HRTFs (through appropriately equalized headphone listening) on individuals who have difficulty localizing sound sources can improve their localization performance by providing more salient interaural spectral differences (assuming comparable head size between 'donor' and 'recipient' and sufficient learning time).

Experimental explorations of HRTFs/ATFs use specially-designed binaural heads (e.g. KEMAR, by G.R.A.S. Sound & Vibration, Denmark) to record signals at various positions in their path to the ear drum and tease out the various anatomical spectral-filtering contributions.

IID and IPD cues also help us perceive tones that would have otherwise been masked. The following three listening examples illustrate this point and correspond to the three scenarios described in the figure to the right (must use headphones).

(A) Example 1: a 300Hz sine signal with no IIDs or IPDs is presented along with a 600Hz-wide noise band, centered at 300Hz. Due to the level difference between signal and noise (the signal is 15dB below the noise) the sine tone is masked.

(B) Example 2: same as in (A) but with a 180⁰ IPD for the sine signal. In spite of the level difference between noise and signal, the sine tone is now perceivable.

(C) Example 3: same as in (A) but with the left channel of the sine signal removed (extreme case of IID for the sine signal). Again, in spite of the increased level difference between noise and signal, the sine tone is now perceivable.
The counter-intuitive observation in (C) is that a masked tone becomes audible by reducing its overall level.

[4 signals used to construct the audio examples:
Noise / 300Hz / 300Hz 180⁰ IPD / 300Hz no left channel.]

IPDs can reduce a signal's detection threshold of low frequencies by up to ~15dB, while they have no effect at high frequencies (why?).

The release from masking of complex signals is facilitated further by interaural spectral differences.

The described release from masking is not due to our ability to localize a signal thanks to the imposed interaural differences. Rather, it is due to signal de-correlation between ears, supported by the interaural changes and supporting the employment of cognitive strategies (e.g. attention focusing) for signal detection in complex sonic environments.

(in Plack, 2005: 180)

Cone of confusion Sound sources moving in a sagittal plane (i.e. changing in elevation) do not produce IID or IPD changes. Consequently, their movement is difficult to track by purely auditory means. More generally, for any given IID or IPD value, there will be a conical surface extending out of the ear that will produce identical IIDs and IPDs, preventing precise sound source localization over the surface (see to the right). Interaural spectral differences and the associated HRTFs do help resolve localization ambiguities, whether in the median sagittal plane or within any cone of confusion. However, the most effective strategy in resolving sound-source localization ambiguities is head movement, assuming the sound signal lasts long enough, unchanged, to allow for such a movement to be of use. Moving the head in the horizontal plane can help resolve front-to-back ambiguities (e.g. below-left), while head tilting can help resolve top-to-bottom ambiguities.	(in Plack, 2005: 184)
As is the case with most sound-source localization cues, the salience of head-movement cues depends largely upon long-term learning and experience. Auditory localization experiments using headphones and conflicting source and head movements exploit our reliance on previous experience, resulting in revealing illusions. Explore the figures, below, and read the explanations in the captions.
(in Plack, 2005: 185)

Distance and loudness

Judging sound-source distance can be partially aided by loudness cues. In general, softer sounds are more likely to be associated with sources farther away and louder sounds with sources nearer.

Loudness cues are only reliable when comparing the loudness of a sound to a known reference that precedes it and/or when judging familiar sounds. In all other cases, loudness cues cannot reliably support distance judgments.
Even in reliable contexts, distance changes are underestimated when judged based solely on loudness changes.

More specifically, although distance doubling in the free field (i.e. where there are no reflections) corresponds to an ~6dB SPL reduction, listeners require an ~20dB SPL reduction in order to report that their distance to a sound-source has doubled.

Distance and reverberation

In reflective environments, distance judgments are aided by reverberation cues. In general, the greater the distance of the source the greater the proportion of the reverberant (relative to the direct) sound.

Direct-to-reverberant-sound cues provide coarse source-distance information and only become perceptible for distance changes by a factor of two or larger. In addition, distance-change judgments based on this cue alone tend, again, to be underestimated.

Distance and spectral composition (timbre)

Changes in sound-source distance are linked to timbral changes, mainly due to corresponding changes in a sound's high-to-low frequency SPL ratio. In general, increasing the distance from a sound source tends to reduce this ratio because air absorption reduces high frequencies far more than low frequencies. This cue is most perceptible for large changes in distance.

In addition, the increased likelihood of higher frequencies to be blocked by obstacles and of lower frequencies to diffract around them further reduces a sound's high-to-low frequency energy ratio with distance. Absorption (by air) and blocking (by obstacles) of high frequency can therefore explain the observation that low frequencies travel much further than high frequencies and, at very large distances, the level of high frequencies drops to 0.

NOTE
In the absence of obstacles and for short-to-middle distances (~20-40m from a source, where air absorption is negligible), all frequencies lose ~6dB SPL for each doubling in distance (why?) and the high-to-low frequency SPL ratio remains fixed.

However:
A given change in dB corresponds to a larger loudness change at low vs. high frequencies, as can be inferred from the equal loudness contours (why?).
In the above scenario, increasing the distance from a source will therefore reduce the loudness of low frequencies more than that of high frequencies, even though the SPL of both frequency ranges will be reduced by the same amount. In other words, while the SPL ratio of high-to-low frequencies remains fixed, their loudness ratio increases.

Sound-source distance and overall location judgments based solely on aural cues are not precise and require a combination of experience/familiarity with any given sound source and with changes in context, loudness, reverberation, timbral, and visual cues that may accompany changes in sound source location. This is one of the reasons why, for example, AI implementations of sound source localization cues to self-driving vehicles has, to date, failed. The most promising neural-network-based approaches explore the discussed sound-source localization cues differently than human listeners.
See for example this study, exploring the way an artificial (vs. a human) "machine" may use only ITD cues to pin-point sound-source location. Three-Dimensional Sound Source Localization for Unmanned Ground Vehicles with a Self-Rotational Two-Microphone Array]

Precedence effect (Haas effect): The precedence effect describes a learned strategy employed implicitly by listeners in order to address conflicting or ambiguous localization cues occurring in environments where sound wave reflections play an important role (e.g. all rooms other than anechoic environments). The Haas effect, named after German psychoacoustician, Helmut Haas (mid 20th century), is a special case of the precedence effect. It refers to an audio processing technique that uses the precedence effect to create a wider stereo image from a mono source

According to the precedence effect, listeners make their localization judgments based on the earliest arriving sound onset. The term "precedence" is used because the direct sound, with presumably accurate localization information, is given precedence over the subsequent reflections and reverberation, which usually convey inaccurate localization information. In fact, in reflective/reverberant environments, both IPD and ILD cues are diffused to such an extent that they become unusable, making the precedence effect a necessary strategy.

More specifically, for signal arrival times differences up to 30�40 milliseconds, the perceptual priority of the signal arriving first persists even if the second (delayed) signal is up to 10dB stronger than the first.
This strategy is developed through extensive exposure to reverberant listening contexts. After it has been developed, it is usually applied to all listening contexts automatically, leading to possible auditory localization "illusions."

The figure to the left illustrates a precedence effect demonstration with two loudspeakers reproducing the same pulsed wave.
The pulse from the left speaker leads in the left ear by a few microseconds, suggesting that the source is on the left. The pulse from the right speaker leads in the right ear by a similar amount, which provides a contradictory localization cue. Because the listener is closer to the left speaker, the left pulse arrives sooner and wins the competition�the listener perceives just a single pulse coming from the left

From "How we Localize Sound" by American physicist, W. M. Hartmann; standard reference resource on the topic of sound source localization.

[Optional: Watch Prof. Hartmann's 2018 presentation on the state of the field at McGill University's Center for Interdisciplinary Research in Music Media and Technology.]

The figure, below, illustrates the interaction between interaural time & intensity differences, observed in experiments exploring the precedence effect.

For small delays (<~1ms) between ears (i.e. small differences in the distance between a sound source and each ear), localization is biased towards the side producing the louder sound.

For delays between 1-40ms, localization is biased towards the earlier sound.

For larger delays we hear two sounds, with the second perceived as an echo of the first, even if the 'echo' is stronger than the original.

[Optional: Brown et. al, 2015. The Precedence Effect in Sound Localization.]

OPTIONAL: Sound source localization neural mechanisms
Localization of Sound Sources (Michigan State University) Long list of links to relevant publications by Prof. W. Hartmann and his colleagues.
To explain ITD (IPD) detection, America psychologist, Lloyd Jeffress hypothesized the presence of a coincidence detector, at the neural level, that uses delay lines to compare arrival times at each ear (1948). His theory is illustrated in the figure, below left (in Plack, 2005: 182). Assuming the presence of neuron arrays, which are tuned to different delays and encode signal arrival time differences between ears, the Jeffress model is equivalent to cross-correlation, comparing the inputs from each ear at different time delays. [video explanation]
	Jeffress's model has been partially confirmed by physiological evidence from birds (e.g. barn owl). However, evidence from mammals (e.g. gerbil; above-right) suggest broad sensitivity to just two ranges of interaural time differences at each characteristic frequency (i.e. per neuron - see the figure, above; in Plack, 2005: 183), corresponding to the period of that frequency. This observation fits to the already discussed observations that a) ITDs do not provide sound source localization at high frequencies (the Jeffress model cannot explain this) and b) it is a very special type of ITDs that is of importance: IPDs (only up to 1/2-cycle IPDs are well represented by the gerbil data). [Advanced review in Ashida & Carr, 2011]
IID detection can be explained in terms of the associated difference between excitatory and inhibitory activity in the two ears when stimulated by signals that display IIDs.