The Engineering and Art of Headphone Design A Brief Overview by Noel Lee
The Art of Listening
Critical listening is an acquired art. It’s a skill that must be learned. Just like tasting fine wines, beers or even fine cigars. It takes time and experience to know what to look for when determining the highest levels of quality. Such is the challenge in determining what is a good headphone or a good speaker. For novice listeners, what they hear in a club or disco is good music reproduction. But advanced listeners will often find these speakers unnatural and inaccurate, and actually fatiguing to listen to over a long period of time.
With headphones, determining what superior sound is even more difficult than speakers. Each ear is different, and even details that seem relatively insignificant, like ear tip fit, can dramatically influence the results. However, there are some absolute terms that we can use to describe the listening experience that go beyond just numbers, like measuring frequency response, which is the common among many manufacturer’s design headphones. If it were that simple, why wouldn’t two headphones that measured similarly, sound the same? Why does a balanced armature design have a “signature sound” that dynamics don’t have? Why do headphones come in so many types and varieties and sound so different?
In search of the perfect sound.
High quality headphones will reproduce all music accurately and allow the listener to enjoy the music as if they were transparent. They don’t sound like headphones, but like real music. The sound feels alive and sounds lifelike. One can listen to them for hours without fatigue. Unfortunately, really great headphones are extremely rare. But we are all in search of the perfect sound, and learning how to achieve that in a headphone is combination of engineering and art.
A headphone. A speaker. A microphone. Understanding the similarities and differences as they relate to music reproduction.
These three devices are the same in that they’re all transducers. In other words, they turn mechanical energy into electrical signals, and visa versa. Headphones are similar to speakers in that they are both transducers on the reproduction end. Their job is to recreate the music signal, without adding sounds of their own. But that’s almost impossible since every mechanical device has sounds, resonances and distortions of their own. For example, when reproducing a bass kick drum, the recorded sound may stop, but because of the inertia of the speaker or headphone diagram, it keeps on going. This is known as “decay” over a period of time and it is easily measured today in the form of a “waterfall” graph. Kind of like a tuning fork that keeps on ringing. This is bad.
The “speed” at which the music signal occurs is also important to create a sense of realism. In real life, when a guitar pick hits the string, or when one hits a triangle, how fast is the initial impact? It’s immediate. But, in the same way a speaker or headphone
has trouble stopping, it can also have trouble accelerating fast enough to accurately capture the initial impact of the music.
The microphone is a speaker in reverse. It captures the music as the airwaves hits its diaphragm, and this also has a stop and start factor, as well as frequency response. That’s why you will see recording engineers being fanatic over their selection of microphones for various instruments. Even singers like different choices of microphones as it reproduces their voice in the way they want to hear it.
Various technologies have been invented over the years to optimize some of these parameters. Dynamic speakers with huge magnets help bass speakers stop and start accurately, along with different cone materials that stiffen the speaker. On the high end, metalized mids and tweeters help rapid stop and starting of the signal, but may have “ringing” distortions of their own. Electrostatic speakers with extremely light diaphragms are the reference used by many headphone and speaker listeners because of their ability to start and stop, but because they don’t move great distances, they may lack power and dynamic range.
In headphones there are designs that help one parameter, but they’re often at the expense of another. Electrostatic headphones are considered the best, but they can’t move a lot of air so they lack bass response. Dynamic headphones are all over the map in their ability to accurately reproduce music, but represent a good compromise if designed properly. Balanced armatures are fast in reacting, but are bad in stopping and producing resonances and sounds of their own as can be seen in their waterfall measurements.
There is one last difference between speakers and headphones. Everyone knows that a speaker sounds best in a tuned room that is designed for the speaker. That’s how many recording studios are designed. However in a headphone, everyone’s ear is slightly different. Obviously there’s no room, but there is an ear cup on over-ear headphones, and an ear tip on in-ears. Both can dramatically affect the sound. Both are an “ecosystem” where a whole lot of parameters depend on one another to get the best results.
That’s why designing a great headphone is knowing how to balance all of the parameters of the ecosystem to get the best reproduction in sound. That is where the ‘art” and the “ear” are part of the design process.
Measurements vs. the Listening.
Measurements are useful in design, but there is no one measurement that will tell you what a headphone will sound like. Many novices are focused on frequency response callouts that have no +/- db variation specification or distortion measurement to them, so they don’t mean anything. Even if the frequency response were exact, without the various distortion figures (IM, Harmonic, TIM), the number is meaningless.
Since the ear does not hear flat frequency response, a correct frequency response should not be flat. The human ear does not hear all frequencies at the same level, and is more sensitive to the middle ranges. Usually every other frequency is referenced to 1k, or 1000hz.
Frequency response curve of what the ear wants to hear. An ideal and "balanced" headphone would have a frequency response similar to this one.
This test shows a relatively flat frequency response of a popular headphone and the harmonic distortion below. Since this curve does not match the reference curve, this headphone will have difficulty reproducing audible bass.
Yet another measurement that affects perceived frequency response is “waterfall,” or “decay” of the original signal. Just as the tuning fork can’t stop, various headphone diaphragms can’t stop, thereby adding coloration to the sound around that particular frequency. So the “frequency response” might look good, but the “waterfall” or delay may look bad, causing exaggerated high frequencies (such as in balanced armature designs) or exaggerated bass, such as in dynamic designs.
The same frequency response with the waterfall response plotted next to it. One can see a slow decay across the frequency spectrum where the energy is stored. This will result in a harsh sound with high and long decay at 8k.
Lastly, impulse response helps to determine how fast a headphone can respond to a signal, and how fast it can stop. An impulse allows us to see the rise and fall of the transducer and its ability to reproduce musical instruments that have fast transients.
One can see why two headphones that measure the same in frequency response can sound very different. It’s a combination of tests that will give us an indication of how a headphone will sound.
This is a simplified explanation. Headphone housing, materials, driver design, and ear tip design are only some of the other considerations in making a great headphone.
The final analysis is how it sounds to the critical human ear. How to "tune" all of parameters is the "art" in the design. Knowing what measures good needs to be in concert with what sounds good. Years of experience of knowing what to do and a critical ear is a rare combination indeed.
Talking the Talk; Audio Terms Describing Headphone Listening
The intent of the following terms will help us establish a common language to talk about headphone music reproduction. Just as wine aficionados have their terms of oaky, airy, fruity, and others, we too need to have terms to describe the sound of headphones.
We have enhanced the description of these terms specifically around headphone listening, and have also have introduced measurements where we can begin to correlate to what we hear with what we can measure. It is impossible to have one measurement, as it is a combination of measurements, along with the ‘ecosystem” which includes your ear, the ear tips, and even the seal around the ear, that will determine the final result.
Aggressive: Forward and overly bright sonic character, as opposed to being smooth and balanced. It can be measured in high amounts of IM distortion and poor waterfall response. This distortion can cause long term listening fatigue.
Air: Spacious and open with a sense of lightness and transparency. Achieved through reproducing mid and high frequencies accurately with good phase response throughout the range.
Airy: Pertaining to treble which sounds light, delicate, open, and seemingly unrestricted in upper extension. From quality reproducing systems that have smooth and very extended HF response.
Ambience: Psychoacoustic impression of a physical acoustic space, such as a concert hall in which a recording is made.
Articulate: Imparting a sense of precise intelligibility and definition of vocals, instrumentals and the interactions between them. Comes from good transient and waterfall, especially at high frequencies.
Attack: The leading edge of a note, such as the “snap” of the drumstick as it hits the snare so one hears the individual snares. Also pertains to the ability of a system to reproduce the attack transients in music. Accomplished with extremely good transient and impulse response with great waterfall with no overhang.
Awesome: The sound when the combination of all of the positive parameters of headphone design come together to describe this listening experience.
Articulation: The ability to reproduce fine details, especially quick transients. Tiny details reproduced accurately are a hallmark of a headphone with good articulation.
Balance: The smooth non emphasis of any part of the audible spectrum. A headphone with good tonal balance can be played louder without fatigue as it does not over
emphasize any part of the frequency spectrum and therefore the overall level can be louder. Proper reproduction of a thunderous orchestra or big band is a good demonstration of tonal balance.
Another balance is channel balance, or the relative level of the left and right stereo channels. Channel balance is critical to good soundstage and imaging in a headphone, as the signals must arrive to both ears at the same time at the same level.
Bass: The audio frequencies between about 20Hz and 250Hz. New music with synthesized effects can be produce very powerful low notes, so reproduction in the 30 to 50hz region becomes important. Well recorded bass guitar is a good test for a combination of low end bass response, with higher end fundamentals as when a performer plays "slap" bass.
For a headphone design, the proper response needs to follow the insensitivities of the human ear. Flat response may not give very satisfying bass as the ear is less sensitive as the volume and frequency go down. Good bass should also be “tight” as the headphone diaphragm needs to start (speed) and stop (waterfall) with the signal and not add sounds of their own, as with acoustic and electric bass.
Bass Extension: Realization of all low bass information from 250Hz down to 20hz. Very few headphones can reproduce this well because of the tiny diaphragm that needs to move a lot of air to create these frequencies. Also because the ear is less sensitive to bass frequencies, the bass response needs to go up as the frequency and volume goes down.
Body: Fullness of sound, with particular emphasis on upper bass. Opposite of thin.
Bright: A sound that over-emphasizes the upper, midrange and lower treble. This can be seen by exaggerated high frequency response as well as poor waterfall (long decay).
Clarity: Is the sound "clear" and "transparent" as opposed to muddy or fuzzy. Accomplished with good impulse and waterfall, so the headphone diaphragm starts and stops rapidly. A “slow” headphone may have all of the frequencies, but not good clarity.
Clear: Similar to clarity, but used to describe a lack of “speaker sound” or headphone sound. See also, Transparent.
Coloration: An audible added characteristic with which a headphone produces that is not a part of the original source material. Caused by poor waterfall and/or frequency response resulting from resonances in the design of the diaphragm, as well as the earphone housing. Heavy metals are usually preferable to plastics, which can resonate.
Coherent: Showing no audible evidence of a crossover or of different driver colorations in any of the various frequency ranges. For example, a saxophone doesn’t sound like it’s coming from a low frequency and high frequency diaphragm, but from one
diaphragm. In dual and triple diaphragm headphone designs, it is extremely difficult to sound like all diaphragms operate as one. Yet another measurement called phase and impulse response will show a lack of, or presence of coherency.
Decay: Fadeout of a note following the initial attack, easily seen in waterfall response. Some frequencies may decay longer than others depending on the headphone design. Decay negatively affects sound accuracy, since it adds “coloration” to the music that wasn’t in the original recording.
Definition (or resolution): The ability of a component to reveal the subtle information that is fundamental to high fidelity sound. Also “inner definition” such as the drawing of a bow across YoYo Ma’s cello, which is extremely difficult to reproduce. Also revealed in the “bite” of horns in a big band recorded with a great condenser (electrostatic in reverse) microphone.
Delicate: High frequencies extending from 8kHz to 20kHz without accentuating peaks. This also describes a headphone’s ability to respond to extremely low-level signals, where some headphones may not have enough sensitivity or response.
Depth: A sense of hearing "into" the music, the 3rd dimension. Also referred to as front to back, there is a great sense of space. Great phase response is required to reproduce depth and “ambiance” of a recording or the "depth" of the soundstage.
Detail: The most delicate elements of the original recorded sound. These elements are the first to disappear with lesser equipment and headphones, as it requires high sensitivity and response.
Distortion: A sound that is not part of the original signal. Distortion can be a modification of the original signal (Intermodulation distortion), or generating new signals that result from the interference of a combination of signals (harmonic distortion). Can be characterized as a roughness, fuzziness, harshness, or stridency in the music.. There are many distortions that can be measured: IM (intermodulation distortion), harmonic distortion, and TIM (transient intermediation distortion). Also breakup, where the headphone can’t handle the power or the low frequencies, which causes sound to crack up.
Dynamic: The ability to play very loud as well to very soft. Some headphones may "compress" on loud passages, or simply "distort." Some headphones will only reproduce the mids with moderate dynamics and cannot capture the power of real music. Some insensitive headphones are not able to resolve subtle low level signals which make them undynamic.
Dynamic is also used to describe a type of headphone speaker, that uses typical magnet and voice coil design as opposed to "electrostatic" or "balanced armature" designs.
Dynamic Range: Pertaining to the ratio between the loudest and the quietest sounds. Wide dynamic range could be an explosion out of silence. Small dynamic range could be a loud rock band that doesn't ever play soft.
Efficient: The ratio of level of signal in to sound output. An efficient headphone will play louder at the same identical volume sitting, while an inefficient one will require that the volume level be turned up to get the same music level.
Therefore different headphones may play louder or quieter, despite being given the same level of output from the same media player. This does not mean that they are good or bad, unless you like to listen at loud levels and your headphone can not play as loud as you want them to. In this case, high efficiency headphones are more satisfying. Many headphones cannot be driven to high output levels, which will require the use of a quality headphone amplifier.
When comparing headphones, it is helpful to set the volume levels so they play the same loudness in the middle frequencies, adjusting the volume between the two if necessary.
Edgy: Excessive high frequency response. Occurs when there is high level of distortion due to frequency response exaggeration with a raspiness or harshness to the sound. . Excessive decay and waterfall which are inherent in the headphone design in itself can cause this unpleasant sound which will create listening fatigue.
Sometimes referred in a positive sense when reproducing brass instruments such as trumpets where one hears the "edge."
Extreme Highs: Audio frequencies of 10 KHz and above. Examples of instruments are triangles and cymbals. These frequencies are hard to hear because they do not occur often in music and the ear is less sensitive to them.
Fast: Good reproduction of rapid transients which give a sense of realism. Often refers to the ability of a headphone to reproduce sharp transients. Timely response and acceleration of a speaker to an incoming signal. Good brass band will show this off.
Focus: A strong, precise sense of image and music instrument placement.
Forward: Usually referring to the midrange, vocals or projection of instruments as being in the front of the soundstage.
Sub-bass – 16Hz to 30Hz Bass – 20Hz to 250Hz Mid-bass – 60Hz to 250Hz Mid-range – 250Hz to 6KHz Upper-mid-range – 2KHz to 6KHz Highs – 6KHz and above Extreme Highs – 10KHz and above
Full: Strong sense of balance in the music with all instruments being equally reproduced.. Good low frequency response, not necessarily extended, but with adequate level around 100 to 300 Hz. Male voices are full around 125 Hz; female voices and violins are full around 250 Hz; sax is full around 250 to 400 Hz. Opposite of thin.
Gritty: A harsh sound in the upper frequencies. Rough sandpapery sound caused by exaggerated high end and long decay times in the high frequencies.
Harmonics: The richness of sound and production of instrument overtones. Examples of this are the sound reproduction of a guitar, saxophone, and piano. Sometimes long decays in bass frequencies, although a form of distortion, can add a sense of fullness. Harmonics can also be a pleasant kind of distortion caused by electronics or the headphones themselves, but it is a distortion in that the harmonics are not part of the original music.
Harsh: Combination of unpleasant high frequency peaks and a hashy distortion. Harshness makes one want to turn the volume down as the harshness overrides other parts of the music.
Highs: The audio frequencies above about 6 kHz. Examples are the upper ranges of electric guitars, flutes, triangles. Most headphones cannot reproduce highs accurately.
High Midrange (High Mids, Upper Mids): The audio frequencies between about 2kHz and 6kHz. Examples are frequencies from the upper voice range, and the bite of electric guitars and brass instruments such as horns and big band.
Imaging: The placement of vocals or instruments within the soundstage. Good phase response and coherency will help provide realistic image.
Impact: How music "hits" the listener is an indication of how impactful music can be. Kick drum, explosions in a movie are more dynamic with a headphones ability to reproduce impact. Very few headphones have this ability.
Incredible: Like awesome, will describe a combination of desirable parameters.
Liquid: Smooth, relaxing, yet detailed sound. Opposite of harsh.
Low Bass: The audio frequencies ranging from below 20Hz to 60Hz. These are the hardest frequencies for headphones to reproduce because it involves moving a lot of air with a very small diaphragm. Reproducing very low organ pedal notes, and today’s electronic instruments present new challenges for headphones to reproduce. You want a headphone to accurately reproduce this low bottom end when its there in the music, but not reproduce it when it’s not.
Low Level Detail: Being able to resolve the delicate nuisances of music, especially during quiet passages.
Low End Detail: the subtle distinct sounds that you can hear in the bass frequencies. Examples are when a pedal impacts the bass drum or when fingers move across a stringed bass.
Low Midrange (Low Mids): The audio frequencies between about 250Hz and 500Hz. Very critical in reproducing vocals accurately. Artificial exaggeration here can create a sense of fullness that is not natural.
Lush: Very rich and full reproduction. Smooth.
Mellow: Reduced high frequencies. The opposite of edgy.
Midbass: The audio frequencies between about 60Hz and 250Hz. Kick drum and bass guitar are examples of instruments represented by these frequencies.
Midrange (Mids): The audio frequencies between about 250 Hz and 6000 Hz. This is the ear’s most sensitive range. We can hear even small variations in this region. Because of this sensitivity, natural and not artificial sounding vocals, piano, and guitars is extremely important to this range.
Muddy: Ill defined and congested. Headphone keeps going after the signal stops. Can easily be seen with bad waterfall and/or high harmonic distortion. Muddy sound can actually be caused by soft eartips, such as some foam eartips.
Musical (or musicality): The ability to hear through the headphone and into the music. Instruments sound more like real instruments as opposed to speakers reproducing instruments.
Natural: Realistic sound reproduction.
Neutral: Free from coloration. Not artificially exaggerating one frequency or another.
Open: Sound which has height and "air.” Relates to clean upper midrange and treble.
Perfection: No such thing, but it’s a nice word.
Presence: A sense that the instrument is present in the room with the listener. Able to reproduce an experience of "being there."
Punch: Good reproduction of dynamics. Good transient response, with strong impact. Sometimes a bump around 5 kHz or 200 Hz. Good waterfall is required here. Ear tips will also affect the ability to properly reproduce punch.
Reference: A standard by which all others are compared. The highest quality available.
Resolution (Resolving): Hearing under a microscope. The ability to hear fine details. Can also refer to sampling rates and ability to hear all of the harmonics and tonality of the music.
Rich: See “Full.” Also, having euphonic distortion made of even order harmonics.
Roll off: A frequency response which falls gradually above or below a certain frequency range. This is can cause inaccurate sound, in that a headphone cannot reproduce all of the frequencies.
Sibilance: A coloration that resembles or exaggerates the vocal ssss sound. Bad waterfall in the high frequency can exaggerate these sounds.
Smooth: Easy on the ears. Not harsh. Flat (neutral) frequency response, especially in the midrange. Lack of peaks and dips in the response.
Soundstage: The space in front of the listener from far left, center, to far right. can also incorporate the "depth" of soundstage. Is the listener in front of the band instruments in the back of the orchestra sound like they are in the back? Soundstage should have width, depth, and height. Higher resolution source material (sampled at higher rates) will reproduce soundstage better.
Speed: Timely impulse response. The ability of a speaker to respond quickly to signal input. A fast system with good pace gives the impression of being right on the money in its timing. Good speed is absolutely critical to realistic musical reproduction and separates ordinary headphones from great ones.
Sub-Bass: The audio frequencies between about 16Hz and 30Hz. These frequencies are more “felt” rather than heard. Nearly impossible to reproduce with a headphone.
Sweet: Not strident or piercing. Delicate. Flat high frequency response, low distortion. Lack of peaks in the response. Highs are extended to above 10k.
Texture: A perceptible pattern or structure in reproduced sound. The ability to hear the differences between two similar instruments can be described in its texture.
Tight: Good low frequency transient response and detail. Accurate and fast impulse response. Low decay (great waterfall).
Timbre: The tonal character of an instrument with all of its harmonics that give its identity. Headphones themselves have a timbre of its own, which should not interfere with its ability to reproduce music.
Transient: The leading edge of a percussive sound. Good transient response makes the sound as a whole more live and realistic. Accomplished with fast impulse response and good waterfall.
Transparent: Easy to hear into the music. Detailed, clear and not muddy. Wide flat frequency response, sharp time response, very low distortion and noise. The headphone sound becomes invisible and only the music remains.
Upper Midrange (Upper Mids, High Mids): The audio frequencies between 2 kHz and 6 kHz. Electric guitar, brass instruments and high vocal ranges are in this region.
Uncolored: Free from audible colorations sounding more like real music.
Warm: Describes satisfying full sound. Low and mid bass needs to be accurately reproduced without becoming lean or thin. High frequencies need to be full and harmonious.
Weighty: Good low frequency response below 50 Hz. A sense of substance and produced by deep, controlled bass.