MSc-IT Study Material
June 2010 Edition

Computer Science Department, University of Cape Town
| MIT Notes Home | Edition Home |

Sound in the Interface

The vast majority of computer-based user interfaces that we encounter rely almost totally on the visual medium. However, another medium that is frequently used in interface design is sound. Before discussing how interface designers do and may make use of sound, we will review some of the properties of the audio medium consisting of sound together with the human sense of hearing.

Properties of sound and hearing

A number of properties of the audio medium are pertinent to the use of sound in the interface.

Sounds can vary in a number of dimensions: pitch (or frequency), timbre (or musical tone), and intensity (or loudness). Not all sounds or variations is sound are audible to humans. The ear is capable of hearing frequencies ranging from about 20Hz up to about 15KHz, and differences of around 1.5 Hz can be discerned (though this is less accurate at high frequencies).

The capability of computer sound output devices to produce variations along each of the dimensions of pitch, timbre and intensity means that sound output is potentially a rich and sophisticated medium for conveying information to users.

The audio medium, like any other, has a number of inherent properties that constrain the way humans process and make sense of the sounds they hear. Therefore, understanding these constraints will be crucial to the successful use of sound in interactive systems.

In contrast to vision, sound is a "volatile" medium in the sense that sounds do not persist in the same way that visual images do. Or to put it another way, the visual field can be regarded as a parallel array of information elements, each of which may remain constant or may vary over time. Sound, on the other hand, can be seen as a single element (described by its pitch, timbre and intensity) that may vary over time, and the rate at which it varies or carries information, or is perceived, is not under the control of the listener. Its potential as a means of conveying information, and the amount and type of information that can be carried, is therefore rather different from that of the visual channel. Consequently, the visual channel can be regarded as frequently having a much faster access time. For example, a large amount of information may be made simultaneously available in the visual channel, whereas presenting the information in the audio channel may take longer as the information must be "serialised". A further implication of this is that an audio information stream may place greater demands on the user’s memory: while listening to part of a message, the user must remember what has gone before. When reading a visual display, the parts previously read remain visible.

Another relevant property of the audio medium is the fact that, unlike vision, hearing is non-directional. While binaural hearing does grant us a limited ability to determine the direction of a sound source, there is no sense in which we can listen in a particular direction. Similarly, while we can very easily control our visual sense (e.g., by looking in a particular direction), it is much harder to be selective about what one listens to.

It is well known that people are rather good at noticing changes in sound – such as the onset or cessation of a noise or tone. However, we are rather less good at extracting information from a stream that remains relatively constant. In fact, if a background sound remains relatively constant, over a period of time we will tend to become less aware of it, and eventually will filter it completely and cease to notice it at all. See the later section on ATTENTION.

A further property of sound and hearing that designers should be aware of is that we are relatively poor at separating out a single sound or sequence of sounds from a background of similar sounds. Imagine trying to hold a conversation in a noisy environment where many other people are talking. Or trying to follow two conversations at once.

For most of us, in our everyday lives, sound plays a very important part. It is often said that the majority of information we receive about the world comes to us through our visual sense. While this is true, it is also the case that sound plays a central role in communicating with others (through speech and other sounds), receiving information and entertainment (through radio broadcasts, musical performances, and so on), and allowing us to be aware of events – some of which may be outside our visual field (police sirens, ringing telephones, etc). Sounds that are apparently in the ‘background’ often give us vital clues about the status of ongoing processes (e.g., the sound made by a car engine as we are driving, the noise made by machinery in a factory).

Despite the apparent limitations described above, sound is a remarkably important channel for conveying information. Next we will look at some of the ways sound is used in current user interfaces, and how it could be used in the future, and will identify some guidelines that can help designers to make the use of sound more successful.

Activity 1 - Sound and vision

You can carry out this activity on your own if you like, but it might be easier if you can work in pairs. Compare the time taken to read a written passage, with that taken to listen to the same text being spoken. For the latter part, you can simply read the text out loud, or use a computer speech synthesiser.

Activity 2 - Menu Structure

It is often said structured menus should be made wide (many choices in a menu) and shallow (few levels of menu), rather than deep (many levels). Studies suggest that doing this will improve allow users to carry out tasks faster, while making fewer errors. See (Shneiderman, 1998, pages 249-250) or (Norman, 1990, chapter 5) for discussions.

Suppose that instead of menus on a visual display, we are designing menus in an audio interface for telephone-based services. Do the same guidelines about wide/shallow or deep/narrow structures apply? How would you investigate the efficiency of different audio menu schemes?

A Discussion on this activity can be found at the end of the chapter.

Use of sound in current and future interfaces

Sound is currently used in many computer interfaces and other interactive devices, sometimes as an important source of information, and at other times simply as a means of making the interface seem more impressive. Sound can be used for both input and output functions in an interface, and the kinds of sounds that are used include abstract tones and bleeps, naturally occurring sounds, music, and, of course, speech.

The audio channel is, for most of us, a rich and important source of information in interactions with other people or with the environment. However, sound has played a more limited role in human-computer interaction than in other aspects of life. Part of the reason for this may be that it is not always clear what kinds of functions sound is appropriate for, and how to make effective use of sounds to support those functions.

Audio Alerts

While sound might be ineffectively exploited, that is not to say it is unused. Almost all computer systems "bleep" when an error of some sort occurs. What is the purpose of such a bleep, and what can the user infer from it? Clearly the sound indicates to the user that something has happened, but it is typically left up to the user to determine by other means the nature of the event (an erroneous input, or a system generated warning, or simply the arrival of email?), its source (one of several applications, the operating system, networking software?), and what should be done about it. The following, rather extreme, example highlights the kind of problematic situation that can arise.


In 1979, at the Three Mile Island nuclear power station in the USA, one of the most serious nuclear accidents in history took place. As the incident was unfolding, operators in the plant control room were faced with a bewildering array of instruments and displays giving information about the state of the power station. As problems arose, auditory alarms sounded to alert the operators to what was going on. The following commentary on the incident (drawing on the report of the subsequent investigation by a Presidential Commission) explains one aspect of the operators’ predicament:

Frederick and Faust [two of the operators] were in the control room when the first alarm sounded, followed by a cascade of alarms that numbered 100 within minutes. … Later Faust would recall for the Commission his reaction to the incessant alarms: "I would have liked to thrown away the alarm panel. It wasn't giving us any useful information."

Partly as a result of the confusing alarms, the operators failed to diagnose one of the serious problems with the reactor for two hours, during which time extensive damage and leakage of radioactive material occurred.

The point is that although alarms and alerts can be successful for indicating that some event has occurred, they often carry little useful information about the nature or location of the event or what should in response. And if several notable events occur together, then providing an auditory indication of each is simply going to confuse users. The world of power stations and control rooms may seem very far removed from everyday design, but the same issues are relevant in the interfaces of desktop systems and the design of web pages.

Sound output, therefore must be used with care. Sometimes it is appropriate to indicate a change of status or a particular event with a sound, but we must be aware that beeps and similar alerting sounds often provide the user with too little information about either the nature of the event or what action will need to be taken as a result. Used with care, however, sound can enhance an interface and provide users with an important source of information.

Providing Information

So far, we have been discussing sounds without saying much about what kinds of sounds are typically – or in the future could be – useful features of an interface. The simplest and most basic sounds are simple bleeps that indicate an event or change of state. However, other possibilities exist.

An idea that several researchers have experimented with is to add sounds to many of the familiar features of existing visual user interfaces. So, in addition to the visual cue provided by an icon, an auditory cue would be provided as a further memory aid. The sounds used in some of these experiments were natural ones, chosen to match the kind of interface objects with which they are associated. For example, folders on a desktop might have the sound of paper crumpling (presumably because in the real world, we put paper into folders), and dropping an item in the wastebasket might produce a sound of breaking (because things sometimes break when we throw them in the bin).

While this might seem an attractive way of providing the user with additional feedback and extra ways of remembering how the interface works, it has some problems. One is that the same sound might mean different things to different people – so the sounds are not as "natural" as they first seem. Another problem is that while auditory equivalents may readily be found for some interface elements, there are a great many computer-based objects and operations for which no real-world audio counterpart exists.

One researcher in this area, Stephen Brewster, has taken up the idea of augmenting existing, visually based interfaces with sounds to assist the user, but has used abstract sounds rather than naturally occurring ones. These sounds – known as earcons (icons for the ear!) – are made up of short musical phrases that vary in the sequence of notes, overall pitch, tempo, and so on. Earcons have been added to icons, menu items, and so on, of conventional computer interfaces.

Speech Output

In communicating with other people, we most commonly make use of speech, and our interactions with computers can be similarly speech-based. Speech synthesis has been possible for quite some time, but it is only relatively recently that using synthetic speech has become a reality in everyday interfaces. However, speech is becoming increasingly popular ad an addition to more conventional user interfaces.


Example: Voice as an alert

Recent versions of Apple's MacOS operating system include a speech synthesiser that will speak the contents of pop-up alert windows. If a pop-up window appears with a text message, this will be accompanied by a voice speaking the text.

While this will undoubtedly be a valuable feature for some users, others simply find it unnecessary, and turn it off (as we already discussed, it can take rather longer to listen to the verbalisation than it does to read the corresponding text). In fact, in Apple's implementation of this feature, there is a short delay between the pop-up box appearing and the start of the speech, and if the pop-up is closed or cancelled during this delay, then the speech is never started. Some users are quite surprised when, the first time they are slow to close the dialogue box, the computer starts speaking to them!


Example: Voice can reduce demands on the visual sense

On modern aircraft flight decks, synthesised speech is used for many applications. For example, in the final stages of an approach to landing, the aircraft's height above the ground is critical. On some aircraft, the computer system monitors the height and provides a verbal read-back of the height in intervals of 10 feet as the aircraft approaches the ground.

This is one instance where speech has the potential to add real value to an interface. On approach to landing, a pilot's visual sense is usually in high demand – looking out of the window at the runway below, as well as monitoring other instruments. If the critical height information were available only on a visual display, then the pilot's vision would be stretched even further, by the requirement to look at yet another display, and by the need to continually re-focus on different things. On the other hand, if some of the information can be "off loaded" and presented acoustically instead of visually, then the demand on the pilot is lessened, reducing the potential for error and catastrophe.

Sound Input

Sound can also be used as an input device. Almost always in human-computer interaction, this means using voice recognition – a relatively new technology that is beginning to achieve the level of performance needed to be effective. One often hears a view, presumably inspired by science fiction movies, that if only we could just speak to our computers, then all our usability problems would be solved. Indeed, we would no longer need user interfaces! It must be emphasised, though, that voice recognition is yet another kind of user interface technology, and just like the mouse and keyboard. It therefore is appropriate for some things and not for others, and using voice input has just as much (or perhaps more) potential to create usability problems as any other technology.

A number of products exist that allow standard desktop computer systems to take speech as input. For instance, recent versions of Apple's MacOS operating systems allow the user to speak standard commands (such as Save, Print, Open, and so on), which the computer interprets and executes. Typically, though, users spend only a small part of their time entering commands into a computer, and much more time entering and editing the content of their documents, so a facility for speaking commands is only of limited use.

A number of products are available that allow users to speak text to the computer, which is then used as the input for word processors and other software.

While voice input might be a nice addition to existing desktop computer systems, it is likely to play a much more central role in interactions where using familiar visual output and mouse and keyboard input is difficult or infeasible. One such situation that we have mentioned before is the provision of services that will be accessed by telephone using only the auditory channel. Several products exist that allow designers interactive telephone-based services to incorporate voice recognition technology. This allows users to provide information, commands and so on. For example, the Vocalis group markets a range of products and has made demonstrations available over the telephone.

Implications for user interface design

In the following sections we will present some suggestions and guidelines for when and where sound might most successfully be employed in the user interface for both the output and input channels.

Sound output

Audio output may be the appropriate channel to use for:

  • Giving immediate feedback that an action has taken place (buttons on ATMs, telephones, etc., that "beep" or "click" as they are pressed);

  • Presenting different kinds of information to that made available using the visual channel (e.g., non-static information such as alerts);

  • Augmenting visual interfaces by providing additional information and cues

  • Supporting users for whom the visual interfaces are not an option (e.g., those with visual impairments)

  • Supporting users whose visual senses are already heavily used for other parts of their task (e.g., aircraft pilots)

  • In interfaces where visual information cannot be presented (e.g. mobile or handheld devices with small or no screens).

Audio output may not be particularly good for:

  • Constantly changing status information

  • Use in shared offices or workplaces, where privacy is important or where the output of many users' computers would lead to confusion or disturbance

  • Noisy environments where sound may be difficult to hear (e.g., the user interface of machinery on a building site)

  • Quiet environments where sounds could cause distraction (e.g., a library or recording studio)

Voice input

Sound can also be used as an input device. Almost always, this means using voice recognition – a relatively new technology that is beginning to achieve the level of performance needed to be effective.

Voice input may be appropriate for:

  • Users who are unable to use more familiar input devices like the mouse or keyboard (e.g. those with severe motor impairments)

  • Users who are busy doing other things (e.g., an aircraft pilot, whose visual channel is occupied monitoring the approaching runway during landing)

  • Interactions that may be cumbersome using the other interaction devices that are available (e.g., menus used in mobile phone services).

  • Small and pre-determined range of spoken commands.

Voice input may not be particularly good for:

  • Noisy environments

  • Use by many different users with different voices and accents

  • Wide range of words or a specialised technical vocabulary. Note that some commercially available voice recognition systems have versions adapted to technical specialisms such as medicine or law.

  • Things that are not easily verbalised (e.g., diagrams)