Transcribing audio to text

Overview – Transcribing audio to text

Transcribing audio to text for captions and for transcripts involve more than capturing the speech. Good captions and transcripts include relevant non-speech audio information. The art is knowing what is relevant and how to communicate it in text, both of which benefit from experience. If you have the budget, consider outsourcing the transcription to professionals. Otherwise, you can do it yourself following the guidance on this page.

How to transcribe

You have two options for transcribing:

Type up the audio as you listen to it, stopping and restarting. This can be made less tedious with transcription software that slows the audio and facilitates pausing.
Start with an automatically-generated text file and correct errors.

More details on options and tools for transcribing are in How to get or make transcripts.

What to transcribe

You transcribe all speech and non-speech sounds (laughs, groans, sighs, screams, car backfiring, footsteps approaching, distant roaring).

For the most part, the best practices for what to include in captions and transcripts are identical. Differences are noted in the following sections.

Identify the speakers. Use the full name the first time and single name otherwise. If a speaker is not identified, use Speaker + number (e.g, Speaker 1, Speaker 2) or use a role/title without a number (e.g., interviewer, Doctor).
Include relevant information about the speech. For example,

(whispering)
Alan: You go first.

(mouthing)
Ellen: No way.
Put non-speech sounds in parenthesis, italics, lowercase, and with a space before and after. For example,

( chatter in distance )

( sniff )
When a speaker is off-screen, set their speech in italics. For example,

Doug: Are you coming?

Annie: I’ll be right down.
Use punctuation to convey emphasis.
- For an incredulous question, use a question mark and an exclamation point:
  
  Ted: Are you saying we have to start over?!
- For pauses in speech, use an ellipse:
  
  Sheila: We … Yes, we begin tonight.
For interrupted speech, use a dash at the end of the line. Any text that follows the interruption should be set on a new line:

Ted: We’ll never get --
Use all capital letters only to indicate yelling. For example,

Ellen: FORE!
When the speech is unintelligible or inaudible, transcribe:

[inaudible]
Indicate large silences as:

(silence)
Do not reveal intentionally held information before the appropriate time.
Exclude non-relevant speech and non-relevant background noise.
Include background music if it’s important to understand the content. Identify music with the uppercase label MUSIC (or a verb implying music), followed by a colon and the title in quotation marks followed by the artist. For example,

MUSIC: “Rocket Man” by Elton John

CAROL HUMS: “Happy Birthday”

BOB WHISTLES: “Take Me Out to Ball Game”
Transcribe lyrics if they’re important and set them in italics. With captions, add a musical note to the beginning and end of each.

♪ A long, long time ago ♪
Describe music that’s not part of the action but sets the mood:

♪ scary music ♪

Transcribe accurately

When transcribing, the goal is accuracy:

Never paraphrase or omit words (and do not censor).
Never substitute words.
Never rearrange the order of speech.
Never correct or edit a speaker’s grammar.
Never provide clarifying information in the captions (you may in the transcript).

Transcript considerations and differences from captions

Descriptive transcripts also describe important visual information (animation, text or graphics, the setting and background, the actions and expressions of people, animals, etc.). Follow the best practices for writing descriptions (see Description of visual information).
Descriptive transcripts may be generated by combining captions and descriptions timed text files (e.g., WebVTT). This is necessary for interactive transcripts.
If your transcript is generated from timed text files, descriptions must fit into gaps in the main audio, or the player must provide functionality to pause during the description (see Description of visual information).
If your transcript is static, your descriptions do not need to fit into gaps in the audio, and may take as many words as needed for clarity.
Transcripts include onscreen text in videos. Captions do not include onscreen text.
Transcripts also identify the source of sounds, rather than just describe them.
In some cases, such as legal depositions, the transcript must be verbatim, including ums, ahs, and indicating pauses.
Headings, topics and links can make the transcript more usable. Here is an example transcript with headings. Here's another example transcript organized by topics in square brackets.
Include timestamps only when useful. If you do include them, they don’t need to be as granular as the captions, and don’t need to include end times. This example TED Talk transcript: How technology allowed me to read adds a timestamp to each paragraph in an interactive transcript where the timestamp doubles as a video link.
Add a timestamp to inaudible audio. For example,

Rebecca: You have one two weeks and [inaudible 1:20:33] to prepare.

Related WCAG resources

Success criteria

Techniques

Failures

F8: Failure of Success Criterion 1.2.2 due to captions omitting some dialogue or important sound effects