“The power of the Web is in its universality. Access by everyone regardless of disability is an essential aspect.”
– Tim Berners-Lee
Accessibility is an important element of web development, and with the ever-growing prevalence of video content, the necessity for captioned content is growing as well. WebVTT is a technology that solves helps with captioned content as a subtitle format that integrates easily with already-existing web APIs.
That’s what we’re going to look at here in this article. Sure, WebVTT is captioning at its most basic, but there are ways to implement it to make videos (and the captioned content itself) more accessible for users.
Hi, meet the WebVTT format
First and foremost: WebVTT is a type of file that contains the text “WebVTT” and lines of captions with timestamps. Here’s an example:
WEBVTT
00:00:00.000 --> 00:00:03.000
- [Birds chirping]
- It's a beautiful day!
00:00:04.000 --> 00:00:07.000
- [Creek trickling]
- It is indeed!
00:00:08.000 --> 00:00:10.000
- Hello there!
A little weird, but makes pretty good sense, right? As you can see, the first line is “WEBVTT” and it is followed by a time range (in this case, 0 to 3 seconds) on Line 3. The time range is required. Otherwise, the WEBVTT file will not work at all and it won’t even display or log errors to let you know. Finally, each line below a time range represents captions contained in the range.
Note that you can have multiple captions in a single time range. Hyphens may be used to indicate the start of a line, though it’s not required and more stylistic than anything else.
The time range can be one of two formats: hh:mm:ss.tt
or mm:ss.tt
. Each part follows certain rules:
- Hours (
hh
): Minimum of two digits - Minutes (
mm
): Between00
and59
, inclusive - Seconds (
ss
): Between00
and59
, inclusive - Milliseconds (
tt
): Between000
and999
, inclusive
This may seem rather daunting at first. You’re probably wondering how anyone can be expected to type and tweak this all by hand. Luckily, there are tools to make this easier. For example, YouTube can automatically caption videos for you with speech recognition in addition to allowing you to download the caption as a VTT file as well! But that’s not it. WebVTT can also be used with YouTube as well by uploading your VTT file to your YouTube video.
Once we have this file created, we can then embed it into an HTML5 video element.
<video autoplay="autoplay" controls="controls" width="300" height="150">
<source src="your_video.mp4" type="video/mp4">
<track default="" kind="captions" srclang="en" label="English" src="your_caption_file.vtt">
</video>
The tag is sort of like a script that “plays” along with the video. We can use multiple tracks in the same video element. The
default
attribute indicates that a the track will be enabled automatically.
Let’s run down all the attributes while we’re at it:
srclang
indicates what language the track is in.kind
represents the type of track it is and there are five kinds:subtitles
are usually translations and descriptions of different parts of a video.descriptions
help unsighted users understand what is happening in a video.captions
provide un-hearing users an alternative to audio.metadata
is a track that is used by scripts and cannot be seen by users.chapters
assist in navigating video content.
label
is a title for the text track that appears in the caption tracksrc
is the source file for the track. It cannot come from a cross-origin source unlesscrossorigin
is specified.
While WebVTT is designed specifically for video, you can still use it with audio by placing an audio file within a <video>
element.
Digging into the structure of a WebVTT file
MDN has great documentation and outlines the body structure of a WebVTT file, which consists of up to six components. Here’s how MDN breaks it down:
- An optional byte order mark (BOM)
- The string “
WEBVTT
“- An optional text header to the right of
WEBVTT
.
- There must be at least one space after
WEBVTT
.- You could use this to add a description to the file.
- You may use anything in the text header except newlines or the string “
-->
“.- A blank line, which is equivalent to two consecutive newlines.
- Zero or more cues or comments.
- Zero or more blank lines.
Note: a BOM is a unicode character that indicates the unicode encoding of the text file.
Bold, italic, and underline — oh my!
We can absolutely use some inline HTML formatting in WebVTT files! These are the ones that everyone is familiar with: <b>
, <i>
, and <u>
. You use them exactly as you would in HTML.
WEBVTT
00:00:00.000 --> 00:00:03.000 align:start
This is <b>bold text</b>
00:00:03.000 --> 00:00:06.000 align:middle
This is <i>italic text</i>
00:00:06.000 --> 00:00:09.000 vertical:rl align:middle
This is <u>underlined text</u>
Cue settings
Cue settings are optional strings of text used to control the position of a caption. It’s sort of like positioning elements in CSS, like being able to place captions on the video.
For example, we could place captions to the right of a cue timing, control whether a caption is displayed horizontally or vertically, and define both the alignment and vertical position of the caption.
Here are the settings that are available to us.
Setting 1: Line
line
controls the positioning of the caption on the y-axis. If vertical
is specified (which we’ll look at next), then line
will instead indicate where the caption will be displayed on the x-axis.
When specifying the line
value, integers and percentages are perfectly acceptable units. In the case of using an integer, the distance per line will be equal to the height (from a horizontal perspective) of the first line. So, for example, let’s say the height of the first line of the caption is equal to 50px, the line
value specified is 2
, and the caption’s direction is horizontal. That means the caption will be positioned 100px (50px times 2) down from the top, up to a maximum equal to coordinates of the boundaries of the video. If we use a negative integer, it will move upward from the bottom as the value decreases (or, in the case of vertical:lr
being specified, we will move from right-to-left and vice-versa). Be careful here, as it’s possible to position the captions off-screen in addition to the positioning being inconsistent across browsers. With great power comes great responsibility!
In the case of a percentage, the value must be between 0-100%, inclusive (sorry, no 200% mega values here). Higher values will move the caption from top-to-bottom, unless vertical:lr
or vertical:rl
is specified, in which case the caption will move along the x-axis accordingly.
As the value increases, the caption will appear further down the video boundaries. As the value decreases (including into the negatives), the caption will appear further up.
Tough picture this without examples, right? Here’s how this translates into code:
00:00:00.000 --> 00:00:03.000 line:50%
This caption should be positioned horizontally in the approximate center of the screen.
00:00:03.000 --> 00:00:06.000 vertical:lr line:50%
This caption should be positioned vertically in the approximate center of the screen.
00:00:06.000 --> 00:00:09.000 vertical:rl line:-1
This caption should be positioned vertically along the left side of the video.
00:00:09.000 --> 00:00:12.000 line:0
The caption should be positioned horizontally at the top of the screen.
Setting 2: Vertical
vertical
indicates the caption will be displayed vertically and move in the direction specified by the line
setting. Some languages are not displayed left-to-right and instead need a top-to-bottom display.
00:00:00.000 --> 00:00:03.000 vertical:rl
This caption should be vertical.
00:00:00.000 --> 00:00:03.000 vertical:lr
This caption should be vertical.
Setting 3: Position
position
specifies where the caption will be displayed along the x-axis. If vertical
is specified, the position
will instead specify where the caption will be displayed on the y-axis. It must be an integer value between 0%
and 100%
, inclusive.
00:00:00.000 --> 00:00:03.000 vertical:rl position:100%
This caption will be vertical and toward the bottom.
00:00:03.000 --> 00:00:06.000 vertical:rl position:0%
This caption will be vertical and toward the top.
At this point, you may notice that line
and position
are similar to the CSS flexbox properties for align-items
and justify-content
, and that vertical
behaves a lot like flex-direction
. A trick for remembering WebVTT directions is that line
specifies a position perpendicular to the flow of the text, whereas position
specifies the position parallel to the flow of the text. That’s why line
suddenly moves along the horizontal axis, and position
moves along the vertical axis if we specify vertical
.
Setting 4: Size
size
specifies the width of the caption. If vertical
is specified, then it will set the height of the caption instead. Like other settings, it must be an integer between 0%
and 100%
, inclusive.
00:00:00.000 --> 00:00:03.000 vertical:rl size:50%
This caption will fill half the screen vertically.
00:00:03.000 --> 00:00:06.000 position:0%
This caption will fill the entire screen horizontally.
Setting 5: Align
align
specifies where the text will appear horizontally. If vertical
is specified, then it will control the vertical alignment instead.
The values we’ve got are: start
, middle
, end
, left
and right
. Without vertical
specified, the alignments are exactly what they sound like. If vertical
is specified, they effectively become top
, middle
(vertically), and bottom
. Using start
and end
as opposed to left
and right
, respectively, is a more flexible way of allowing the alignment to be based on the unicode-bidi
CSS property’s plaintext
value.
Note that align
is not unaffected by vertical:lr
or vertical:rl
.
WEBVTT
00:00:00.000 --> 00:00:03.000 align:start
This caption will be on the left side of the screen.
00:00:03.000 --> 00:00:06.000 align:middle
This caption will be horizontally in the middle of the screen.
00:00:06.000 --> 00:00:09.000 vertical:rl align:middle
This caption will be vertically in the middle of the screen.
00:00:09.000 --> 00:00:12.000 vertical:rl align:end
This caption will be vertically at the bottom right of the screen regardless of vertical:lr or vertical:rl orientation.
00:00:12.000 --> 00:00:15.000 vertical:lr align:end
This caption will be vertically at the bottom of the screen, regardless of the vertical:lr or vertical:rl orientation.
00:00:12.000 --> 00:00:15.000 align:left
This caption will appear on the left side of the screen.
00:00:12.000 --> 00:00:15.000 align:right
This caption will appear on the right side of the screen.
WebVTT Comments
WebVTT comments are strings of text that are only visible when reading the source text of the file, the same way we think of comments in HTML, CSS, JavaScript and any other language. Comments may contain a new line, but not a blank line (which is essentially two new lines).
WEBVTT
00:00:00.000 --> 00:00:03.000
- [Birds chirping]
- It's a beautiful day!
NOTE This is a comment. It will not be visible to anyone viewing the caption.
00:00:04.000 --> 00:00:07.000
- [Creek trickling]
- It is indeed!
00:00:08.000 --> 00:00:10.000
- Hello there!
When the caption file is parsed and rendered, the highlighted line above will be completely hidden from users. Comments can be multi-line as well.
There are three very important characters/strings to take note of that may not be used in comments: <
, &
, and -->
. As an alternative, you can use escaped characters instead.
Not Allowed | Alternative |
---|---|
NOTE PB&J | NOTE PB&J |
NOTE 5 < 7 | NOTE 5 < 7 |
NOTE puppy --> dog | NOTE puppy --> do |
A few other interesting WebVTT features
We’re going to take a quick look at some really neat ways we can customize and control captions, but that are lacking consistent browser support, at least at the time of this writing.
Yes, we can style captions!
WebVTT captions can, in fact, be styled. For example, to style the background of a caption to be red, set the background property on the ::cue
pseudo-element:
video::cue {
background: red;
}
Remember how we can use some inline HTML formatting in the WebVTT file? Well, we can select those as well. For example, to select and italic (<i>
) element:
video::cue(i) {
color: yellow;
}
Turns out WebVTT files support a style block, a lot like the way HTML files do:
WEBVTT
STYLE
::cue {
color: blue;
font-family: "Source Sans Pro", sans-serif;
}
Elements can also be accessed via their cue identifiers. Note that cue identifiers use the same escaping mechanism as HTML.
WEBVTT
STYLE
::cue(#middle\ cue\ identifier) {
text-decoration: underline;
}
::cue(#cue\ identifier\ \33) {
font-weight: bold;
color: red;
}
first cue identifier
00:00:00.000 --> 00:00:02.000
Hello, world!
middle cue identifier
00:00:02.000 --> 00:00:04.000
This cue identifier will have an underline!
cue identifier 3
00:00:04.000 --> 00:00:06.000
This one won't be affected, just like the first one!
Different types of tags
Many tags can be used to format captions. There is a caveat. These tags cannot be used in a element where
kind
attribute is chapters
. Here are some formatting tags you can use.
The class tag
We can define classes in the WebVTT markup using a class tag that can be selected with CSS. Let’s say we have a class, .yellowish
that makes text yellow. We can use the tag in a caption. We can control lots of styling this way, like the font, the font color, and background color.
/* Our CSS file */
.yellowish {
color: yellow;
}
.redcolor {
color: red;
}
WEBVTT
00:00:00.000 --> 00:00:03.000
This text should be yellow. This text will be the default color.
00:00:03.000 --> 00:00:06.000
This text should be red. This text will be the default color.
The timestamp tag
If you want to make captions appear at specific times, then you will want to use timestamp tags. They’re like fine-tuning captions to exact moments in time. The tag’s time must be within the given time range of the caption, and each timestamp tag must be later than the previous.
WEBVTT
00:00:00.000 --> 00:00:07.000
This <00:00:01.000>text <00:00:02.000&>will <00:00:03.000>appear <00:00:04.000>over <00:00:05.000>6 <00:00:06.000>seconds.
The voice tag
Voice tags are neat in that they help identify who is speaking.
WEBVTT
00:00:00.000 --> 00:00:03.000
How was your day, Bob?
00:00:03.000 --> 00:00:06.000
Great, yours?
The ruby tag
The ruby tag is a way to display small, annotative characters above the caption.
WEBVTT
00:00:00.000 --> 00:00:05.000
<ruby>This caption will have text above it<rt>This text will appear above the caption.</rt></ruby>
Conclusion
And that about wraps it up for WebVTT! It’s an extremely useful technology and presents an opportunity to improve your site’s accessibility a great deal, particularly if you are working with video. Try some of your own captions out yourself to get a better feel for it!
Do you know about browser support for the voice tag? Doesn’t seem to be showing up for me. I’m using videoJS.
Do WebVTT files need to be encoded in UTF-8? Or can they be encoded in other encodings?
Thanks!
Voice tags are supported but I don’t think any browser has default colors for the different voices. You have to provide styles manually.
Styling itself is also complicated as most browsers don’t support it. The vtt.js project, which Video.js uses for captions by default on browsers other than Safari, doesn’t support the CSS extensions in webvtt.
The webvtt file must be utf-8 as per the specification: https://www.w3.org/TR/webvtt/#file-structure