The VTT Subtitle Format: A Complete Guide to WebVTT Syntax, Features, and Usage

2026-04-16 · 8 min read

WebVTT (Web Video Text Tracks) is the subtitle format built for the modern web. If you’ve ever added captions to an HTML5 video using the <track> element, you were using VTT whether you realized it or not — it’s the only subtitle format browsers understand natively.

This guide covers everything: file anatomy, timestamp syntax, cue settings, styling hooks, conversion, and the mistakes that silently break your captions in production. If you’ve worked with SRT before, VTT will feel familiar — but the differences matter.

What Is the VTT (WebVTT) Format?

VTT stands for Web Video Text Tracks. It’s a W3C standard (formally specified as

WebVTT: The Web Video Text Tracks Format

) designed to deliver timed text content alongside HTML5 video and audio. The file extension is .vtt and the MIME type is text/vtt.

Unlike older formats that were retrofitted for the web, VTT was purpose-built for browsers. It supports CSS styling, precise positioning, metadata tracks, chapter markers, and accessibility features that SRT and SUB formats simply cannot express. Every modern browser — Chrome, Firefox, Safari, Edge — parses VTT natively through the TextTrack API.

VTT File Anatomy

A VTT file is a plain-text UTF-8 document with a strict structure. Here’s a minimal but complete example:

WEBVTT

00:00:01.000 --> 00:00:04.000
Hello, and welcome to this tutorial.

00:00:05.200 --> 00:00:08.500
Today we'll cover the WebVTT
subtitle format from scratch.

Every VTT file has three parts:

The header line — must be WEBVTT, optionally followed by text on the same line (e.g., WEBVTT - My Subtitles). This is non-negotiable. Without it, browsers reject the file.
A blank line — separates the header from the first cue. Required.
Cue blocks — each containing a timestamp line and one or more lines of text, separated from other cues by blank lines.

That’s the entire structure. If you can write those three parts correctly, you can write valid VTT by hand.

VTT Syntax Reference

The WEBVTT Header

The first line of every VTT file must begin with the string WEBVTT. You can append a description after a space or tab, but you cannot put anything before it — no BOM (byte order mark) issues aside, no comments, no blank lines.

WEBVTT

WEBVTT - English Subtitles

WEBVTT Kind: captions; Language: en

All three headers above are valid. The text after WEBVTT is purely informational and ignored by parsers.

Timestamp Format

VTT timestamps use the format HH:MM:SS.mmm where the hours portion is optional for content under one hour. The start and end times are separated by an arrow: —> (two hyphens and a greater-than sign), with spaces on both sides.

00:00:01.000 --> 00:00:04.500
This uses the full hours:minutes:seconds.milliseconds format.

00:15.000 --> 00:18.750
This omits hours — valid when your content is under 60 minutes.

The millisecond separator is a period, not a comma. This is the single most common mistake when converting from SRT, where commas are used. A VTT parser encountering 00:00:01,500 will either reject the cue or misinterpret the timing entirely.

Cue Identifiers

Unlike SRT, where sequential numbers are mandatory, VTT cue identifiers are optional. When present, they appear on the line before the timestamp:

WEBVTT

intro-1
00:00:01.000 --> 00:00:04.000
Welcome to the course.

intro-2
00:00:05.000 --> 00:00:08.000
Let's get started.

Cue identifiers can be any string — numbers, words, slugs — as long as they don’t contain —> or blank lines. They’re useful for scripting, JavaScript access via the TextTrack API, and for your own reference when editing large files.

Cue Settings (Positioning and Alignment)

VTT lets you control where and how each cue renders on screen. Settings are appended to the timestamp line, space-separated:

00:00:01.000 --> 00:00:04.000 position:10% align:start
[Narrator] The story begins in 1985.

00:00:05.000 --> 00:00:08.000 line:0 position:50% align:center
CHAPTER 1: THE BEGINNING

00:00:09.000 --> 00:00:12.000 vertical:rl
This text renders vertically, right-to-left.

Available cue settings:

Setting	Values	Purpose
`position`	0%–100%	Horizontal position of the cue box
`line`	Number or %	Vertical position (line number or percentage)
`size`	0%–100%	Width of the cue box as percentage of video
`align`	start, center, end, left, right	Text alignment within the cue box
`vertical`	rl, lr	Vertical text direction (for CJK languages)

Most subtitle files never use cue settings — the browser defaults work fine. But when you need to place speaker labels in different corners or position chapter titles at the top of the frame, these settings are indispensable.

Styling with Inline Tags

VTT supports a small set of inline tags for formatting cue text:

00:00:01.000 --> 00:00:04.000
This is <b>bold</b> and this is <i>italic</i>.

00:00:05.000 --> 00:00:08.000
<u>Underlined text</u> for emphasis.

00:00:09.000 --> 00:00:12.000
<c.highlight>This text has a CSS class attached.</c>

00:00:13.000 --> 00:00:16.000
<v Speaker A>I disagree completely.
<v Speaker B>You're wrong about that.

The supported tags:

 — Bold
 — Italic
 — Underline
<c> — Span with a class (e.g., <c.yellow>), stylable via the ::cue(.yellow) CSS selector
<v> — Voice tag for speaker identification
<ruby> / <rt> — Ruby annotations (pronunciation guides for CJK text)
<lang> — Language tag (e.g., <lang fr>Bonjour</lang> )

On the CSS side, you can style these from your stylesheet using the ::cue pseudo-element:

::cue {
background: rgba(0, 0, 0, 0.7);
color: white;
font-size: 1.1em;
}

::cue(b) {
color: #ffd700;
}

::cue(.highlight) {
color: #00ff88;
}

This is one of VTT’s strongest advantages over SRT. Instead of embedding style information in the subtitle file (which is fragile and non-standard), VTT delegates presentation to CSS — the same system that styles everything else on your page.

Comment Blocks (NOTE)

VTT supports comments for annotations, translator notes, or production metadata:

WEBVTT

NOTE
This file was auto-generated by Whisper
and manually corrected on 2026-04-10.

00:00:01.000 --> 00:00:04.000
Welcome to the session.

NOTE This is a single-line comment.

00:00:05.000 --> 00:00:08.000
Let's begin with an overview.

Comments start with NOTE followed by a space (for single-line) or a newline (for multi-line). They are stripped by the parser and never displayed to the viewer.

VTT vs SRT: Key Differences

For a deep comparison, see our dedicated SRT vs VTT comparison. Here’s the quick reference:

Feature	VTT	SRT
File header	`WEBVTT` required	None
Millisecond separator	Period (`.`)	Comma (`,`)
Cue numbers	Optional (any string)	Required (sequential integers)
Positioning	Yes (position, line, size, align)	No
CSS styling	Yes (::cue selector)	No
Comments	Yes (NOTE blocks)	No
Browser support	Native via <track>	Not supported
Desktop players	Most (VLC, MPV)	Universal

The short version: SRT is the universal legacy format. VTT is the web standard with richer features. For browser playback, VTT is the only option. For everything else, SRT has wider support.

Where VTT Is Used

VTT appears anywhere web video appears:

HTML5 <track> element

— The native way to attach subtitles, captions, chapters, or descriptions to <video> and <audio> elements. Browsers only accept VTT here.
HLS and DASH streaming — Apple’s HTTP Live Streaming and MPEG-DASH both use VTT (or its segmented variant, X-TIMESTAMP-MAP) for delivering timed text in adaptive streams.
YouTube — Exports captions in multiple formats including VTT. When you download YouTube subtitles, VTT is one of the available options.
Video.js, Plyr, and custom players — JavaScript video player libraries universally support VTT, often as the preferred or only subtitle format.
Accessibility compliance — WCAG 2.1 guidelines reference the <track> element for media captions, making VTT the de facto format for accessible web video.
E-learning platforms — Coursera, Udemy, edX, and similar platforms use VTT for their web-based video players.

How to Create a VTT File from Scratch

You don’t need special software. VTT is plain text — any text editor works. Here’s the process:

Open a new file in any text editor (VS Code, Notepad, Sublime, TextEdit in plain-text mode).
Type WEBVTT on the first line.
Leave a blank line.
Add your first cue: a timestamp line followed by the subtitle text.
Separate cues with blank lines.
Save the file with a .vtt extension and UTF-8 encoding (this is critical — non-UTF-8 encoding will corrupt non-ASCII characters).

Here’s a ready-to-use template:

WEBVTT

00:00:00.000 --> 00:00:03.500
First subtitle line goes here.

00:00:04.000 --> 00:00:07.500
Second subtitle line.

00:00:08.000 --> 00:00:12.000
Third line — notice the blank lines
separating each cue block.

Once saved, you can test it immediately by dropping it into the caption viewer or attaching it to an HTML5 video element with <track src=“subs.vtt” kind=“subtitles” srclang=“en” label=“English”>.

How to Convert To and From VTT

Converting between VTT and SRT is straightforward because the formats share the same core structure. The transformation involves three mechanical changes:

Add or remove the WEBVTT header
Swap the millisecond separator (period in VTT, comma in SRT)
Add or strip cue numbers

Our SRT to VTT converter handles this instantly in your browser — no file upload, no server, no data leaves your machine. It works in both directions.

For programmatic conversion, most subtitle libraries (pysrt, subtitle.js, webvtt-py) read and write both formats. If you’re processing files at scale, these libraries handle edge cases (overlapping timestamps, malformed cues) that a naive regex replacement will miss.

Common VTT Mistakes and How to Fix Them

These are the errors I see most often. Each one will cause your subtitles to either fail silently or render incorrectly — browsers don’t throw helpful error messages for malformed VTT.

1. Missing or malformed WEBVTT header

The file must start with WEBVTT. No leading blank lines, no BOM characters visible in the text, no lowercase. If you copied your file content from a rich-text editor, check for invisible characters before the header.

2. Using commas instead of periods in timestamps

This happens every time someone converts from SRT by hand and forgets one line. 00:01:30,500 is SRT syntax. 00:01:30.500 is VTT syntax. One wrong comma and the entire cue is silently dropped.

3. Missing blank lines between cues

Each cue block must be separated by at least one empty line. If two cues run together without a blank line, the parser will treat the second timestamp as subtitle text and display it on screen.

4. Wrong file encoding

VTT files must be UTF-8. If you save as ANSI, Latin-1, or UTF-16, accented characters, CJK text, and emoji will render as garbled symbols. In VS Code, check the encoding indicator in the bottom-right corner and switch to UTF-8 if needed.

5. Overlapping timestamps without intent

Overlapping cues are technically valid in VTT — the browser will render both simultaneously. But if you didn’t intend it, two captions stacking on top of each other looks broken. Review your timestamps to ensure each cue’s end time comes before the next cue’s start time.

6. Using unsupported HTML tags

VTT only recognizes its own set of inline tags (, , , <c>, <v>, <ruby>, <rt>, <lang>). Dropping in , , or <div> tags won’t work — they’ll be stripped or displayed as literal text.

FAQ

Can browsers play SRT files directly?

No. The HTML5 <track> element only accepts VTT. If you have SRT files, you need to convert them to VTT first. Some JavaScript player libraries (like Video.js) include built-in SRT-to-VTT conversion, but the browser itself won’t parse SRT.

What’s the maximum file size for a VTT file?

There is no specification-defined limit. A two-hour movie’s subtitles typically produce a VTT file between 50 KB and 150 KB — trivial by web standards. For HLS/DASH streaming, VTT files are segmented into smaller chunks that align with media segments, so file size is never a practical concern.

Does VTT support right-to-left languages like Arabic and Hebrew?

Yes. VTT inherits text direction from the document or you can set it explicitly with CSS. The Unicode bidirectional algorithm handles mixed-direction text within cues. For vertical CJK text, use the vertical:rl or vertical:lr cue setting.

Can I use VTT for chapters, not just subtitles?

Absolutely. Set the <track> element’s kind attribute to “chapters” and use cue text as chapter titles. Browsers and players that support chapter navigation will read the timestamp ranges as chapter boundaries. The VTT file format is identical — only the kind attribute changes how the player interprets it.