February 13, 2001

Typesetting dashes in HTML

by David Benbennick

Most of the characters you need for writing English text are available on a standard keyboard. Alas, there are three missing characters: the em-dash (—), the en-dash (–), and the opening double quote (“). The dashes are missing because they are too wide: keyboard layouts come from typewriter layouts, and typewriters were fixed width. The opening double quote is missing because on typewriters the closing double quote was straight (") and so could be used for both.

Now, instead of typewriters, we have HTML. The people in charge of this sort of thing defined the standard character set of HTML. Here is the perfect opportunity to fix the typewriter legacy by including these missing characters. The character set has 255 slots, so there is plenty of room (in fact, 62 slots are unused). But I guess they were smoking crack, because a) they didn't include either of the dashes, b) they didn't include an opening double quote, c) they made the opening single quote ugly (`), and d) they made the closing quotes straight (' and ").

This is where Unicode comes in. Unicode is a two-byte character set, which as of Unicode 3 has 49,194 characters defined. Among those are the characters that should have been part of HTML from the beginning. But how do you type a two-byte character?

In HTML there are three ways to type characters: as the character itself, as a character entity reference, or as a numeric character reference. For example, to write a double quote I can type " (the character itself), or " (the entity reference), or " (the numeric reference). Every HTML character has a numeric reference, and many characters that cannot be typed directly have a character reference. To get a Unicode character in HTML, you simply use its numeric reference, or a character reference if there is one. A complete set of Unicode charts is at unicode.org/charts. And here is a list of character entity references, as of HTML 4.01.

Below is a chart of some typographic symbols that I need. Each row is a particular character: the first column is the character as it is displayed when given as a numeric reference, the second column is the actual number used, the third column is the entity reference (if any), the fourth is the name of the entity reference, and the fifth column describes what the character is.

Numeric displayCharacter numberEntity displayEntity nameDescription
8211ndashen-dash
8212mdashem-dash
8216lsquoleft single quote
8217rsquoright single quote, apostrophy
8220ldquoleft double quote
8221rdquoright double quote
á225áaacuteaccute accent over a
ø248øoslashslash through o
ö246öoumlumlaut over o
π960πpiGreek pi
±177±plusmn+/-
8800ne!=
8804le<=
8805ge>=
­173­shysoft hyphen

This document acts as a conformance test for your browser. The first and third columns of the table above must be identical, and must have no question marks or box symbols or anything other than what is described in the fifth column. The final line, labelled soft hyphen, must have the first and third columns blank. A soft hyphen specifies an optional hyphenation point for breaking paragraphs into lines, so must not be displayed unless the browser chooses a break there. Finally, the words soft hyphen in this sentence must be rendered with delimiting quotation marks since they appear within the <Q> tag. If there are no quotes then your browser does not conform to HTML 4.0. Of course, laying the blame on a browser is not much consolation for not being able to view your web page. That is why I do not yet use <Q> elsewhere on my site.

Here is a table of characters that are part of Unicode but are not yet officially part of HTML. I expect these characters will not be displayable on many browsers for a while. In fact, the only browser I know of that can display any of them is Lynx, the amazing open source text web browser that displays the whole table correctly.

Numeric displayCharacter numberDescription
8213figure-dash
64256ff ligature
64257fi ligature
64258fl ligature
64259ffi ligature
64260ffl ligature

Internet Explorer for Windows handles Unicode pretty well. In particular, it displays all of the characters in the above table. Netscape for Windows, on the other hand, still (as of 4.76) has some problems. Under Netscape, many of the characters in the table above do not display correctly. A mediocre work-around is to put the line

  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

in the header of this document, instead of the standard

  <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

And in the HTML tradition of making things difficult, the typesetting is completely different on Netscape for X-Windows (the Linux windowing system). Under X, the keyboard single quotes (` and ') look good, and the Unicode versions (‘ and ’) look bad; and the dashes are rendered as question marks. Bah!

Valid XHTML 1.0!