February 13, 2001
Typesetting dashes in HTML
Most of the characters you need for writing English text are available on a standard keyboard. Alas, there are three missing characters: the em-dash (—), the en-dash (–), and the opening double quote (“). The dashes are missing because they are too wide: keyboard layouts come from typewriter layouts, and typewriters were fixed width. The opening double quote is missing because on typewriters the closing double quote was straight (") and so could be used for both.
Now, instead of typewriters, we have HTML. The people in charge of this sort of thing defined the standard character set of HTML. Here is the perfect opportunity to fix the typewriter legacy by including these missing characters. The character set has 255 slots, so there is plenty of room (in fact, 62 slots are unused). But I guess they were smoking crack, because a) they didn't include either of the dashes, b) they didn't include an opening double quote, c) they made the opening single quote ugly (`), and d) they made the closing quotes straight (' and ").
This is where Unicode comes in. Unicode is a two-byte character set, which as of Unicode 3 has 49,194 characters defined. Among those are the characters that should have been part of HTML from the beginning. But how do you type a two-byte character?
In HTML there are three ways to type characters: as the character itself, as a character entity reference, or as a numeric character reference. For example, to write a double quote I can type " (the character itself), or " (the entity reference), or " (the numeric reference). Every HTML character has a numeric reference, and many characters that cannot be typed directly have a character reference. To get a Unicode character in HTML, you simply use its numeric reference, or a character reference if there is one. A complete set of Unicode charts is at unicode.org/charts. And here is a list of character entity references, as of HTML 4.01.
Below is a chart of some typographic symbols that I need. Each row is a particular character: the first column is the character as it is displayed when given as a numeric reference, the second column is the actual number used, the third column is the entity reference (if any), the fourth is the name of the entity reference, and the fifth column describes what the character is.
|Numeric display||Character number||Entity display||Entity name||Description|
|‘||8216||‘||lsquo||left single quote|
|’||8217||’||rsquo||right single quote, apostrophy|
|“||8220||“||ldquo||left double quote|
|”||8221||”||rdquo||right double quote|
|á||225||á||aacute||accute accent over a|
|ø||248||ø||oslash||slash through o|
|ö||246||ö||ouml||umlaut over o|
This document acts as a conformance test for your browser. The first and
third columns of the table above must be identical, and must have no
question marks or box symbols or anything other than what is described in
the fifth column. The final line, labelled
soft hyphen, must have
the first and third columns blank. A soft hyphen specifies an optional
hyphenation point for breaking paragraphs into lines, so must not be
displayed unless the browser chooses a break there. Finally, the words
soft hyphen in this sentence must be
with delimiting quotation marks since they appear within the
<Q> tag. If there are no quotes then your browser does not conform to
HTML 4.0. Of course, laying the blame on a browser is not much
consolation for not being able to view your web page. That is why I do
not yet use <Q> elsewhere on my site.
Here is a table of characters that are part of Unicode but are not yet officially part of HTML. I expect these characters will not be displayable on many browsers for a while. In fact, the only browser I know of that can display any of them is Lynx, the amazing open source text web browser that displays the whole table correctly.
|Numeric display||Character number||Description|
Internet Explorer for Windows handles Unicode pretty well. In particular, it displays all of the characters in the above table. Netscape for Windows, on the other hand, still (as of 4.76) has some problems. Under Netscape, many of the characters in the table above do not display correctly. A mediocre work-around is to put the line
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
in the header of this document, instead of the standard
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
And in the HTML tradition of making things difficult, the typesetting is completely different on Netscape for X-Windows (the Linux windowing system). Under X, the keyboard single quotes (` and ') look good, and the Unicode versions (‘ and ’) look bad; and the dashes are rendered as question marks. Bah!