A list of common misconceptions about HTML, with lots of excellent detail about how HTML parsers actually work.
(Though I'm not sure how common ‒ or even controversial ‒ some of them are; I'm not sure any one is arguing to use XHTML these days, are they?)
On the death of HTML4:
HTML4 is just outright dead. Browsers do not parse HTML documents as HTML4, regardless of the DOCTYPE.
On a genuinely solid reason not to use "self-closing" elements in HTML:
Lastly, because HTML parsing rules are not the same as SGML’s or XML’s, the trailing solidus carries an additional danger. If it directly follows unquoted attribute values without proper space before it, then it will be parsed as part of that attribute value.
•
<img class=wide/>
is anIMG
withclass
value “wide/
”.
•<img class=wide />
is anIMG
withclass
value “wide
”.
So don’t add that trailing solidus! It’s not better; it’s just more dangerous.
On how we need more HTML parsers:
The world needs more HTML parsers with different applications. Most available HTML parsers produce a DOM – a fully-parsed tree representation of the DOM the HTML represents. This is a memory-heavy operation and performs a lot of semantic cleanup to form a proper DOM document. However, lots of operations don’t need or want a DOM interface.
On how XML parsers (and RegEx, and string functions, and any other suggested alternative) will never actually be able to parse HTML; only an HTML parser can:
There’s significant advice on the web to use an XML parser for HTML, but it’s not safe. Reach for an HTML parser to parse HTML.