Skip to content

Theory of generalized markup, 2014

Theory of generalized markup, 2014 published on

On xml-dev, Arjun Ray posts a link to Charles Goldfarb’s seminal paper on the theory of generalized markup and its application in SGML, which was first published in 1981 and subsequently revised and included, in 1986, as Annex A to the SGML standard (ISO 8879).

Thirty years along we have the web, and everything has changed, yet nothing has changed. The core of Goldfarb’s argument is the same lesson taught daily to neophyte web developers on how much better things are when you hang your styles on “semantic” class attributes.

Such labels are useful today because the elements on which they sit (p, td, li, div, span, what have you) have almost no “descriptive” semantics of their own (a p marks a paragraph?!): they have been reduced to their operations, viz. their effects in the browser. (No, p only starts a new line, with some vertical white space. Or not, as the case may be). The pretense of HTML5 to mitigate this trend with new semantic elements like article and aside acquires a poignant irony when we reflect that it can last only as long as these elements are not used (or abused, if that’s how you look at it) to do stuff in the browser that has nothing to do with what they “are” or are “supposed to be”. At that point (which has undoubtedly already past), the semantics of article, in HTML, become as vacuous as those of p. It means only what you say it means when it does what it does.

Similarly, I wryly note how the WordPress interface into which I type gives me an HTML strong element for its B button and an em for its I button, and how I then use strong to signal what might be term or gi in (“descriptive”) TEI, and em for what might be soCalled. (Someone somewhere has ruled that HTML b is bad and strong is good — it’s semantic! So I get strong whether I like it or not.) And yet, seeing only bold for my strong and italics for my em, you know well enough what I mean. Semantics are so sneaky!

This tug of war has gone on long enough to suggest that it cannot be won. Thus, having emerged as a de facto standard for formatting publications even off line — and accordingly reduced to the “presentational”, for good or ill — HTML kindly permits us, in order that we may do what we need, to sneak our semantics back into our markup. The fact that the application of class attribute values is so hard to constrain, particularly in comparison to the rigid document types imposed by SGML (Goldfarb calls them “rigorous”), is both a terrible weakness and a secret strength.

Does it seem paradoxical that an XML enthusiast should see any good in all this redoubled reversing? I hope not. Why we will never have a fully comprehensive descriptive markup language (after many valiant attempts) is more interesting than the simple fact of it. And the point of XML is not, it seems to me, what SGML so often presumed, to enable “true description” of our information. It is to achieve better layering in systems design, to be more flexible, more expressive, more graceful. As for HTML, if we didn’t have it, we’d have to invent it. And then we’d have to invent CSS to go with it.