XML – Eat Your Vegetables

Viral properties of schemas

December 16, 2014 Wendell Piez

Markup languages are memes as well as memetic, semantic and sometimes mimetic systems.

(Within XML, think TEI, Docbook, NLM/JATS, DITA, MODS/METS, vertical schemas. HTML, fer chrissake. Markdown. Grammar-based formats. Object property sets. Application binaries.)

I have written about markup languages considered as games; they can also be considered as genes. Indeed one might see a document (here in the very narrow, XML/SGML sense of that word) as phenotype to a schema’s genotype. The fact that formally, documents are always already valid to infinities of unspecified schemas (since an infinite number of valid schemas can be trivially and arbitrarily generated for a given document), obscures the fact that “in the real world” (that is, in the wider contexts of our knowing of it), a document does not stand merely as a formal abstraction (a collection of element types), but also has a history. As such, it is representative of its history as it is of its (purported) representations. That is, the aircraft maintenance manual is not only a description of an aircraft; it is also a maintenance manual. This history, often, implicates a particular schema.

Schemas:

Replicate (are copied across file systems)
Are modified and adapted
May be identified with “species”, which evolve over time
Are embedded in local systems, but also enable discrete semantic “environments” (which may be portable)
Occasionally cross the barriers between systems (often to be adapted on the other side)
Sometimes include mechanisms for their own adaptation

(BTW these are also all true of documents, at a more discrete level. Indeed, schemas are among the mechanisms that documents provide for their own adaptation. Note how you can stretch the concepts of “document” and “schema” here very widely, to all kinds of media, especially if you allow for the idea that schemas may exist implicitly without being codified.)

Unlike living systems (but perhaps like viruses), schemas cannot be said to have volition of their own. They get their volition from the environments in which they are promulgated. Perhaps they are more plant-like than animal-like.

Also the question arises as to whether (and to what extent) they are parasitic or symbiotic. One suspects they have to be symbiotic in order to encourage adoption. However, they clearly get much of their power from their network effects (the more people using HTML, the more useful HTML becomes to use) — and at a certain point, this may introduce stresses between the local goals (of HTML users themselves) and of the interests that promote HTML despite poor fitness to local goals.

Schemas are also the deliberate, logical/social/linguistic creations of people and of organizations of people. Can they be this, and also “viral” at the same time?

Model and Process, part I

November 12, 2014 Wendell Piez

Being called on to pinch-hit for a colleague at GSLIS (and seriously, it’s an honor to be asked), I am today pondering the relation between “document modeling” (especially but not only as a sort of “data modeling”) and the design and implementation of processes, that is, actual workflows that perform actual operations and achieve actual results.

To describe it thus is to see this immediately as a deep set of questions. (Not only are the questions themselves, but the set is deep.) Yet many or most even of those who are students of these dark arts, ever much ponder them, pretty much going on our ways developing models and deploying (and even operating) processes, without much thinking about how truly dependent on one another these are.

It is not that one must always devote a document model to a process: document models can be designed independently of any actual or particular process — and have been so much so it puts me in mind of what Mark Twain is said to have said, when asked if he believed in infant baptism: “Not only do I believe in it; I’ve seen it”. Indeed, this activity is theoretically necessary (or at least that argument can be made), and to design such models (to be “application independent”) — and to design systems that support the design (and yes, ironically, the processing) of such models is work of paramount importance. Yet at the same time, it is only when we actually try to do things with actual data — leveraging our modeling, that is to say, and capitalizing and profiting from our investment in it — that we discover the true outlines and limits set by our models along with (and reflecting) their powers. (Well, that is not strictly true, as some people are possessed of enough perspicacity to be able to envision and thus anticipate the limits of a design, without actually running the thing. But these people are rare, and tend not to be listened much to in any case.)

Thus there is a theoretical as well as a practical interest in process, as well as model, as indeed there can be an abstraction of process too — models of process, as are specified in multiple layers of our system in any case, in its various software components designed to interface with each other in various ways. It’s models all the way down. But what enables the models to layer up is the processes in which they are expressed and transformed.

Maybe “model” and “process” are simply projections of “space” and “time” in the imaginal worlds we construct and actuate in our systems building and operation? Space can be thought of without time, if geometric relations can subsist among however many spatial directions there may be, straight or bent, outside of time. (Let no one come in without geometry, as it said over the door of Plato’s Academy.) But time moves, is not static, and in one direction only, even as it ebbs and flows and branches and aligns, as time lines (it may be, however we might define such a thing) cross and rejoin other threads or strands of themselves. With time, space is more than just a cold, abstract unchanging field of potential. It has energy, becomes capable of transformation, a setting for organic emergence.

Is this what we also see within the simulacra we build in the machine, whose laws of physics are so different from ours? Add process to model, that is, and you really get legs. A process must always have or work with a model, even if only implicitly, so the model comes first. But it is only when we put fuel in the line and start cranking that we find out how the engine runs.

XML vs/and HTML5

August 19, 2014 Wendell Piez

One thing we learned at the Balisage 2014 Workshop on “HTML5 and XML: Mending Fences” of two weeks ago is how vast a gulf is …

I thought the day went pretty well. Highlights included a knock-down survey by Steve DeRose on requirements for hypertext (not always so well addressed in the web stack); a demo by Phil Fearon showing the level of polish and performance that can be achieved today (at least by an engineer of his caliber) with XML/XSLT/SaxonCE in the browser; and the redoubtable Alex Miłowski reflecting ambivalence (or at least this is the way I took it, very sympathetically): regret for missed opportunities and concern for the future, mixed with an unshakeable sense of opportunities still in front of us.

Most of us in the room were ambivalent, probably, albeit all in our own ways. We were treated, by Robin Berjon (who did a fantastic job helping the Balisage conference committee organize and run the event) and by Google’s fearless and indomitable Dominic Denicola, to an examination of what HTML5 will offer us, including features — web components — that promise the capability of hiding “shadow DOMs” in the browser presenting arbitrary markup vocabularies (which in principle includes descriptive markup languages) and binding them to browser semantics, allowing them to be handled and processed in CSS and from script using standard APIs. Awesome stuff.

On the other hand, web components (notwithstanding prototypes) are still a distance off, and no one knows like this crowd how the most well-meaning standards initiatives can go awry. (Can anyone say “XML namespaces”?) Plus, this was an audience already satisfied (at least for the most part) that traditional XML document processing architectures — layered systems often running where browsers never see them — are meeting our needs. Not only has XML not failed us, it is a phenomenal success; on top of the many excellent tools we have, all we really want is more and better tools (sure, we’ll take web components); better integration in application stacks of all kinds; and — above all — more curiosity, sympathy and maybe even understanding from ambitious hot-shot developers who aspire to make the world a better place.

I mean, we came to listen, and we did, but I’m not sure anyone was there to hear us but us.

I hasten to add, of course, that this isn’t (or isn’t just) a matter of feeling unappreciated. To be sure, an audience that included technical staff from organizations such as (just to name a few) the US House of Representatives, the Library of Congress, NCBI/NLM (publishers of PMC), NIST, ACS, OSA, and other sophisticated publishers both large and small — people who use XML daily to get the job done (and web browsers too, if only for the last mile) — found it a bit jarring to hear our tried-and-true solution (which indeed is such close kindred to the web) described as a “pet project”, and therefore deemed unworthy of the attention of an important company that builds browser software. But this wasn’t the issue. More than anything, we need not recognition or respect (heck, this is a crowd of introverts, happy not to get too much attention) — but support — young blood, new, active and imaginative developers who will help us not to retire and forget our working systems, but to maintain, extend and improve them.

Instead, we are being offered a new platform on which to relearn old lessons, however disguised they will be in new syntax and technical terminology. And what is worse — the message we hear being told to others is that the successful technologies and solutions we have built do not matter, will soon obsolesce, and deserve no further consideration in the wider economy, to say nothing of investment.

Yes, I exaggerate! I didn’t hear any of this said, at least in so many words, by anyone last August 4. These were just implications hanging in the air.

Yet the sense was unmistakeable that these two cultures were frankly baffled, each by the other. One culture (“the web”?) deliberately limits its scope of interest to the web itself – necessarily and rightly so – and so it must perforce assume that the HTML web and its browser (name your favorite vendor here) are the be-all-end-all, the only communications medium a civilization would ever need. (I know this is a caricature here. Feel free to argue against it.) Another culture (call it “XML”) values standards in text-based document markup not because they are a driver for adoption of anything in particular, but when and as they support adaptability and heterogeneity — of applications and of needs and purposes — on a robust, capable, versatile, open and non-proprietary technical ecosystem — that is, not on one platform or another, but on whatever platforms work best, today and then (differently) tomorrow.

So why are XML developers regarded now as lacking vision? Because we live in a world bigger than the web, whose edges we do not claim to see?

Set this aside, the little voice tells me: it doesn’t really matter. Instead, come back to that unshakeable sense of opportunity that Alex Miłowski communicated. This could work: this does work. We have XML, XSLT, XQuery: the tools are there, and the work is being done. There is no shortage of strong ideas circulating in the XML space. (Over the course of the next few days, Balisageurs including David Lee, Hans-Jürgen Rennau, John Lumley, Abel Braaksma and others showed this well enough.) And HTML5 does not have to be a competitor any more than other formats, both data sources and transformation targets: like PDF, HTML, CSV, you name it, HTML5 will be a tool for us to use, for the work it is good for.

Microformat proving ground

April 21, 2014 Wendell Piez

From perfect grief there need not be
Wisdom or even memory:
One thing then learnt remains to me,—
The woodspurge has a cup of three.

Rossetti

As the Starved Maelstrom laps the Navies
As the Vulture teased
Forces the Broods in lonely Valleys
As the Tiger eased

By but a Crumb of Blood, fasts Scarlet
Till he meet a Man
Dainty adorned with Veins and Tissues
And partakes — his Tongue

Cooled by the Morsel for a moment
Grows a fiercer thing
Till he esteem his Dates and Cocoa
A Nutrition mean

I, of a finer Famine
Deem my Supper dry
For but a Berry of Domingo
And a Torrid Eye

Dickinson

Remarks

WordPress, with the help of a couple of plugins, is … barely … able to add a layer on top for me to edit CSS, to drive the formatting. The worst problem is not actually the part about its being a plugin (and therefore prone to breakage), but rather in how WordPress is unable to save the native HTML dependably. It is evident why this is (for all kinds of reasons WordPress will not allow random HTML injections), but it creates a problem for anyone who … needs more …

Oh! And here’s an Achilles’ heel – the CSS is easily lost. For example, on the site’s front page, the same code that comes out pretty and formatted on the blog post’s page is … busted.

SVG in WordPress

April 18, 2014 Wendell Piez

So … what I’ve learned is, with the help of an extension one can indeed get an SVG to appear under WordPress … barely. (Take a look at the little Irish Airman experiment.)

It drops in as media, and WordPress can only link to it, not embed it. I.e., via an img, not by including the SVG in the HTML. This is okay for some purposes, maybe not so great for others, since among other things, it makes controlling the scaling relative to the HTML page next to impossible.

The larger theme is the “too much to know” problem. Because people don’t know how to use SVG, support for it is slow to come. Since support for it is slow to come, no one explores it and learns how to use it. Folks like me (nothing special, I just came in through a side door) are outliers again.

Yeats’s Irish Airman (a visualization)

April 17, 2014 Wendell Piez

Yeats's Irish Airman (a programmatic rendering)

A fanciful interpretation of William Butler Yeats’s fantastic poem, “An Irish Airman Foresees his Death”, in SVG. This was drawn (some years ago now) using an XSLT stylesheet working over a rather plain XML version of this poem in four quatrains of tetrameter lines. The “fourness” of this poem suggests its structure might be taken to be that of … a biplane.

If you see nothing, it’s due to a failure either in this platform, or your browser. (I can see it in the preview, but one of the hazards of this kind of work is that I can’t control every link in the chain. And some of them can be rather weak.) Some reflections on SVG in WordPress are coming in another post….

Are XML tags sharp objects?

April 11, 2014 Wendell Piez

Start and end tags, no, they are not sharp, despite appearances. They will generally not poke or hurt you as long as you keep them properly closed (that is, every start has its end inside the same parent). Tags written with angle brackets indicate structure, bracing the XML document, holding everything in place. They are your friends.

The really bad tags in XML and the ones you have to watch out for are the entity references, the things that start with &. Think about what & means to an XML parser. It sees & and it doesn’t know what comes next. It looks for a name. (Let’s hope it finds a legal name before it hits ;.) Finding a name, it looks it up. (Let’s hope it is able to find someplace to do so.) It splices in what it says. It then goes back.

This is a precarious operation. Stuff supposed to be “XML” fails to parse all the time, not because its element markup is awry, but because its entities are not resolving correctly, if at all. And if even a single entity reference fails, the document cannot be processed. Use entities only with care. Don’t assume they’re safe just because you’ve seen them a lot elsewhere (such as in HTML).

Note that XML character references look like entity references, but aren’t. It’s pretty safe in XML to refer to a character in Unicode by its number, such as 
 (the LF character) or (its hexadecimal equivalent) 
.

Watch out for your entity references! They can break your documents when they move across boundaries, if their declarations become lost. To have standalone XML (this means well-formed, but also entirely self-contained) you should avoid any entity references that have to be declared. Which is pretty much all of them.

XSLT and e-publishing, past and future

March 28, 2014 Wendell Piez 1 Comment

One link led to another, and so I found myself reading Liza Daly on “The unXMLing of digital books”, from about a year ago (February 2013). She also links to a nice presentation by John Maxwell, “The Webby Future of Structured Markup: Not your father’s XML”, also worth a look.

Both of these fine considerations give me the crawlies. I mean, not only do I agree with them on many key points regarding appropriate and inappropriate uses of structured markup (see my 2011 Balisage paper on some of this), but also, they evidently represent a trend. Which scares me. It’s hard not to wonder how much of this, for us die-hard structured markup fanatics, should be seen as writing on the wall — even if the fact that XML has now become unfashionable was predictable as soon as it got hot back in the early twenty-aughts. What goes up most come down: for something to be fashionable is precisely for it to be given credit it does not deserve (no it’s not a panacea, we kept saying), and fashion inevitably fades into embarrassment, then nostalgia, if it does not harden into ideology. And I’ve never been interested in XML for the ideology of it. So is there anything left?

Yet peel away a layer, and what both Daly and Maxwell say isn’t as hostile to XML as you might think. On the contrary: both of them allow, more or less explicitly, that there is a nugget in the dross. The question is how to keep it, indeed what we might make it into if we recognize and take care with it.

And this is what leads me back to XSLT. Daly is pleased that in her publishing system, she can just drop in the HTML and go. Whee! CSS gives her all the layering she needs. (Except it doesn’t … she talks about processing requirements for filtering, indexing and aggregation that uncontrolled HTML can’t address.) “We have very little preprocessing necessary; XSLT, which is hard to learn and harder to master, is almost absent from our workflow.” Interesting … so even with her fault-tolerant HTML toolkit, she has some need for preprocessing and some XSLT. Maybe eventually they’ll find a way to mothball that stuff too, presumably as soon as they find something else that meets the need half as well.

XSLT is hard to learn and hard to master. Who am I to say it isn’t? (If that wasn’t my experience, I’m willing to admit that mine is a special case, and not because I’m so smart. I’m only lucky or unlucky as the case may be.) The flip side is — it can be learned. (I’m also willing to help. As a professor of mine once said, “In Deutschland auch die Kinder sprechen Deutsch” — in Germany, even the children speak German!) And nothing else (at least out here in the free market) comes close to XSLT in power, adaptability and fitness for the particular class of problems for which it is designed — a class of problems that is central in the publishing space. XSLT is necessary enough to have been reinvented any number of times, when people have decided they would prefer to do without it, for reasons of platforms, or markets, or culture, or aesthetics.

Given Daly’s complaints about XML, the irony is in how XSLT’s strength is in dealing with only poorly controlled data sets. I mean, if your data is well controlled and as granular as you need, by all means use RDF or RDBMS or OO: go to town, enjoy, and consider yourself lucky. But XSLT needs only a tree; if you can get one out of your HTML, however sloppy it is, that’s good enough to get you in. You can even start with a data brick if you think of it as a tree with just one node. Of course what you can do with that tree depends on your level of control; but one of the things you can do in any case is expose the issues and start to assert the control you need. XSLT isn’t just for transformation into presentation formats or even preprocessing and normalization. It’s also for diagnostics, validation and heuristics — even conversion into structured formats from messes of tags or plain text. And yes, I mean XSLT 2.0 here. If the XSLT you tried offered no temporary trees, native grouping, stylesheet functions and regular expressions, you have no idea what you’re missing.

This is why, when you hang out with XSLT people these days, you pick up such mixed feelings. On the one hand, there is trepidation. We wonder if we should feel silly wearing last year’s hat, and we know we will be judged for it. On the other, we know that within organizations that use it, XSLT is known, if often grudgingly, as powerful ju-ju. For some, eager to reduce dependencies on skills that are hard to find, this will be an excellent reason to get rid of it. For others, with the skills or the sense to invest in them, it will continue to be a secret weapon as long as they have inputs that will benefit from transparency and control.

In this context, I try to tell myself the future actually looks bright. Daly reminds us that “books aren’t data”, and they aren’t wrong just because a file is invalid to some schema or other. I’ve said the same myself; but she’s not complaining about XML or even about schemas: she’s complaining about the insensitive and clumsy ways XML-based systems have been designed, built and used, about muddle-headedness and misleading promises. Yea, verily, yea. But when will structure (as loose or strict as as the case demands), inspection, validation and transformation go out of style?

Schemas, schemata

March 26, 2014 Wendell Piez 1 Comment

I guess it’s the browser’s spell checker that doesn’t like “schemas”. It prefers “schemata”.

*Sigh* the enigmata, the dramata, the dogmata…

(Yes, the plural of Greek σχῆμα is indeed σχήματα. But the spell checker is for English.)

Theory of generalized markup, 2014

March 21, 2014 Wendell Piez

On xml-dev, Arjun Ray posts a link to Charles Goldfarb’s seminal paper on the theory of generalized markup and its application in SGML, which was first published in 1981 and subsequently revised and included, in 1986, as Annex A to the SGML standard (ISO 8879).

Thirty years along we have the web, and everything has changed, yet nothing has changed. The core of Goldfarb’s argument is the same lesson taught daily to neophyte web developers on how much better things are when you hang your styles on “semantic” class attributes.

Such labels are useful today because the elements on which they sit (p, td, li, div, span, what have you) have almost no “descriptive” semantics of their own (a p marks a paragraph?!): they have been reduced to their operations, viz. their effects in the browser. (No, p only starts a new line, with some vertical white space. Or not, as the case may be). The pretense of HTML5 to mitigate this trend with new semantic elements like article and aside acquires a poignant irony when we reflect that it can last only as long as these elements are not used (or abused, if that’s how you look at it) to do stuff in the browser that has nothing to do with what they “are” or are “supposed to be”. At that point (which has undoubtedly already past), the semantics of article, in HTML, become as vacuous as those of p. It means only what you say it means when it does what it does.

Similarly, I wryly note how the WordPress interface into which I type gives me an HTML strong element for its B button and an em for its I button, and how I then use strong to signal what might be term or gi in (“descriptive”) TEI, and em for what might be soCalled. (Someone somewhere has ruled that HTML b is bad and strong is good — it’s semantic! So I get strong whether I like it or not.) And yet, seeing only bold for my strong and italics for my em, you know well enough what I mean. Semantics are so sneaky!

This tug of war has gone on long enough to suggest that it cannot be won. Thus, having emerged as a de facto standard for formatting publications even off line — and accordingly reduced to the “presentational”, for good or ill — HTML kindly permits us, in order that we may do what we need, to sneak our semantics back into our markup. The fact that the application of class attribute values is so hard to constrain, particularly in comparison to the rigid document types imposed by SGML (Goldfarb calls them “rigorous”), is both a terrible weakness and a secret strength.

Does it seem paradoxical that an XML enthusiast should see any good in all this redoubled reversing? I hope not. Why we will never have a fully comprehensive descriptive markup language (after many valiant attempts) is more interesting than the simple fact of it. And the point of XML is not, it seems to me, what SGML so often presumed, to enable “true description” of our information. It is to achieve better layering in systems design, to be more flexible, more expressive, more graceful. As for HTML, if we didn’t have it, we’d have to invent it. And then we’d have to invent CSS to go with it.

Pass along:

Pass along:

Pass along:

Rossetti

Dickinson

Remarks

Pass along:

Pass along:

Pass along:

Pass along:

Pass along:

Pass along:

Pass along:

Primary Sidebar