Tuesday, May 27, 2008

Representation Taxation

It seems my thoughts about Atom Entries will be delayed a while longer. This post begun life as a response to this "BlubML" post by Bill de hÓra, but quickly turned out to be about my view of data models, as used in the IT industry yesterday, today and tomorrow. And industry rather ridden with technology wars, which has reduced the status of information into mere fodder. Leaving meaning, shared and reused concepts, discoverability, integratability and hence interoperability in many ways just a far-off vision many never even have time time to think about (leaving us thrashing data in the technology trenches).

The mentioned post reflects (by quotation) upon markup and the perceived "angle bracket tax". I can definitely get that. But, as mentioned in a comment by Iain Buckingham, the issue at hand is probably mainly about what the markup is supposed to represent. (And now I totally leave the subject of lightweight text formats (which I like, by the way); I'll focus on data models, not syntax). Take the content model of XML. It is a mixture of plain text (useful for primitive data like numbers, dates and other "human language" constructs) and different structure blocks/markers (elements, attributes, pi:s, some more) who have no useful semantics apart from a label (sometimes in a namespace), ordering and nesting. This has been proven as very useful to represent documents, where documents range from books via web pages (also as application views) to vector graphics and so on.

The "defining nature" of documents is difficult to pin down, but it seems we get by fairly efficiently anyway. Mainly since this semi-structure is intended (and, incidentally, often indented) for one-way (sometimes layered) rendering of output in turn meant for human consumption, which actually works without information precision (albeit sometimes with more or less apparent negative concequences).

Documents aside, let's continue to data records. More or less the first use of XML was as a serialization of the RDF data model. This model has real semantics, where unique resources are described with statements, whose parts (subject, predicate, object) are either resources (all three) or primitive values (for objects). Unfortunately, RDF still struggles to emerge as a useful representation in the industry at large, partly due to its initial appearance as XML (from whose abundance of namespaces and literal URI:s even many XML neophytes has fled in panic), partly due to its data model being more academic (rooted in logics and AI) and thus less approachable than the (IMHO) less expressive but quite succinct object oriented, class based data model of most common modern programming languages.

This latter model has always been fragmented by language differences, and up close hard to pin down due to conflicting computer science details (OO encapsulation, coupling with function and message metaphors etc). And while objects often live in (connect as) graphs, their identity has been hidden, often being a mere memory address. Quite parallell with this, data has been pushed into the even more mechanical and implementation focused semantics of the relational model, in SQL databases, for decades. While these records do have explicit IDs (or composites, or...), they are for internal use within a given application context. Any usage beyond that must be secured by convention outside of that. The often cumbersome use of RDBs (where "splitting and slicing" all but the flattest records is needed) has been minimalistically remedied by the O/R-mapping tools in the OO languages, which has proven exceptionally useful in introspectable and/or dynamic languages. But it's mostly just a prettier surface upon the same old relational model. Signs of which can be seen by the specific restrictions (that many O/R:s have) placed upon the object semantics of the "host" language (inheritance, composition, coupling with functionality), and not the least the invention of SQL-like languages, mostly one per mapping implementation (although some credit should go to earlier efforts by the object database people for trying to standardize such things).

But there are very useful aspects of an OO-based representation. As a common denominator, JSON should be held up as an extremely useful format (albeit quite void of namespacing). It represents the "bare bones" data record format which is isomorphic with the OO languages (be it class- or prototype-based ones). And although a more narrow scope, it has more defined semantics than XML, since it explicitly differs between properties and lists (and does not introduce the artificial controversy of whether to use "attributes" or "elements"). But it's still a long way from the identifiable resources and datatyped or human language typed literals of RDF. Depending on context/domain needs, this may or may not be a painful point to realize (if not, JSON is indeed the simpler thing that actually works).

The attempt to express similar semantics in W3C:s XML Schema is to me a painful one, and in view of the complexities it adds and how little it brings in succinctness and clarity (and immediately/apparently useful general semantics) is a sign of that schema technology's failure. Use of RelaxNG isn't too bad though, and I find nothing inherently long in some formats, such as Atom, whose role is to bring a packaging and insulating channel format for resources (in the URI, REST, and most imporantly the RDF sense of the word). Atom defines a model which is scoped and simple (as in easy to grasp, limited in expressiveness). This is done with a text specification (RFC 4287 and its kin) coupled with a (non-normative) RelaxNG (compact) schema for easier machine use. For some things this can be a fitting approach. I don't believe its the best way to fully express records though (but perhaps to glean a narrow but useful aspect of such).

Now, let's turn the focus back to the objects often used as records with e.g. O/R technologies. It's been mainly through the emergence of adherence to the principles of REST, in turn (in practise) dependent upon URI:s and ideals of "cool URI:s" (for e.g. sustainable connectedness), that these objects have gained identities usable outside of a given, often ad-hoc defined, changing and quite internal (think "data silo") management of the data at hand. This process is the evolution of the Web as a data platform (which I believe should be the basis of a stable distributed computing platform, where one is needed), and it seems to me that the use of RDF is now a very promising next step. Since it embraces (indeed builds upon) this uniform and global identification scheme. Since it is "data first" oriented — you can use it to state things about resources which are meaningful even for machine processing without any invention/specification on your part. And since (due to URI:s and the semantic precision) it's less vulnerable to coupling with internal solutions, which "bare" O/R-mapped objects often are (in the same way, albeit less humongous, as automatically generated XML-schema defined API-fragments that hide in SOAP envelopes), due to implementation details (and sometimes a hard-wired and thus fragile URI minting mechanism).

With JSON, RDF shares the decoupling of types/classes ("frames" in AI) and properties of resources. But in JSON properties are simple, totally local keys, and not first-class resources themselves possible to describe in a uniform manner.  And JSON has other upper limits (such as the literal language mechanism), which can't be surpassed without adding semantics, "somehow" (e.g. conventions). This may work for specific needs (and very well), but is hard to scale to the levels of interoperability needed for a proper web-based data record format. (Similar problems riddles it's ragged step-cousin microformats, which also lacks formal definitions for how to handle (especially decentrally so).)

Back in "just XML"-land, as said, it seems to work for the semi-structured logical enigmas of XML that the above mentioned "documents" are, but as for data records, the XML format will be a serialization of some very specific, elsewhere explicitly defined model (document formats as well of course, but these often make some use of the more exoteric mixed content model XML supports). One such model is the "web resources over time" that Atom works for (coupled with the HTTP infrastructure and RESTful principles). Granted, such XML-based formats can be used without worrying too much about the mysterious XML infoset model (including the RDF/XML format, its free-form character aside). But to be useful, they need to be standardized and extensively supported by the larger community. Otherwise, apart from needing to design and implement the deserialized model from scratch, the creeping complexities that an ill-thought XML model design lets in (or forever locks out any extensibility from) may emerge whenever you want to express something idiomatic to your data.

That's all. This was mostly letting out some thoughts that have been building up pressure in my mind lately. And became yet another prelude to my thoughts still simmering regarding how to effectively use Atom with payloads of RDF. And also about some of the more crude ways, if you need to gain ground quickly (but more or less imperfectly). You can do that with JSON (fairly ok, but with JSON you're quite localized anyway, so why take the Atom highway if you only need to use your bike?), microformats (not ok in by my standards (again: where's your model?)), RDFa (if you have overlapping needs it can be brilliant), or with simplistic Atom extensions (could be very XML-taxating and thus risky; but done "right" it's kind of a "poor mans RDF", a bleak but somewhat useful shadow of the real thing).

(Admittedly this post also turned out to be a rant with questionable focus, but why not. Pardon my posting in quite a state of fatigue.)

No comments: