Dust Feed: 2008

Thursday, November 27, 2008

Labelled Reduction as A Good Thing

One small thing (of the many) in Python 2.6 I like (and have waited for since it appeared as a recipe), is collections.namedtuple. It is very useful in itself, but the fact that the stdlib has been adapted to use it throughout is quite nice. Consider the following code:

from urlparse import urlparse
print urlparse(
        "http://localhost:8080/doc/something;en?rev=1")

If run with Python 2.5, you get this tuple:

('http', 'localhost:8080', '/doc/something',
 'en', 'rev=1', '')

, whereas in 2.6 it is a namedtuple:

ParseResult(
        scheme='http', netloc='localhost:8080',
        path='/doc/something',
        params='en', query='rev=1', fragment='')

The last one is unpackable just like a regular tuple, but you can access the parts as attributes as well. This little "data struct" is quite handy, since I don't like to access tuples by index, but quite often pass them around and only access some piece of them at a time.

(With these you don't need to create full-fledged classes for every kind of instrumental data. (Sometimes coupling data and functionality in a single paradigm may be a coarse hammer, treating nails and iron chips alike..) Nor resort to the use of dictionaries where you really want a "restricted value lens", if you will.. But this is another rant altogether.)

Of course, there's lots more to enjoy in 2.6 (the enhanced property for decorator use, ABC:s, json, 2to3 etc).

On a related note, do check out Swaroop C H:s excellent and free books on Python (2.x + 3.0(!)): A Byte of Python. And if you're into Vim (you should be, IMHO) his new A Byte of Vim.

Friday, October 24, 2008

Gnostic Nihilism

Here's my take on it. God actually exists. It's the world that doesn't. It's just that when you leave a being like that all alone in nothingness for an eternity, it starts to dream up all sorts of amazing and insane things.

Or to put it in mathematical terms (incidentally also being my unbound alternative to "42" for half my life):

0 * oo = x, where 0 < x < oo.

("oo" is for 𝌺, i.e. "eternity", of course.)

Take that you goddamn realists! I laugh in your general direction! Ha. Ha I say.

If you nevertheless find a discrepancy in my flawless argument, I'd gather you're just cheating by using rationality. How gnostic is that? I suppose next you're going to argue that there is logical truth in empirical facts.

Saturday, September 27, 2008

Resources, Manifests, Contexts

Just took a quick look at oEmbed (found from a context I was led to from Emil Stenström (an excellent Django promoter in my surroundings btw. Kudos.)).

While oEmbed is certainly quite neat, I very much agree with the criticism regarding the lack of RESTfulness, and that they have defined a new metadata carrier. I think oEmbed would work very well as an extension element in Atom Entry documents (who already have most of the properties oEmbed (re-)defines). Or by reusing (in such atom entry docs) e.g. Media RSS, as Stephen Weber suggested.

Granted, if (as I do hope) RESTfulness and Atom permeation on the web becomes much more well established (approaching ubiquity), this would be dead easy to define further down the line. (And signs of this adoption continue to pop up, even involving the gargantuans..)

But since it wasn't done right away, oEmbed is to some extent another part of the fragmented web data infrastructure — already in dire need of unification. It's not terrible of course, JSON is very effective — it's just too context-dependent and stripped to work for much more than end-user consumption in "vertical" scenarios. While oEmbed itself is such a scenario, it could very well piggy-back on a more reusable format and thus promote much wider data usability.

A mockup (with unsolicited URI minting in the spaces of others) based on the oEmbed quick example could look like:


<entry xmlns="http://www.w3.org/2005/Atom"
       xmlns:oembed="http://oembed.com/ns/2008/atom/">
  <id>tag:flickr.com,2008:/3123/2341623661_7c99f48bbf_m.jpg</id>
  <title>ZB8T0193</title>
  <summary></summary>
  <content src="http://farm4.static.flickr.com/3123/2341623661_7c99f48bbf_m.jpg"
           type="image/jpg"/>
  <oembed:photo version="1.0" width="240" height="160"/>
  <author>
    <name>Bees</name>
    <uri>http://www.flickr.com/photos/bees/</uri>
  </author>
  <source>
    <id>tag:flickr.com,2008:/feed</id>
    <author>
      <name>Flickr</name>
      <uri>http://www.flickr.com/</uri>
    </author>
  </source>
</entry>

The main point, which I have mentioned before, is that Atom Entries work extremely well as manifests of resources. This is something I hope the REST community will pick up in a large way. Atom feeds complement the RESTful infrastructure by defining a standard format for resource collections, and from that it seems quite natural to expose manifests of singular resources as well using the same format.

In case you're wondering: no, I still believe in RDF. It's just easier to sell uniformity one step at a time, and RDF is unfortunately still not well known in the instrumental service shops I've come in contact with (you know, the ones where integration projects pop up ever so often, mainly involves hard technology, and rarely if ever reuse domain knowledge properly). So I choose to support Atom adoption to increase resource orientation and uniformity — we can continue on to RDF if these principles continue to gain momentum (which they will, I'm sure).

Thus I also think we should keep defining the bridge(s) from Atom to RDF for the 3.0 web.. There are some sizzling activities on that respect which can be seen both in the Atom syntax mailing list and the semantic web list. My interest stems from what I currently do at work (and as a hobby it seems). Albeit this is from a very instrumental perspective — and as a complement, rather than an actual bridge.

In part, it's about making Atom entries from RDF, in order for simple RESTful consumers to be able to eat some specific Atom crumbs from the semantic cakes I'm most certainly keeping (the best thing since croutons, no doubt). These entries aren't complete mappings, only selected parts, semantically more coarse-grained and ambiguous. While ambiguity corrupts data (making integration a nightmare), it is used effectively in "lower-case sem-web" things such as tagging and JSON. (Admittedly I suppose it's ontologically and cognitively questionable whether it can ever be fully avoided though.)

We have proper RDF at the core, so this is about meeting "half way" with the gist of keeping things simple without loosing data quality in the process. To reduce and contextualize for common services — that is at the service level, not the resource level. (I called this "RA/SA decoupling" somewhere, for "Resource Application"/"Service Application". Ah well, this will all be clarified when I write down the COURT manifesto ("Crafting Organisation Using Resources over Time"). :D)

Hopefully, this Atom-from-RDF stuff will be reusable enough to be part of my Out of RDF Transmogrifier work. Which (in my private lab) has been expanded beyond Python, currently with a simple Javascript version of the core mapping method ("soonish" to be published). Upon that I'm aiming for a pure js-based "record editor", ported from an (py-)Oort-based prototype from last year. I also hope the service parts of my daytime work may become reusable and open-sourced as well in the coming months. Future will tell.

Sunday, September 21, 2008

Possession

I honestly thought that my next post wouldn't be a drunken one. Alas, it'll be. This time, it's prompted by a sample initializing "Requiem" by Delerium (-89, you won't bother): "Possession is a state of mind". Somehow this statement rings true. Then I thought about "theft". Following, I figured reflecting about "possession" was at the core. You'll be the judge. As always. Please respect possession - and never (ever ever) be the thief. Right.

Thursday, June 26, 2008

Have you ever

stayed within your own contemplation
long enough,
to realize
that you're never
outside
of your contemplation?

Leave.

Saturday, June 21, 2008

The

most merciful thing, is the inability of the human mind to correlate all its contents. Please don't. Or fret not. At least -- not at least.

Tuesday, June 17, 2008

Brave New World

[I wrote this little silliness more than a year ago. Didn't find an appropriate time to post it (it served mostly as critique of the computerized semantics I usually like so much). Now, with Sweden (where I live) on the brink of legalizing a little Echelon of our own (or their own, as it were), it seems as good a time as any.]

.. Sometimes I wonder if what we do is just a fancy way of digging our own graves.


Echelon> Event 116822052.26#113967:
  Instant Message detected: thinking.. done (in 0.032 ms).
  Storing inferred knowledge as:
   [ foaf:mbox_sha1sum "286e3265...";
     emo:likes <urn:bbdb:1997:python> ].
  in context:
   <urn:gov:surveillance:event:116822052.26:113967>
     a :NLPResult;
     :timestamp 116822052.26032;
     :trustability 0.681132 .
  Done. Running ruleset..
   CoreBot>
     - No known subject for:
       owl:inverseFunctionalProperty
         foaf:mbox_sha1sum with "286e3265..."
       , using BNode _:a21c3dd723ffe39
       (subject of 31102 previous facts).
     - rdf:type foaf:Person inferred for _:a21c3dd723ffe39.
     - No unknown knowledge gained. Done.
   EmotionTracker>
     - increasing likability of <urn:bbdb:1997:python> .
   MicroSoftAgent>
     - increasing demand of <urn:bbdb:2006:ironpython> .
     - increasing threat of <urn:bbdb:1998:opensource> .
  Backtrack threshold at 1337
    - refactoring reasoner dependencies.. Done.
  Done (passed through 4211 reasoners).
  Done (in 0.11 ms).

Echelon> System Update:
  "Owner Change. Renaming."
  Done (in 0.002 ms).

SkyNet>

Tuesday, May 27, 2008

Representation Taxation

It seems my thoughts about Atom Entries will be delayed a while longer. This post begun life as a response to this "BlubML" post by Bill de hÓra, but quickly turned out to be about my view of data models, as used in the IT industry yesterday, today and tomorrow. And industry rather ridden with technology wars, which has reduced the status of information into mere fodder. Leaving meaning, shared and reused concepts, discoverability, integratability and hence interoperability in many ways just a far-off vision many never even have time time to think about (leaving us thrashing data in the technology trenches).

The mentioned post reflects (by quotation) upon markup and the perceived "angle bracket tax". I can definitely get that. But, as mentioned in a comment by Iain Buckingham, the issue at hand is probably mainly about what the markup is supposed to represent. (And now I totally leave the subject of lightweight text formats (which I like, by the way); I'll focus on data models, not syntax). Take the content model of XML. It is a mixture of plain text (useful for primitive data like numbers, dates and other "human language" constructs) and different structure blocks/markers (elements, attributes, pi:s, some more) who have no useful semantics apart from a label (sometimes in a namespace), ordering and nesting. This has been proven as very useful to represent documents, where documents range from books via web pages (also as application views) to vector graphics and so on.

The "defining nature" of documents is difficult to pin down, but it seems we get by fairly efficiently anyway. Mainly since this semi-structure is intended (and, incidentally, often indented) for one-way (sometimes layered) rendering of output in turn meant for human consumption, which actually works without information precision (albeit sometimes with more or less apparent negative concequences).

Documents aside, let's continue to data records. More or less the first use of XML was as a serialization of the RDF data model. This model has real semantics, where unique resources are described with statements, whose parts (subject, predicate, object) are either resources (all three) or primitive values (for objects). Unfortunately, RDF still struggles to emerge as a useful representation in the industry at large, partly due to its initial appearance as XML (from whose abundance of namespaces and literal URI:s even many XML neophytes has fled in panic), partly due to its data model being more academic (rooted in logics and AI) and thus less approachable than the (IMHO) less expressive but quite succinct object oriented, class based data model of most common modern programming languages.

This latter model has always been fragmented by language differences, and up close hard to pin down due to conflicting computer science details (OO encapsulation, coupling with function and message metaphors etc). And while objects often live in (connect as) graphs, their identity has been hidden, often being a mere memory address. Quite parallell with this, data has been pushed into the even more mechanical and implementation focused semantics of the relational model, in SQL databases, for decades. While these records do have explicit IDs (or composites, or...), they are for internal use within a given application context. Any usage beyond that must be secured by convention outside of that. The often cumbersome use of RDBs (where "splitting and slicing" all but the flattest records is needed) has been minimalistically remedied by the O/R-mapping tools in the OO languages, which has proven exceptionally useful in introspectable and/or dynamic languages. But it's mostly just a prettier surface upon the same old relational model. Signs of which can be seen by the specific restrictions (that many O/R:s have) placed upon the object semantics of the "host" language (inheritance, composition, coupling with functionality), and not the least the invention of SQL-like languages, mostly one per mapping implementation (although some credit should go to earlier efforts by the object database people for trying to standardize such things).

But there are very useful aspects of an OO-based representation. As a common denominator, JSON should be held up as an extremely useful format (albeit quite void of namespacing). It represents the "bare bones" data record format which is isomorphic with the OO languages (be it class- or prototype-based ones). And although a more narrow scope, it has more defined semantics than XML, since it explicitly differs between properties and lists (and does not introduce the artificial controversy of whether to use "attributes" or "elements"). But it's still a long way from the identifiable resources and datatyped or human language typed literals of RDF. Depending on context/domain needs, this may or may not be a painful point to realize (if not, JSON is indeed the simpler thing that actually works).

The attempt to express similar semantics in W3C:s XML Schema is to me a painful one, and in view of the complexities it adds and how little it brings in succinctness and clarity (and immediately/apparently useful general semantics) is a sign of that schema technology's failure. Use of RelaxNG isn't too bad though, and I find nothing inherently long in some formats, such as Atom, whose role is to bring a packaging and insulating channel format for resources (in the URI, REST, and most imporantly the RDF sense of the word). Atom defines a model which is scoped and simple (as in easy to grasp, limited in expressiveness). This is done with a text specification (RFC 4287 and its kin) coupled with a (non-normative) RelaxNG (compact) schema for easier machine use. For some things this can be a fitting approach. I don't believe its the best way to fully express records though (but perhaps to glean a narrow but useful aspect of such).

Now, let's turn the focus back to the objects often used as records with e.g. O/R technologies. It's been mainly through the emergence of adherence to the principles of REST, in turn (in practise) dependent upon URI:s and ideals of "cool URI:s" (for e.g. sustainable connectedness), that these objects have gained identities usable outside of a given, often ad-hoc defined, changing and quite internal (think "data silo") management of the data at hand. This process is the evolution of the Web as a data platform (which I believe should be the basis of a stable distributed computing platform, where one is needed), and it seems to me that the use of RDF is now a very promising next step. Since it embraces (indeed builds upon) this uniform and global identification scheme. Since it is "data first" oriented — you can use it to state things about resources which are meaningful even for machine processing without any invention/specification on your part. And since (due to URI:s and the semantic precision) it's less vulnerable to coupling with internal solutions, which "bare" O/R-mapped objects often are (in the same way, albeit less humongous, as automatically generated XML-schema defined API-fragments that hide in SOAP envelopes), due to implementation details (and sometimes a hard-wired and thus fragile URI minting mechanism).

With JSON, RDF shares the decoupling of types/classes ("frames" in AI) and properties of resources. But in JSON properties are simple, totally local keys, and not first-class resources themselves possible to describe in a uniform manner. And JSON has other upper limits (such as the literal language mechanism), which can't be surpassed without adding semantics, "somehow" (e.g. conventions). This may work for specific needs (and very well), but is hard to scale to the levels of interoperability needed for a proper web-based data record format. (Similar problems riddles it's ragged step-cousin microformats, which also lacks formal definitions for how to handle (especially decentrally so).)

Back in "just XML"-land, as said, it seems to work for the semi-structured logical enigmas of XML that the above mentioned "documents" are, but as for data records, the XML format will be a serialization of some very specific, elsewhere explicitly defined model (document formats as well of course, but these often make some use of the more exoteric mixed content model XML supports). One such model is the "web resources over time" that Atom works for (coupled with the HTTP infrastructure and RESTful principles). Granted, such XML-based formats can be used without worrying too much about the mysterious XML infoset model (including the RDF/XML format, its free-form character aside). But to be useful, they need to be standardized and extensively supported by the larger community. Otherwise, apart from needing to design and implement the deserialized model from scratch, the creeping complexities that an ill-thought XML model design lets in (or forever locks out any extensibility from) may emerge whenever you want to express something idiomatic to your data.

That's all. This was mostly letting out some thoughts that have been building up pressure in my mind lately. And became yet another prelude to my thoughts still simmering regarding how to effectively use Atom with payloads of RDF. And also about some of the more crude ways, if you need to gain ground quickly (but more or less imperfectly). You can do that with JSON (fairly ok, but with JSON you're quite localized anyway, so why take the Atom highway if you only need to use your bike?), microformats (not ok in by my standards (again: where's your model?)), RDFa (if you have overlapping needs it can be brilliant), or with simplistic Atom extensions (could be very XML-taxating and thus risky; but done "right" it's kind of a "poor mans RDF", a bleak but somewhat useful shadow of the real thing).

(Admittedly this post also turned out to be a rant with questionable focus, but why not. Pardon my posting in quite a state of fatigue.)

Tuesday, May 20, 2008

Memes and Principles, Intent

This is a prelude to an upcoming post where I intend to speak about Atom Entries as some kind of "simplest thing that could possibly work" (for the specific purpose of representing manifests of resources ("resource" as the R in URI (and in RDF, of course) — i.e. the resources of the web, and the mind (perhaps even "the world", but I doubt it))).

The prelude is just some thoughts about a couple of principles.

First: the simplest thing that could possibly work. A phrase often quoted and often misapplied. This is common knowledge, and many clarify things by interpreting what "possibly work" means. I won't. I'd just like to rephrase it as "the simpler thing that actually works". It's useful since it both hints that there can be more complex/complicated things (see #3 and #4 in The Zen of Python) that don't actually work (for some arbitrary meaning of "work", admittedly), and (obviously) that there are simpler things that don't work at all (less or more simple than the one that works — this is the heart of the problem). I feel this phrasing avoids the confusion the original one often causes (it just seemed to be a simpler way of saying it that still works..). So, to repeat: just say "the simpler thing that actually works". It works.

Second: just a recap of an old joke of mine regarding the principle of DRY. I very much do prefer if programming practise adheres to that one. There's too much code that suffers from MOIST. Still, I advice you to avoid to DRY until SHRIVELLED. (Those sure are acronyms. Please do hover.) This is related to the first principle above, and I mainly mean that some practises, such as heavy use of metaprogramming, simply performs too much reduction to make the intent clear even to the appropriately trained eye. Mileage will vary, but as many others have already advised: don't be too clever. It will bite others, and you as well. (Ok; honestly, my main intent was probably just to make a pun out of a perfectly sane advice.)

I will probably repeat this.

[Side note: It's funny. Just this last week I've noticed I habitually misspell "intent" as "indent". I had to correct myself about three times while writing this. Obviously my prolonged use of Python has caused these two distinct words to conflate in my mind. I never indented that.]

Wednesday, April 16, 2008

Getting lxml 2 into my gnarly old Tiger; going on ROA

In case anyone might google for something like:

lxml "Symbol not found" _xmlSchematronNewParserCtxt

, I'd like to reiterate the steps I just took to get lxml 2 (2.1 beta 1 to be precise) up and running on OS X 10.4 (which I haven't yet purged in favour of the shiny Leopard for reasons of fear, uncertainty and doubt. And while pleasurable, quite mixed impressions from that cat when running my wicked new iMac (and my funky old iBook, in case you're dying to know)).

In short: you need a fresh MacPorts (1.6, I had 1.3.something):

$ sudo port selfupdate

, and then, wipe your current libxml2 (along with any dependent libraries, MacPorts will tell you which):

$ sudo port uninstall libxml2@2.6.23_0

(You might not need the version number; I had tried to get the new libxml2 before this so there was an Activating libxml2 2.6.31_0 failed: Image error: Another version of this port thingy going on.) Then just:

$ sudo port install libxml2 libxslt

and you'll be ready to do the final (after removing any misbehaving lxml (compiled with the old libxml2) from site-packages) :

$ sudo easy_install lxml

But why?

Because lxml 2 is a marvellous thing for churning through XML and HTML with Python. There was always XPath and XSLT, C14N in lxml 1.x too (admittedly also in 4Suite as well; Python has had strong XML support for many, many years). But in lxml 2, you also get:

CSS selectors
much improved "bad HTML" support (including BeautifulSoup integration)
a relaxed comparison mechanism for XML and HTML in doctests!

And more, I'm sure.

So, why all this Yak Shaving tonight? Just this afternoon I put together an lxml and httplib2-based "REST test" thing. Doesn't sound too exciting? I think it may be, since I use it with the old but still insanely modern doctest module, namely its support for executing plain text documents as full-fledged tests. This gives the possibility (from a tiny amount of code), to run plain text specifications for ROA-apps with a single command:

Set up namespaces:

 >>> xmlns("http://www.w3.org/1999/xhtml")
 >>> xmlns(a="http://www.w3.org/2005/Atom")

We have a feed:

 >>> http_get("/feed",
 ...     xpath("/a:feed/a:link[@rel='next-archive']"))
 ...
 200 OK
 Content-Type: application/atom+xml;feed
 [XPath 1 matched]

Retrieving a document entry with content-negotiation:

 >>> http_get("/publ/sfs/1999:175",
 ...         headers={"Accepts": "application/pdf"},
 ...         follow=True)
 ...
 303 See Other
 Location: /publ/sfs/1999:175/pdf,sv
 [Following..]
 200 OK
 Content-Type: application/pdf
 Content-Language: sv
 [Body]

That's the early design at least. Note that the above is the test. Doctest checks the actual output and complains if it differs. Declarative and simple. (And done before, of course.)

Ok, I was a bit overambitious with all the "ROA"; this is mostly for checking your content-negotiation, redirects and so on. But I think it'll be useful for a large system I'm putting together, which will (if things work out well) at its core have a well-behaved Atom-based resource service loaded with (linked) RDF payloads. Complete with feed archiving and tombstones, it will in turn be used as fodder for a SPAQRL service. (Seriously; it's almost "SOA Orchestration Done Right", if such a thing can be done.) But this I'll get back too..

(.. I may call the practice "COURT" — Crafting Organization Using Resources over Time. It's about using Atom Entries as manifests of web resources with multiple representations, fully described by RDF and working as an over-time serialized content repository update stream. (Well, that's basically what Atom is/works perfectly for, so I'm by no means inventing grand things here — grand as they IMHO are.)

Old Bad Breaks Improved by Dividing

I stopped blogging quite a while ago. Partly because I cooked up long articles about Atom-based services with RDF payloads in my head that never got written down (I shall get back to this). Partly because I use this "convert newlines to breaks" in blogspot. If I turned it off, I hadto edit all my old posts. Simple in itself, but I couldn't bear touching those old "modified" times.

So all my posts became these horribly formatted blocks of old, stale, ill-parsable html.

I don't care anymore. I'll use this to post random stuff, and put proper articles (if any) at Neverspace.net or Oort.to. And in a couple of decades from now, move my blogging to a a construct of my own, full of Atom-based services with RDF payloads.

Oh. Just-in-Time Update: Switching to edit-mode in the blogspot editor I just realize that there are no horrible little bloody breaks anymore. They are div:s now! Perhaps they could even become p:s. Nice! Thank you blogspot/google people! Then I don't have to move, since apart from RDF payloads, all Google services are indeed Atom-based (RSS 2.0 too, but I don't care for that).

(Thus I shall "soon" attempt to post from Vim to here, possibly with the python-based GData stuff I haven't been using for anything productive yet. Or some of the other TODO:s that clog my creative pipe far too often.)

Now for another post, possibly with some useful content in it as well.

Out-of-Sync Update: Typically, those pesky breaks still show up. Well, at least they are funnily isolated in div:s too..