Wednesday, April 16, 2008

Getting lxml 2 into my gnarly old Tiger; going on ROA

In case anyone might google for something like:
lxml "Symbol not found" _xmlSchematronNewParserCtxt

, I'd like to reiterate the steps I just took to get lxml 2 (2.1 beta 1 to be precise) up and running on OS X 10.4 (which I haven't yet purged in favour of the shiny Leopard for reasons of fear, uncertainty and doubt. And while pleasurable, quite mixed impressions from that cat when running my wicked new iMac (and my funky old iBook, in case you're dying to know)).

In short: you need a fresh MacPorts (1.6, I had 1.3.something):

$ sudo port selfupdate

, and then, wipe your current libxml2 (along with any dependent libraries, MacPorts will tell you which):

$ sudo port uninstall libxml2@2.6.23_0

(You might not need the version number; I had tried to get the new libxml2 before this so there was an Activating libxml2 2.6.31_0 failed: Image error: Another version of this port thingy going on.) Then just:

$ sudo port install libxml2 libxslt

and you'll be ready to do the final (after removing any misbehaving lxml (compiled with the old libxml2) from site-packages) :

$ sudo easy_install lxml

But why?

Because lxml 2 is a marvellous thing for churning through XML and HTML with Python. There was always XPath and XSLT, C14N in lxml 1.x too (admittedly also in 4Suite as well; Python has had strong XML support for many, many years). But in lxml 2, you also get:
  • CSS selectors
  • much improved "bad HTML" support (including BeautifulSoup integration)
  • a relaxed comparison mechanism for XML and HTML in doctests!
And more, I'm sure.

So, why all this Yak Shaving tonight? Just this afternoon I put together an lxml and httplib2-based "REST test" thing. Doesn't sound too exciting? I think it may be, since I use it with the old but still insanely modern doctest module, namely its support for executing plain text documents as full-fledged tests. This gives the possibility (from a tiny amount of code), to run plain text specifications for ROA-apps with a single command:
Set up namespaces:

>>> xmlns("http://www.w3.org/1999/xhtml")
>>> xmlns(a="http://www.w3.org/2005/Atom")

We have a feed:

>>> http_get("/feed",
... xpath("/a:feed/a:link[@rel='next-archive']"))
...
200 OK
Content-Type: application/atom+xml;feed
[XPath 1 matched]

Retrieving a document entry with content-negotiation:

>>> http_get("/publ/sfs/1999:175",
... headers={"Accepts": "application/pdf"},
... follow=True)
...
303 See Other
Location: /publ/sfs/1999:175/pdf,sv
[Following..]
200 OK
Content-Type: application/pdf
Content-Language: sv
[Body]

That's the early design at least. Note that the above is the test. Doctest checks the actual output and complains if it differs. Declarative and simple. (And done before, of course.)

Ok, I was a bit overambitious with all the "ROA"; this is mostly for checking your content-negotiation, redirects and so on. But I think it'll be useful for a large system I'm putting together, which will (if things work out well) at its core have a well-behaved Atom-based resource service loaded with (linked) RDF payloads. Complete with feed archiving and tombstones, it will in turn be used as fodder for a SPAQRL service. (Seriously; it's almost "SOA Orchestration Done Right", if such a thing can be done.) But this I'll get back too..

(.. I may call the practice "COURT" — Crafting Organization Using Resources over Time. It's about using Atom Entries as manifests of web resources with multiple representations, fully described by RDF and working as an over-time serialized content repository update stream. (Well, that's basically what Atom is/works perfectly for, so I'm by no means inventing grand things here — grand as they IMHO are.)

Old Bad Breaks Improved by Dividing

I stopped blogging quite a while ago. Partly because I cooked up long articles about Atom-based services with RDF payloads in my head that never got written down (I shall get back to this). Partly because I use this "convert newlines to breaks" in blogspot. If I turned it off, I hadto edit all my old posts. Simple in itself, but I couldn't bear touching those old "modified" times.

So all my posts became these horribly formatted blocks of old, stale, ill-parsable html.

I don't care anymore. I'll use this to post random stuff, and put proper articles (if any) at Neverspace.net or Oort.to. And in a couple of decades from now, move my blogging to a a construct of my own, full of Atom-based services with RDF payloads.

Oh. Just-in-Time Update: Switching to edit-mode in the blogspot editor I just realize that there are no horrible little bloody breaks anymore. They are div:s now! Perhaps they could even become p:s. Nice! Thank you blogspot/google people! Then I don't have to move, since apart from RDF payloads, all Google services are indeed Atom-based (RSS 2.0 too, but I don't care for that).

(Thus I shall "soon" attempt to post from Vim to here, possibly with the python-based GData stuff I haven't been using for anything productive yet. Or some of the other TODO:s that clog my creative pipe far too often.)

Now for another post, possibly with some useful content in it as well.

Out-of-Sync Update: Typically, those pesky breaks still show up. Well, at least they are funnily isolated in div:s too..