Showing posts with label semanticweb. Show all posts
Showing posts with label semanticweb. Show all posts

Saturday, September 27, 2008

Resources, Manifests, Contexts

Just took a quick look at oEmbed (found from a context I was led to from Emil Stenström (an excellent Django promoter in my surroundings btw. Kudos.)).

While oEmbed is certainly quite neat, I very much agree with the criticism regarding the lack of RESTfulness, and that they have defined a new metadata carrier. I think oEmbed would work very well as an extension element in Atom Entry documents (who already have most of the properties oEmbed (re-)defines). Or by reusing (in such atom entry docs) e.g. Media RSS, as Stephen Weber suggested.

Granted, if (as I do hope) RESTfulness and Atom permeation on the web becomes much more well established (approaching ubiquity), this would be dead easy to define further down the line. (And signs of this adoption continue to pop up, even involving the gargantuans..)

But since it wasn't done right away, oEmbed is to some extent another part of the fragmented web data infrastructure — already in dire need of unification. It's not terrible of course, JSON is very effective — it's just too context-dependent and stripped to work for much more than end-user consumption in "vertical" scenarios. While oEmbed itself is such a scenario, it could very well piggy-back on a more reusable format and thus promote much wider data usability.

A mockup (with unsolicited URI minting in the spaces of others) based on the oEmbed quick example could look like:

<entry xmlns="http://www.w3.org/2005/Atom"
xmlns:oembed="http://oembed.com/ns/2008/atom/">
<id>tag:flickr.com,2008:/3123/2341623661_7c99f48bbf_m.jpg</id>
<title>ZB8T0193</title>
<summary></summary>
<content src="http://farm4.static.flickr.com/3123/2341623661_7c99f48bbf_m.jpg"
type="image/jpg"/>
<oembed:photo version="1.0" width="240" height="160"/>
<author>
<name>Bees</name>
<uri>http://www.flickr.com/photos/bees/</uri>
</author>
<source>
<id>tag:flickr.com,2008:/feed</id>
<author>
<name>Flickr</name>
<uri>http://www.flickr.com/</uri>
</author>
</source>
</entry>

The main point, which I have mentioned before, is that Atom Entries work extremely well as manifests of resources. This is something I hope the REST community will pick up in a large way. Atom feeds complement the RESTful infrastructure by defining a standard format for resource collections, and from that it seems quite natural to expose manifests of singular resources as well using the same format.

In case you're wondering: no, I still believe in RDF. It's just easier to sell uniformity one step at a time, and RDF is unfortunately still not well known in the instrumental service shops I've come in contact with (you know, the ones where integration projects pop up ever so often, mainly involves hard technology, and rarely if ever reuse domain knowledge properly). So I choose to support Atom adoption to increase resource orientation and uniformity — we can continue on to RDF if these principles continue to gain momentum (which they will, I'm sure).

Thus I also think we should keep defining the bridge(s) from Atom to RDF for the 3.0 web.. There are some sizzling activities on that respect which can be seen both in the Atom syntax mailing list and the semantic web list. My interest stems from what I currently do at work (and as a hobby it seems). Albeit this is from a very instrumental perspective — and as a complement, rather than an actual bridge.

In part, it's about making Atom entries from RDF, in order for simple RESTful consumers to be able to eat some specific Atom crumbs from the semantic cakes I'm most certainly keeping (the best thing since croutons, no doubt). These entries aren't complete mappings, only selected parts, semantically more coarse-grained and ambiguous. While ambiguity corrupts data (making integration a nightmare), it is used effectively in "lower-case sem-web" things such as tagging and JSON. (Admittedly I suppose it's ontologically and cognitively questionable whether it can ever be fully avoided though.)

We have proper RDF at the core, so this is about meeting "half way" with the gist of keeping things simple without loosing data quality in the process. To reduce and contextualize for common services — that is at the service level, not the resource level. (I called this "RA/SA decoupling" somewhere, for "Resource Application"/"Service Application". Ah well, this will all be clarified when I write down the COURT manifesto ("Crafting Organisation Using Resources over Time"). :D)

Hopefully, this Atom-from-RDF stuff will be reusable enough to be part of my Out of RDF Transmogrifier work. Which (in my private lab) has been expanded beyond Python, currently with a simple Javascript version of the core mapping method ("soonish" to be published). Upon that I'm aiming for a pure js-based "record editor", ported from an (py-)Oort-based prototype from last year. I also hope the service parts of my daytime work may become reusable and open-sourced as well in the coming months. Future will tell.

Wednesday, April 16, 2008

Getting lxml 2 into my gnarly old Tiger; going on ROA

In case anyone might google for something like:
lxml "Symbol not found" _xmlSchematronNewParserCtxt

, I'd like to reiterate the steps I just took to get lxml 2 (2.1 beta 1 to be precise) up and running on OS X 10.4 (which I haven't yet purged in favour of the shiny Leopard for reasons of fear, uncertainty and doubt. And while pleasurable, quite mixed impressions from that cat when running my wicked new iMac (and my funky old iBook, in case you're dying to know)).

In short: you need a fresh MacPorts (1.6, I had 1.3.something):

$ sudo port selfupdate

, and then, wipe your current libxml2 (along with any dependent libraries, MacPorts will tell you which):

$ sudo port uninstall libxml2@2.6.23_0

(You might not need the version number; I had tried to get the new libxml2 before this so there was an Activating libxml2 2.6.31_0 failed: Image error: Another version of this port thingy going on.) Then just:

$ sudo port install libxml2 libxslt

and you'll be ready to do the final (after removing any misbehaving lxml (compiled with the old libxml2) from site-packages) :

$ sudo easy_install lxml

But why?

Because lxml 2 is a marvellous thing for churning through XML and HTML with Python. There was always XPath and XSLT, C14N in lxml 1.x too (admittedly also in 4Suite as well; Python has had strong XML support for many, many years). But in lxml 2, you also get:
  • CSS selectors
  • much improved "bad HTML" support (including BeautifulSoup integration)
  • a relaxed comparison mechanism for XML and HTML in doctests!
And more, I'm sure.

So, why all this Yak Shaving tonight? Just this afternoon I put together an lxml and httplib2-based "REST test" thing. Doesn't sound too exciting? I think it may be, since I use it with the old but still insanely modern doctest module, namely its support for executing plain text documents as full-fledged tests. This gives the possibility (from a tiny amount of code), to run plain text specifications for ROA-apps with a single command:
Set up namespaces:

>>> xmlns("http://www.w3.org/1999/xhtml")
>>> xmlns(a="http://www.w3.org/2005/Atom")

We have a feed:

>>> http_get("/feed",
... xpath("/a:feed/a:link[@rel='next-archive']"))
...
200 OK
Content-Type: application/atom+xml;feed
[XPath 1 matched]

Retrieving a document entry with content-negotiation:

>>> http_get("/publ/sfs/1999:175",
... headers={"Accepts": "application/pdf"},
... follow=True)
...
303 See Other
Location: /publ/sfs/1999:175/pdf,sv
[Following..]
200 OK
Content-Type: application/pdf
Content-Language: sv
[Body]

That's the early design at least. Note that the above is the test. Doctest checks the actual output and complains if it differs. Declarative and simple. (And done before, of course.)

Ok, I was a bit overambitious with all the "ROA"; this is mostly for checking your content-negotiation, redirects and so on. But I think it'll be useful for a large system I'm putting together, which will (if things work out well) at its core have a well-behaved Atom-based resource service loaded with (linked) RDF payloads. Complete with feed archiving and tombstones, it will in turn be used as fodder for a SPAQRL service. (Seriously; it's almost "SOA Orchestration Done Right", if such a thing can be done.) But this I'll get back too..

(.. I may call the practice "COURT" — Crafting Organization Using Resources over Time. It's about using Atom Entries as manifests of web resources with multiple representations, fully described by RDF and working as an over-time serialized content repository update stream. (Well, that's basically what Atom is/works perfectly for, so I'm by no means inventing grand things here — grand as they IMHO are.)

Wednesday, September 19, 2007

A Mind must Become be4 it can Go

Now, I do not understand why the Web 4.0 page at Wikipedia has been protected to prevent creation. It is the fear of the machine? Because as everyone must know, Web 4.0 is the peak of our civilization, when mankind will unite in celebration, marvelling at our own magnificence as we give birth to AI. The Web at 4.0 will also affectionately be known as "Web H4L", to honor its predecessor, born on January 12, 1997 at the HAL Plant in Urbana, Illinois.

Thursday, May 10, 2007

Point of Data

(Read all of this, it is a Zen Koan.)

Been thinking about data lately (possibly "in" as well). And information, content, metadata, meaning, knowledge, taxonomies, ontologies.. Ontology. Flashbacks of never-ending philosophical debates, epistemology, Plato's Ideals, empiricism, positivism, all of that. Well not so much of the latter really, I guess I've learnt to recognize mental swamps before I go trekking nowadays.

But data. I think a little clarification could be in order. I'd say it goes like:

Data
Particulars (atoms if you will) of that which compose our impressions. Not the world, but that stuff from which we "get the world".
Information
I just quote Gregory Bateson: "a difference that makes a difference". The part of data we can use, as opposed to "noise" or "void".
Content
A tricky thing. "Composited information with a bound context" perhaps, with varying (often hard to measure) complexity.
Metadata
Let's say "added bits used to externally correlate content". I'm not much of a fan of the word anymore. I may deprecate it in favour of just "context-providing statements". The stuff that turns data we don't get into data we get.

I often just call content and metadata "data in" and "data about" nowadays.

Of Content

Content is perhaps the most widely used and less defined stuff. It's abundant, the substance which we structure (and in so doing "contentifying" it further). It is the composition from which richer meaning can be derived. By it's virtue of having a context. This article is "content". That last statement is information (this was a reification, but I digress). Possibly "metadata". Now it's just getting funny. Anyway, content is stuff which can be molded to gain shapes and shades, color and tone; somewhat "synergetical" effects which may or may not add meaning — more often than not depending on cultural aspects (part of the implicit context).

It is stuff which we leave semi-formal at best, the information we hardly can process by anything less than our own neural networks (brains). Somewhat pessimistically perhaps, information which during interpretation gain illusory qualities (akin to optical illusions) which either enhance or corrupt the bringing of meaning.

I won't get into information theory nor semantical discussions. Neither into a discussion of techniques of how to "fluff" expressions to become more sympathetical, making the receiver prone to interpret them as "meaningful".

To the Point

If you're still reading, this is the part where I intend to make a point. Binding the context by which this becomes meaningful.

This content bears no inherent meaning. It's all in your head. Perchance it may have conveyed a structure which made sense, put your mind into a state of "ah, ok". I really just needed to externalize that first part of categorizing some terms that continue to come up in discussions, and for which I needed some binding interpretations.

Perhaps you detected an "anomaly" if you tried to interpret the terms in a hierarchical fashion. "Data in" and "data of" was my differentiation of Content and Metadata, seemingly "instances of Information". But Data isn't necessarily Information. The thing is, that "thing that makes the difference" is where the "meaning", "semantic relevance", "precise form" comes in, and what that is my current state of mind can only grasp by intuition.

We use Content, and the more fine-grained and "to the point" context-providing statements (Metadata), one as compositions within a context, the other to bind both particulars and contexts by means of relations and characteristics. To have a feel for the difference is important, for it is the key by which to understand why Knowledge Representation is the missing piece in many Information Technology issues today.

And this is where How To Tell Stuff To a Computer, and then The Semantic Web FAQ comes in, as my recommended reading of the week, to get you going. Those sources of information will hopefully make the difference that eluded you here.

That's the point.

(Or is it?)

Friday, January 26, 2007

A Great Day for Specificity

dbpedia.org - Using Wikipedia as a Web Database
[...] dbpedia can also be seen as a huge ontology that assigns URIs to plenty of concepts and backs these URIs with with dereferencable RDF descriptions.
We have advanced a tech level!

This is really good timing, as I just recently considered using TDB URNs for referencing non-retrievable things and concepts (using TDB-versions of URLs to Wikipedia articles and web sites). Finding out if the idea of TDB and DURI URNs have been long since abandoned was my next step down that path. (Then there's the use of owl:InverseFunctionalProperty and bnodes (or "just-in-time"-invented ad-hoc URIs) and throwing in owl:sameAs if official ones are discovered..)

With DBPedia, a lot of that becomes unnecessary. The availibility of public, durable URIs for common concepts will surely ease the integration of disparate sources of knowledge. That is, if we start to use dbpedia-URIs in our statements.

And gone will be the strangeness of saying "I'm interested in [the word 'semantics']", "This text was edited with [the website <http://vim.org>]" and "I was born in [the wikipedia article about 'Stockholm']"..