Wednesday, January 17, 2007

Knowledge, The Bits and Pieces

In a recent post by Lee Feigenbaum, he talks about Using RDF on the Web. Naturally I find this very interesting. In my work on the Oort toolkit, I use an approach of "removing dimensions": namespaces, I18N (optionally), RDF-specific distinctions (collections vs. multiple properties) and other forms of graph traversing. This is done with declarative programming very similar to ORM-tools in dynamic languages (mainly class declarations with attributes describing desired data selection). The resulting objects become simple value trees — with the added bonus of automatic JSON-serialization.

Ideally, this approach will be deterministically reversible. I have not yet implemented it, but the idea is that the declared classes (RdfQueries in Oort — let's call them "facets" here) could take JSON as input and reproduce triples. Using checksums of the JSON would make over-the-web editing possible.

Since the task of "updating" a subgraph is somewhat difficult at best, I think a basic "wipe and replace" approach may be simplest. There are many dangers here of course (removing a relation must not remove knowledge about its object — unless perhaps if that object is a bnode..).

Albeit all of this is Python to the core now, nothing in the design — declarative as it is — prevents the approach from being more general. Indeed, such facets, were they themselves serializable, could be used as structured content retrieval over-the-web too. Ok, maybe I'm reinventing SPARQL now.. Or should I use SPARQL for their remote execution? It seems reasonable (I mean that's exactly how ORMs do SQL).

Now, I seem to end up in the RPC/RESTful camp with this. A solution to that could be: use the facets on the client, having them use SPARQL for retrieval. Then you have clients working against any kind of SPARQL endpoint, mashup-style. Still, if facets are completely reversible, they may be powerful, aware tools, and perhaps an alternative to SPARQL in intercommunication? That's a pipe dream for me right now though.

The SPARQL-way is of course only for reading data. Fine, the RDF model may be too expressive (fine-grained) to be practical for direct over-the-web editing in specific situations anyway. A confined approach such as this JSON+checksums+"careful with nested knowledge" may be better for this.

I think of JSON-RPC here as I view e.g. microformats with GRDDL — it's leverage, not a final solution. RDF models are the ideal, we may just need reversible simplifications to gain mass usability. I touched upon JSON for integrated apps, but stuff like Atom for "talking to strangers" in response to a previous post by Lee. His post which I refer to above hints at better stuff than just Atom if we want RDF all the way also in more specific apps.

So, what I'm saying is really: I also desire the best of two worlds. The RDF model is sound and extremely valuable, but after all, simple domain-specific object manipulation is what makes the web go round. A solution may be some form of O/R mapping for RDF. The difference is not wanting to forget the data layer (there is no monster in the closet as there is with raw SQL..), just streamline some of the work with it.

Let's hope 2007 will be a good year for the knowledge revolution.

5 comments:

L said...

Thanks for the follow-up and thoughts, Nikklas. I agree with everything you have to say here--in particular, even though I approached the survey from a client-side point of view, the general ideas are applicable in a traditional servlet/CGI/python/ruby/perl/java environment.

We've tried RDF ORM around here by developing and using Jastor ( http://jastor.sourceforge.net/ ) with mixed results. I think ActiveRDF has a great deal of promise for the Ruby community; it's still under quite a bit of active development.

The subgraph problem is one that fascinates me and I think is closely related to the second part of my post (which, hmm, I should be writing right about now :-) ). RDF update is where a great deal of the challenge lies, especially in figuring out how much we can simplify the model while still keeping it invertible.

Blank nodes tend to complicate matters; I like to ignore them when I can. :-)

Lee

Niklas Lindström said...

Thanks for the reply!

Yes, I recall looking at Jastor previously. It's just that the static shackles of Java encumber me so.. ;) It is a good thing to have around though, I never know when I might need Jena (or IBM-SLRP).

Furthermore, I basically ignore metainfo about the model (i.e. RDFS+OWL) in the "facet" approach (well, they themselves become little schemas, just as in dynamic ORMs). I'd say it's because I want a "data first" approach, although one could blame me for ignoring the appropriateness of OWL.. Still, resources can have limitless statements about them, which is why I used the "facet" metaphor - it's all about a certain slice of a multifaceted, sometime enormous information graph. And this before even thinking about inference..

Indeed, ActiveRDF looks really good. The rdfview parts of Oort resembles it to some extent, just as they resemble Sparta. I guess this hints at the need of simplification. Working with proper triples (and SPARQL) is powerful stuff, but as said, often overly fine-grained for domain-specific daily work.

The thing with blank nodes may be that while they are unidentified, they most definitely have a unique identity.. :)

Looking forward to your second part!

Anonymous said...

im not sure what this 'monster in the closet' you speak fo with SQL isnt also applicable to RDF - the only huge triple-stoes, swh's and bigOWLIM aren't open source, se we're stuck with SQL-layered ones for now - the queries ive seen Redland generate are pretty nasty..

i dont think you need ORM for domain-specific tools. ive just got utility functions that make working with pure triples as simple as possible - lessening the overhead of object-creation in scripting language space...

Anonymous said...

is there a language that can mirror exactly RDF's model with objects having multiple parent classes, exact equivalence with other objects, etc - haskell?

SQL requires ORM, as it has no innate object model. RDF on the other hand is already a model - so a mediocre duplication of rdf objects as programming-language-of-choice objects is bound to be imperfect and inefficient

..i admire ActiceRDF for its effort, but the need to bypass it and get at the raw toolkit (Redland) so much , and not wanting to solidify things in actual code (eg, storing view lenses as RDF, instead of hardcoding a domain specific view as FOAF::Person.new() etc), meant it was an unnecessary middleman between the user and the model behind it's model..

Niklas Lindström said...

Thanks for the comments! I'll try to reply to both at once here.

The monster I thought of regarding SQL was mostly the artificial model with its lack of (knowledge) semantics. The RDF core — the triples — is very simple (a good thing). Their intent of representing statements — as opposed to say "relations" — adds a lot of power. This in unison with URI:s and the open-ended typed or localized literals make RDF truly sound. Upon this we have RDFS to model classes and properties (and then OWL upon that), which becomes the model that presumably makes an O/R approach feel less correct.

Implementations may well be a little scary today though, I agree with that. For instance, about RDF graphs having SQL backends.. In theory, I don't mind that more than I do about Python being implemented in C. In practise, perhaps the way we work with RDF may be ill-served by how the RDBMS:s work. I'm guessing BerkleyDB is a better fit. I'm not educated enough to assert much in this space though.

And now, regarding shoving RDF semantics into lesser programming language semantics. In this case, it is somewhat about "dumbing down" stuff. I just don't think that imperfection is necessarily inefficient (whereas the strife for perfection often is). I hesitated about the equation with O/R, since my approach rests upon the better model of the RDF itself. To be lazy, I'll just say I feel that an "object wrapping" solution is adhering to "less is more", DRY, hiding details and whatnot.

To get specific. I did go with utility functions for a while, but I felt a shortcoming in that I exposed the complexity to layers (the view) where I didn't need them. More than that, I consider the wrapping of a state (the graph, the resource in question, current language) bundled with utility selectors to be more contained (and OO, sure). Object instantiation in Python isn't expensive (oh I feel the birds of prey sweeping down now) and I wouldn't consider such implementation details in this case (unless it was a daily obstacle, but then I'd consider a tool switch).

I should dig deeper into ActiveRDF perhaps, I'm unfamiliar with these lenses for instance. One thing I do differently I think, is that I directly associate namespaces (vocabularies) with individual properties. Hence, a "Person" in Oort can easily have attributes from many different vocabularies, and also aliases, composite properties, filters etcetera. These latter things are a dangerous thing in regard to reversability, but they are convenient tools in the upper layers. Neither do I use RDF types as "types" are used in OO/ORMs/"Web2.0"-frameworks. I match on the type of the requested resource, and apply different "aspects" of a display depending on that. These aspects then use the "facet" mechanism to retrieve the parts of interest. All in the name of experimentation for now. It works great for my use cases at least. I pay some attention to subtyping, but little to multiple type values (here Oort is a stochastic can of worms).

Admittedly, dynamic languages come with a trade-off. I would like a language where the RDF model was inherent. Say something where Versa was syntactically and semantically valid. Haskell isn't a bad suggestion; I did mention that whereas my wish for "Web 3.0" would be Python+RDF, "4.0" could do better with say OCaml+OWL.. ;) Still, exact mirroring in an existing language would be by chance, and this coincidence of semantics may diverge at any time. Anyway, I'm doubtful if it would be enough to have a rich language well-suited for RDF. Many more correct programming languages haven't prospered just because they're more difficult to fit in your brain. Some would say RDF is similarly cursed. I wouldn't agree per se, but regardless, I do think making it simple to work with in domain-specific situations may serve it better than having its richness explicit in every case.

I do feel a lot can be done in this field. To paraphrase myself: I don't want to forget the RDF data layer (at all), just streamline some of the work with it. I take your objections to the suggested approach seriously; I'm not adamant about it (though I'll likely stay on course for now).