Ivan’s private site

January 24, 2012

Nice reading on Semantic Search

I had a great time reading a paper on Semantic Search[1]. Although the paper is on the details of a specific Semantic Web search engine (DERI’s SWSE), I was reading it as somebody not really familiar with all the intricate details of such a search engine setup and operation (i.e., I would not dare to give an opinion on whether the choice taken by this group is better or worse than the ones taken by the developers of other engines) and wanting to gain a good image of what is happening in general. And, for that purpose, this paper was really interesting and instructive. It is long (cca. 50 pages), i.e., I did not even try to understand everything at my first reading, but it did give a great overall impression of what is going on.

One of the “associations” I had, maybe somewhat surprisingly, is with another paper I read lately, namely a report on basic profiles for Linked Data[2]. In that paper Nally et al. look at what “subsets” of current Semantic Web specifications could be defined, as “profiles”, for the purpose of publishing and using Linked Data. This was also a general topic at a W3C Workshop on Linked Data Patterns at the end of last year (see also the final report of the event) and it is not a secret that W3C is considering setting up a relevant Working Group in the near future. Well, the experiences of an engine like SWSE might come very handy here. For example, SWSE uses a subset of the OWL 2 RL Profile for inferencing; that may be a good input for a possible Linked Data profile (although the differences are really minor, if one looks at the appendix of the paper that lists the rule sets the engine uses). The idea of “Authoritative Reasoning” is also interesting and possibly relevant; that approach makes a lot of pragmatic sense, I wonder whether this is not something that should be, somehow, documented for a general use. And I am sure there are more: In general, analyzing the experiences of major Semantic Web search engines on handling Linked Data might provide a great set of input for such pragmatic work.

I was also wondering about a very different issue. A great deal of work had to be done in SWSE on the proper handling of owl:sameAs. On the other hand, one of the recurring discussions on various mailing list and elsewhere is on whether the usage of this property is semantically o.k. or not (see, e.g., [3]). A possible alternative would be to define (beyond owl:sameAs) a set of properties borrowed from the SKOS Recommendation, like closeMatch, exactMatch, broadMatch, etc. It is almost trivial to generalize these SKOS properties for the general case but, reading this paper, I was wondering: what effect would such predicates have on search? Would it make it more complicated or, in fact, would such predicates make the life of search engines easier by providing “hints” that could be used for the user interface? Or both? Or is it already too late, because the ubiquitous usage of owl:sameAs is already so prevalent that it is not worth touching that stuff? I do not have a clear answer at this moment…

Thanks to the authors!

  1. A. Hogan, et al., “€œSearching and Browsing Linked Data with SWSE: the Semantic Web Search Engine”€, Journal of Web Semantics, vol. 4, no. December, pp. 365-401, 2011.
  2. M. Nally and S. Speicher, “Toward a Basic Profile for Linked Data”, IBM developersWork, 2011.
  3. H. Halpin, et al. “When owl:sameAs Isn’t the Same: An Analysis of Identity in Linked Data”, Proceedings of the International Semantic Web Conference, pp. 305-320, 2010

October 30, 2009

ISWC2009 4-5

Filed under: Semantic Web,Work Related — Ivan Herman @ 0:59
Tags: , , , , , ,

Fourth day

Shame on me, but I missed the morning keynote… I was a bit late arriving to the conference site and I got stuck in a conversation at breakfast. Things happen…

The most notable event in the morning, at least for me, was the SPARQL WG panel. All members of the Working Group (me included) were on the panel and the room was full. I mean, full, people were standing in the back. And I regard that as a success by itself, it shows not only the overall importance of SPARQL, but the real interest around the new version, ie, SPARQL 1.1 (in case you have missed it, the first working draft has just been published a few days ago). Lee Feigenbaum (co-chair of the group) gave a quick overview of the new features and then questions came.

The difficulty of the SPARQL 1.1 work is that it has to find a balance between what is realistic to standardize in a relatively short time frame and what could be good to see in a new query language. As a consequence, there are features that the community has discussed but have not made it into the document, or only in a simple format. That came up during the discussion but I had the impression that the audience, by and large, understood this balance. Actually, for some, the set of new features were even too much for an efficient implementation. I have the feeling that  the WG will have to publish a separate conformance document (a bit like OWL 2 has), because there is a certain confusion on whether a conforming SPARQL implementation will have to implement, say, update or inference regimes or not. That clearly came up through the questions. Anyway, remember one email address (yes, it is a bit of a mouthful): public-rdf-dawg-comments@w3.org this is where comments have to be sent on SPARQL 1.1!

I chaired a session on the use track in the afternoon.  The paper of Daniel Elenius et al on reasoning about resources (for military exercises) was interesting to me because it was based on reasoning with relatively large OWL ontologies plus rules. The OWL ‘side’ was not very complex (Daniel referred to DLP, today I would say probably OWL 2 RL) but extended with extra rules. What this shows that when RIF will be finished and published, the combination of OWL with RIF may become very important for tons of practical applications. (As an aside, a nice little joke from Daniel: what is the system used by the military today when planning for exercises? The system is called BOGSAT. It stands for ‘Bunch Of Guys Sitting Around a Table’…)

Roland Stuhmer gave a very different style presentation on how user events (clicks, combination of clicks, etc) can be collected, categorized, and integrated into an application, analyzed with some rules for, eg, targeted ads. The system is based on harvesting not only the structure of the Web page, but annotations appearing in the Web page via RDFa. The result is an RDF structure describing the events that can be sent to a server, analyzed locally, distributed, etc. Nice usage of RDFa, but also important to have a Javascript API that can retrieve the RDF triplets from the RDFa structure attached to a specific node. (B.t.w., the old graphics standards of the 80’s and 90’s, called GKS or PHIGS, had notions of combined event structures with different event types. I do not remember all the details any more, but may it be worth looking at those again in a modern setting?)

Personally, the highlight of the day was the presentation of the semantic web challenge finalists. I was member of the jury, which meant that I had to review the submissions in advance and we had two very enjoyable discussions with the rest of the jury on the submissions. We had the first selection the day before, and this time all finalists gave their presentations and demos. And it was a tough task to choose (that is why we had such long discussions:-) because, well, the submissions were great overall. I do not really want to analyze each of the entries; I do not think it would be appropriate for me in this position. But the winner entry for the challenge, namely TrialX, really made a great impression on me. In short, the application is a consumer-centric tool through which patients can find matching clinical trials where they want to participate; it also helps those who organize those trials, etc. It is some sort of a matchmaking tool using all kinds of medical ontologies and vocabularies, public health record data and the like. We should realize the importance of this: here is a great Semantic Web application, winner of the challenge, which is really an application, not only demonstration, already deployed on the Web (soon as an iPhone app, too), and, to be a bit dramatic, may (and possibly has already) save lives. What else to we want as a proof that this technology is not only an academic exercise any more?

Fifth day

Only a partial day for me, as far as the conference goes, because I had to fly out before the end… But I could listen to the last keynote of the conference, ie, that of Nova Spivack.

Not surprisingly, Nova talked about Twine-2, a.k.a. T2. I did not really know what T2 was to be, I only heard that Twine, ie, T1, is moribund. As Nova acknowledged, it is too complicated, it is too hard for users to really figure it out; in fact, most of the users used it for search. Which is not the strongest feature of T1 in the first place.

So T2 is (well, will be) all about semantically backed search. It semantically indexes the Web, with an attempt to extract semantic information from the pages. The user interface would then be some sort of, essentially, faceted interface that would automatically classify the search hit results into different tabs; the user can use these tabs, drill down along other categories, etc. So far nothing radically new, though the user interface Nova showed was indeed very clean and nice. All this is done, internally, via vocabularies/ontologies, using RDF, RDFS, or OWL.

The interesting aspect of T2 (at least as far as I am concerned) is the incorporation of collective knowledge. First of all, T2 will include a system whereby users can add vocabularies that T2 will use in categorization. Users can get back those ontologies in OWL/RDF, they can improve them, etc. The other tool they will provide is a means to help semantically index pages that are, by themselves, not semantically annotated. This can be done via a Firefox extension; users can identify parts of the web pages (I presume, essentially, the DOM nodes) and associate these with classes of specific ontologies. The extension produces an XSLT transformation that can be sent back to the T2 system. Some social mechanism should of course be set up (eg, webmasters annotating their own pages should get a higher priority than third party annotators) but, essentially, it is some sort of a GRDDL transformation by proxy: T2 will have information on how to find transformation to semantically index specific pages without requiring the modification of the pages themselves (in contrast to GRDDL where such transformation is to be referred to from the page itself).

Of course, the system was a bit controversial in this community; indeed, it was not clear whether T2 would make use of the semantic information that do exist in pages already (microformats, RDFa, …) let alone the Linked Open Data information that is already out there. When asked, Nova did not seem to give a clear answer though, to be fair, he did not specifically say no and he also said that the semantic index might be put back to the public in the form of linked data. To be decided. It is also not fully clear whether those proxy-GRDDL transformations would be available for the community at large (hopefully the answer is yes…). It will be interesting to see how it plays out (T2 comes out in beta sometimes early 2010). Certainly a project to keep an eye on.

From a slightly more general point of view it is also interesting to note that two out of the three Semantic Challenge winners are also semantic search engines with different user interfaces (though sig.ma and VisiNav definitely do use the LOD cloud, no question there…). Definitely an area on the move!

I had the time and, frankly, the energy to really listen to only one more paper in the regular track, namely the paper on functions of RDF language elements, by Bernhard Schandl. A nice idea: imagine a traditional spreadsheet, where each cell is a collection of resources from an RDF Graph, or functions that can manipulate those resources (extract information, produce new set of resources, etc). Just like a spreadsheet, if you modify the underlying graph, ie, the resources in a cell, everything is automatically recalculated. Because, just like for a spreadsheet, a function can refer to the result of another function in another cell, one can do fairly complicated transformation and information extraction quite easily. Neat idea, to be tried out from their site.

That is it for ISWC2009. I obviously missed a lot of papers, partly because social life and hallway conversations sometimes had the upper hand, and sometimes simply because there were too many parallel sessions. But it was definitely an enriching week… See you all, hopefully, at ISWC2010, in Shanghai!

Blog at WordPress.com.