A nice paper I just heard at the USEWOD2011 Workshop at the WWW2011 conference: “Empirical study of real-world SPARQL queries”, by M.A. Gallego and his friends from the Univ. of Valladolid, in Spain. What they did was to analyse the SPARQL queries as issued by various clients to the DBPedia and the Semantic Web Dogfood dataset, to see if some general features appear that RDF triple stores and SPARQL implementers can take into account. This is a workshop paper, i.e., work in progress, so the results must be taken with a pinch of salt. E.g., it seems that DESCRIBE and CONSTRUCT queries are very rarely used (not a big surprise), that the OPTIONAL and UNION are used quite a lot, so their optimization is important, that most of the queries are dead simple, but around half of them rely on FILTER (albeit with one variable only), etc.
The interesting point for me is, however, that some of these data were radically different between these two datasets. E.g., 16% of the queries used OPTIONAL for DBPedia, whereas only 0.41% for the Dogfood dataset. What this tells me is that it is extremely difficult to optimise data stores in general. I.e., the characteristics of the data set, and indeed the application area (e.g., I would expect SPARQL queries to be much more complicated in the health care domain) have to play an important role. What the dimensions of optimizations are is not clear, but the type of research Gallego and his friends are doing might shed some light… Kudos for having started this discussion!