For those of us programming in Python, RDFLib is certainly one of the RDF packages of choice. Several years ago, when I developed a distiller for RDFa 1.0, some good souls picked the code up and added it to RDFLib as one of the parser formats. However, years have gone by, and have seen the development of RDFa 1.1, of microdata, and also the specification of directly embedding Turtle into HTML. It is time to bring all these into RDFLib…
Some times ago I have developed both a new version of the RDFa distiller, adapted for the 1.1 RDFa standard, as well as a microdata to RDF distiller, based on the Interest Group note on converting microdata to RDF. Both of these were packages and applications on top of RDFLib. Which is fine because they can be used with the deployed RDFLib installations out there. But, ideally, these should be retrofitted into the core of RDFLib; I have used the last few quiet days of the vacation period in August to do just that (thanks to Niklas Lindström and Gunnar Grimes for some email discussion and helping me through the hooplas of RDFLib-on-github). The results are in a separate branch of the RDFLib github repository, under the name
structured_data_parsers. Using these parsers here is what one can do:
g = Graph() # parse an SVG+RDF 1.1 file an store the results in 'g': g.parse(URI_of_SVG_file, format="rdfa1.1") # parse an HTML+microdata file an store the results in 'g': g.parse(URI_of_HTML_file, format="microdata") # parse an HTML file for any structured conent an store the results in 'g': g.parse(URI_of_HTML_file, format="html")
The third option is interesting (thanks to Dan Brickley who suggested it): this will parse an HTML file for any structured data, let that be in microdata, RDFa 1.1, or in Turtle embedded in a
<script type="text/turtle">...</script> tag.
The core of the RDFa 1.1 has gone through a very thorough testing, using the extensive test suite on rdfa.info. This is less true for microdata, because there is not yet an extensive test suite for that one yet (but the code is also simpler). On the other hand, any restructuring like that may introduce some extra bugs. I would very much appreciate if interested geeks in the community could install and test it, and forward me the bugs that are still undeniably there… Note that the microdata->RDF mapping specification may still undergo some changes in the coming few weeks/months (primarily catching up with some development around schema.org); I hope to adapt the code to the changes quickly.
I have also made some arbitrary decisions here, which are minor, but arbitrary nevertheless. Any feedback on those is welcome:
- I decided not to remove the old, 1.0 parser from this branch. Although the new version of the RDFa 1.1 parser can switch into 1.0 mode if the necessary switches are in the code (e.g.,
@versionor a RDFa 1.0 specific DTD), in the absence of those 1.1 will be used. As, unfortunately, 1.1 is not 100% backward compatible with 1.0, this may create some issues with deployed applications. This also means that the
format="rdfa"argument will refer to 1.0 and not to 1.1. Am I too cautious here?
- The format argument in parse can also hold media types. Some of those are fairly obvious: e.g.,
application/svg+xmlwill map on the new parser with RDFa 1.1, for example. But what should be the default mapping for
text/html? At present, it maps to the “universal” extractor (i.e., extracting everything).
Of course, at some point, this branch will be merged with the main branch of RDFLib meaning that, eventually, this will be part of the core distribution. I cannot say at this point when this will happen, I am not involved in the day-to-day management of the RDFLib development.
I hope this will be useful…