Ivan’s private site

January 18, 2014

Some W3C Documents in EPUB3

Filed under: Code,Digital Publishing,Python,Work Related — Ivan Herman @ 13:04
Tags: , ,

I have been having fun the past few months, when I had some time, with a tool to convert official W3C publications (primarily Recommendations) into EPUB3. Apart from the fact that this helped me to dive into some details of the EPUB3 Specification, I think the result might actually be useful. Indeed, it often happens that a W3C Recommendation consists, in fact, of several different publications. This means that just archiving one single file is not enough if, for example, you want to have those documents off line. On the other hand, EPUB3 is perfect for this; one creates an eBook contains all constituent publications as “chapters”. Yep, EPUB3 as complex archiving tool:-)

The Python tool (which is available in github) has now reached a fairly stable state, and it works well for documents that have been produced by Robin Berjon’s great respec tool. I have generated, and put up on the Web, two books for now:

  1. RDFa 1.1, a Recommendation that was published last August (in fact, there was an earlier version of an RDFa 1.1. EPUB book, but that was done largely manually; this one is much better).
  2. JSON-LD, a Recommendation published this week (i.e., 16th of January).

(Needless to say, these books have no formal standing; the authoritative versions are the official documents published as a W3C Technical Report.)

There is also draft version for a much larger book on RDF1.1, consisting of all the RDF 1.1 specifications to come, including all the various serializations (including RDFa and JSON-LD). I say “draft”, because those documents are not yet final (i.e., not yet Recommendations); a final version (with, for example, all the cross-links properly set) will be at that URI when RDF 1.1 becomes a Recommendations (probably in February).


January 4, 2014

Data vs. Publishing: my change of responsibilities…

Fairly Lake Botanical Garden, Shenzhen, China

There was an official announcement, as well as some references, on the fact that the structure of data related work has changed at W3C. A new activity has been created called “Data Activity”, that subsumes what used to be called the Semantic Web Activity. “Subsumes is an important term here: W3C does not abandon the Semantic Web work (I emphasize that because I did get such reactions); instead, the existing and possible future work is simply continuing within a new structure. The renaming is simply a sign that W3C has also to pay attention to the fact that there are many different data formats used on the Web, not all of them follow the principles and technologies of the Semantic Web, and those other formats and approaches also have technological and standardization needs that W3C might be in position to help with. It is not the purpose of this blog, however, to look at the details; the interested reader may consult the official announcements (or consider Tim Finin’s formula: Data Activity  ⊃ Semantic Web  ∪  eGovernment 🙂

There is a much less important but more personal aspect of the change, though: I will not be the leader of this new Data Activity (my colleague and friend, Phil Archer, will do that). Before anybody tries to find some complicated explanation (e.g., that I was fired): the reason is much more simple. About a year ago I got interested by a fairly different area, namely Digital Publishing. What used to be, back then, a so-called “headlight” project at W3C, i.e., an exploration into a new area, turned into an Activity on its own, with me as the lead, last summer. There is a good reason for that: after all, digital publishing (e.g., e-books) may represent one of the largest usage areas of the core W3C technologies (i.e., HTML5, CSS, or SVG) right after browsers; indeed, for those of you who do not realize that (I did not know that just a year and a half ago either…) an e-book is “just” a frozen and packaged Web site, using many of the technologies defined by W3C. A major user area, thus, but whose requirements may be special and not yet properly represented at W3C. Hence the new Activity.

However, this development at W3C had its price for me: I had to choose. Heading both the Digital Publishing and the Data Activities was not an option. I have lead W3C’s Semantic Web Activity for cca. 7 years; 7 years that were rich in events and results (the forward march of Linked Open Data, a much more general presence and acceptation of the technology, specifications like OWL 2, RDFa, RDB2RDF, PROV, SKOS, SPARQL 1.1, with RDF 1.1 just around the corner now…). I had my role in many of these, although I was merely a coordinator for the work done by other amazing individuals. But, I had to choose, and I decided to go towards new horizons (in view of my age, probably for the last time in my professional life); hence my choice for Digital Publishing. As simple as that…

But this does not mean I am completely “out”. First of all, I will still actively participate in some of the data activity groups (e.g., in the “CSV on the Web WG”), and have a continuing interest in many of the issues there. But, maybe more importantly, there are some major overlapping areas between Digital Publishing and Data on the Web. For example, publishing also means scientific, scholarly publishing, and this particular area is increasingly aware of the fact that publishing data, as part of reporting of a particular scientific endeavor, becomes as important as publishing a traditional paper. And this raises tons of issues on data formats, linked data, metadata, access, provenance, etc. Another example: the traditional publishing industry makes an increasingly heavy usage of metadata. There is a recognition among publishers that a well chosen and well curated defined metadata for books is a major business asset that may make a publication win or loose. There are many (overlapping…) vocabularies and relationships to libraries, archival facilities, etc., come to the fore. Via this metadata the world of publishing may become a major player of the Linked Data cloud. A final example may be annotation: while many aspects of the annotation work is inherently bound to Semantic Web (see, e.g., the work W3C Community Group on Annotation), it is also considered to be one of the most important areas for future development in, say, the educational publishing area.

I can, hopefully, contribute to these overlapping areas with my experience from the Semantic Web. So no, I am not entirely gone, just changed hats! Or, as on the picture, acting (also) as a bridge…

February 22, 2013

Browsers and eBook Readers

Filed under: Digital Publishing,Work Related — Ivan Herman @ 23:02
Tags: , , ,
eBook Readers Galore

eBook Readers Galore (Photo credit: libraryman)

My last week was all around digital publishing: first, I was at the W3C Workshop on eBooks and the Open Web Platform, that I helped to organize. If I extrapolate from the discussions at the W3C Workshop, there are good prospects that this topic will become more important at the W3C, and that it will also keep us busy (in addition to my role on the Semantic Web). By the way, the minutes of the W3C Workshop (both for the 1stand the 2nd days) and the presentations are public; a somewhat more detailed workshop report should also be available soon.

The Workshop was followed by O’Reilly’s Tools of Change (TOC) conference: a first time for me. And it was extremely interesting to find myself in a new environment where I have never been before. I have seen some great keynotes (e.g., Mark Waid’s on “Reinventing Comics And Graphic Novels For Digital”, or Maria Popova’s, from Brain Pickings), learned a lot at some of the session (for example, at Bill Rosenblatt’s session on some of the legal aspects surrounding eBooks).

My interest in this whole area is, primarily, on how digital publishing in general, and electronic books in particular, relate to technologies developed at W3C. For those of you who may not realize that: if an electronic book uses the ePUB standard (and more and more books do) than the book is, in fact, a “frozen” Web site (depending on the ePUB version either based on XHTML1 or HTML5). Technically, it is a zip file containing all the files necessary to render the content, plus some ePUB specific files to manage table of content, to help readers to display the content even more quickly, etc. Actually, as far as I know, most of the ePUB readers are based on the same core technology as many of the Web browsers, namely Webkit). The strong relationship between publishing in general, and eBooks in particular, was emphasized several times at the conference, especially by the keynote of Jeff Jaffe, the CEO of W3C.

But then… if so, why do we need separate eBook readers, either in hardware or in software? (Let us put aside for now the issue of DRM, vendor lock-in, etc; these are of course reasons but let us hope the business will evolve towards a more open environment where those issues will be less relevant.) Do we really need a separate ePUB reader software on, say, my iPad, or should we simply rely on the browsers taking care of ePUB files either directly or through some extensions? (There is, for example, a project called Readium to add such capabilities to Chrome.) And the answer is not obvious, there are proponents of both approaches. My 2 cents here is: it is not a core technology issue, but a user experience and interface one. Reading a book, electronic or otherwise, is a different intellectual activity than an average Web page. Here are some differences that I feel are important, and I am sure there are more, much more:

  • A book must be available off-line; this is, actually, its natural state. This difference is obvious, but worth noting: for example, the user interface for books has to be able to list what is and what is not available at a given moment (all readers have some sort of an imitation of a traditional bookshelf).
  • The amount of “information” you want to absorb is different. A typical Web page is not terribly long; even the more detailed Wikipedia articles, when printed, are rarely longer than 4-5 pages. Compare that to an average book that may be hundreds or even thousands of pages. What this mean in practice is that, whereas a Web page is usually read, understood, “absorbed” in one go, reading a single book may take several days or weeks. This has all kinds of consequences on how one navigates, uses traditional bookmarks (not the ones browsers usually provide, i.e., to store URL-s, but what used to be bookmarks in the past), tables of content, indexes, glossaries, etc. These features are essential for books but much less so for an average Web page.
  • Modern Web pages have more and more interactive features, they are related to various social sites like Twitter or Facebook; very often these pages are Web applications with very complex features (think of gmail, for example). Obviously, browsers have to be prepared for a high level of interactivity and have to be optimized to offer an optimal user experience. Books are much less interactive. Although newer generations of books may include some level of interactivity, and these are important for, say, the educational book markets or for children’s books, but it is a far cry compared to what Web sites do. Also, some readers (like Kobo’s) try to include some level of Social Web facilities (sharing information about books with friends, that sort of things); to be honest, I never found those social features interesting or important (o.k., I may just be old-skool). Reading a book for me remains a linear reading activity, whether it is a fiction, poetry, history, or politics. I want my eBook reader to optimize on that, and avoid distractions.
  • There are some features that a good eBook reader should offer and browsers do not traditionally do. A prime example is annotation facilities. Many people like to scribble on their books, underline full sentences, highlight words; I still have not found any tools to do that properly in a Web browser, although all the eBook readers that I have tested so far have such functionality. This is a typical user interface difference that comes from different demands. (Another example that comes to my mind is a quick on access to a dictionary, to an encyclopedia, etc.)
  • Some sort of a payment/right management system must be part of the reader. I personally consider the current DRM system, as used in the eBook world as fundamentally broken insofar as may drive people away from this market. However, I recognise that something should be available that allows authors of books to get some reward for their work. Whether that is some sort of a watermarking, social DRM, or whatever, I do not know, but something is needed, and the reader environment has to handle this.

I realize, of course, that this is a continuum: with ePUB3 we have the ability to make eBooks much more interactive, possibly with scripts, multimedia, etc.; in effect, electronic books are becoming more and more like Web applications. I.e., some of these differences may disappear or become less important. Nevertheless, I believe there will always be a difference in user expectations, in the emphasis that a software (or hardware) may have. eBook readers are not browsers, although electronic books are, in fact, part of the Web just like other types of Web contents. Is it a sign that we may need a more diverse landscape of accessing the Web than we have today?

Create a free website or blog at WordPress.com.