Thoughts on this week’s Beyond the PDF workshop and suggests that scientists join engineers in pioneering real-world solutions for science publishing.
The Beyond the PDF Workshop is currently in-session, running from January 19-21, 2011 at the University of California, San Diego.
Why is the workshop entitled “Beyond the PDF”? It’s a plain, unvarnished fact that much of today’s science publishing is rooted in the use of static PDF files as the authoritative platform for the electronic publication of scientific research papers. PDF thus symbolizes the static printed page that workshop attendees regard as a ball-and-chain retarding scientific advancement. This workshop is intended to help ventilate the frustrations of those who see the traditionally inanimate printed page as an archaic convention of unimaginative publishers.
Why not publish scientific papers and supporting data in a seamless, dynamic, interactive package? At this workshop, attendees sense the potential for enormous efficiency gains in research productivity, peer-review, dissemination, cross-referencing and more.
Some are captured by the possibilities for reinvigorating science education with attractive interfaces bridging stale textbooks and live data. They see the chance to bring students closer to the data than ever before, and believe that new sources of inspiration could revolutionize the process of drawing young minds into science and engineering.
If only the publishers, libraries, institutions and technology people would just get their act together, it is claimed, the pace of progress itself would accelerate, encompassing source materials, built-in interactive analysis tools, social media, and more.
There’s another widespread assumption; that the marketplace for science publishing is relatively tiny and thus vulnerable to capture by commercial interests. Revolutionizing science publishing, it is said, is an ideal case for open-source solutions.
I’d like to offer one perspective from the point of view of an organization which has helped dozens of science publishers, pharmaceutical companies, engineering organizations and other demanding consumers of document technologies transition into the digital realm. There’s lots of room for open-source and commercial solutions alike, and the solution is nearer at hand than many might think.
The Concept of “Page”
Paper pages overtook stone tablets, scrimshaw and scrolls a long, long time ago. Whether handwritten or printed, the humble paper page provided key benefits in terms of portability, functionality (ease of use), reproducibility and more.
Humanity entered the information age with a tremendous legacy of paper documents and records. We’ve evolved an ingrained respect for information committed to paper, information that can be readily dated, authenticated, shared, reviewed, analyzed and annotated.
Adobe Systems invented PDF as a more flexible iteration of their PostScript printing technology. It’s ironic that the birth of electronic paper occurred as a function of the need to produce more of the dead-tree variety.
Once again, why PDF?
Some have it that the concept of a page is nothing but a convention. They argue that physical, self-contained documents are immaterial to the recording and dissemination of knowledge. They say that the page is simply an anachronism born of technologies that are thousands of years older than computing (not to mention far older than modern science itself).
Nanopublishing is in the air, the idea that science publishing should facilitate a single observation or assertion. It can include annotations, links to data, and (possibly) comments. This snippet would not be a document in the traditional sense. The idea is more like an amalgam of text, metadata and sources – more like a couple of sentences together with a bunch of hyperlinks. So new is nanopublishing that it doesn’t even (Jan 21, 2011) have a Wikipedia page or rank in Google Trends. Nonetheless, some are sure that it’s the second coming of scientific publishing. Maybe it is, and if so, PDF may still be a great technical solution for it.
I’m here to argue that the scope of the publication itself isn’t the issue; scientists and engineers, no less than lawyers and accountants, generally (not always!) require an electronic analog of paper documents to do their work.
You can’t do science or engineering if you can’t easily record, reference and communicate about your materials. Reliably reproducible documents provide indisputable common points of reference without proprietary software, network connections and so on.
From the start, PDF offered many of the mission-critical features for a successful electronic document format that had to play nice in a paper-based world: portability and reliability. The factors that allowed PDF to leap out from a printing solution to become the universal electronic document format are twofold. First, Adobe’s choice to publish the PDF Reference from the beginning in 1993. Second, Adobe’s very early choice to make the Adobe Reader free at a time when very little software was free.
As a result, online science publishing over the past fifteen or so years has typically taken the same form as the printed version: PDF. HTML and other formats are often (not always) secondary or supplemental.
There are other, more prosaic reasons for PDF in many science and science-related implementations, clinical records in drug trials being only one such example. PDF is a superb medium for clinical records because it is flexible enough to encompass a wide variety of document sources, and includes the metadata and other features required by the industry’s characteristically goliath implementations.
PDF is well and good for a reliable page, but the desperate need for nanopublishing aside, what’s bothering attendees at this workshop is the fact that there’s no framework, no readily available and generally accepted tools for associating scientific data with the research which relies on it, much less making said data attractive and easy to use.
We know that valuable or not, being able to print your document is not going to be the last word in science publishing. The question this workshop is rightly raising: what’s beyond the printed page?
Why the PDF page isn’t enough
For me, there’s no arguing with the workshop’s premise. The most common, most basic PDF file is simply for viewing and printing. It can be manipulated in a variety of useful and interesting ways, but it’s still rather like it’s paper origin – flat.
Scientists, publishers and engineers need more. First and foremost, they need methods and models for publishing scientific data alongside, and preferably integrated within, the analytical and presentational document that is the research paper itself.
Secondly, the tools for annotating documents with dynamic information, including video, scripts, animation, tabular data and more are lacking, confusing, clunky or expensive. (ISO 32000-2, soon to be better-known as PDF 2.0, will improve support for rich media of all sources. It’s targeted for publication in 2012.)
Last, but by no means least, the technology needs to be open and extensible, with room for all players to thrive, including non-commercial publishers, organizations and individuals. Empowerment is key.
It’s important to realize is that PDF can and will accommodate all (or at least most) of the desired qualities. More work needs to be done, but PDF is already a lot of the solution.
Wish I Was There
I’m not attending the conference due to other commitments, but if I were there I would tell the attendees these things.
FIRST. PDF is mature, stable and established, with broad acceptance across the spectrum of producers and consumers. Adobe Systems, although still the leading toolmaker for PDF manipulation, is far from the only player. The pool is open. PDF is ready and waiting for commercial and open-source developers alike to create new solutions for science publishing needs.
SECOND. As of 2008, the world’s electronic document file-format is an Open Standard, ISO 32000. The subset ISO standards developed over the past twelve years to meet industry-specific needs testify to the conceptual and technical depth and sophistication of the format. PDF/A sets requirements for archival and retention purposes while PDF/X sets norms for print-production implementations. PDF/VT is a brand-new ISO Standard focussed on the needs of the variable and transactional printing industry, and PDF/UA is a soon-to-be-published Standard specifying what is meant by an “accessible” PDF file. Then there’s PDF/E, which we’ll address shortly.
FOURTH. PDF is already an International Standard, and International Standards are a perfectly reasonable way to address the technology needs and publishing standards required by the science and engineering communities.
Go Beyond the PDF: Help Develop ISO 24517-2
PDF/E, formally known as ISO 24517-1:2008, is an ISO Standard published in 2008. A subset of PDF, PDF/E is designed to be an open and neutral exchange format for engineering and technical documentation. More specifically, part 1 of the Standard defines the use of the Adobe Systems PDF Reference 1.6 for the creation of documents used in engineering workflows.
Key benefits of PDF/E include:
- Reduced requirements for expensive, proprietary software
- Reliable, trusted exchange across multiple applications and platforms
- Self-contained documents
- A cost-effective and accurate means of capturing markup
- An open, International Standard managed by volunteers in a transparent, democratic process.
- Reduced storage and exchange costs (vs. paper)
One problem with the current ISO 24517-1 is the fact that it’s based on an older, Adobe-proprietary version of the PDF Reference. Part 2 of PDF/E will be based on ISO 32000-2, the next version of the ISO standard for PDF itself.
PDF/E-1 won’t meet the needs of the scientific community, but Part 2 of the standard offers the possibility of a solution to the issues raised in the Beyond the PDF workshop. When it comes to publishing, the needs of engineers and the needs of scientists have a lot in common, including rich media, 3D modeling, and data.
Part 2 of PDF/E, just launching now, is a sensible step to going beyond the PDF because the intention of PDF/E-2 is to specify normative language for the archiving of dynamic technical data.
Interested engineers, scientists and others with a stake in the future of dynamic electronic document technology should get involved with the PDF/E Committee and help fashion the next-generation of scientific, technical and engineering publishing. Subject-matter experts from any country represented on ISO’s TC 171 Committee are welcome to join the International Committee for PDF/E. The next face-to-face meeting is mid-May, in Salt Lake City.
Science publishing isn’t to be confused with the paper it’s printed on. It is, or could be, a complex stew of content, representation and data. PDF has served publishing well until now. PDF is useful, capable and open enough to serve publishing well into the future, but only if the stakeholders engage with its potential.
At the end of the day, it’s not about open-source versus commercial. The future of the document lies in International Standards, broad acceptance and an open technical architecture.
Our work in International Standards
Readers may be interested in the study I conducted way back in 2005 regarding the relative popularity of PDF and HTML versions of scientific literature. The methodology is, I’m sure, far from exemplary; then again, I’m not a trained scientist, so it’s simply my best effort emerging from a couple of weekends with a lot of raw data and a copy of Microsoft Excel.
by Duff Johnson