Few know it yet, but a new paradigm for document capture begins in 2011. This is the year in which PDF/A and PDF/UA become cornerstone International Standards for electronic documents in the decades and centuries ahead. This is the year in which document capture begins its fourth paradigm – semantics.
My old company started life as an imaging service bureau. In 1996, our very first job involved scanning several hundred pages of typed poetry and converting the results into a useful Microsoft Word file. I think we charged $400 (which didn’t come close to covering the OCR software), but we learned a bit about document capture.
This article argues for a definable set of paradigms in the development of document capture, and suggests that PDF/A and PDF/UA will enable a brave new world in which documents and data exchange freely, improving document longevity, search, reuse and accessibility into the future.
Where We’re Coming From
Paper was (and often remains) the indispensable medium for documents. Paper is, for the most part, indisputable. It either exists, or it doesn’t. It has this or that writing on it (whether currency or contracts or music). Paper is portable and convenient, or rather, seemed convenient. Dead trees have enjoyed a very long run as the medium of documents, but the utility of paper, relative to the alternatives, has been declining for decades. From micrographics to PDF/A, the set of technologies providing alternatives to paper have long established one thing: To replace paper, you better be able to act like paper. Each of the paradigms of document capture, the art, science and industry of converting source content into reliable, useful documents, has respected that fact.
The First Paradigm of Document Capture: Micrographics
Measured in paper consumption terms, some organizations generate metric tons of documents. The really big organizations generate tons of documents on a daily basis. The (seemingly) simple act of storing and retrieving these documents can pose a major organizational and financial burden. It was for this reason that the micrographics (microfilm and microfiche) industry was born in the years following World War II.
The idea was (and remains) simple: It’s easier to store, view and share pictures of documents than it is to store the documents themselves. If you are willing to trust these little pictures as if they were the original document, then you can safely shred the paper and start saving money.
Some organizations still store pictures of their documents on film or fiche – it’s eye-readable, even without electricity. Ever since the advent of the personal computer, however, it’s become popular to capture pictures using a digital camera instead. The most common type of digital camera used for this purpose is more commonly known as a scanner.
The Second Paradigm of Document Capture: Digital Imaging and OCR
Scanning allows documents to be captured to a digital form rather than to another physical medium.
Compared with previous analogue technologies (paper and micrographics), the digital age developed very, very quickly. Almost as soon as imaging became popular for storage, retrieval and sharing of business, research, technical and other documents, developers began to produce software to analyze and convert those images into text for search or reuse purposes.
While Optical Character Recognition (OCR) was first made commercially available in the late 1970s, it took until the late 1990s for the accuracy, speed and cost begin to make sense for large-scale capture.
OCR represents a crucial step beyond a reproducible document image. When locating interesting content was purely a function of metadata and human indexing, there was simply no way to find specific content without physically reading every page – a problem that’s unimaginable in the age of search engines. These days, reasonably high-quality and fast OCR is approaching commodity pricing. It’s now routine for scanned documents to be OCRed without much extra thought. Google will even OCR your scanned documents for free simply because you leave them on a webserver!
In the 1980s, while electronic imaging began to get going, others were asking themselves: Why scan or image a paper document to store it? Why not just create it electronically in the first place?
It was this idea that animated Adobe’s “Camelot” project, eventually resulting in PDF, the third paradigm of document capture.
The Third Paradigm of Document Capture: PDF
Second Paradigm thinking wasn’t limited to scanning paper or previously-imaged microfilm. Computer Output to Laser Disc (COLD), became popular in the late 1980s and early 1990s as a “straight to storage” solution for electronically-generated documents such as bank and insurance statements. Most COLD technologies represented the document with images (usually TIFF files) and database records. As it turns out, electronic documents are more than just images and metadata.
In the 1980s, Adobe Systems was a humming factory for a vast array of technologies and concepts. Among other accomplishments, the company drove the publishing industry from a set of highly technical specialities right into the computer on your desk. Adobe got many requests from government and industry for electronic document technology, but it was a request from the IRS for a reliable, cross-platform electronic document format that caught their eye.
In 1993, Adobe Systems announced PDF (Portable Document Format) along with it’s soon-to-be flagship PDF creation, management and manipulation software; Adobe Acrobat. The “electronic document” was born.
There are many reasons why the PDF format has proven so durable. I cover these reasons in detail in this article, so here, I’ll keep it to the bullet points:
- Easy to make from any source
- Authentically represents the original
- Portable, and free to view and print
- Flexible and powerful, with many features
- Relatively secure (and secure-able)
As it turns out, these are the attributes that allow an electronic document to sufficiently resemble paper such that people are willing to trust it as they do paper.
The Fourth Paradigm of Document Capture: Semantics
Once page and text have been captured, users can view and search for documents. What’s left to capture? The answer: semantics.
What are “semantics”?
In the electronic document context, semantic information describes the relationships between elements of content. Those familiar with HTML will recognize the concept right away because semantics are an inextricable part of HTML. Here’s an example, with the browser’s interpretation in the green box to the right:
A second-level heading in the current page.
A paragraph of text.
- The first List Item in an Unordered List (bullets instead of numbers)
- The second List Item in the Unordered List
<H2>A second-level heading in the current page.</H2>
<P>A paragraph of text.</P>
<UL><LI>The first List Item in an Unordered List (bullets instead of numbers)</LI>
<LI>The second List Item in the Unordered List</LI></UL>
Pretty simple, right? The tags within the brackets express logical relationships that help software interpret and display the actual content in a pleasing, easy-to-read fashion. Tags accomplish two things:
- Organization of content into the correct logical reading order, and
- Specify the role or function of the text in the document
For example, an <H2> tag signifies that the text enclosed by the tag is to be understood as a 2nd level heading. This fact allows the reader and (and other consumers, such as software) to characterize the text as important… a chapter, or perhaps a section heading. Likewise, an <UL> tag, along with its subordinate <LI> tags, denotes a list of items, to be distinguished from simple paragraphs of text.
Semantics, in other words, allow one to distinguish a document from a stream of words.
How Semantics Work in PDF
PDF was originally designed to ensure reliable on-screen and in-print appearance. Searchability, text extraction and content re-use wasn’t a priority. While it was always possible to extract text from PDFs, the means of including semantic information along with text, graphics, annotations and other content in the document was only added to commercially-available software beginning in 2000.
Since 2000, PDF files may include tags, and they even look rather similar to HTML tags (see the image to the right). PDF tags perform exactly the same sort of role as HTML tags – they define the logical reading order of the content. A tagged PDF contains information that can help all manner of consumers; blind users, those copying text, search-engines and others to navigate and understand their documents.
Today, the vast majority (probably over 99%) over PDF files are created without semantic information. Even if their software is capable of including semantics in their PDF files, most users don’t know how to implement correct semantics when they write documents, let alone convert them into some other format.
What’s the new Paradigm?
In 2011, we’ll see the publication of ISO 19005-2, part two of PDF/A. We’ll also see the approval (if not the publication) of ISO 14289-1, PDF/UA. Why should document capture people care?
- PDF/A specifies constraints and quality-control measures for PDF files to ensure they will operate reliably anytime in the future.
- PDF/UA defines the correct usage of the features in PDF that allow for document semantics to be stored and retrieved in addition to the raw text and graphics.
Between these two standards and ISO 32000 itself, the technical underpinnings of the Fourth Paradigm of Document Capture is complete. It is now possible to capture not just images of documents, not just their text, but also the structure and organization of the content.
From search-engines to screen-readers, from tables to footnotes, the humble PDF will become even easier to use and to re-use than ever before.
How can we help?
Appligent Document Solutions was the first company in the world to offer commercial PDF tagging services to capture document semantics in addition to document text. PDF tags work with on any sort of PDF file, including electronic-source content, scanned documents, forms and more. Contact us for more information.
by Duff Johnson