If you haven’t seen it, can I recommend Kristen Snawder’s recent post on the Library of Congress Digital Preservation blog, Digitization is different than digital preservation. Kristen reiterates familiar points about the long-term commitment necessary for serious digital preservation, contrasted with the quick hit of a scanning project. “In the hurry to meet user expectations, institutions may scan large quantities of materials without having a solid plan for preserving the digital images into the future.”

However another recent find on the Web compels me to make an additional point, namely that we might do equally well to differentiate between scanning and digitisation. Anyone can set to work with a scanner and create a bunch of digital images – but that barely scratches the surface of what I think we should be expecting of a digitisation project in 2011.

First and foremost, we need metadata: the more the merrier, but something at least. Even if we expect to come back later and polish it up (once the images can be browsed and examined on screen). In the absence of any established metadata profiles for a project, at least try to cover as many Dublin Core elements as possible – title, creator, date, subject/keywords… Images, in particular, may prove tricky or time-consuming to find again, especially once there are thousands of them on a disk. We should probably keep the metadata in a database, and perhaps additionally store metadata with the objects. This can be as XML or plain text files stored alongside the digital images, or embedded in the files we create (many common file formats – TIFF, JPEG, MPEG, PDF – support metadata embedding, and there are many free tools available to help).

There is yet more, though, that we should be doing, particularly when we are scanning text-based objects (articles, books, magazines, reports, etc). Most importantly, we really should try and extract the text from the image if possible. [1]

My recent web find was the teaching blog of Dr Toine Bogers at the Royal School of Library and Information Science (RSLIS) in Copenhagen, Denmark. One fascinating post describes a Lab Session exercise, From OCR To NER, a set of comparatively simple command-line processes to get the most out of a scanned-text project.

Toine’s post walks us through the process. Once the article is scanned, we should apply some OCR. The exercise goes further and also describes the use of tools to clean up and spell-check the resulting OCR’d text. This will, at the very least, result in a separate text file, hopefully containing a fairly accurate version of the article text. Finally, the cleaned-up text can be submited to a Named Entity Recognition service. Toine’s exercise uses NER tools at University of Illinois. (We’ve been using similar functionality provided by OpenCalais and GATE for our AIM25 Open Metadata project.)

Why do all this? The most important, instant, result of this is that we can now easily index our article for full-text searching – in a local repository system, such as EPrints or DSpace provide – and of course by Google. None of this is possible if we leave the scanned image as just that – an image.

Another  side-effect of any successful OCR outcome, is that the text is now free to be re-flowed. This means that we might consider sharing it with users in a variety of forms enhancing usability and accessibility.

It’s important not to confuse preservation formats with formats for access and dissemination. You probably will have your scanned image masters in TIFF, RAW, JPEG2000, PostScript, SVG. None of these are likely to be of much use to your users over the Web. Not only are the formats not widely supported by Web browsers, but most users probably don’t need or want your master image. If it’s a high-resolution scan of a 100 page book, they might be looking at 100Mb download, or worse – slow to load, and probably slow to render and navigate.

Time taken thinking what formats will give users the best experience is time well spent. What platforms might they want to use now and in the foreseeable future? It’s less than 18 months since Kindle3 made e-book readers affordable, and the Ipad made them sexy. E-books look and function very impressively on both platforms (albeit in different ways): for an overview of some of the benefits of the EPUB format, see Martin Fenner’s post Beyond The PDF… is EPUB. PDF outputs may yet have their uses, if users can at least search for text within them. The point is that only with properly digitised text, do these kinds of accessibility options become possible.

Even image collections can also be disseminated as E-books – nice offline items some users might care to flick through on their tablet computers, possibly even smartphones. I’ve demonstrated how we can create OJS XML from EPrints XML on-the-fly with XSLT: since EPUB and Mobi/Kindle are XML-based formats, we should be able to do something similar to create e-books using repository APIs. Also, by using appropriately sized images in dissemination formats (Ipad screen is 1024x768px; Iphone4 is 960x640px) we can not only ship our users a sensibly-sized download, we can protect any capital we may have in the master images, without having to resort to ugly tricks like watermarking. (Giving users full-size, high-res images with embedded watermarks seems to me the worst of all worlds.)

Therefore I’d suggest that, in order to get the best out of a digitisation project, consider what would you like to see at the end of the project – and, more importantly, what would give your users the best experience, or even win you new users? Ask around, do some tests, with users if possible, and get an idea how they want to use the materials and how they will get the best out of them. Maybe there are comparable projects and systems that you admire, with features you’d like to be available for your collection. What about in five or ten years’ time: will your current project outputs help or hinder longer term accessibility goals?

This kind of vision of is essential. Without some conception of the end result, how the materials will be used and managed most effectively, all the scanning in the world isn’t going to amount to a successful digitisation project.

 


[1] Of course manuscipts and ‘difficult’ print formats – early printing typefaces, multilingual objects – may be resistant to OCR. For that we may need specialised solutions or rekeying, as discussed in recent posts on DA Blog (House Of Books (Part 2), Synergies Abound). Or the kind of online tool we developed with UCL for Transcribe Bentham.

About the author:

3 Comments

  1. Following up this post I was interested to hit on the outputs of the EU-funded IMPACT (Improving Access to Text) project: http://www.impact-project.eu/ Outputs include some interesting looking OCR-related tools and an imminent conference at the British Library.

    Reply

  2. Jill L Schneider

    The Bill LeFurgy post to which you refer in the first paragraph above was actually a guest post written by Kristen Snawder – let’s give credit where credit is due. :)

    Reply

  3. Thanks for the update, Jill, and duly amended. (So much for dc:creator!)

    Reply

  4. Pingback: The next wave of digitization « Henrytapper's Blog

Leave a Reply