Digitisation

Word Cloud 2012!

Lots of updates in the pipeline for the coming weeks, and some spring cleaning for DA Blog, but for now here’s a Wordle Word Cloud from a recent report on our activities.

ULCC Digital Archives & Repositories Word Cloud 2012

ULCC Digital Archives & Repositories Word Cloud 2012

The House of Books: Manuscripts and religious identity in Iraq

Father Najeeb Michaeel examines a manuscript

Father Najeeb Michaeel is an Iraqi Christian priest who speaks Arabic, English, French, Aramaic and Syriac, not to mention being able to read Latin and Greek. In the garden of Zaytun library, Erbil I hear this gentle man tell me how his community of friars used to live in Mosul, a traditional centre for Christianity in Iraq, having the highest proportion of Assyrian Christians of all the Iraqi cities. Father Najeeb’s community has  had to leave Mosul due to persecution.  Later on during The House of Books workshop he gives us a presentation of the magnificent early Christian manuscripts they are digitising.  Over coffee he gives us a moving rendition of the ‘Our Father’ sung in Aramaic.  I wasn’t expecting to feel so moved by a  religion I have become increasingly frustrated by, and in Iraq.

Early Christian manuscript, Centre Numerique des Manuscrits Orientaux, Mosul, Iraq.

Iraq has often compared to a mosaic in terms of the diversity of its religious diversity.  Iraq is a Shia majority country and contains the sacred Shia cities of Najaf and Karbala. Most sources estimate that around 65% of Iraqis follow Shia Islam, and around 35% follow Sunni Islam. What is not so well known is that Christians have inhabited what is modern day Iraq for about 2,000 years. The person who is supposed to be respnsible for the transmission of Christianity in Iraq is St Thomas the Apostle. Assyrians (also called Syriacs and Chaldeans) most of whom are adherents of the Chaldean Catholic Church, Syriac Orthodox Church and the Assyrian Church of the East account for most of Iraq’s Christian population, along with Armenians.  Tariq Aziz was born to an Assyrian family and is a member of the Chaldean Catholic church. There are also small populations of Mandaeans, Shabaks, Yarsan and Yezidis. The Iraqi Jewish community, numbering around 150,000 in 1941, almost entirely left the country. There are also Gnostics in the form of Mandeans and sub sects thereof, Yazidis who believe in a god but have a blue peacock angel in their pantheon, and of course the Zoroastrians which the ancient Babylonians followed.

Read More »

Scanning is different from digitisation

If you haven’t seen it, can I recommend Kristen Snawder’s recent post on the Library of Congress Digital Preservation blog, Digitization is different than digital preservation. Kristen reiterates familiar points about the long-term commitment necessary for serious digital preservation, contrasted with the quick hit of a scanning project. “In the hurry to meet user expectations, institutions may scan large quantities of materials without having a solid plan for preserving the digital images into the future.”

However another recent find on the Web compels me to make an additional point, namely that we might do equally well to differentiate between scanning and digitisation. Anyone can set to work with a scanner and create a bunch of digital images – but that barely scratches the surface of what I think we should be expecting of a digitisation project in 2011.

First and foremost, we need metadata: the more the merrier, but something at least. Even if we expect to come back later and polish it up (once the images can be browsed and examined on screen). In the absence of any established metadata profiles for a project, at least try to cover as many Dublin Core elements as possible – title, creator, date, subject/keywords… Images, in particular, may prove tricky or time-consuming to find again, especially once there are thousands of them on a disk. We should probably keep the metadata in a database, and perhaps additionally store metadata with the objects. This can be as XML or plain text files stored alongside the digital images, or embedded in the files we create (many common file formats – TIFF, JPEG, MPEG, PDF – support metadata embedding, and there are many free tools available to help).

There is yet more, though, that we should be doing, particularly when we are scanning text-based objects (articles, books, magazines, reports, etc). Most importantly, we really should try and extract the text from the image if possible. [1]

My recent web find was the teaching blog of Dr Toine Bogers at the Royal School of Library and Information Science (RSLIS) in Copenhagen, Denmark. One fascinating post describes a Lab Session exercise, From OCR To NER, a set of comparatively simple command-line processes to get the most out of a scanned-text project.

Read More »

Transcribing Bentham

Jeremy Bentham, Bloomsbury WC1 by Ewan-M on Flickr (CC:BY)Did I mention that we are very excited to be contributing to UCL’s Bentham Transcription Initiative. This is an AHRC-funded project to complete the digitisation of the manuscripts of 18th Century philosopher Jeremy Bentham, and transcribe them using a wiki-based collaborative approach. It is being run by the Bentham Project at UCL, with support from ourselves and UCL’s newly-launched Centre for Digital Humanities. You can read an overview of the project on Melissa Terras’s blog.

Obviously, transcription of manuscript materials is an important digitisation activity that can rarely, if ever, be left to computers, in the way that printed texts can be, using OCR. But it’s painstaking and laborious work, and anything that eases the burden is welcome.

The project is already throwing up some very interesting conversations about transcription.  At ULCC we have thought about transcription before, particularly with regard to our ongoing work for the Linnean Society archives, and we hope that there will yet be synergies to exploit. It is a great feeling to be so closely involved with disseminating the work of two such seminal figures as Linnaeus and Bentham.

We’re not naïve enough to think that collaborative web-based transcription is new, but we’ve yet to find any substantial comparable examples. A comment on UCL’s Digital Humanities blog teases us with the prospect of information about other similar projects, but fails to provide even a single link or hint, so is effectively useless: hardly in the collaborative spirit! A more useful lead was Joanne Evans’ link to the National Library of Australia’s Australian Newspapers project, which is crowdsourcing the proof-reading and correcting of OCR outputs, and has an impressive-looking site – I’m sure we’ll be borrowing some ideas from there.

Another useful lead has been from Ben Brumfield of Austin, Texas, directing us to his blog about collaborative manuscript transcription which has been going even longer than DA Blog, and looks like it’s going to make interesting reading. Ben’s recent blog post about a distributed transcription exercise of the US Geological Survey’s Bird Phenology Program includes a link to a training video for volunteers (it even sounds like it’s been recorded in a birdhouse).  In the video we can see a database-form approach to transcription, which is particularly appropriate for transcribing data already entered on structured forms.

For more heterogeneous and free-form texts, such as the Bentham manuscripts, wikis seem to me much more appropriate, being in essence discrete hypertext engines. As for collaborative features, MediaWiki in particular has strong and proven features: there can be few better advertisements for effective virtual, global collaboration and crowdsourcing than Wikipedia.

One thing that is particularly compelling about the BPP video is that it is an excellent example of a thorough approach to online collaboration, giving clear and unequivocal guidance to contributors. Now that screencast tools are so readily available, it’s clear that for many activities like this, video-based instruction is the ideal tool, and often preferable to any number of written instructions. No less than for online teaching and learning environments, the need for effective induction and inclusive management of the online community must never be overlooked.

Farewell ‘TASI’, Hello ‘JISC Digital Media’

Photo by Chad Miller

Photo by Chad Miller

On the 5 March I attended the London launch of the rebranding of ‘TASI’ to ‘JISC Digital Media’. Tables were decked with everything from canapés & wine, to a variety of AV and photographic media on display (on separate tables of course!). Although the former ‘TASI’ was always a JISC-funded venture, it’s now more prominently self-evident in its newly rebranded name.

As of August this year, JISC Digital Media will become part of a consortium of JISC advisory services that aim to provide joined-up solutions for clients. Other aligned services include JISC InfoNet, JISC TechDis, JISC Legal Information, Procureweb and JISC Netskills.

JISC Digital Media’s official brief is “to ensure that digital media resources being created, used and managed within the further and higher education community meet the teaching, learning and research needs of individuals and institutions within the UK.” The recently expanded service now also provides expertise in moving images and sound. (In fact, as I blog, a couple of members of our very own Digitisation team are attending their new training course on Audio Production).

Speakers at the launch touched upon some specific aspirations for the Service, and a few points of interest stood out:

  • JISC Digital Media are keen for the HE and FE sector to use the JISC Digital Media blog and share expertise across the sector;
  • Would like to adopt more web 2.0 technologies, for example, skype-based e-learning that could support some aspects of practical training;
  • Their emphasis will be on helping the HE/FE sector to use images for teaching;
  • There is a recognised need that more must be done to help the FE sector.

Of course, with any newly rebranded organisation, comes a new-look website http://www.jiscdigitalmedia.ac.uk/ Nice bold colours and user friendly too! …So farewell dear ‘TASI’ [now a dirty word that incurs a fine if spoken out-loud by its own staff], and ‘hello’ to the new and improved Advisory Service: ‘JISC Digital Media’.