All

House of Books Part 2: OCR and Arabic texts

A glimpse of Petra, Jordan

‘Machine replication of human functions, like reading, is an ancient dream’ *

One of the many topics discussed in the House of Books project in Amman was the issue of OCR and Arabic texts. Optical character recognition or OCR has become one of the most successful applications of technology in the field of pattern recognition and artificial intelligence. It is now a necessary step in the transition from analogue text to the elctronic world, particularly due to the quantity of information now available in the electronic age as it enables rapdi searching and scanning. In the last five decades, machine reading of text has grown from a dream to reality.

Software for OCR is now almost 100% sucessful for Roman scripts. Middle Eastern library content however, particularly for Arabic and other non-Roman language materials, poses special challenges to the creation of digital repositories of arabic texts.  Arabic, being a diacritic language has many characters (letters) which have exactly the same form, and are distinguished only by the position of various dots over, above, or inside the main character block. This  poses special difficulty for OCR, as dots can be ignored by software as speckling or error, or even removed. Most institutions digitising Arabic manuscripts use Sakhr OCR software, but it does not seem to pick up the intricacies of Arabic script. What to do?

Some arabic fonts http://university.arabsbook.com/

It seems that if prepared well the Sakhr recognition software package has the capability to recognize generic Arabic fonts (called Naskh or Kūfī) with a fair degree of accuracy. However the software has to be taught to recognize any peculiarities or unusual characteristics in the font of the scanned volume in question. This is extremely time consuming and requires technical expertise. Also it is taken for granted in such a process that the font will be more or less consistent throughout any given volume; in many cases the hand can change in any manuscript so I imagine it would need to be reinstructed according to each section where the hand or the font changes. In addition the quality of any OCR depends on the quality of the original scanned file.  Also not everyone wants to use generic fonts, think of how much we like to personalise our own? Another headache for Sakhr.

Our group in Amman as a whole expressed frustration with  Sakhr and really hoped that it could in some way be generally instructed to recognise characters which it consistently fails to pick up. We felt sure that it will be solved soon and I personally cannot imagine that the military have not got a  solution up their sleeve  about this considering the politics of the world these days.

Interestingly in terms of resources discovery, Google Scholar does not allow searching in Arabic, while it allows for searching of both Japanese and Chinese scholarly texts. Surely as complex for an OCR piece of software to recgnise as Arabic?   This means that any texts written in Arabic cannot be accessed, which means that scholarship in Arabic is not being picked up by one of  the biggest and widest search engines for scholarly literature. Why such an oversight by Google scholar? I have contacted them and have yet to find out!

This of course brought home the real need for more collaboration between libraries and archives involved in digitsiation projects in the Middle East itself. There are many  projects based in North America such as Ameel and in the UK such as SOAS (which our own Repository folk in DART have been working on!) which unify and make available digital resources from the Middle East.  There was also an interesting JISC study with the University of Exeter about user requirements for digitised resources in Islamic studies. These are of course a western approach to arabic material, albeit in their own collections. It often also is concerned with transalations of arabic texts to greek or Latin as was the norm.

The issue of OCR and its sucess rate for non Roman fonts also raises questions about the power of the digital and askes the question that if OCR canot serve one of the great languages – Arabic,  how many minority languages which are also very diacritic are not being served well by the OCR sofwtare available.  The result of this must a tip in the balance of available reserach material in favour of texts in Roman script and sees an imbalance in what is being made available online.

Baghdad at night, 2011

There is a need for the countries which created this material to work together on such projects. Many very interesting and topical projects  to do with the emergence of which were being proposed in Amman relating to digitisation and working together to track missing journals as well as trying to avoid duplicating efforts.

So how to do this? Several libraries attending our workshop in Amman highlighted the necessity to coordinate the effort for  Arabic texts digitization in order to avoid duplication, share best practices and develop common standards, index and software. To enable this  a decision was made to work on  developing new cultural cooperation interventions for digitisation in the Middle East; to fund-raise for this and to set up groups in a social network (facebook, linkedin) including all the participants from the House of Books project. Importantly further workshops will be run to encourage this cooperation and hopefully see strides being made in cooperation and digitisation of arabic texts in the Middle East.

* http://www.nr.no/~eikvil/OCR.pdf

**Thanks to Qaiss Hatef  Saeed of the Iraq National Library and Archives for his help.

The House of Books/Dar El Kataub/دار الكتب والوثائق العراقية Part 1

Newspapers in National Library of Jordan

Baghdad, 2003 -  when Domenico Chirico, Director of Un Ponte Per… first asked various organisations for support and resources for the reconstruction of the Iraq National Library and Archives (INLA) destroyed during the Iraq invasion and occupation, he was met with cries of bemusement and disbelief:

‘Why worry about books and archives when we have lives to save?!’.

Some if not many  struggled to understand that a library in Baghdad could be a priority during such horrific times.  But they had not yet met Dr Saad Eskander, Director of the INLA…but that is another story which has been told.

Cairo,  2011 and the ‘Spring revolution’ is happening in the Middle East.  In Egypt almost as soon as unrest broke out, two public libraries in Cairo were  burnt to the ground however cheeringly in Alexandria, news came of staff and citizens forming a human chain around the Biblioteca Alexandrina (much supported by Susan Mubarack) and another around the National Mususem of Egypt to protect it, though it wasn’t so lucky. These stories as told via Twitter cited what had happened in Iraq. Iraq, sadly is not alone in having suffered such destruction to its cultural property. These attacks,  fire, pillage, looting to order and just plain old theft  do more than just destroy a building or some documents. They are attacks on civil society, denying Iraqi people  access to engage in democracy, to access information as well as to their cultural memory.

Amman, 2011 – In light of all this, it was with great interest that I agreed to participate in the ‘The House of Books/ Dar El Kataub’ workshop run by the aforementioned Domenico Chirico’s NGO UPP and UNESCO.   Its purpose was to work on challenges facing the digitisation of Arabic texts.  I was there to look at the preservation aspect of digitisation, a shock to some as it is often thought that digitisation is itself a preservation strategy! So there was work to be done…

William Kopycki and I leading a session

We had a wide range of people from Iraq, Lebanon, Qatar/Australia, Jordan, Egypt/USA, Italy and the United Kingdom.  From an digital archivist’s point of view (and in my view the right one),  the projects represented presented us with the gamut of activities which are now present in a digital library or archive, from the development of impressives copyright legisalation in Jordan to a start to finish overview of a project digitsing  journals relating to the 19th centroy arab cultural renaissance Al-Nahda. We also heard about training and devloping infrastructures for digital object management in Iraq, as well as an overview of the current project in the INLA to reconstitute many of its collections through digitisation. We had a good deal of course about preservation from myself and Giovanni Bergamin from the University of Florence and Maurizio Messina.

My work consisted partly of leading the group in a consolidation of ideas discussed during day one.  A very important goal of this workshop is collaboration and as a result we wanted the group to think individually and in groups of four about why they saw the need to collaborate and then to tell us how they would approach collaboration.  We ended up with great points which reinforced the absolute need for collaboration in terms of standardisation, best practise, resource discovery, lobbying and important networking opportunities across the arab speaking world.

I also was there to speak and work with the group a lot about digital preservation and digitisation. This is an area which can be neglected often as many considered digitisation to be simply about simply delivering access to materials.  In addition these digital objects are considered digital surrogates and little consideration is given to their preservation as the analogue copies are available. It is important to consider preservation of these digital surrogates over time at the point of their creation.  Do we really want to invest the time and money again in their re-digitisation? I very much doubt it! However unless consideration is given to their long term sustainability this is what will happen, data loss or re-digitisation. This is time and money few of course can afford to spend.

Touching of some of the issues we covered in terms of preservation we looked at planning digitisation projects in light of preservation and their sustainability and explored the main points to consider for preservation of digitised content, drawing a lot from our very popular report on digitisation and preservation produced by DPC/ULCC and PORTICO. I reinforced the real need for honest sharing and for sharing failures as well as successes, we have all had them so we can only progress through acknowledgement of both.

Looking at the 1923 magazine on women’s issues called  Layla, we explored in a session what characteristics we would like to preserve over time.  Many things which we assume will be kept in the paper based world, have to planned for in detail  in the digital world, not much if anything can be left to chance.

Workshop over!

All speakers were extremely interesting individually and collectively as they gave an outsider like me the overview of the digital library/archives world in this part of the Middle East.  Qaiss Hatef  Saeed of the INLA spoke about the motivation of the INLA to establish a digital library  due to the loss of their holdings in the library during and post conflict.  Much of the library is being rebuilt literally from ashes by means of digital content,  and popular holdings need to be digitised to give as Quaiss said ‘ books a rest’ from handling.

Qaiss also proposed that in 100 years hard copy books will no longer exist and everything will be digital. Fighting talk! But  digital libraries however will only survive as long as we invest in their sustainability.  Digital resources have great power in terms of access but they are also very vulneable in terms of long term sustainabilty.  As such action needs to be taken an now to stop bit rot and technology obsolesence being the next threat to access to information which is now increasingly digital.

*Note: The House of Books workshop was picked up by 16  newspapers in Jordan and Iraq. These include  articles in the Baghdad press, The Jordan Times and even the Iraq National Congress.

Statistically relevant

From the SHERPA-LEAP blog.

Over the last year or so we’ve installed and configured (in some cases reconfigured) the IRStats package for several of the LEAP repositories, including those hosted by ULCC. It seemed a good moment to share a few thoughts about the process of getting “all statted up” with EPrints.

By default, and without any further action, IRStats provides a kind of smorgasbord control panel, demonstrating the many optional graphs, charts and list available. You can see an example on our own ULCC Publications repository.

More recently we’ve seen growing demand among repository managers to share data on downloads with both their depositors and users at large. It’s really important for repository managers to select carefully which statistics views they actually want or need to display – we can only suggest things we think might work. Once you’ve decided on the views you want, we can look at the most effective ways to display them: and this is why I’ve been having fun souping up some of the displays already offered by IRstats.

The first display we’ve been working on is the Statistics digest. These are common enough and we’ve used the example of UCL Discovery repository as the basis of work for both SAS-Space and SOAS institutional repository.

The second approach has been to re-style the IRstats “dashboard” view to lay the graphs on top of each other and then use some Javascript to handle the tabbed navigation. This seemed a more elegant approach than inserting lots of charts in the abstract page itself (as, for example, at ECS EPrints). I’ve used this display technique to display statistics for individual eprints for the School of Pharmacy, as well as SAS and SOAS.

IRStats on School of Pharmacy EPrints
The tabbed display of graphs and tables was also combined with a ‘modal box’ display that keeps the height of page the same (for example on this Abstract page at SOAS. At the bottom of the Abstract page I’ve added a statistics section showing the number full-text downloads, and a link that displays detailed stats in an overlaid box.

This method doesn’t just work for individual items, but can be used on other datasets in too. For example, on SAS-Space we have added it to the bottom of their Collection browse pages, so that at the bottom of each Collection view there is an opportunity to view download statistics for that collection as a whole.

Additionally in SAS-Space, since it is a repository for a number of discrete institutes, there was a requirement for institutional editors to have access to their own institute’s statistics. To achieve this, I allowed access to a constrained version of the IRStats control panel for editor-users who had the appropriate editorial permissions for the institute in question. (Unless you are a SAS-Space editor, you won’t be able to access this.)

Which statistics views to insert as tabs is the decision of the repository manager. Views we’ve used include:

  • Monthly downloads
  • Daily downloads
  • Unique visitors
  • Referrers
  • Search Engines
  • Top 10 items downloaded (only for a Collection, Repository or Division)
  • Top 10 search terms

From a technical point-of-view, we will have to review these configurations when we upgrade to EPrints version 3.3, possibly later in the year (if it’s released!!), in conjunction with our VM infrastructure migration, and start doing things with EPStats rather than IRStats. But we now have an effective framework for adding statistics quickly to any EPrints installation.

To have and to hold

Susan Corrigall(NRS), Patricia Sleeman & Ed Pinsent (ULCC).

We gave a 2 day in house workshop to the National Archives of Scotland last week.  As Ed Pinsent has noted in his post about legal admissability, our timing was quite interesting; we arrived on the Monday the week after NAS had merged with the General Register Office, to become  the National Records of Scotland.  Interesting times for all and while we couldn’t have known in 2010 what a pivotal day it would be for NAS, we hoped  our training session on digital preservation would be in some way timely.

The 20 staff attending were very varied in their reponsibility and roles in the NAS/NRS, however wisely it was felt that all needed to have a entry level awareness of the issues relating to digital preservation.   NAS has been engaged for some time in the management of its digital resources and it is looking at cost effective ways of doing so.  Staff are being skilled up in relation to digital preservation, in this way NAS are ‘maximising their current resource potential’, to use a buzz word.

Working with a group from one organisation in situ is quite different for us as normally the DPTP is an open course which sees a variety of delegate hailing from many very varied organsiations. As a result we cover many topics to give a good grounding in what we feel are the most important aspects fo digital preservation, while trying to accomodate varying levels of expertise and knowledge.  This type of course is sucessful in many ways – it can often lift spirits, particularly for peope working in a solitary environment to see how much in common they have in terms of digital preservation and its challenges. The social aspect of the course is important as people recognise that digital preservation affects all communities involved in any kind of management of digital content in the long term.  A lot of what we aspire to do is build confidence in people to tackle the issue.

Working in an organisation who has commissioned the DPTP is quite a different experience. We see the coming together of a group already known to each other to a greater or lesser degree. However very often they come from different sections and this ‘time out’ together should be of value in many ways:  as a collective learning opportunity, socialisation and in a time of change an opportunity for supporting each other.

We also get the chance during a class project to investigate a particular issue of note for the organisation.  This hopefully enables the group to leave with a finished product.  In the ‘open to all’ DPTP we want our group to leave with a better understanding of everything but more precisely to have gone some way regarding the implmentation of theory to real life work situatsions through our classwork.  To leave the course with soemthign immediately implementable within the workplace we feel is important. Our class session at NAS/NRS remains confidential but we thoroughly enjoyed it and felt that it translated a lot of what was hitherto theory into a an enhancement of an existing practise to enable improved digital object management.

Interesting times for all and I feel perhaps it was timely that NAS are skilling up their staff to embark on digital preservation now.  Digital preservation is a concern for all government departments but who can manage it? Archivists as a profession have a good set of tools and concepts from their profession which translate well to digital preservation.  In fact OAIS tells many archivists nothing new – they are doing most of it already in the analogue world.

Thus archivists hold one major key on our large key set which we need to unlock digital preservation.