‘Machine replication of human functions, like reading, is an ancient dream’ *
One of the many topics discussed in the House of Books project in Amman was the issue of OCR and Arabic texts. Optical character recognition or OCR has become one of the most successful applications of technology in the field of pattern recognition and artificial intelligence. It is now a necessary step in the transition from analogue text to the elctronic world, particularly due to the quantity of information now available in the electronic age as it enables rapdi searching and scanning. In the last five decades, machine reading of text has grown from a dream to reality.
Software for OCR is now almost 100% sucessful for Roman scripts. Middle Eastern library content however, particularly for Arabic and other non-Roman language materials, poses special challenges to the creation of digital repositories of arabic texts. Arabic, being a diacritic language has many characters (letters) which have exactly the same form, and are distinguished only by the position of various dots over, above, or inside the main character block. This poses special difficulty for OCR, as dots can be ignored by software as speckling or error, or even removed. Most institutions digitising Arabic manuscripts use Sakhr OCR software, but it does not seem to pick up the intricacies of Arabic script. What to do?


It seems that if prepared well the Sakhr recognition software package has the capability to recognize generic Arabic fonts (called Naskh or Kūfī) with a fair degree of accuracy. However the software has to be taught to recognize any peculiarities or unusual characteristics in the font of the scanned volume in question. This is extremely time consuming and requires technical expertise. Also it is taken for granted in such a process that the font will be more or less consistent throughout any given volume; in many cases the hand can change in any manuscript so I imagine it would need to be reinstructed according to each section where the hand or the font changes. In addition the quality of any OCR depends on the quality of the original scanned file. Also not everyone wants to use generic fonts, think of how much we like to personalise our own? Another headache for Sakhr.
Our group in Amman as a whole expressed frustration with Sakhr and really hoped that it could in some way be generally instructed to recognise characters which it consistently fails to pick up. We felt sure that it will be solved soon and I personally cannot imagine that the military have not got a solution up their sleeve about this considering the politics of the world these days.
Interestingly in terms of resources discovery, Google Scholar does not allow searching in Arabic, while it allows for searching of both Japanese and Chinese scholarly texts. Surely as complex for an OCR piece of software to recgnise as Arabic? This means that any texts written in Arabic cannot be accessed, which means that scholarship in Arabic is not being picked up by one of the biggest and widest search engines for scholarly literature. Why such an oversight by Google scholar? I have contacted them and have yet to find out!
This of course brought home the real need for more collaboration between libraries and archives involved in digitsiation projects in the Middle East itself. There are many projects based in North America such as Ameel and in the UK such as SOAS (which our own Repository folk in DART have been working on!) which unify and make available digital resources from the Middle East. There was also an interesting JISC study with the University of Exeter about user requirements for digitised resources in Islamic studies. These are of course a western approach to arabic material, albeit in their own collections. It often also is concerned with transalations of arabic texts to greek or Latin as was the norm.
The issue of OCR and its sucess rate for non Roman fonts also raises questions about the power of the digital and askes the question that if OCR canot serve one of the great languages – Arabic, how many minority languages which are also very diacritic are not being served well by the OCR sofwtare available. The result of this must a tip in the balance of available reserach material in favour of texts in Roman script and sees an imbalance in what is being made available online.
There is a need for the countries which created this material to work together on such projects. Many very interesting and topical projects to do with the emergence of which were being proposed in Amman relating to digitisation and working together to track missing journals as well as trying to avoid duplicating efforts.
So how to do this? Several libraries attending our workshop in Amman highlighted the necessity to coordinate the effort for Arabic texts digitization in order to avoid duplication, share best practices and develop common standards, index and software. To enable this a decision was made to work on developing new cultural cooperation interventions for digitisation in the Middle East; to fund-raise for this and to set up groups in a social network (facebook, linkedin) including all the participants from the House of Books project. Importantly further workshops will be run to encourage this cooperation and hopefully see strides being made in cooperation and digitisation of arabic texts in the Middle East.
* http://www.nr.no/~eikvil/OCR.pdf
**Thanks to Qaiss Hatef Saeed of the Iraq National Library and Archives for his help.
















