Training

House of Books Part 2: OCR and Arabic texts

A glimpse of Petra, Jordan

‘Machine replication of human functions, like reading, is an ancient dream’ *

One of the many topics discussed in the House of Books project in Amman was the issue of OCR and Arabic texts. Optical character recognition or OCR has become one of the most successful applications of technology in the field of pattern recognition and artificial intelligence. It is now a necessary step in the transition from analogue text to the elctronic world, particularly due to the quantity of information now available in the electronic age as it enables rapdi searching and scanning. In the last five decades, machine reading of text has grown from a dream to reality.

Software for OCR is now almost 100% sucessful for Roman scripts. Middle Eastern library content however, particularly for Arabic and other non-Roman language materials, poses special challenges to the creation of digital repositories of arabic texts.  Arabic, being a diacritic language has many characters (letters) which have exactly the same form, and are distinguished only by the position of various dots over, above, or inside the main character block. This  poses special difficulty for OCR, as dots can be ignored by software as speckling or error, or even removed. Most institutions digitising Arabic manuscripts use Sakhr OCR software, but it does not seem to pick up the intricacies of Arabic script. What to do?

Some arabic fonts http://university.arabsbook.com/

It seems that if prepared well the Sakhr recognition software package has the capability to recognize generic Arabic fonts (called Naskh or Kūfī) with a fair degree of accuracy. However the software has to be taught to recognize any peculiarities or unusual characteristics in the font of the scanned volume in question. This is extremely time consuming and requires technical expertise. Also it is taken for granted in such a process that the font will be more or less consistent throughout any given volume; in many cases the hand can change in any manuscript so I imagine it would need to be reinstructed according to each section where the hand or the font changes. In addition the quality of any OCR depends on the quality of the original scanned file.  Also not everyone wants to use generic fonts, think of how much we like to personalise our own? Another headache for Sakhr.

Our group in Amman as a whole expressed frustration with  Sakhr and really hoped that it could in some way be generally instructed to recognise characters which it consistently fails to pick up. We felt sure that it will be solved soon and I personally cannot imagine that the military have not got a  solution up their sleeve  about this considering the politics of the world these days.

Interestingly in terms of resources discovery, Google Scholar does not allow searching in Arabic, while it allows for searching of both Japanese and Chinese scholarly texts. Surely as complex for an OCR piece of software to recgnise as Arabic?   This means that any texts written in Arabic cannot be accessed, which means that scholarship in Arabic is not being picked up by one of  the biggest and widest search engines for scholarly literature. Why such an oversight by Google scholar? I have contacted them and have yet to find out!

This of course brought home the real need for more collaboration between libraries and archives involved in digitsiation projects in the Middle East itself. There are many  projects based in North America such as Ameel and in the UK such as SOAS (which our own Repository folk in DART have been working on!) which unify and make available digital resources from the Middle East.  There was also an interesting JISC study with the University of Exeter about user requirements for digitised resources in Islamic studies. These are of course a western approach to arabic material, albeit in their own collections. It often also is concerned with transalations of arabic texts to greek or Latin as was the norm.

The issue of OCR and its sucess rate for non Roman fonts also raises questions about the power of the digital and askes the question that if OCR canot serve one of the great languages – Arabic,  how many minority languages which are also very diacritic are not being served well by the OCR sofwtare available.  The result of this must a tip in the balance of available reserach material in favour of texts in Roman script and sees an imbalance in what is being made available online.

Baghdad at night, 2011

There is a need for the countries which created this material to work together on such projects. Many very interesting and topical projects  to do with the emergence of which were being proposed in Amman relating to digitisation and working together to track missing journals as well as trying to avoid duplicating efforts.

So how to do this? Several libraries attending our workshop in Amman highlighted the necessity to coordinate the effort for  Arabic texts digitization in order to avoid duplication, share best practices and develop common standards, index and software. To enable this  a decision was made to work on  developing new cultural cooperation interventions for digitisation in the Middle East; to fund-raise for this and to set up groups in a social network (facebook, linkedin) including all the participants from the House of Books project. Importantly further workshops will be run to encourage this cooperation and hopefully see strides being made in cooperation and digitisation of arabic texts in the Middle East.

* http://www.nr.no/~eikvil/OCR.pdf

**Thanks to Qaiss Hatef  Saeed of the Iraq National Library and Archives for his help.

The House of Books/Dar El Kataub/دار الكتب والوثائق العراقية Part 1

Newspapers in National Library of Jordan

Baghdad, 2003 -  when Domenico Chirico, Director of Un Ponte Per… first asked various organisations for support and resources for the reconstruction of the Iraq National Library and Archives (INLA) destroyed during the Iraq invasion and occupation, he was met with cries of bemusement and disbelief:

‘Why worry about books and archives when we have lives to save?!’.

Some if not many  struggled to understand that a library in Baghdad could be a priority during such horrific times.  But they had not yet met Dr Saad Eskander, Director of the INLA…but that is another story which has been told.

Cairo,  2011 and the ‘Spring revolution’ is happening in the Middle East.  In Egypt almost as soon as unrest broke out, two public libraries in Cairo were  burnt to the ground however cheeringly in Alexandria, news came of staff and citizens forming a human chain around the Biblioteca Alexandrina (much supported by Susan Mubarack) and another around the National Mususem of Egypt to protect it, though it wasn’t so lucky. These stories as told via Twitter cited what had happened in Iraq. Iraq, sadly is not alone in having suffered such destruction to its cultural property. These attacks,  fire, pillage, looting to order and just plain old theft  do more than just destroy a building or some documents. They are attacks on civil society, denying Iraqi people  access to engage in democracy, to access information as well as to their cultural memory.

Amman, 2011 – In light of all this, it was with great interest that I agreed to participate in the ‘The House of Books/ Dar El Kataub’ workshop run by the aforementioned Domenico Chirico’s NGO UPP and UNESCO.   Its purpose was to work on challenges facing the digitisation of Arabic texts.  I was there to look at the preservation aspect of digitisation, a shock to some as it is often thought that digitisation is itself a preservation strategy! So there was work to be done…

William Kopycki and I leading a session

We had a wide range of people from Iraq, Lebanon, Qatar/Australia, Jordan, Egypt/USA, Italy and the United Kingdom.  From an digital archivist’s point of view (and in my view the right one),  the projects represented presented us with the gamut of activities which are now present in a digital library or archive, from the development of impressives copyright legisalation in Jordan to a start to finish overview of a project digitsing  journals relating to the 19th centroy arab cultural renaissance Al-Nahda. We also heard about training and devloping infrastructures for digital object management in Iraq, as well as an overview of the current project in the INLA to reconstitute many of its collections through digitisation. We had a good deal of course about preservation from myself and Giovanni Bergamin from the University of Florence and Maurizio Messina.

My work consisted partly of leading the group in a consolidation of ideas discussed during day one.  A very important goal of this workshop is collaboration and as a result we wanted the group to think individually and in groups of four about why they saw the need to collaborate and then to tell us how they would approach collaboration.  We ended up with great points which reinforced the absolute need for collaboration in terms of standardisation, best practise, resource discovery, lobbying and important networking opportunities across the arab speaking world.

I also was there to speak and work with the group a lot about digital preservation and digitisation. This is an area which can be neglected often as many considered digitisation to be simply about simply delivering access to materials.  In addition these digital objects are considered digital surrogates and little consideration is given to their preservation as the analogue copies are available. It is important to consider preservation of these digital surrogates over time at the point of their creation.  Do we really want to invest the time and money again in their re-digitisation? I very much doubt it! However unless consideration is given to their long term sustainability this is what will happen, data loss or re-digitisation. This is time and money few of course can afford to spend.

Touching of some of the issues we covered in terms of preservation we looked at planning digitisation projects in light of preservation and their sustainability and explored the main points to consider for preservation of digitised content, drawing a lot from our very popular report on digitisation and preservation produced by DPC/ULCC and PORTICO. I reinforced the real need for honest sharing and for sharing failures as well as successes, we have all had them so we can only progress through acknowledgement of both.

Looking at the 1923 magazine on women’s issues called  Layla, we explored in a session what characteristics we would like to preserve over time.  Many things which we assume will be kept in the paper based world, have to planned for in detail  in the digital world, not much if anything can be left to chance.

Workshop over!

All speakers were extremely interesting individually and collectively as they gave an outsider like me the overview of the digital library/archives world in this part of the Middle East.  Qaiss Hatef  Saeed of the INLA spoke about the motivation of the INLA to establish a digital library  due to the loss of their holdings in the library during and post conflict.  Much of the library is being rebuilt literally from ashes by means of digital content,  and popular holdings need to be digitised to give as Quaiss said ‘ books a rest’ from handling.

Qaiss also proposed that in 100 years hard copy books will no longer exist and everything will be digital. Fighting talk! But  digital libraries however will only survive as long as we invest in their sustainability.  Digital resources have great power in terms of access but they are also very vulneable in terms of long term sustainabilty.  As such action needs to be taken an now to stop bit rot and technology obsolesence being the next threat to access to information which is now increasingly digital.

*Note: The House of Books workshop was picked up by 16  newspapers in Jordan and Iraq. These include  articles in the Baghdad press, The Jordan Times and even the Iraq National Congress.

DPTP at the NAS – legal admissibility


We recently gave a two-day version of the Digital Preservation Training Programme to the National Archives of Scotland. Our timing was quite interesting; we arrived on the Monday the week after NAS had merged with the General Register Office, to become a new body called the National Records of Scotland. And just days before, the Public Records (Scotland) Bill was amended to strengthen its powers for the preservation of digital records. I was personally very encouraged to read what the Minister for Culture had to say about the latter, as it confirms what I’ve always believed about digital preservation; it has a lot of common ground with traditional archives and records management.

Our brief was to introduce digital preservation topics to traditional archivists. Among other things I was asked to deliver a module on BIP0008 and the legal admissibility of electronic records. In preparing this I discovered that the standard is very comprehensive, requiring written policies for classes of records that are in scope, a stated “duty of care” and written procedures for what staff should be doing, an extremely meticulous and well-documented methodology for the production of scanned versions of records, plus a reliable IT framework in which this can work. And of course, an audit trail marked with date and timestamps generated at every possible link in the chain of custody. That’s a lot of boxes to tick, but the payoff is a scanned document (or born-digital record) which is regarded in the eyes of the law as an authentic unaltered copy, hence legally admissible.

My personal take on the standard it goes a long way to satisfying the requirements of auditors who seem to take legal admissibility to rather extreme lengths; to my mind, any good EDRMS or records management system ought to be providing enough audit trails to keep them happy. Nonetheless the archivists at Scotland seem to have a requirement not only to observe this standard, but also to ensure they continue the chain of custody from current records management into archival storage. In other words, legally admissible records must also be legally admissible archives; and any preservation actions performed on these objects while in digital custody must not compromise that authenticity.

This may also have been reflected in their IT manager’s interest in use of the checksum and file format validation tools which can be used in the repository; he seemed to be wondering if such tools could verify authenticity, perhaps by indicating whether a file had been tampered with at any stage between leaving the EDRM system and entering the archival repository. It’s an interesting line of thought, and an approach that would probably involve a heightened degree of audit trailing for any organisation that wanted to work this way. My off-the-cuff contribution to this discussion on the day involved something about providing evidence that “best effort” had been made, but I suppose the real proof would be in a court of law and a legal precedent.

Getting Started in Digital Preservation

Last week Patricia and I attended this event organised by the DPC at the Wellcome Collection Conference Centre on Digital Preservation. There was a good mixture of attendees which showed us digital preservation is a priority.

William Kilbride from the DPC asked the audience what were their main concerns and some of the answers were: obsolescence and migration issues; partners links; storage requirements; business cases and funding development. He explained the challenges for preserving digital data; the value and opportunities that preservation creates and the key approaches to achieve digital preservation (migration; emulation; hardware preservation and exhumation) as well as the risk management approach.

Our own Patricia Sleeman from ULCC explained clearly and with a very interesting example how to use the OAIS model for preserving our personal archive of photos, giving us some light on how to start and how to avoid losing crucial data.

She described the OAIS model as a tool that develops consensus from different sectors providing shared vocabulary, bringing everybody together. We were shown how an information package (digital object, metadata, packaging information which related digital object and metadata) travels through the OAIS model using a photographic archive example for the SIP, AIP and DIP stages. We had the opportunity to see other models from Portico and DCC

Bram Van Der Werf from Open PLANETS Foundation presented and demonstrated the Plato tool . He raised the need for more technical background connected to archival training paths. He welcomed the attendees to participate on the content community at Open Planets Foundation.

Caroline Peach from BLPAC gave us the opportunity to use a preservation plan. We spent time with a working example to identify the plan, its status and triggers, description of the institutional setting and the collection, requirements for preservation, evidence of decision for a preservation strategy, costs, roles and responsibilities. We had to think about what is important about the digital access we want to preserve. We used the PLATO tool to assess the preservation plan and its collection.

Ed Fay from LSE explained in detail how they got to preserve the born digital and digitized collections; and how they maintain the Institutional Repository. Initially, they established user requirements and security analysis to establish where they were in the digital preservation process. They had to investigate all the formats and their backups at the LSE data centre and he praised the robust service and infrastructure from the LSE Information Services Department to maintain their secure digital library. He briefly explained the use of checksum creation and verification with MD5. In conclusion, we saw that the LSE digital Library is flexible extensible and modular. They have a transparent process for decision making so if any changes in the technical infrastructure, everything is well documented and there is an excellent engagement throughout the institution for the purpose of digital preservation.

Other Resources and Training:
DPC events
AIDA
DPTP
PRONOM
National Library of New Zealand metadata extraction tool
BagIt
JHOVE
DPC
PADI
Library of Congress
Gloucestershire Archives
Archivematica
UKLON
ULCC
Repositories Support Project
#starting_dp at Twitter

Digital natives of Hackney, we salute you.

Know your rights, the future is unwritten

Know your rights, the future is unwritten

My 6 year old son brought back from school a booklet about the Convention of the Rights of the Child. As a school councillor he is expected to have a copy on his person, or so I believe, in case any one might need to check out their rights in the playground or corridor! It is a neat and accessible booklet explaining the many rights which we take often for granted here in the UK but which are  absent for many children around the world. I am constantly being reminded of the right to play!

I had heard a lot about it from my son but never actually read it. When I did do so, the archivist in me was drawn to Articles 7 & 8.  Article 7 states ‘every child has a right to a birth certificate’ in the child friendly version, and states in the actual convention that:

‘1. The child shall be registered immediately after birth and shall have the right from birth to a name, the right to acquire a nationality and. as far as possible, the right to know and be cared for by his or her parents.’

Article 8 reinforces this notion of identity:

‘1. States Parties undertake to respect the right of the child to preserve his or her identity, including nationality, name and family relations as recognized by law without unlawful interference.

2. Where a child is illegally deprived of some or all of the elements of his or her identity, States Parties shall provide appropriate assistance and protection, with a view to re-establishing speedily his or her identity. ‘

The importance of a birth certificate

Interesting yet  blindingly obvious to most of us reading this blog. It is wonderful to live in a society where it would never occur to us that we couldn’t have a birth certificate or if we lost it we couldn’t get a replacement. So many people around the world do not live with this certainty. Despite 191 countries ratifying the Convention, the births of millions of children worldwide go unregistered.  Birth registration opens the door to rights to children and adults which many other human beings take for granted: to prove their age; to prove their nationality; to receive healthcare; to go to school; to take exams; to be adopted; to protection from under-age military service or conscription; to marry; open a bank account; to hold a driving licence; to obtain a passport; to inherit money or property; and to vote or stand for elected office.

If there was any doubt about this, this map of the percentage of children under 5 whose births are registered demonstrates that this basic right is not met in many countries around the world. Questions of birth certification integrity have arisen in recent times in the ‘developed’ world, just think about how much controversy Obama’s birth cert has caused!

The importance of records for our identity

Barack Obama birth cert

Barack Obama's birth cert

Records, on the whole, including records of births have lived in discreet repositories most of their lives and it usually their absence/loss  during  wars/conflicts that bring them to most peoples’ attention. During most recent conflicts one of the first targets will be a library and archive or a museum. The destruction of archives occurred in Bosnia and Iraq during conflicts there.  This has been termed ‘cultural genocide’. I would go further and if I were to use such language I would call it ‘identicide’ , the destruction of peoples’ identity. Now that this vital data is increasingly if not primarily electronic we are looking at increased risks and vulnerabilities of data loss or corruption – not by totalitarian regimes or warring ethnic conflicts but by that single enemy known as benign neglect due partly to hardware, software obsolesence and our neglect or inaction.

Bringing it all together

Musing on all this I landed in Millfields school, thanks to Jonathan Everingham, the teacher in charge of history at Millfields.  He asked me to speak to a group of  children at the school in year 6 about ‘My Job’.  I never thought I would speak to fifteen 9-10 year olds about data and data creation.  With children I usually steer clear of this and show them ‘fun’ documents such as old alphabets or old archive footage of children their age – fun fun fun. I really wondered how they would respond. I knew from Jonathan they were studying Ancient Egypt and knew they were doing some simple data gathering and making graphs. Fun! I really wanted to communicate to them the importance of primary sources whether it be electronic or analogue but I was going to focus on data this time. But how to make data fun? I decided to combine the two topics they were studying – Egypt and data gathering – and weave them into my talk.

A quick visit to the time of the Ptolomies

Demotic census

Demotic census, Petrie Museum, museum number - UC32223

So closing our eyes we paid a visit to the time of the Ptolemies in Ancient Egypt and looked at a census from then written in demotic. They were a clever group but no demotic readers! I explained that it had 21 columns of a Ptolemaic census-return for a household. We discussed why governments then and now gather data and the idea of language as code and the absolute need for a demotic dictionary to intterpret this data. They also noted how well the papyrus survived despite its age. I showed them pictures of the people this data could have related to, beautiful portraits of soul searching greco-egyptian men and women looking at us across the ages from their coffin paintings.

fayum

Greco-Egyptian Tomb portraits

I asked them to identify which records produced by the government today did they think related to them? They themselves came up with birth and death and discussed what you needed these for and what would happen if you didn’t have the record of your birth. I wanted them to draw a direct relationship between their lives and the record of their lives and to think about what it would be like NOT to have this information, what it would mean to them as individuals. I also wanted them to be enfranchised/ connected to the fact that the state holds and is supposed to look after these their records of their important life events.

Back to the future

Primary births dataset, 1963-64. NDAD, TNA reference RG 71/2

Primary births dataset, 1963-64. NDAD, TNA reference RG 71/2

We then cut to a slide of a 20th century census. This time it was the register of primary births in the United Kingdom in 1963, i.e. the stuff needed to prove you were born. I had worked on this dataset while on the NDAD project. I showed them a snapshot of the raw encoded data and asked them to identify anything which made sense! They immediately zoned in on identifiable codes, one recgonised it as ‘some sort of data’ but agreed it was gobbledegook on the whole. In assessing the data, they quickly realised that something crucial was missing to accompany the data – the data dictionary. We hummed and hawed a while trying to dicipher the code but to no avail. I then showed them the partially decoded data. They could easily read through the data and understand each field and what a field was. The speed at which things ‘clicked’ for them suprised me. So these students are ‘digital natives’ I had heard about ! They are removed  from people like me (not hard) and their teacher who are in many ways always trying to catch up with technology. They then were very keen to discuss what data should be kept about them – facebook! Twitter! Photos, text messages. They were outraged by the idea that facebook may not be being kept! I think they were also outraged that a government department with responsibility for such crucial data would not keep the dictionary explaining the code. This is their/ their parents’ data!

What fascinated me was these 9-10 year old children easily made so many leaps in understanding. 1. The leap of association between encoded data and language as a code. 2. The identification of data as having a role in their lives 3.Awareness of the fragility of digital stuff 4. that they should have a say in what is being kept.

This is out data, look after it please

Should we be talking to these ‘digital natives’ more than we do? Shouldn’t everyone be engaged in what is being kept in relation to their collective memory? This generation and each subsequent generation have such an aptitude for technology and its uses that is astounding. Do we do them a diservice not asking them what it is they would like, while bearing in mind their youth and inexperience but also bearing in mind they will inherit whatever we have left behind in our (in the minds of the future) clumsy inept approach.

Digital natives

Digital natives

Since the visit to Millfields I have discovered CensusAtSchool, a project where national census can be used to help children learn and do statistics. I think there is a lot of potential here to look at working with children and enfranchising them with regards to data and its use. I like this important idea of enfranchising people – there is so much disenfranchisement around. Data isn’t easy and people need help but it is also up to the creators, especially government to make it easy to access and view. A message from Hackney ‘This is our data, let’s reclaim it.’