This is the third and final post of mine summing up my notes from the 4th international digital curation conference (now a few weeks ago.) These notes cover the bulk of the second day, which consisted of submitted (as opposed to invited) papers. Most of these were given in parallel tracks, so I’m only able to cover half of them. As luck would have it, Chris Rusbridge seems to have covered many of the others.
The opening paper, however, was in its own session as it won the prize for best paper of the conference: Manjula Patel and Alexander Ball’s study on the issues surrounding the preservation of engineering CAD models was a useful guide to the problems and to possible solutions. One of the things that they opened my eyes to is that CAD files aren’t just about geometry and spatial representation – there’s lots of other information possibly embedded in them or attached to them, from engineering tolerances to feedback from the manufacturing process or field maintenance.
Aaron Griffiths then spoke about the RIN report looking at attitudes to data publishing across different disciplines (To Share or Not To Share: Publication and quality assurance of research data outputs) It examined culture, infrastructure barriers, the effects of policy and overall propensity for sharing across a number of disciplines via interviews with researchers: astronomy was generally high (that is, well disposed towards sharing), social and health sciences low, for instance. Motivations included altruism and opportunities for collaboration; constraints included time, legal and ethical, competition, sense of limited rewards, and nowhere to put stuff. The report says we need incentives: evidence of benefits, standard workable mechanisms for citation, more explicit rewards that affect career progression. It also looked at issues around discovery, access and usability. and then asked researchers about processes relating to quality assurance. This didn’t get support from the researchers they spoke to, but we can’t have greater rewards for data publishing without a willingness to have QA, I feel. His presentation made it clear to me that I ought to read the report.
Amanda spencer then spoke about TNA’s web continuity project, which aims to ensure (eventually) that 404 errors on UK government websites will be a thing of the past. One detail that was new to me (we’ve covered much of this in the PoWR project) was that the TNA team followed up their original study of link persistence in Hansard answers (which found that 60% of the 4,000 URLs given in answers to parliamentary questions over the last 10 years no longer worked) with a more wide-ranging study of government sites using google webmaster tools. They wanted to check to see if PQ answers were in some way a special case: they weren’t. They are promoting the use of XML sitemaps to help in capture but these also help search engines and therefore current use of the sites.
Ronald Jantz then spoke of ideas around institutional support for authentic digital objects, starting with the assumption that “digital scholarship requies authentic digital objects”. An example he gave in support of this (attempts to recreate cold fusion experiment failed because experimental design – including location of thermometer – wasn’t available) was interesting but didn’t seem directly relevant to questions of authenticity but rather of selection or completeness. He reminded us of the useful distinction made in archival diplomatics between authenticity and reliability, but ended up proposing that the solution to the authenticity question depended on institutionally-supported key signing, together with some process like TRAC to authenticate the institution and/or its repository. I and others weren’t convinced by this argument, nor of its novelty. The issue of keys and what they attest was being dealt with at the time of the second DLM forum in 1999 by the Swedish National Archives, amongst others. Using keys to place the equivalent of wax seals on documents isn’t difficult; the complex problem is whether the trust networks exist to make the keys useful to anyone.
A presentation about the curation of weather data at NCAR provided lots of numbers and some interesting new terminology. ‘Enriched staff’ who have subject knowledge of the data they are curating. NCAR storage requirements double every 2.5 years, and are now about 6 Petabytes. A new phrase for what I think of as media refreshing – moving data between storage media without necessarily changing the format of the data – was ‘tape oozing’. This is still something that requires significant planning and coordination for large data centres to ensure that data can be moved before media become obsolete and without interfering with day-to-day access needs. NCAR’s observation is that poor curation causes more incidents of data loss than equipment failure or media failure. ‘Poor curation’ can mean both bad practice and the failure to follow established good practice. (This is true of many areas of activity, from healthcare to manufacturing, and quality management systems like ISO9000 are designed to minimise both causes of error, but they will never eliminate it.)
Mackenzie Smith then gave the day’s second paper on CAD, specifically in architecture. A fascinating collaboration with MIT’s architecture school looked at well-known achitectes like Gehry, Thom Mayne and Moshe Safdie. Data is a 3D CAD model (known as BIM – building information model) – which moves from architects to builders to owners. Targer audiences are practitioners, historians, instructors and the public. The practitioners already know they jhave a problem managing the data, so are incentivised. 10,000s of files, 100+ file formats, many gigabytes, almost no metadata: just filesystems, sometimes with spreadsheets in them that are like partial catalogues. The MIT project buildt a gui to help in automating assigning 5 properties to every file: when, where, how, why, what. “How” is generic type, such as presentation; “what” is specific format; “when” is to do with building phase rather than absolute time. Importamt stuff gets extra tags: models tell you what was built, but not why. MIT are archiving by creating 3 derivatives, all of which lack something. (STEP,dessicated shape, and web display.) One problem is that geometry is not authentic, but practiioners say much parametric stuff is tool artefact, not design intent. CAD vendors are very resistant on releasing format information. The project is exploring escrow solutions for this, which sounds like a pragmatic way forward.
In the afternoon, Stephen Abrams began with the statement that “Preservation is not a place” and went on to a wide-ranging reflection on the nature and role of repositories in a way that has relevance for many other institutions and services. Starting from first principles, Abrams’ group worked from values to services, reimagining the repository as a process, not a place (still a work in progress.) They went back to Raganathans 5 laws and also looked at archival science, particularly concerns over provenance and its importance in supporting authenticity. They identified 10 properties which they think apply to all objects (identity, viability, visibility are some.) But curation depends on a lot of human things: competency, deicsion making, analysis. They identify 13 minimal services that suppport the human endeavours: these include identity services (ark and noid), storage (pairtree), fixity/replication (ACE), catalog (relational or rdf/xml database), characterization (jhove2), transformation (at ingest, preservation dessication/migration, and use copy generation), ingest (bagit/grabit), request – metadata-based search and browse, search (lucene), publication (sitemap,rss/atom,oai-pmh), annotation (social tagging,oai-ore). Any one of those would have made an interesting paper in itself, and some of them were the subject of separate posters. He ended by thinking how simple can a curation environment be and still be effective? And also took the concept of LOCKSS a little bit further: lots of description keeps stuff meaningful, services keep stuff useful, use keeps stuff valuable.
Finally Malcolm Atkinson gave a stirring speech which began with thoughts of scientists who still go out once a month to count birds or moss species – trained volunteers are important as well as industrialised scientists. Data can have uses not envisaged when it was created nor even when it was first preserved: ships logs’ original purpose was safety. They were collected by archives to understand trade, and are now helping climate modelling because of their detailed records of such things as temperature and wind. Soon power will cost more than hardware. 1pb consumes 2.2mwh in 5 years. He talked of ‘data huggers’ who avoid competition, don’t describe, don’t expose, don’t share their data. There are also good data sharers, but at the moment incentives and disincentives aren’t well balanced. We need research on how to enable better research and data must be fundamental to this.
IDCC has been over-subscribed for 3 of its years and always offers lots of inspiration and food for thought. The linkup with CODATA, whose events cover a similar subject area, can only be a good thing. I’m already looking forward to IDCC5, expected to be in London in 2009.

