Web Archiving

A Workshop for the Web

We delivered our first DPTP workshop in London on 28 June 2010, on the subject of archiving websites. I delivered most of the training myself, working from my experience with archiving JISC project websites, writing the PoWR Handbook, and my sense for how the work should fit into a traditional archiving continuum. Accordingly I tried to structure the day to reflect a start-to-finish approach for the job, thusly:

  1. Consider your organisational requirements, drivers and level at which you want to do it, and create a selection policy that matches this. Consider legal framework.
  2. Understand the technology that copies websites (harvesters) and how websites themselves behave. I talked about aspects of the dynamic web that sometimes trip up a harvester – CMS, wiki, databases.
  3. Consider how (and indeed whether) you want to offer access to the collection, and whether metadata is needed.
  4. Build a programme for web archiving, adapting existing methodologies as needed – e.g. Institutional vs Individual. What other services exist, and can they do it for you?

In the middle, we had an excellent case study from Dave Thompson at the Wellcome Trust, and his experiences strongly reflected many of the themes of the course. Like many organisations, they don’t have one single reason for collecting web archives, and the future value of these collections is something we can’t yet see (due to its closeness with the live web).

We were all impressed by the people attending the course, all from a variety of backgrounds and projects, coming with widely different expectations of how they would be managing their web content. National libraries and business archives were represented, but also the arts; the Tate Gallery are doing interesting work in time-based media and specialist works of art that manifest themselves over the web. How to capture that content, and make it perform in the future?

The DPTP recognises the value of participation and sharing experiences, which we can all learn from. When I was holding forth on the concept of three possible points of capture for web content, I was very pleased to hear a proposal for a fourth possible method from our Swiss delegate Daniel Spichty. There were also numerous questions about exactly what it is that Content Management Systems do, which suggested to me I need to learn more about the inner workings and preservation implications of such systems.

I’m also pleased we were able to offer a printed copy of the PoWR Handbook to all those attending – in advance of the official launch of the book, which will take place at IWMW 2010 on 12 July.

JISC-PoWR Guide To Web Preservation now available at Lulu

You’ll all remember the JISC-PoWR (Preservation of Web Resources) project in 2008, during which we worked with Brian Kelly and Marieke Guy of UKOLN to raise awareness of Web preservation issues through face-to-face workshops and the project blog, and to develop the JISC-PoWR Handbook. The JISC-PoWR baton has, in different ways, passed to other projects such as ArchivePress, and the Beginners Guide to Digital Preservation, but you can still catch up with all of its outputs and discussions at the JISC-PoWR blog.

It was observed by some, at the time, that the Handbook was quite a hefty tome. It was full of good advice and illustrative case studies collected during the course of the project, but arguably suffered a bit from a lack of objective editing.

However, thanks to Brian’s prudent audit of the project budget, he was able to engage Susan Farrell to take this work further, and Susan has done a great job in editing and condensing the material, so now we can present what I think can truly be described as a much more handy Handbook that fulfils the project’s key aims, of offering an accessible introduction to the preservation issues surrounding Web sites and materials, and a practical guide to dealing with them.

The 60 page guide is available through Lulu.com for the bargain price of £2.82 for anyone that wants a hard-copy now. Both UKOLN and ULCC hope to be able to share copies with those attending events such as IWMW and DPTP. I hope we’ll also shortly be able to deposit OA PDFs in one or more repositories.

And with a bit of luck there should even be a preview of it embedded below; if not hop on over to Lulu.com and check it out there.

PICT Memento plugin allows us to step into a wiki’s past

The PICT project is pretty much over, but I can steal a few moments out of my day every now and then to do a bit of house keeping, try out a new plugin and maybe even blog about it.

Inspired by Rob Sanderson’s lightening talk at dev8D on Memento I decided to go for the bounty offered for writing a memento client. My tack was to enable a mediawiki instance to handle the Accept-Date protocol using an existing plugin. Then to write a little PICT tool that supplied a user interface by which users could specify a date and browse their PICT enabled mediawiki “from the past”… spooky!

Thanks to getting nowhere near the deadline (those involved with this project will be grinning at that), I got nowhere near the prize, but I did finish a prototype plugin and threw it up on a mediawiki instance for another project: CLASM-demo. CLASM is the name of project not a piece of conjurers’ onomatopoeia, so click on that link to see the PICT-memento client in action. A wiki with more pages would have been a better example, but at least it does have a lot of revisions.

Innovations in Reference Management

Beacon cited through fog

Beacon cited through fog

Who would have thought that reference management could be so interesting? We spent a  very informative and enjoyable Thursday in snowy Milton Keynes, at the Innovations in Reference Management (#IRM10) event (part of the OU/JISC TELSTAR project). All thoroughly blogged by Owen Stephens, and tweeted by many.

Owen Stephens and Jason Platts of OU described the outputs of the TELSTAR project, which integrates the OU’s Moodle VLE with Refworks. This means that students using the VLE can move seamlessly between their reading lists and Refworks, locating resources, maintaining consistency of style and generating bibliographies easily.

Paul Stainthorp of Lincoln University described some exciting, bleeding-edge uses of Yahoo Pipes to mashup data from Refworks, OPAC, and Amazon. Arguably even more bleeding-edge was the presentation by Euan Adie from Nature Publishing, who showed us Help Me Igor, a reference manager plugin for Google Wave. Speakers from CiteULike and Mendeley also gave us fascinating insights into their respective social-tinged bibliographic management offerings.

Perhaps unsurprisingly, Kevin and I brought to the table the theme of web preservation. With reference to our work with JISC-PoWR, UKWAC and ArchivePress, we reminded anyone who hasn’t heard our spiel already that there are many important, valuable and eminently citable web resources, notably blogs by academic researchers, that are at risk of disappearing – making references to them virtually useless.

Authors may not be responsible for ensuring their readers can access the resources they reference, but we think they should at least give them a fighting chance of doing so! We  therefore proposed that students and researchers should be encouraged to locate and cite copies of web resources in stable web archives (such as the UK Web Archive) rather than “in the wild”.

We also discussed the idea that persistent collections of web resources could be created at the institutional level, whether that were an open archive of blog posts by a university’s researchers, or a closed repository where researchers can store copies of the web resources they cite.

One of the strong themes that emerged in discussion was the need for information literacy/digital skills training at all levels to address current tools and trends in reference management; and to re-assert the purpose, value and nature of citation in online digital environments

An interesting suggestion also made was that reference management tools are becoming a natural part of the environment, just as email has: is provision of specialised applications by universities an “aberration”?

I’m inclined to think not, after all it was clear from the workshop that there’s still a need to support ongoing study and research effectively, and scope to develop and validate new approaches.  Microsoft Word may now include reference management features, but that doesn’t obviate the need to educate people in how to use them effectively, and why.

We’re very grateful to Owen for including us in his programme: this is a fascinating area, where e-learning, libraries, preservation and publishing collide, and I’m sure we haven’t heard the last of it.

Archiving a wiki

On dablog recently I have put up a post with a few observations about archiving a MediaWiki site. The example is the UKOLN Repositories Research Team wiki DigiRep, selected for the JISC to add to their UKWAC collection (or to put it more accurately, pro-actively offered for archiving by DigiRep’s manager). The post illustrates a few points which we have touched on in the PoWR Handbook, which I’d like to illuminate and amplify here.

Firstly, we don’t want to gather absolutely everything that’s presented as a web page in the wiki, since the wiki contains not only the user-input content but also a large number of automatically generated pages (versioning, indexing, admin and login forms, etc). This stems from the underlying assumption about doing digital preservation, mainly that it costs money to capture and store digital content, and it goes on costing money to keep on storing it. (Managing this could be seen as good housekeeping. The British Library Life and Life2 projects have devised ingenious and elaborate formulae for costing digital preservation, taking all the factors into account to enable you to figure out if you can really afford to do it.) In my case, there are two pressing concerns: (a) I don’t want to waste time and resource in the shared gather queue while Web Curator Tool gathers hundreds of pages from DigiRep, and (b) I don’t want to commit the JISC to paying for expensive server space, storing a bloated gather which they don’t really want.

Secondly, the above assumptions have led to me making a form of selection decision, i.e. to exclude from capture those parts of the wiki I don’t want to preserve. The parts I don’t want are the edit history and the discussion pages. The reason I don’t want them is because UKWAC users, the target audience for the archived copy – or the designated user community, as OAIS calls it – probably don’t want to see them either. All they will want is to look at the finished content, the abiding record of what it was that DigiRep actually did.

This selection aspect led to Maureen Pennock’s reply, which is a very valid point – there are some instances where people would want to look at the edit history. Who wrote what, when…and why did it change? If that change-history is retrievable from the wiki, should we not archive it? My thinking is that yes, it is valuable, but only to a certain audience. I would think the change history is massively important to the current owner-operators of DigiRep, and that as its administrators they would certainly want to access that data. But then I put on my Institutional records management hat, and start to ask them how long they really want to have access to that change history, and whether they really need to commit the Institution to its long-term (or even permanent) preservation. Indeed, could their access requirement be satisfied merely by allowing the wiki (presuming it is reasonably secure, backed-up etc.) to go on operating the way it is, as a self-documenting collaborative editing tool?

All of the above raises some interesting questions which you may want to consider if undertaking to archive a wiki in your own Institution. Who needs it, how long for, do we need to keep every bit of it, and if not then which bits can we exclude? Note that they are principally questions of policy and decision-making, and don’t involve a technology-driven solution; the technology comes in later, when you want to implement the decisions.