A repository for pi(es)

January 7th, 2010 Kevin Ashley Posted in General, Technical 4 Comments »

As you may have read recently, Fabrice Bellard has announced the computation of π to almost 2.7 trillion decimal places using a faster algorithm that allows desktop technology to be used, rather than the supercomputers that are usually used to break this particular record. Bellard is an extremely talented programmer who has made a useful contribution to one area of digital preservation with his emulation and virtualisation system QEMU. But it’s a comment by Les Carr that set me thinking about costs, research data and repositories.

“Would you want to put that in your repository?” asked Les. And this is a particularly extreme example where we can do some calculations to give us a fairly good answer. Scientific data centres and the researchers that Pi Pie - CC-BY-NC-SA by Maitri@flickr use them have been considering this question for many years, and one way of looking at it is to see if the cost of recomputation exceeds the cost of storage over a particular time period. We’re assuming here that the initial question – is this worth keeping at all – has been answered at least vaguely positively.

Let’s look first at the cost of recomputation. Fabrice says the equipment used for this task cost no more than €2000. If we assume that it has a life of 3 years, that gives us a cost per day of €1.83. I’m avoiding the usual accounting practice of allowing for inflation, or lost interest on capital, in calculating the true depreciation value of the asset – there’s a number of different schemes and they all give similar results. I’ve just dividided the capital cost by the number of days of use we’ll get. But computers use electricity, and that costs money as well. Let’s assume this is a power-hungry beast that draws 400W and that power costs us 13.5¢ per kwH (which is what my domestic tarrif is if we assume a euro/sterling rate of €1.10 = £1 and 5% VAT.) That adds €1.30/day to the cost of running the system, for a total cost of €3.13/day.

Fabrice’s announcement says that it took 131 days of system time to calculate and verify his results, which gives a computational cost of €410.03 – which I’ll round to €410 since I’ve only been using 3 significant figures so far in the computations, and because there’s a lot of hand-waving involved in lots of these figures. So, we know how much it would take to recompute this result given the software, machine and instructions. (And the computational cost is likely to decline over time in the short term.)

The answer needs a Terabyte of storage. What will it cost to keep that in a repository? That’s a slightly more difficult question to answer, but we can give a number of figures that provide upper and lower bounds. SDSC quote $390/Tbyte/year for archival tape storage (dual copies), excluding setup costs and assuming no retrieval. Moore et al quote $500/year as a raw figure, obtained by dividing total system costs by usable storage within it. At current rates of $1 = €0.67, that gives us a cost of €261/year or €335/year. SDSC are likely to be at the cheap end of the scale. ULCC’s costs, given our lower total volumes, would be closer to €1500/year for a similar service (dual archival tape copies on separate sites) although that does include retrieval costs. Amazon’s AWS would be about €100/year for a single copy. You would want two copies, so it’s twice that, and the cost of transferring the data in would be about 25% more than the storage cost. Since I haven’t factored in ingest costs for any of the other models, I’ll ignore it for AWS as well. (And yes, AWS isn’t a repository, and there’s no metadata, and… This is a back-of-the-envelope calculation. It’s a small envelope.)

Which means, at a very rough level and ignoring many pertinent factors, that after about two years of storage in the repository, we would have been better off recalculating the data rather than storing it. There’s a lot of assumptions hidden there, however. For one, we’re assuming that this data will rarely, if ever, be required. If many people want it, the recalculation cost rapidly becomes prohibitive (and so does the 131 days they have to wait for their request to be satisfied!)

One of the other problems is more subtle. I said that, in the short term, recalculation costs would be likely to fall as computational power becomes cheaper. The energy costs involved will rise, of course, but there’s still a significant downward trend. But after a sufficient period of time, it becomes non-trivial to reconstruct the software and the environment it needs in order to allow the computation to happen. Imagine trying to recalculate something now where the original software is a PL/I program designed to run under OS/360. It’s not impossible by any means, but the cost involved and expertise required is non-trivial. At least with our example we won’t have any doubts about whether the right answer has been produced – the computation of π produces an exact, if never-ending, answer. Most scientific software doesn’t do this and the exact answers produced can depend on the compiler, the floating-point hardware, mathematical libraries and the operating system. Over time, it becomes harder and harder to recreate these faithfully, and we often don’t have any means of checking whether or not we have succeeded. (Keeping the original outputs would help in this, of course, but that’s exactly what we’re trying to avoid.) That’s part of the problem that Brian Matthews and his colleagues examine in the SigSoft project and there’s still a great deal of work to be done there.

So have we answered Les’s question ? My feeling is that in this case we have – there’s a fair amount of evidence that suggests that keeping this particular data set isn’t cost-effective. But in general, the question is far harder to answer. Yet we must strive harder for more general answers as the cost of not doing so is not trivial. Even if money did grow on trees, it still wouldn’t be free and at present we need to be very careful how we use it.

AddThis Social Bookmark Button

Latest Digital Preservation Training Programme, SOAS May 2009.

June 1st, 2009 Patricia Sleeman Posted in DCC, DPTP, Events, General, News No Comments »

Japanese Zen garden at Brunei gallery, School of Oriental and African Studies.

Japanese Zen garden at Brunei gallery, School of Oriental and African Studies.

So another DPTP over! As presenters we felt it went really well. We again had a great group of people. The level of knowledge was very high and even so it seems the course really does help consolidate many levels of knowledge about digital preservation. For many it was OAIS and the class project seems to have helped put the theory into practice. One quote from the feedback:

‘Things really fell into place for me during this exercise and models started to make proper sense. Moved things from theory to practice.’

Overall the level of satisfaction with what we are providing is high.

‘Overall an excellent course. Bringing together so many disparate ideas and concepts and making sense of the muddle! Just hope we can move forward using the models. Excellent group too, good interaction and discussion – I got as much out of this element as from the taught content. Thank you so much all!’

We are now looking to developing links with the DCC as well as moving on to another stage of the DPTP. We will keep providing these 3 day courses with readjustments and updates but we are also looking at developing the modules into e-learning objects. Now all we need is funding!

AddThis Social Bookmark Button

On the limits of preservation

May 31st, 2009 Kevin Ashley Posted in General 3 Comments »

A recent article in New Scientist on the outer fringes of the chiptune scene prompted me to think about preservation, emulation and the fact that some digital things simply aren’t preservable in any useful sense.

Chiptunes are typically created using early personal computers or videogames and/or their soundchips. In that respect, they depend on Harp and netbook - Fzero@flickr CC BY NCtechnology preservation – the museum approach to digital preservation. Chiptune composers either use the systems as designed, programming them directly to create their music, or alter them in some way using techniques collectively known as ‘circuit-bending’, which makes the machines capable of producing sounds that they could not have originally produced. Some aspects of the chiptune scene utilise more modern synthetic techniques to recreate the sounds produced by these early chips – these are, in a loose sense, emulating the original systems, although not in a way that would allow you to use original software to create your sounds. But some adherents of the chiptune genre are going further, using the sounds of the systems themselves in their compositions.

The article which set my train of thought going covered Matthew Applegate’s (aka Pixelh8) concert in late March 2009 at the National Museum of Computing, Read the rest of this entry »

AddThis Social Bookmark Button

Will Rawaa and Waleed get here?

April 30th, 2009 Patricia Sleeman Posted in DPTP, Events, General, News No Comments »

Iraq National Library and Archives

Iraq National Library and Archives, Baghdad, Iraq.

So we are on tenterhooks awaiting the visa applications of Rawaa and Waleed from the Iraq National Library and Archives (INLA). The visas are now being processed in Baghdad and should be ready any day. Thanks to the British Council and the British Institute for the study of Iraq, Rawaa and Waleed are (hopefully) coming to attend the next DPTP which takes place in SOAS on the 18th-20th of May. The programme is really shaping up, we have added a few things and altered the schedule somewhat in light of feedback from the last course. So exciting stuff.

More about the INLA: The Iraq National Library and Archives had 95% of its holdings destroyed mostly during and post conflict. This was an institution which held 417,000 books, 2,618 periodicals dating from the late Ottoman era to modern times, and a collection of 4,412 rare books and manuscripts.  Now with the energy and dedication of its director, Dr Saad Eskander and his staff the library is being rebuilt with donations and a lot of digital surrogates. Rawaa and Waleed are coming here to learn more about how to manage these digital assets which are vital to the reconstitution of Iraqi culture and history.

The alert among you will remember me talking about Dr Eskander before.

Here’s looking forward to meeting our Iraqi colleagues as well as everyone booked on the next DPTP.

AddThis Social Bookmark Button

Set a blog to catch a blog…

March 23rd, 2009 Richard M. Davis Posted in General, JiSC-PoWR 5 Comments »

Originally published on the JISC-PoWR blog.

Much discussion of blog preservation focuses on how to preserve the blogness of blogs: how can we make a web archive store, manage and deliver preserved blogs in a way that is faithful to the original?

Nesting...

Since it is blogging applications that provide this stucture and behaviour (usually from simple database tables of Posts, Comments, Users, etc), perhaps we should consider making blogging software behave more like an archive. How difficult would that be? Do we need to hire a developer?

One interesting thing about Wordpress is the number of uses its simple blog model has been put to. Under-the-hood it is based on a remarkably simple data base schema of about 10 tables and a suite of PHP scripts, functions and libraries that provide the interface to that data. Its huge user-base has contributed a wide variety of themes and additional functions. It can be turned into a Twitter-like microblog (P2 and Prologue) or a fully-fledged social network (Wordpress MU, Buddypress).

Another possibility exploited by a 3rd-party plugin is that of using Wordpress as an aggregating blog, collecting posts automatically via RSS from other blogs: this seems like a promising basis for starting to develop an archive of blogs, in a blog.

Read the rest of this entry »

AddThis Social Bookmark Button