A repository for pi(es)

January 7th, 2010 Kevin Ashley Posted in General, Technical 4 Comments »

As you may have read recently, Fabrice Bellard has announced the computation of π to almost 2.7 trillion decimal places using a faster algorithm that allows desktop technology to be used, rather than the supercomputers that are usually used to break this particular record. Bellard is an extremely talented programmer who has made a useful contribution to one area of digital preservation with his emulation and virtualisation system QEMU. But it’s a comment by Les Carr that set me thinking about costs, research data and repositories.

“Would you want to put that in your repository?” asked Les. And this is a particularly extreme example where we can do some calculations to give us a fairly good answer. Scientific data centres and the researchers that Pi Pie - CC-BY-NC-SA by Maitri@flickr use them have been considering this question for many years, and one way of looking at it is to see if the cost of recomputation exceeds the cost of storage over a particular time period. We’re assuming here that the initial question – is this worth keeping at all – has been answered at least vaguely positively.

Let’s look first at the cost of recomputation. Fabrice says the equipment used for this task cost no more than €2000. If we assume that it has a life of 3 years, that gives us a cost per day of €1.83. I’m avoiding the usual accounting practice of allowing for inflation, or lost interest on capital, in calculating the true depreciation value of the asset – there’s a number of different schemes and they all give similar results. I’ve just dividided the capital cost by the number of days of use we’ll get. But computers use electricity, and that costs money as well. Let’s assume this is a power-hungry beast that draws 400W and that power costs us 13.5¢ per kwH (which is what my domestic tarrif is if we assume a euro/sterling rate of €1.10 = £1 and 5% VAT.) That adds €1.30/day to the cost of running the system, for a total cost of €3.13/day.

Fabrice’s announcement says that it took 131 days of system time to calculate and verify his results, which gives a computational cost of €410.03 – which I’ll round to €410 since I’ve only been using 3 significant figures so far in the computations, and because there’s a lot of hand-waving involved in lots of these figures. So, we know how much it would take to recompute this result given the software, machine and instructions. (And the computational cost is likely to decline over time in the short term.)

The answer needs a Terabyte of storage. What will it cost to keep that in a repository? That’s a slightly more difficult question to answer, but we can give a number of figures that provide upper and lower bounds. SDSC quote $390/Tbyte/year for archival tape storage (dual copies), excluding setup costs and assuming no retrieval. Moore et al quote $500/year as a raw figure, obtained by dividing total system costs by usable storage within it. At current rates of $1 = €0.67, that gives us a cost of €261/year or €335/year. SDSC are likely to be at the cheap end of the scale. ULCC’s costs, given our lower total volumes, would be closer to €1500/year for a similar service (dual archival tape copies on separate sites) although that does include retrieval costs. Amazon’s AWS would be about €100/year for a single copy. You would want two copies, so it’s twice that, and the cost of transferring the data in would be about 25% more than the storage cost. Since I haven’t factored in ingest costs for any of the other models, I’ll ignore it for AWS as well. (And yes, AWS isn’t a repository, and there’s no metadata, and… This is a back-of-the-envelope calculation. It’s a small envelope.)

Which means, at a very rough level and ignoring many pertinent factors, that after about two years of storage in the repository, we would have been better off recalculating the data rather than storing it. There’s a lot of assumptions hidden there, however. For one, we’re assuming that this data will rarely, if ever, be required. If many people want it, the recalculation cost rapidly becomes prohibitive (and so does the 131 days they have to wait for their request to be satisfied!)

One of the other problems is more subtle. I said that, in the short term, recalculation costs would be likely to fall as computational power becomes cheaper. The energy costs involved will rise, of course, but there’s still a significant downward trend. But after a sufficient period of time, it becomes non-trivial to reconstruct the software and the environment it needs in order to allow the computation to happen. Imagine trying to recalculate something now where the original software is a PL/I program designed to run under OS/360. It’s not impossible by any means, but the cost involved and expertise required is non-trivial. At least with our example we won’t have any doubts about whether the right answer has been produced – the computation of π produces an exact, if never-ending, answer. Most scientific software doesn’t do this and the exact answers produced can depend on the compiler, the floating-point hardware, mathematical libraries and the operating system. Over time, it becomes harder and harder to recreate these faithfully, and we often don’t have any means of checking whether or not we have succeeded. (Keeping the original outputs would help in this, of course, but that’s exactly what we’re trying to avoid.) That’s part of the problem that Brian Matthews and his colleagues examine in the SigSoft project and there’s still a great deal of work to be done there.

So have we answered Les’s question ? My feeling is that in this case we have – there’s a fair amount of evidence that suggests that keeping this particular data set isn’t cost-effective. But in general, the question is far harder to answer. Yet we must strive harder for more general answers as the cost of not doing so is not trivial. Even if money did grow on trees, it still wouldn’t be free and at present we need to be very careful how we use it.

AddThis Social Bookmark Button

File formats…or data streams?

December 3rd, 2009 Ed Pinsent Posted in DPC, Events, Reports, Technical 4 Comments »

On 1st December Malcolm Todd of The National Archives gave a good account of the work he’s been doing on File Formats for Preservation, resulting in a substantial new Technology Watch report for the DPC. It was a seminar hosted by William Kilbride, with participants from the BBC, the BL, NLW and others. The afternoon was useful and interesting for me since I teach an elementary module on file formats in a preservation context for our DPTP courses.

My naïve thinking in the area has been characterised by the assumption that the process is rather static or linear, and that the problem we’re facing is broadly the same every time; migrate data from a format that’s about to become obsolete or unsupported, onto another format that’s stable, supported, and open. MS Word document to PDF or PDF/A…now that, I can understand!

In fact, I learned at least two ways of thinking about formats that hadn’t occurred to me before. One simple one is costs; some formats can cost more to preserve than others. This can be calculated in terms of storage costs, multiplied over time, and the costs associated with migrations to new versions of that format. Read the rest of this entry »

AddThis Social Bookmark Button

SNEEP 0.3.2 (now with automagic installer) + PICT (SNEEP evolves!)

June 11th, 2009 Rory McNicholl Posted in JISC, News, PICT, SNEEP, Technical No Comments »

SNEEP 0.3.2

The JISC funded SNEEP project (Social Networking Extensions for EPrints) – part of the original JISC rapid innovation programme – aimed to provide a set of social networking tools for EPrints repositories. It ran for 6 months and ended in May 2008. Since the rather low key publication of the resultant EPrints plugin interest and uptake has been slowly but surely gathering momentum.

Today I am pleased to announce a couple of significant SNEEP related developments. Firstly , thanks to my colleague Ben Wheeler here at ULCC, SNEEP 0.3.2 released this week offers an automagic installer. This does away with the (slightly tortuous) manual install procedure that we suspect discouraged all but the hardier EPrints hac… I mean administrators.

You can download SNEEP 0.3.2 and/or read Ben’s post to the EP-tech mailling list. The download page is also a good place to see SNEEP in action.

PICT

I am also pleased to announce a new project (funded as part of the 2009 JISC rapid innovation programme) that aims to build on the SNEEP work to provide SNEEP-ish services to a broader range of web resources. The goal of the PICT project (Platform Independent Community Toolbox) is a lightweight javascript tool that can be deployed across an number of web resources (not just a repository) to encompass the web-based real estate of a given research community and provide that community with collaborative tools available at the on-line research coalface.

Effectively PICT will allow resource owners to offer

  • tags
  • comments
  • notes
  • other goodies

from their web page. The data gathered by these tools will be managed by a PICT server (probably run by a community-minded resource owner) and be available for cross referencing with other resources in a PICT community.

If all that is a bit difficult to picture, rest assured that demos will appear throughout the course of the project that should help to clear the murk.

AddThis Social Bookmark Button

If you can keep your blog when all around…

March 20th, 2009 Richard M. Davis Posted in General, JiSC-PoWR, Technical 1 Comment »

I was a keen participant in the activities of ERPANET , but I must confess I haven’t kept abreast of its successor, Digital Preservation Europe (DPE). However I was interested to see the recent DPE briefing paper about blog preservation, since it covers an area that we also tackled in the course of the JISC-PoWR project – on the blog , in the workshops and the handbook. The Briefing Paper highlights key issues for those who would preserve blogs. It is a necessarily general overview, and manages to cram a lot of preservation issues into its two sides of A4. But, for the blogger approaching preservation, or the preservationist approaching blogs, I wonder if such avalanches of considerations aren’t sometimes unnecessarily overwhelming. It seemed worth looking at a few of the points made in the DPE briefing paper, and considering whether we can demystify them or make the task seem less daunting.

Read the rest of this entry »

AddThis Social Bookmark Button

Draft standard for long-term archiving of CAD data

September 8th, 2008 Kevin Ashley Posted in News, Technical No Comments »

September’s edition of BSI’s Update Standards magazine alerted me to another batch of standards, currently at the public comment stage, which are of particular relevance to digital preservation. The BS EN 9300 family is entitled ‘Long term archiving and retrieval of digital technical product documentation such as 3D, CAD and PDM data‘ and 5 parts (100; 110; 007; 005; 002 and 115) are open for comment until September 30th. I was initially surprised that I had heard nothing of this series of standards before, and wasn’t sure if this was simply lack of observation on my part or because they had come from an entirely different domain. They clearly aren’t new – unlike BS10008, which I wrote about in June, this is not a home-grown British Standard but one which is being proposed ‘for adoption’ – which means that it’s already been adopted by another body somewhere. That status also means that you can’t use BSI’s excellent online commenting system. You have to buy the drafts from BSI on paper, so far as I can see.

In fact, as I was relieved to note, this group of standards isn’t entirely new to the digital preservation community, and the authors are also aware of general DP standards such as OAIS. They derive from a group of standards known as STEP (Standard for Exchange of Product Model Data), codified in ISO 10303. Read the rest of this entry »

AddThis Social Bookmark Button