<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>ulcc da blog &#187; digital curation</title>
	<atom:link href="http://dablog.ulcc.ac.uk/tag/digital-curation/feed/" rel="self" type="application/rss+xml" />
	<link>http://dablog.ulcc.ac.uk</link>
	<description>ulcc digital archives blog</description>
	<lastBuildDate>Fri, 03 Feb 2012 16:24:59 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>4th International Digital Curation Conference part 3 (idcc4 rides again)</title>
		<link>http://dablog.ulcc.ac.uk/2008/12/22/4th-international-digital-curation-conference-part-3-idcc4-rides-again/</link>
		<comments>http://dablog.ulcc.ac.uk/2008/12/22/4th-international-digital-curation-conference-part-3-idcc4-rides-again/#comments</comments>
		<pubDate>Mon, 22 Dec 2008 17:28:59 +0000</pubDate>
		<dc:creator>Kevin Ashley</dc:creator>
				<category><![CDATA[All]]></category>
		<category><![CDATA[Digital Archives]]></category>
		<category><![CDATA[conference]]></category>
		<category><![CDATA[data curation]]></category>
		<category><![CDATA[DCC]]></category>
		<category><![CDATA[digital curation]]></category>
		<category><![CDATA[edinburgh]]></category>
		<category><![CDATA[Events]]></category>
		<category><![CDATA[idcc4]]></category>

		<guid isPermaLink="false">http://dablog.ulcc.ac.uk/?p=246</guid>
		<description><![CDATA[This is the third and final post of mine summing up my notes from the 4th international digital curation conference (now a few weeks ago.) These notes cover the bulk of the second day, which consisted of submitted (as opposed to invited) papers. Most of these were given in parallel tracks, so I&#8217;m only able [...]<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2008/12/22/4th-international-digital-curation-conference-part-3-idcc4-rides-again/' addthis:title='4th International Digital Curation Conference part 3 (idcc4 rides again) '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></description>
			<content:encoded><![CDATA[<p>This is the third and final post of mine summing up my notes from the 4th international digital curation conference (now a few weeks ago.) These notes cover the bulk of the second day, which consisted of submitted (as opposed to invited) papers. Most of these were given in parallel tracks, so I&#8217;m only able to cover half of them. As luck would have it, Chris Rusbridge seems to have <a href="http://digitalcuration.blogspot.com/2008/12/strand-b1-research-papers-at-idcc4.html">covered many of the others.</a></p>
<p>The opening paper, however, was in its own session as it won the prize for best paper of the conference: Manjula Patel and Alexander Ball&#8217;s study on the issues surrounding the preservation of engineering CAD models was a useful guide to the <a href="http://www.ukoln.ac.uk/projects/grand-challenge/">problems and to possible solutions</a>. One of the things that they opened my eyes to is that CAD files aren&#8217;t just about geometry and spatial representation &#8211; there&#8217;s lots of other information possibly embedded in them or attached to them, from engineering tolerances to feedback from the manufacturing process or field maintenance.</p>
<p>Aaron Griffiths then spoke about the RIN report looking at attitudes to data publishing across different disciplines (<em><a href="http://www.rin.ac.uk/data-publication">To Share or Not To Share: Publication and quality assurance of research data outputs</a></em>) <span id="more-246"></span> It examined culture, infrastructure barriers, the effects of policy and overall propensity for sharing across a number of disciplines via interviews with researchers: astronomy was generally high (that is, well disposed towards sharing), social and health sciences low, for instance. Motivations included altruism and opportunities for collaboration; constraints included time, legal and ethical, competition, sense of limited rewards, and nowhere to put stuff. The report says we need incentives: evidence of benefits, standard workable mechanisms for citation, more explicit rewards that affect career progression. It also looked at issues around discovery, access and usability. and then asked researchers about processes relating to quality assurance. This didn&#8217;t get support from the researchers they spoke to, but we can&#8217;t have greater rewards for data publishing without a willingness to have QA, I feel. His presentation made it clear to me that I ought to read the report.</p>
<p>Amanda spencer then spoke about TNA&#8217;s web continuity project, which aims to ensure (eventually) that 404 errors on UK government websites will be a thing of the past. One detail that was new to me (we&#8217;ve covered much of this in the PoWR project) was that the TNA team followed up their original study of link persistence in Hansard answers (which found that 60% of the 4,000 URLs given in answers to parliamentary questions over the last 10 years no longer worked) with a more wide-ranging study of government sites using google webmaster tools. They wanted to check to see if PQ answers were in some way a special case: they weren&#8217;t. They are promoting the use of XML sitemaps to help in capture but these also help search engines and therefore current use of the sites.</p>
<p>Ronald Jantz then spoke of ideas around institutional support for authentic digital objects, starting with the assumption that &#8220;digital scholarship requies authentic digital objects&#8221;. An example he gave in support of this (attempts to recreate cold fusion experiment failed because experimental design &#8211; including location of thermometer &#8211; wasn’t available) was interesting but didn&#8217;t seem directly relevant to questions of authenticity but rather of selection or completeness. He reminded us of the useful distinction made in archival diplomatics between authenticity and reliability, but ended up proposing that the solution to the authenticity question depended on institutionally-supported key signing, together with some process like TRAC to authenticate the institution and/or its repository. I and others weren&#8217;t convinced by this argument, nor of its novelty. The issue of keys and what they attest was being dealt with at the time of the second DLM forum in 1999 by the Swedish National Archives, amongst others. Using keys to place the equivalent of wax seals on documents isn&#8217;t difficult; the complex problem is whether the trust networks exist to make the keys useful to anyone.</p>
<p>A presentation about the curation of weather data at NCAR provided lots of numbers and some interesting new terminology. &#8216;Enriched staff&#8217; who have subject knowledge of the data they are curating. NCAR storage requirements double every 2.5 years, and are now about 6 Petabytes. A new phrase for what I think of as media refreshing &#8211; moving data between storage media without necessarily changing the format of the data &#8211; was &#8216;tape oozing&#8217;. This is still something that requires significant planning and coordination for large data centres to ensure that data can be moved before media become obsolete and without interfering with day-to-day access needs. NCAR&#8217;s observation is that poor curation causes more incidents of data loss than equipment failure or media failure. &#8216;Poor curation&#8217; can mean both bad practice and the failure to follow established good practice. (This is true of many areas of activity, from healthcare to manufacturing, and quality management systems like ISO9000 are designed to minimise both causes of error, but they will never eliminate it.)</p>
<p>Mackenzie Smith then gave the day&#8217;s second paper on CAD, specifically in architecture. A fascinating collaboration with MIT&#8217;s architecture school looked at well-known achitectes like Gehry, Thom Mayne and <a href="http://www.msafdie.com/">Moshe Safdie</a>. Data is a 3D CAD model (known as BIM &#8211; building information model) &#8211; which moves from architects to builders to owners. Targer audiences are practitioners, historians, instructors and the public. The practitioners already know they jhave a problem managing the data, so are incentivised. 10,000s of files, 100+ file formats, many gigabytes, almost no metadata: just filesystems, sometimes with spreadsheets in them that are like partial catalogues. The MIT project buildt a gui to help in automating assigning 5 properties to every file: when, where, how, why, what. &#8220;How&#8221; is generic type, such as presentation; &#8220;what&#8221; is specific format; &#8220;when&#8221; is to do with building phase rather than absolute time. Importamt stuff gets extra tags: models tell you what was built, but not why. MIT are archiving by creating 3 derivatives, all of which lack something. (STEP,dessicated shape, and web display.) One problem is that  geometry is not authentic, but practiioners say much parametric stuff is tool artefact, not design intent. CAD vendors are very resistant on releasing format information. The project is exploring escrow solutions for this, which sounds like a pragmatic way forward.</p>
<p>In the afternoon, Stephen Abrams began with the statement that “Preservation is not a place” and went on to a wide-ranging reflection on the nature and role of repositories in a way that has relevance for many other institutions and services.  Starting from first principles, Abrams&#8217; group worked from values to services, reimagining the repository as a process, not a place (still a work in progress.) They went back to <a href="http://en.wikipedia.org/wiki/Five_laws_of_library_science">Raganathans 5 laws</a> and also looked at archival science, particularly concerns over provenance and its importance in supporting authenticity. They identified 10 properties which they think apply to all objects (identity, viability, visibility are some.) But curation depends on a lot of human things: competency, deicsion making, analysis. They identify 13 minimal services that suppport the human endeavours: these include identity services (ark and noid), storage (pairtree), fixity/replication (ACE), catalog (relational or rdf/xml database), characterization (jhove2), transformation (at ingest, preservation dessication/migration, and use copy generation), ingest (bagit/grabit), request &#8211; metadata-based search and browse, search (lucene), publication (sitemap,rss/atom,oai-pmh), annotation (social tagging,oai-ore). Any one of those would have made an interesting paper in itself, and some of them were the subject of separate posters. He ended by thinking how simple can a curation environment be and still be effective? And also took the concept of  LOCKSS a little bit further: lots of description keeps stuff meaningful, services keep stuff useful, use keeps stuff valuable.</p>
<p>Finally Malcolm Atkinson gave a stirring speech which began with thoughts of scientists who still go out once a month to count birds or moss species &#8211;  trained volunteers are important as well as industrialised scientists. Data can have uses not envisaged when it was created nor even when it was first preserved: ships logs&#8217; original purpose was safety. They were collected by archives to understand trade, and are now helping climate modelling because of their detailed records of such things as temperature and wind.  Soon power will cost more than hardware. 1pb consumes 2.2mwh in 5 years. He talked of ‘data huggers’ who avoid competition, don&#8217;t describe, don’t expose, don’t share their data. There are also good data sharers, but at the moment incentives and disincentives aren’t well balanced. We need research on how to enable better research and data must be fundamental to this.</p>
<p>IDCC has been over-subscribed for 3 of its years and always offers lots of inspiration and food for thought. The linkup with CODATA, whose events cover a similar subject area, can only be a good thing. I&#8217;m already looking forward to IDCC5, expected to be in London in 2009.</p>
<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2008/12/22/4th-international-digital-curation-conference-part-3-idcc4-rides-again/' addthis:title='4th International Digital Curation Conference part 3 (idcc4 rides again) '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></content:encoded>
			<wfw:commentRss>http://dablog.ulcc.ac.uk/2008/12/22/4th-international-digital-curation-conference-part-3-idcc4-rides-again/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>4th International Digital Curation Conference part 2 (return of idcc4)</title>
		<link>http://dablog.ulcc.ac.uk/2008/12/07/4th-international-digital-curation-conference-part-2-return-of-idcc4/</link>
		<comments>http://dablog.ulcc.ac.uk/2008/12/07/4th-international-digital-curation-conference-part-2-return-of-idcc4/#comments</comments>
		<pubDate>Sun, 07 Dec 2008 19:54:45 +0000</pubDate>
		<dc:creator>Kevin Ashley</dc:creator>
				<category><![CDATA[All]]></category>
		<category><![CDATA[Digital Archives]]></category>
		<category><![CDATA[conference]]></category>
		<category><![CDATA[data curation]]></category>
		<category><![CDATA[DCC]]></category>
		<category><![CDATA[digital curation]]></category>
		<category><![CDATA[edinburgh]]></category>
		<category><![CDATA[Events]]></category>
		<category><![CDATA[idcc4]]></category>

		<guid isPermaLink="false">http://dablog.ulcc.ac.uk/2008/12/07/4th-international-digital-curation-conference-part-2-return-of-idcc4/</guid>
		<description><![CDATA[My last post from IDCC4 ended with me being unable to report on John Wilbank&#8217;s closing keynote on day 1, so I&#8217;ll rectify that now with the benefit of handwritten notes and a little more time for reflection. John is working on the Science Commons initiative which, through projects and advocacy, is taking the Creative [...]<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2008/12/07/4th-international-digital-curation-conference-part-2-return-of-idcc4/' addthis:title='4th International Digital Curation Conference part 2 (return of idcc4) '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></description>
			<content:encoded><![CDATA[<p>My <a href="http://dablog.ulcc.ac.uk/2008/12/03/the-4th-international-digital-curation-conference-idcc4/">last post</a> from IDCC4 ended with me being unable to report on John Wilbank&#8217;s closing keynote on day 1, so I&#8217;ll rectify that now with the benefit of handwritten notes and a little more time for reflection.</p>
<p>John is working on the <a href="http://sciencecommons.org/">Science Commons</a> initiative which, through projects and advocacy, is taking the Creative Commons concepts and applying them to the doing of science, as well as the publishing of its outputs. (The DCC were <a href="http://forum.dcc.ac.uk/viewtopic.php?t=106">drawing our attention</a> to the initiative over 3 years ago.) He began with a view[1] that science is not unlike wikipedia: they are about publishing, in the sense of disclosure, advances are made by individual action and proceed by small, discrete steps, and trust ratings accrue from peer review. He also commented on the &#8220;tyranny of the crowd&#8221; effect that general search tools like Google suffer from:  someone searching for information about spears (their manufacture, use, or &#8220;spears, the carrying and chucking of&#8221;) will be somewhat overwhelmed by the number of results relating to Britney. And from this he moved to a view that science, to advance further, requires a disruptive change to its practices that it is inherently resistant to. One thing that needs to change is the notion that science is communicated through periodic papers (itself an outdated metaphor), &#8220;units of knowledge which are adverts for years of work.&#8221; He observed that, even if we still want papers, we really want them with embedded (ideally semantic) linking and tagging. Yet, although we have the technology to do this in a semi-automated way, the licenses that apply to many e-journals explicitly prevent us from doing so.</p>
<p>He then moved to considering ways to improve openness of journals and their content: by giving incentives such as better statistics to those who publish in open journals, and through simple but effective tools such as the <a href="http://scholars.sciencecommons.org/">scholar&#8217;s copyright and addendum tool</a>, <span id="more-245"></span> a really simple idea which impressed me: something which automatically adds text to a publisher&#8217;s licence from a closed journal which gives the author the necessary rights to self-publish (and hence link and annotate) their work. He demonstrated the effect of linking and the power of the semantic web through a google search on particular receptors in brain chemistry and the genes which relate to them. A google search, despite highly specific language, returned 88,400 results, most of which were papers about the receptors. But what the researcher probably wants is a list of genes and evidence for the nature of their relationship (as encoders, regulators, etc.). Their semantic web tools give exactly this, and allow the resultant RDF query to be turned into a simple (if unwieldy) hyperlink. What&#8217;s more, they were able to use google maps for brain data in a way that allows it to be annotated; one of those clever, simple ideas that makes you wonder why noone else did it before. As John made clear, one of the reasons is that the data isn&#8217;t sufficiently open.</p>
<p>He argued strongly that, for open science data to realise its potential, we must abandon the notion of requiring attribution and/or citation because it places too great a burden on those combining data from multiple sources. He&#8217;s doing pilot work on open data in science commons with charities funding work on <a href="http://neurocommons.org/">rare brain diseases</a>: the sort of thing in which the ability to link scattered data, often from other areas of research, hugely amplifies the value of the original research funding. Working with google, they&#8217;ve come to the realisation that typical data mashups or search results might involve 40,000 invididual citations if all the data sources ar taken into account. For Google that&#8217;s a burden they&#8217;re not willing to deal with. So John is arguing strongly that we should abandon the desire to have databases, or cells in databases, attributable. It was a powerful argument, although I&#8217;m concerned that it shouldn&#8217;t be forgotten that we often still need the ability to determine data provenance, sometimes at the level of an individual value in a database cell, to ensure that we know we&#8217;re comparing like with like, or applying an appropriate statistical technique. And the provenance information is often tied up with the citation information. Still, John&#8217;s argument, and that of the science commons, seems very persuasive: huge benefits can be gained from making data completely open (as in public domain) and we will not realise those benefits if we cling to attribution or citation. He also made that point that although data growth is exponential, our brain capacity remains constant, and that the only human factor increasing along with data is population. People-driven annotation and sharing therefore helps us process increasing volumes of information (I think.)</p>
<p>Cameron Neylon has also <a href="http://blog.openwetware.org/scienceintheopen/2008/12/02/quick-update-from-international-digital-curation-conference/"> written about John&#8217;s talk and that crucial distinction between attribution and citation.</a> (His blog post also contains a link to <a href="http://www.slideshare.net/CameronNeylon/radical-sharing-transforming-science-presentation">his presentation from Tuesday morning</a>, which makes a great case for the need to curate networks of science, not digital objects per se. I&#8217;m sorry I missed the talk.) John points out that many people confuse the need to attribute with the need to cite; attribution is a legal requirement, bound up with copyright and licences (even in the open world of creative commons) and failure to attribute material puts you legally in the wrong. Citation is merely a social convention or an ethical obligation, something we <em>ought</em> to do; failure to cite leads to the disapproval of your peers and is called plagiarism, at least when you attempt to pass off the ideas of others as your own. And that leads to the conclusion that, for an academic, the consequences of not meeting that social obligation are potentially much worse than the consequences of falling foul of copyright law. The latter may, at worst, lead to a financial penalty which is unpleasant but survivable, whereas plagiarism can lead to the loss of reputation, job and career &#8211; despite no laws having been broken.</p>
<p><a href="http://www.flickr.com/photos/asifch/356920779/sizes/l/"><img src="http://dablog.ulcc.ac.uk/wp-content/uploads/2008/12/356920779_c0f4d50b91_m.jpg" alt="Edinburgh Castle in the mist: asifch@flickr.com, CC-BY-ND-NC licence" title="Edinburgh Castle in the mist: asifch@flickr.com, CC-BY-ND-NC licence" class="float-right" style="border: 1px solid #dddddd; padding: 4px" /></a>But that was a minor cororollary of the Science Commons thesis, which appears to be that we have to be willing to really let go of data, in a way that we haven&#8217;t done before, to allow science to proceed in a way that permits disruptive, rather than incremental, change.</p>
<p>Tuesday morning saw Martin Lewis, university librarian at Sheffield, look at the role libraries need to play in supporting research. His talk had two threads: one a reflection on how libraries are still agile, capable of change and sensitive to the needs of scolars and learners, and one a summary of the findings and consequences of the <a href="http://www.ukrds.ac.uk/">UKRDS</a> report, which at present is still in draft. He used new library buildings at Sheffield and Glasgow Calendonian to illustrate that libraries now create very different sorts of spaces for learners and as a result still attract students to congregate there. Capping of collection size is now commonplace &#8211; they recognise that most collections cannot continue to consume more space indefinitely. He reflected that UK university libraries had been devoting too much attention to learning and teaching and had, as a result, neglected the needs of researchers. And thus he moved to his assertion that university library services were best placed to meet the needs that the UKRDS feasibility study would identify. Martin says that they can raise awareness of data issues, lead policy on data management, provide advice to researchers early in the data lifecycle, and work with IT services to develop appropriate local capacity.</p>
<p>Martin&#8217;s talk was entertaining, informative and well-argued in equal measure, but I am not entirely persuaded by it. University libraries certainly have a place to play here, and in some institutions they may well need to adopt a central role. But just because they can do all of the things that Martin describes does not mean that they are the only people who can, nor that they are the best-placed to do so in every case.  His argument put up a number of straw men, one of which seemed to see future RDS provision as a choice between a library-led service and an computing services-led one. It was a cheap dig (&#8220;How many university computing services have a &#8216;friends&#8217; organisation?&#8221;) and sets up a false dichotomy. This won&#8217;t be a simple either/or choice and there are many other routes to the intended end of a federated shared service of some sort. He emphasised that the UKRDS study was primarily a political act to keep the issue of research data management high on the agenda and that it should be seen in that light. I&#8217;m in complete agreement with that, and that&#8217;s the very reason I don&#8217;t see that it&#8217;s useful to engage in a turf war about which part of the university takes this forward.</p>
<p>That takes us to just over halfway at iDCC4, and it&#8217;s enough for one blog post. The rest will follow in a day or two, and there&#8217;s lots more of interest to write about.<br />
&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;<br />
[1] &#8211; which he credited to someone else that sounded like Jean-Pierre Galdon, but wasn&#8217;t</p>
<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2008/12/07/4th-international-digital-curation-conference-part-2-return-of-idcc4/' addthis:title='4th International Digital Curation Conference part 2 (return of idcc4) '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></content:encoded>
			<wfw:commentRss>http://dablog.ulcc.ac.uk/2008/12/07/4th-international-digital-curation-conference-part-2-return-of-idcc4/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>The 4th International Digital Curation Conference: idcc4</title>
		<link>http://dablog.ulcc.ac.uk/2008/12/03/the-4th-international-digital-curation-conference-idcc4/</link>
		<comments>http://dablog.ulcc.ac.uk/2008/12/03/the-4th-international-digital-curation-conference-idcc4/#comments</comments>
		<pubDate>Wed, 03 Dec 2008 00:54:14 +0000</pubDate>
		<dc:creator>Kevin Ashley</dc:creator>
				<category><![CDATA[All]]></category>
		<category><![CDATA[Digital Archives]]></category>
		<category><![CDATA[conference]]></category>
		<category><![CDATA[data curation]]></category>
		<category><![CDATA[DCC]]></category>
		<category><![CDATA[digital curation]]></category>
		<category><![CDATA[edinburgh]]></category>
		<category><![CDATA[Events]]></category>
		<category><![CDATA[idcc4]]></category>

		<guid isPermaLink="false">http://dablog.ulcc.ac.uk/2008/12/03/the-4th-international-digital-curation-conference-idcc4/</guid>
		<description><![CDATA[This is a brief and undigested report from day 1 of the DCC&#8217;s international digital curation conference taking place in Edinburgh. After a welcome from Chris Rusbridge (DCC director) and Professor Peter Clarke (NeSC director) we had a keynote from Professor David Porteous, a professor of human molecular genetics and medicine and a key player [...]<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2008/12/03/the-4th-international-digital-curation-conference-idcc4/' addthis:title='The 4th International Digital Curation Conference: idcc4 '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></description>
			<content:encoded><![CDATA[<p>This is a brief and undigested report from day 1 of the DCC&#8217;s international digital curation conference taking place in Edinburgh. After a welcome from Chris Rusbridge (<a href="http://www.dcc.ac.uk/">DCC</a> director) and Professor Peter Clarke (<a href="http://www.nesc.ac.uk/">NeSC</a> director) we had a keynote from Professor David Porteous, a professor of human molecular genetics and medicine and a key player in <a href="http://generationscotland.org/">Generation Scotland</a>.</p>
<p>He began by illustrating the changes in health, disease and knowledge of causes that lie behind some of his work. Changes in scottish demography illustrate this: in 1911 everyone is young, numbers decline with age in a smooth curve. After WWII, in 1951, there is a flat bulge from ages 10 to 50 with a decline thereafter, whereas 2001 and 2031 sees a bulge in pensioners. There is a consequent rise in chronic disease: disease that treatments of today are not very effective for, unlike the killers of the past, where effective treatments contributed to the changes in age profile in the population that we are now seeing. He illustrated this with reference to the <a href="http://www.sasi.group.shef.ac.uk/publications/reaper.html">grim reapers road map</a>: an atlas of mortality in the UK (and, as he said, a fine book title.) It showed cancer, heart and lung disease unequally distributed over the UK. Glasgow is particularly bad. Why? Nature and nurture both play a part, but other than smoking, we have very little evidence for the real effects underlying nurture causes such as diet variation. So his research concentrates on the nature aspect: what difference our genetic inheritance plays.</p>
<p>He then looked at changes in sequencing costs. From 1990-2003 the human genome project spent $3bn to do the first with machines spread over aircraft hangers; now one machine can do a genome for 500k, in the next year 5k. <a href="http://www.completehenomics.com">completegenomics</a> plans 20,000 genomes at $5k each in 2010 using 60,000 processors and 30Pb of storage. <span id="more-244"></span>One goal of all this is personalised medicine &#8211; drugs that work for your genetic makeup. In the USA, adverse drug reaction is 4th leading cause of hospital mortality (but I&#8217;m thinking that only some of this is genome-related; some of it must surely be because some drugs are just downright toxic, with prescription involving balanced risk that sometime doesn&#8217;t go the way we want, and because prescribing errors are still all too frequent.) Bringing together mass genomics and automated drug screening is key, involves two big sets of data. Generation Scotland plans to do this: a competitive advantage is an unhealthy, stable population. Large scale family-based studies possible, supportive attitude in Scotland to doctors and medical schools. Expertise in health informatics and ethical, legal, social science essential. It&#8217;s all volunteer-based (striking contrast to Iceland and UK genome bank); grandmothers are key influencers. Recruits have blood, serum and uurine samples and tests of lung, bone, etc. and mental health status, so it&#8217;s more than just aggregating health records. Mental health is one where drugs generally don&#8217;t work and interaction between nature, nurture, aetiology and drugs needs to be explored in much more depth. System is linked to medical records; subjects have right to withdraw.</p>
<p>10 years of consultation before it started. but then there&#8217;s google health and google health trends &#8211; both ways of using large amounts of data to gain knowledge which work in different ways.</p>
<p>I missed the next morning sessions because I had to attend another meeting, and rejoined to see the minute-madness presentations for the posters, about which I hope to write later this week.</p>
<p>After lunch we had Dr Bryan Lawrence of <a href="http://www.scitech.ac.uk/">STFC</a> talk about big science data curation at the STFC environmental data archive. There were lots of numbers in this presentation and I only capture some of them: petabytes of data overall, 50TB from met office, 4000 years work to &#8216;look&#8217; at it all. That&#8217;s why you need metadata, because you can&#8217;t examine it all at ingest. 2 minutes to find and do something simple: 60,000 images/year/person. So need to automate metadata creation and extraction. Google needs this metadata to help; it can&#8217;t deal with non-cited data directly. Most data mining processes text, not image data. We find data with discovery and ontology metadata; then look at context, character and discipline stuff; then also archival metadata. He mentions ISO 19115, should be derived from browse metadata. (There was a much more formal classification of metadata in his presentation which I haven&#8217;t captured in these notes.)</p>
<p>Data scientists can&#8217;t do their job unless the scientist has done theirs. They can choose <strong>not</strong> to take stuff, though, because the scientist hasn&#8217;t done their job. But even not taking something consumes resources to make the assessment and decision. Makes point that you can automate streams, but you can&#8217;t automate jobs away (10 things still need to be done, even if they are automated, so there&#8217;s still a linear resource relationship to the number of objects.)</p>
<p>A charge for 3 years storage up-front at ingest time which if volumes continue to grow, historical data storage lives on the margins from current business. Core budget pays for management and access systems , data management, network access, etc. then per data stream costs charged to projects. Core covers some projects already, with 25 FTE can supoport 10 new types of data per year, 100 things of a few hours work, 1000 things of a few minutes work, beyond that it must be automated. Next IGPCC requirements changes scale and thus the cost models. Interesting problems are in browse ontology and extra metadata space. Preserving metadata now presents its own challenges; real data publication is the way forward. In questions we determine that the deluge is necessary for the cost model, as is storage cost reduction. And at the moment it isn&#8217;t worth bothering about things to throw away, but that could change if those cost assumptions aren&#8217;t true (such as if storage costs plateau at some point.)</p>
<p>Neal Beagrie then  speaks about research data costs. 4 case studies, 12 interviews, literature review, detailed look at 2 cost models against OAIS and UKHE TRAC led to <a href="http://www.jisc.ac.uk/publications/publications/keepingresearchdatasafe.aspx">the report</a>. Produced a 3-part activity model which supports a cost framework (pre-archive, archive, support services.) There are key cost variables, a resource template. separate economic adjustments from service adjustments. He contrasts repository costs for publications with costs for data repositories, and data from elsewhere about much bigger cost of repairing metadata later on as opposed to doing it right at creation. It looks at efficiency curve effects, economies of scale and problems with first mover costs. He mentions the ADS 20-year rule (that all-time costs of preservation are essentially accounted for in the first 20 years), but points out there are a number of assumptions behind that. Points out usefulness of NSB/NSF distinction between research, community and reference collections. The study is new in using FEC (full economic costs, a UK HE funding model which comes from TRAC &#8211; transparent approach to costing (not trustworthy archive certification!)</p>
<p>The study is not just about DIY, can account for partial or full outsourcing. OSI study shows that 1.4% or 1.5% of research funding goes on data preservation and access.</p>
<p>Brian Lavoie talks about economic sustainability, from the <a href="http://brtf.sdsc.edu/">blue ribbon task force</a>. Resources aren&#8217;t just &#8216;available&#8217;: meaningful engagement is necessary. They need to be comprehensive (or at least a critical mass), actionable and sustainable (hence persistent.) Sustainability is economic, technical and social. Task force supported by NSF, Mellon, LoC, JISC, CLIR, NARA. Mission to frame digital preservation as a sustainable economic activity. Need to articulate benefits and incentives for decision makers (parallels with PoWR and digital preservation policy study.) First gives a willingness to pay, second a willingness to provide.</p>
<p>Need selection and efficiency, and reliable predictability. Then need to choose organisational form and governance. Org may be no interest( 3rd party provider), private interest (university library/archive), statutory/mandate interest (national library/archive). Issues: separating costs of access now vs access in the future. Monetizing public good. He mentions &#8220;spend now for future value&#8221; which appears to resonate with the DELOS/NSF <a href="http://eprints.erpanet.org/48/">&#8220;Invest to Save&#8221;</a> message (in which I must declare an interest.) First report due this month.</p>
<p>At this point my laptop battery gave up the ghost. The final presentation from John Willbanks was a real highlight in many ways, but it will take until tomorrow for me to transcribe my handwritten notes.</p>
<p>Day 1 ended with a conference dinner in the splendid setting of Edinburgh Castle. The harpist was a particularly fine touch.</p>
<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2008/12/03/the-4th-international-digital-curation-conference-idcc4/' addthis:title='The 4th International Digital Curation Conference: idcc4 '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></content:encoded>
			<wfw:commentRss>http://dablog.ulcc.ac.uk/2008/12/03/the-4th-international-digital-curation-conference-idcc4/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

