<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>ulcc da blog &#187; web archiving</title>
	<atom:link href="http://dablog.ulcc.ac.uk/tag/web-archiving/feed/" rel="self" type="application/rss+xml" />
	<link>http://dablog.ulcc.ac.uk</link>
	<description>ulcc digital archives blog</description>
	<lastBuildDate>Fri, 03 Feb 2012 16:24:59 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>DA visits CERN</title>
		<link>http://dablog.ulcc.ac.uk/2011/06/22/da-visits-cern/</link>
		<comments>http://dablog.ulcc.ac.uk/2011/06/22/da-visits-cern/#comments</comments>
		<pubDate>Wed, 22 Jun 2011 15:41:57 +0000</pubDate>
		<dc:creator>Silvia Arango-Docio</dc:creator>
				<category><![CDATA[All]]></category>
		<category><![CDATA[Web Archiving]]></category>
		<category><![CDATA[BlogForever]]></category>
		<category><![CDATA[CERN]]></category>
		<category><![CDATA[European Commission]]></category>
		<category><![CDATA[FP7]]></category>
		<category><![CDATA[Organisations]]></category>
		<category><![CDATA[web archiving]]></category>

		<guid isPermaLink="false">http://dablog.ulcc.ac.uk/?p=1540</guid>
		<description><![CDATA[Last week I was in a very welcoming Geneva, exactly in the European Organization for Nuclear Research (CERN) to meet other partners working on BlogForever and to have several Invenio workshops. I felt very lucky to be in the hub of such an organization and to see how many young international students are getting the [...]<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2011/06/22/da-visits-cern/' addthis:title='DA visits CERN '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></description>
			<content:encoded><![CDATA[<div id="attachment_1556" class="wp-caption alignleft" style="width: 202px"><a href="http://dablog.ulcc.ac.uk/wp-content/uploads/2011/06/computercentre_cern-300x225.jpg"><img class="size-full wp-image-1556 " title="CERN Computer Centre" src="http://dablog.ulcc.ac.uk/wp-content/uploads/2011/06/computercentre_cern-300x225.jpg" alt="CERN Computer Centre" width="192" height="144" /></a><p class="wp-caption-text">CERN Computer Centre</p></div>
<div id="attachment_1557" class="wp-caption alignright" style="width: 164px"><a href="http://dablog.ulcc.ac.uk/wp-content/uploads/2011/06/computer_raks_cern-300x300.jpg"><img class="size-full wp-image-1557  " title="CERN Computer Racks" src="http://dablog.ulcc.ac.uk/wp-content/uploads/2011/06/computer_raks_cern-300x300.jpg" alt="" width="154" height="154" /></a><p class="wp-caption-text">CERN Computer Racks</p></div>
<p>Last week I was in a very welcoming Geneva, exactly in the <a href="http://public.web.cern.ch/public/en/About/About-en.html">European Organization  for Nuclear Research</a> (CERN) to meet other partners working on <a href="http://blogforever.eu/">BlogForever</a> and to have several <a href="http://invenio-software.org/">Invenio</a> workshops. I felt very lucky to be in the hub of such an organization and to see how many young international students are getting the opportunity to be in the forefront of high physics research.</p>
<div id="attachment_1558" class="wp-caption alignleft" style="width: 164px"><a href="http://dablog.ulcc.ac.uk/wp-content/uploads/2011/06/Outside_Globe_Cern.jpg"><img class="size-medium wp-image-1558 " title="The Globe at CERN" src="http://dablog.ulcc.ac.uk/wp-content/uploads/2011/06/Outside_Globe_Cern-300x300.jpg" alt="The Globe at CERN" width="154" height="154" /></a><p class="wp-caption-text">The Globe at CERN</p></div>
<p>CERN is home of the world&#8217;s biggest and most powerful particle accelerator, the <a href="http://public.web.cern.ch/public/en/lhc/lhc-en.html">Large Hadron Collider</a> (LHC). This machine is installed in a 27 km circumference tunnel. The LHC records around 15 petabytes per year. All the data is stored in their vast <a href="http://cdsweb.cern.ch/record/1103476/">computer centre</a>, where open access and sharing has been the driving principle since their foundation in 1954 and an inspirational environment for the Web to be born there.</p>
<p>Invenio&#8217;s workshops showed us that their electronic document management system is robust and versatile, targets the management of more than 1.2 million documents and it can be used in 19 different languages. Its content is clean and complete. In just their High Energy Physics domain, they have around 700 collections and approximately 20K queries a day. As well Invenio is used for special programs like the UNESCO funded digital repos in Africa and EU funded projects like <a href="http://www.d4science.eu/">D4Science</a> and <a href="http://www.openaire.eu/">OpenAIRE.</a></p>
<div id="attachment_1559" class="wp-caption alignright" style="width: 250px"><a href="http://dablog.ulcc.ac.uk/wp-content/uploads/2011/06/me_at_CERN-300x225.jpg"><img class="size-full wp-image-1559  " title="Silvia at CERN" src="http://dablog.ulcc.ac.uk/wp-content/uploads/2011/06/me_at_CERN-300x225.jpg" alt="Silvia at CERN" width="240" height="180" /></a><p class="wp-caption-text">Me at CERN</p></div>
<p>In the case of BlogForever and Invenio, plenty of work to be done by the Invenio Team at the User and Document Service Group. At the moment, they have more than 30 readily available <a href="http://www.python.org/">Python</a> modules that can be adapted to the case of preserving huge amount of blogs. From the point of view of my work with repositories as part of the <a href="http://www.ulcc.ac.uk/content/repositories-he-and-research">Digital Archives and Repositories Team</a> at <a href="http://www.ulcc.ac.uk/">ULCC</a>, I was inspired by Invenio&#8217;s advance search engines; indexing and ranking methods.</p>
<p>In a more personal level, if you are ever crossing the border between France and Switzerland near Geneva, get Tram 18 and hop off at CERN to see their <a href="http://microcosm.web.cern.ch/microcosm/Welcome.html">Microcosm</a> and <a href="http://public.web.cern.ch/public/en/spotlight/SpotlightGlobe-en.html">Globe</a> exhibitions.</p>
<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2011/06/22/da-visits-cern/' addthis:title='DA visits CERN '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></content:encoded>
			<wfw:commentRss>http://dablog.ulcc.ac.uk/2011/06/22/da-visits-cern/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A Tab in the Ocean</title>
		<link>http://dablog.ulcc.ac.uk/2011/05/24/a-tab-in-the-ocean/</link>
		<comments>http://dablog.ulcc.ac.uk/2011/05/24/a-tab-in-the-ocean/#comments</comments>
		<pubDate>Tue, 24 May 2011 14:52:23 +0000</pubDate>
		<dc:creator>Ed Pinsent</dc:creator>
				<category><![CDATA[All]]></category>
		<category><![CDATA[Digital Archives]]></category>
		<category><![CDATA[metadata]]></category>
		<category><![CDATA[UKWAC]]></category>
		<category><![CDATA[web archiving]]></category>

		<guid isPermaLink="false">http://dablog.ulcc.ac.uk/?p=1426</guid>
		<description><![CDATA[I&#8217;ve been using Web Curator Tool (WCT) to curate the JISC website collection at UKWA since 2008. I&#8217;ve long been aware that the system offered me the opportunity to record a lot of metadata, in tabs called General, Annotations, Groups and Access. It&#8217;s a mix of technical metadata (about the gather / website) and descriptive [...]<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2011/05/24/a-tab-in-the-ocean/' addthis:title='A Tab in the Ocean '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></description>
			<content:encoded><![CDATA[<p><a href="http://dablog.ulcc.ac.uk/wp-content/uploads/2011/05/tabs.jpg"><img class="alignleft size-medium wp-image-1434" style="margin: 5px;" title="tabs" src="http://dablog.ulcc.ac.uk/wp-content/uploads/2011/05/tabs-300x148.jpg" alt="" width="300" height="148" /></a><br />
I&#8217;ve been using Web Curator Tool (WCT) to curate the <a href="http://www.jisc.ac.uk/whatwedo/programmes/preservation/ajw.aspx" target="_blank">JISC website collection</a> at <a href="http://www.webarchive.org.uk/ukwa/"  target="_blank">UKWA</a> since 2008. I&#8217;ve long been aware that the system offered me the opportunity to record a lot of metadata, in tabs called General, Annotations, Groups and Access. It&#8217;s a mix of technical metadata (about the gather / website) and descriptive metadata. It&#8217;s mainly of value to the curator who wants to keep track of what they&#8217;re doing with the website gathering; but WCT also allows us to create some descriptive metadata for exposure. At the bare minimum, we&#8217;re required to use Groups; despite its name, this component is actually a simple subject classification scheme, allowing me to tag all my websites with &#8220;Higher Education&#8221; for example. Once stored in the WCT database and rendered through Wayback Machine, this subject selection translates into <a href="http://www.webarchive.org.uk/ukwa/subject/79/page/1" target="_blank">this useful view of the collection</a>.</p>
<p>Recently the British Library team approached all the users of the shared WCT tool. It seems that the curators involved in UKWA have been using these metadata fields slightly differently and the BL team have initiated a project to move towards more consistency. The project will involve deciding on definitive interpretations of how to use these fields, followed by a process of cleaning up legacy data stored in the system. Some of it is potentially useful, some of it not so useful; some is legacy from the earlier PANDAS phase of the project, mostly not needed, or entered into the wrong field.</p>
<p>As noted, a lot of this metadata is mainly to do with selection and evaluation decisions, curation information such as changes in status of the site, and as such it&#8217;s never been exposed anywhere except within WCT. However, one descriptive field will eventually end up exposed on the UKWA live site, and provide us cataloguer types an opportunity to describe the resources in more detail. It will appear on the Title Entry Page (TEP) for each instance.</p>
<p>I welcome any move towards exposing more descriptive metadata on the UKWA public site. I have always taken the view that the phrase which currently appears alongside a Title “The live site may provide more information” is not really very helpful in the context of a web archive, for three reasons. (1) we don’t want our users clicking away from UKWA; (2) the link to the live site may be dead by now, and; (3) as archivists and curators, I feel strongly that we are the ones who should be providing that “more information” in the shape of a catalogue description of some kind.</p>
<p>The JISC project sites, as a collection, have high evidentiary value as stages in development of very specific tools, services and activities that benefit the UK Higher Education community. The sites by themselves don’t always explain their history or intentions; I would argue that a lot of rich contextual detail about the reasons these sites existed (the JISC programme under which they were developed, the dates, the staff involved, the themes, the outputs) would help interpret the collection to the users and make it more intelligible.</p>
<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2011/05/24/a-tab-in-the-ocean/' addthis:title='A Tab in the Ocean '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></content:encoded>
			<wfw:commentRss>http://dablog.ulcc.ac.uk/2011/05/24/a-tab-in-the-ocean/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Thoughts about blog data and metadata</title>
		<link>http://dablog.ulcc.ac.uk/2011/04/25/thoughts-about-blog-data-and-metadata/</link>
		<comments>http://dablog.ulcc.ac.uk/2011/04/25/thoughts-about-blog-data-and-metadata/#comments</comments>
		<pubDate>Mon, 25 Apr 2011 20:39:18 +0000</pubDate>
		<dc:creator>Richard M. Davis</dc:creator>
				<category><![CDATA[Web Archiving]]></category>
		<category><![CDATA[Atom]]></category>
		<category><![CDATA[blog]]></category>
		<category><![CDATA[BlogForever]]></category>
		<category><![CDATA[blogs]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[data model]]></category>
		<category><![CDATA[European Commission]]></category>
		<category><![CDATA[metadata]]></category>
		<category><![CDATA[newsfeeds]]></category>
		<category><![CDATA[RSS]]></category>
		<category><![CDATA[web archiving]]></category>

		<guid isPermaLink="false">http://blogforever.eu/?p=487</guid>
		<description><![CDATA[During the ArchivePress project at ULCC, we briefly considered the data and metadata generally made available with blogs and blog posts. As ArchivePress focused on the representations of blogs in newsfeeds, we examined the metadata that is generated in common, and exposed in the newsfeeds of three of the most common blog platforms, WordPress, Blogger [...]<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2011/04/25/thoughts-about-blog-data-and-metadata/' addthis:title='Thoughts about blog data and metadata '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></description>
			<content:encoded><![CDATA[<i>From the <a href="http://blogforever.eu/">BlogForever</a> blog.</i>
<p>During the <a href="http://jiscpowr.jiscinvolve.org/wp/2009/06/24/archivepress-when-one-size-doesnt-fit-all/">ArchivePress</a> project at ULCC, we briefly considered the data and metadata generally made available with blogs and blog posts. As ArchivePress focused on the representations of blogs in newsfeeds, we examined the metadata that is generated in common, and exposed in the newsfeeds of three of the most common blog platforms, WordPress, Blogger and TypePad. Blogger and Typepad prefer the Atom newsfeed format; WordPress (particularly WordPress.com) prefers RSS (though it can be made to publish Atom feeds too). This analysis was done, about a year ago, things may have changed, but here is a summary of what we found.</p>
<p>For each <strong>Blog</strong>, the following core information is available in the feeds:</p>
<table style="border-collapse: collapse" border="1" cellspacing="1" cellpadding="2" width="100%">
<tbody>
<tr>
<th></th>
<th><strong>WordPress (RSS)</strong></th>
<th><strong>Blogger (Atom)</strong></th>
<th><strong>Typepad (Atom)</strong></th>
</tr>
<tr>
<th>Feed Unique ID</th>
<td style="vertical-align: top">NA </td>
<td style="vertical-align: top">feed/id </td>
<td style="vertical-align: top">feed/id</td>
</tr>
<tr>
<th>Blog URL</th>
<td style="vertical-align: top">rss/channel/link </td>
<td style="vertical-align: top">feed/link@rel=&#8221;alternate&#8221; </td>
<td style="vertical-align: top">feed/link@rel=&#8221;alternate&#8221;</td>
</tr>
<tr>
<th>Blog Title</th>
<td style="vertical-align: top">rss/channel/title </td>
<td style="vertical-align: top">feed/title </td>
<td style="vertical-align: top">feed/title</td>
</tr>
<tr>
<th>Blog Description</th>
<td style="vertical-align: top">rss/channel/description </td>
<td style="vertical-align: top">feed/subtitle </td>
<td style="vertical-align: top">feed/subtitle</td>
</tr>
<tr>
<th>Date of last update</th>
<td style="vertical-align: top">rss/channel/lastBuildDate </td>
<td style="vertical-align: top">feed/updated </td>
<td style="vertical-align: top">feed/updated</td>
</tr>
<tr>
<th>Generating software</th>
<td style="vertical-align: top">rss/channel/generator </td>
<td style="vertical-align: top">feed/generator </td>
<td style="vertical-align: top">feed/generator</td>
</tr>
</tbody>
</table>
<p></p>
<p>For each <strong>Post</strong>, we established that the following core information is available in the newsfeeds:</p>
<table border="1" cellspacing="1" cellpadding="2" width="100%" style="border-collapse: collapse">
<tbody>
<tr>
<th></th>
<th>WordPress (RSS)</th>
<th>Blogger (Atom)</th>
<th>Typepad (Atom)</th>
</tr>
<tr>
<th>Post Unique ID</th>
<td style="vertical-align: top">rss/channel/item/guid@isPermaLink </td>
<td style="vertical-align: top">feed/entry/id </td>
<td style="vertical-align: top">feed/entry/id</td>
</tr>
<tr>
<th>Post Title</th>
<td style="vertical-align: top">rss/channel/item/title </td>
<td style="vertical-align: top">feed/entry/title </td>
<td style="vertical-align: top">feed/entry/title</td>
</tr>
<tr>
<th>Post Summary</th>
<td style="vertical-align: top">rss/channel/item/description </td>
<td style="vertical-align: top">NA </td>
<td style="vertical-align: top">feed/entry/summary</td>
</tr>
<tr>
<th>Post URL</th>
<td style="vertical-align: top">rss/channel/item/link </td>
<td style="vertical-align: top">feed/entry/link@rel=&#8221;alternate&#8221; </td>
<td style="vertical-align: top">feed/entry/link@rel=&#8221;alternate&#8221;</td>
</tr>
<tr>
<th>Date of publication</th>
<td style="vertical-align: top">rss/channel/item/pubDate </td>
<td style="vertical-align: top">feed/entry/published </td>
<td style="vertical-align: top">feed/entry/published </td>
</tr>
<tr>
<th>Date of last update</th>
<td style="vertical-align: top">NA </td>
<td style="vertical-align: top">feed/entry/updated </td>
<td style="vertical-align: top">feed/entry/updated</td>
</tr>
<tr>
<th>Post Author</th>
<td style="vertical-align: top">rss/channel/item/dc:creator<br />
rss/xmlns:dc
</td>
<td style="vertical-align: top">feed/entry/author/name</td>
<td style="vertical-align: top">feed/entry/author/name</td>
</tr>
<tr>
<th>Post Category</th>
<td style="vertical-align: top">rss/channel/item/category </td>
<td style="vertical-align: top">feed/entry/category@term </td>
<td style="vertical-align: top">feed/entry/category@term</td>
</tr>
<tr>
<th>Post Content</th>
<td style="vertical-align: top">rss/channel/item/content:encoded<br />
rss/xmlns:content
</td>
<td style="vertical-align: top">feed/entry/content
</td>
<td style="vertical-align: top">feed/entry/content
</td>
</tr>
<tr>
<th>Post Comments</th>
<td style="vertical-align: top">rss/channel/item/comments </td>
<td style="vertical-align: top">feed/entry/link@rel=&#8221;replies&#8221; </td>
<td style="vertical-align: top">feed/entry/link@rel=&#8221;replies&#8221;</td>
</tr>
<tr>
<th>Post Comments Feed</th>
<td style="vertical-align: top">rss/channel/item/wfw:commentRss </td>
<td style="vertical-align: top">NA
</td>
<td style="vertical-align: top">NA
</td>
</tr>
</tbody>
</table>
<p></p>
<p>One interesting point we noted was that neither Blogger nor Typepad published a link to a Comments Feed for each post. This made our work on ArchivePress more difficult since it was predicated on being able to easily identify the Comments feed for each post, and harvest new Comments as they were published. Obviously for blogs generated other than by WordPress, this was not going to be so easy. (Our ace developer Emanuele found some workarounds, but that&#8217;s another story.)</p>
<p>I think this offers us an interesting overview of the core of standard, structured blog data and metadata, in three of the leading blog platforms. This is the data structure and metadata profile that is maintained in blog databases, in one of its native forms, and I&#8217;d expect it to be present in all blog platforms, since it arguably represents the essence of blogs. I hope this will be useful background when considering the core models for data and metadata handling that will be developed for BlogForever.</p>
<div style="float: right; margin-left: 10px;"><a href="http://twitter.com/share?url=http://blogforever.eu/blog/2011/04/25/thoughts-about-blog-data-and-metadata/&via=blogforever&text=Thoughts%20about%20blog%20data%20and%20metadata&related=:&lang=en&count=vertical" class="twitter-share-button">Tweet</a><script type="text/javascript" src="http://platform.twitter.com/widgets.js"></script></div>]]></content:encoded>
			<wfw:commentRss>http://dablog.ulcc.ac.uk/2011/04/25/thoughts-about-blog-data-and-metadata/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Nominate blogs for our survey</title>
		<link>http://dablog.ulcc.ac.uk/2011/04/20/nominate-blogs-for-our-survey/</link>
		<comments>http://dablog.ulcc.ac.uk/2011/04/20/nominate-blogs-for-our-survey/#comments</comments>
		<pubDate>Wed, 20 Apr 2011 07:59:46 +0000</pubDate>
		<dc:creator>Richard M. Davis</dc:creator>
				<category><![CDATA[Web Archiving]]></category>
		<category><![CDATA[blog]]></category>
		<category><![CDATA[BlogForever]]></category>
		<category><![CDATA[blogs]]></category>
		<category><![CDATA[European Commission]]></category>
		<category><![CDATA[web archiving]]></category>

		<guid isPermaLink="false">http://blogforever.eu/?p=443</guid>
		<description><![CDATA[Is there a particular blog or blogger you would like to see included in the BlogForever survey? We invite you to use this form to nominate them, and we will try to ensure that the blog is reviewed or the blogger contacted to participate in our survey. Tweet<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2011/04/20/nominate-blogs-for-our-survey/' addthis:title='Nominate blogs for our survey '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></description>
			<content:encoded><![CDATA[<i>From the <a href="http://blogforever.eu/">BlogForever</a> blog.</i>
<p>Is there a particular blog or blogger you would like to see included in the BlogForever survey? We invite you to use this form to nominate them, and we will try to ensure that the blog is reviewed or the blogger contacted to participate in our survey.</p>
<div class="pageview">
	
  <iframe src="https://spreadsheets.google.com/embeddedform?formkey=dDlxOEgwRUpKVU9KZHFxSGo3MGxIb0E6MQ" frameborder="0" style="" scrolling="yes" height="700px" width="100%">Get a better browser!</iframe>
</div>

<div style="float: right; margin-left: 10px;"><a href="http://twitter.com/share?url=http://blogforever.eu/blog/2011/04/20/nominate-blogs-for-our-survey/&via=blogforever&text=Nominate%20blogs%20for%20our%20survey&related=:&lang=en&count=vertical" class="twitter-share-button">Tweet</a><script type="text/javascript" src="http://platform.twitter.com/widgets.js"></script></div>]]></content:encoded>
			<wfw:commentRss>http://dablog.ulcc.ac.uk/2011/04/20/nominate-blogs-for-our-survey/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Asynchronicities in blog structure</title>
		<link>http://dablog.ulcc.ac.uk/2011/04/11/asynchronicities-in-blog-structure/</link>
		<comments>http://dablog.ulcc.ac.uk/2011/04/11/asynchronicities-in-blog-structure/#comments</comments>
		<pubDate>Mon, 11 Apr 2011 15:56:04 +0000</pubDate>
		<dc:creator>Richard M. Davis</dc:creator>
				<category><![CDATA[Web Archiving]]></category>
		<category><![CDATA[blog]]></category>
		<category><![CDATA[BlogForever]]></category>
		<category><![CDATA[blogs]]></category>
		<category><![CDATA[data model]]></category>
		<category><![CDATA[European Commission]]></category>
		<category><![CDATA[preservation]]></category>
		<category><![CDATA[time]]></category>
		<category><![CDATA[web archiving]]></category>

		<guid isPermaLink="false">http://blogforever.eu/?p=430</guid>
		<description><![CDATA[At an atomic level, a “blog” comprises “blog posts”, which are continually added to the blog corpus: that is the dynamic essence of a blog, and distinguishes it from old-fashioned, largely static Websites and hypertexts in which little content changed between major update iterations, which process was probably more akin to “publishing a new edition” [...]<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2011/04/11/asynchronicities-in-blog-structure/' addthis:title='Asynchronicities in blog structure '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></description>
			<content:encoded><![CDATA[<i>From the <a href="http://blogforever.eu/">BlogForever</a> blog.</i>
<p>At an atomic level, a “blog” comprises “blog posts”, which are continually added to the blog corpus: that is the dynamic essence of  a blog, and distinguishes it from old-fashioned, largely static Websites and hypertexts in which little content changed between major update iterations, which process was probably more akin to “publishing a new edition” in the world of non-digital publications.</p>
<p>The blog also displays, as part of its frame, other graphical and functional elements (sidebars, widgets, “blogrolls”, etc) which may themselves contain dynamically updated, constantly changing information. These can be added, removed, amended and rearranged at will by the blog author/editor. Blog posts that were “published” in the context of one set of framing elements, will persist through subsequent versions of that framework.</p>
<p>Similarly with design (layout, colours, mastheads, etc), though the persistence tends to be longer, the informal nature of blogs means that these may be easily changed by the blog editor/author, and are thus more volatile than a typical “corporate” website. Again, blog posts may persist, unchanged in themselves, through many iterations of the blog site design and layout.</p>
<div id="attachment_431" class="wp-caption aligncenter" style="width: 310px"><a href="http://blogforever.eu/wp-content/uploads/2011/04/blogatoms.png"><img class="size-medium wp-image-431" src="http://blogforever.eu/wp-content/uploads/2011/04/blogatoms-300x223.png" alt="blogatoms 300x223 Asynchronicities in blog structure" width="300" height="223" title="Asynchronicities in blog structure" /></a><p class="wp-caption-text">A simple view of blog elements and their temporal relationship</p></div>
<p>&nbsp;</p>
<p>This very simplified visualisations suggests where we might start conceptualising key elements of a blog. It indicates that they iterate over time, but in the cases of Design, Posts and Widgets (as we’ll call them for brevity), according to independent schedules. While Posts and Comments persist in the online view of a blog, designs and widget arrangements are overwritten.</p>
<p>With my earlier ArchivePress project we deliberately overlooked preservation of the blog&#8217;s framing elements, and (given the much smaller scope of that project) established an <a href="http://archivepress.ulcc.ac.uk/2009/07/31/archival-musings/">acceptable rationale</a> for doing so. The challenge for BlogForever is to find a solution to  precisely these issues. Unless we were simply to adopt the snapshot approach of Heritrix-based web archiving initiatives (e.g. Wayback/archive.org, UK Web Archive), we need to ensure the BlogForever repository supports a degree of granularity that can capture, describe and preserve atomic blog objects in a way that reflects the particular interdependencies, in order to understand and preserve them authentically, and permit the many possible authentic and valid “time slice” views and analyses that users of the archive will need.</p>
<p>(I appreciate, by the way that these objects themselves are compound objects, so not strictly &#8220;atomic&#8221;: but the same is also true of atoms, as our CERN colleagues can attest!)</p>
<div style="float: right; margin-left: 10px;"><a href="http://twitter.com/share?url=http://blogforever.eu/blog/2011/04/11/asynchronicities-in-blog-structure/&via=blogforever&text=Asynchronicities%20in%20blog%20structure&related=:&lang=en&count=vertical" class="twitter-share-button">Tweet</a><script type="text/javascript" src="http://platform.twitter.com/widgets.js"></script></div>]]></content:encoded>
			<wfw:commentRss>http://dablog.ulcc.ac.uk/2011/04/11/asynchronicities-in-blog-structure/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A Workshop for the Web</title>
		<link>http://dablog.ulcc.ac.uk/2010/07/19/a-workshop-for-the-web/</link>
		<comments>http://dablog.ulcc.ac.uk/2010/07/19/a-workshop-for-the-web/#comments</comments>
		<pubDate>Mon, 19 Jul 2010 13:57:46 +0000</pubDate>
		<dc:creator>Ed Pinsent</dc:creator>
				<category><![CDATA[All]]></category>
		<category><![CDATA[Web Archiving]]></category>
		<category><![CDATA[DPTP]]></category>
		<category><![CDATA[JiSC-PoWR]]></category>
		<category><![CDATA[web archiving]]></category>
		<category><![CDATA[web preservation]]></category>

		<guid isPermaLink="false">http://dablog.ulcc.ac.uk/?p=1025</guid>
		<description><![CDATA[We delivered our first DPTP workshop in London on 28 June 2010, on the subject of archiving websites. I delivered most of the training myself, working from my experience with archiving JISC project websites, writing the PoWR Handbook, and my sense for how the work should fit into a traditional archiving continuum. Accordingly I tried [...]<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2010/07/19/a-workshop-for-the-web/' addthis:title='A Workshop for the Web '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></description>
			<content:encoded><![CDATA[<p>We delivered our first <a href="http://www.dptp.org">DPTP</a> workshop in London on 28 June 2010, on the subject of archiving websites. I delivered most of the training myself, working from my experience with archiving JISC project websites, writing the <a href="http://jiscpowr.jiscinvolve.org/wp/">PoWR Handbook</a>, and my sense for how the work should fit into a traditional archiving continuum. Accordingly I tried to structure the day to reflect a start-to-finish approach for the job, thusly:</p>
<ol>
<li>Consider your organisational requirements, drivers and level at which you want to do it, and create a selection policy that matches this. Consider legal framework.</li>
<li>Understand the technology that copies websites (harvesters) and how websites themselves behave. I talked about aspects of the dynamic web that sometimes trip up a harvester &#8211; CMS, wiki, databases.</li>
<li>Consider how (and indeed whether) you want to offer access to the collection, and whether metadata is needed.</li>
<li>Build a programme for web archiving, adapting existing methodologies as needed &#8211; e.g. Institutional vs Individual. What other services exist, and can they do it for you?</li>
</ol>
<p>In the middle, we had an excellent case study from Dave Thompson at the Wellcome Trust, and his experiences strongly reflected many of the themes of the course. Like many organisations, they don&#8217;t have one single reason for collecting web archives, and the future value of these collections is something we can&#8217;t yet see (due to its closeness with the live web).</p>
<p>We were all impressed by the people attending the course, all from a variety of backgrounds and projects, coming with widely different expectations of how they would be managing their web content. National libraries and business archives were represented, but also the arts; the <a href="http://www.tate.org.uk/conservation/time/about.htm">Tate Gallery</a> are doing interesting work in time-based media and specialist works of art that manifest themselves over the web. How to capture that content, and make it perform in the future?</p>
<p>The DPTP recognises the value of participation and sharing experiences, which we can all learn from. When I was holding forth on the concept of three possible points of capture for web content, I was very pleased to hear a proposal for a fourth possible method from our Swiss delegate <a href="http://virtualworld.ch/">Daniel Spichty</a>. There were also numerous questions about exactly what it is that Content Management Systems do, which suggested to me I need to learn more about the inner workings and preservation implications of such systems.</p>
<p>I&#8217;m also pleased we were able to offer a printed copy of the PoWR Handbook to all those attending &#8211; in advance of the official launch of the book, which will take place at <a href="http://iwmw.ukoln.ac.uk/blog/2010/">IWMW 2010 on 12 July</a>.</p>
<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2010/07/19/a-workshop-for-the-web/' addthis:title='A Workshop for the Web '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></content:encoded>
			<wfw:commentRss>http://dablog.ulcc.ac.uk/2010/07/19/a-workshop-for-the-web/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Working with Web Curator Tool (part 2): wikis</title>
		<link>http://dablog.ulcc.ac.uk/2009/03/10/working-with-web-curator-tool-part-2-wikis/</link>
		<comments>http://dablog.ulcc.ac.uk/2009/03/10/working-with-web-curator-tool-part-2-wikis/#comments</comments>
		<pubDate>Tue, 10 Mar 2009 12:16:22 +0000</pubDate>
		<dc:creator>Ed Pinsent</dc:creator>
				<category><![CDATA[All]]></category>
		<category><![CDATA[Web Archiving]]></category>
		<category><![CDATA[British Library]]></category>
		<category><![CDATA[JISC]]></category>
		<category><![CDATA[UKWAC]]></category>
		<category><![CDATA[web archiving]]></category>

		<guid isPermaLink="false">http://dablog.ulcc.ac.uk/?p=391</guid>
		<description><![CDATA[How to archive a website built with a wiki? It&#8217;s worth looking into this as increasingly JISC projects are using wikis to manage and report on their projects; of the available brands, MediaWiki is a popular one. The challenge for me is how to bring in a good copy of a wiki site without causing [...]<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2009/03/10/working-with-web-curator-tool-part-2-wikis/' addthis:title='Working with Web Curator Tool (part 2): wikis '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-medium wp-image-397" style="border: 1px solid black; margin: 15px;" title="mediawiki" src="http://dablog.ulcc.ac.uk/wp-content/uploads/2009/03/mediawiki-300x246.jpg" alt="mediawiki" width="300" height="246" /><br />
How to archive a website built with a wiki? It&#8217;s worth looking into this as increasingly JISC projects are using wikis to manage and report on their projects; of the available brands, <a href="http://www.mediawiki.org" target="_blank">MediaWiki</a> is a popular one.</p>
<p>The challenge for me is how to bring in a good copy of a wiki site without causing Web Curator Tool to gather too many pages from it. We don&#8217;t want that, because (a) the finished result occupies unnecessary space in the archive and (b) because it takes so long to complete that it can hold up the gather queue in the shared web-archiving service, delaying the work of other UKWAC partners.</p>
<p>I am not technical enough to tell you in great detail what&#8217;s causing this, although I sense that it&#8217;s something to do with the Heritrix crawler requesting too many pages from the wiki. When you consider that a wiki is database-driven it should not surprise us that it&#8217;s creating a lot of its pages on the fly. Secondly, since a wiki is editable by lots of contributors (that&#8217;s its core function after all), it presumably means we have numerous past versions of pages also stored somewhere in the wiki labyrinth, and it&#8217;s possible that the implacable Heritrix will not cease until it&#8217;s faithfully requested and copied every single one of them.</p>
<p>Let&#8217;s look at the <strong><a href="http://www.ukoln.ac.uk/repositories/digirep/index/JISC_Digital_Repository_Wiki" target="_blank">Repositories Research Team wiki (DigiRep)</a></strong> owned by UKOLN, which I tried to gather five times in 2008. WCT conveniently keeps a history of these attempts, information about which I can still access even if the actual gathered pages have been discarded or archived. The size problems were chronic. Of five 2008 gathers, one was aborted after it had reached a massive 16.87 GB; a second one was rejected at 14.69 GB. I have archived one impression at 5.31 GB, another at 736.26 MB and another at 157.36 MB. Quite large variations there, which was worrying enough in itself.</p>
<p>At first, my workaround was to adjust the Profile Setting in the title to override the maximum number of documents Heritrix can gather. Setting &#8216;Maximum Documents&#8217; at 10000 worked, but it was not ideal; I suppose all this means is that Heritrix stops when it collects 10,000 pages, whether we have everything we want or not. (I found that the copies in the archive seemed to render OK however).</p>
<p>To get a closer look at what&#8217;s going on, I started to browse the Log Files created by WCT (complete records of every single client-server request), which show patterns which I can vaguely understand; when these Log Files are packed with near-identical strings of code I sense that something&#8217;s up. For example, a string containing <code>index.php?title=Repositories_Research&amp;action=edit</code> tells us that the wiki is requesting a specific named page, <strong>and</strong> allowing an edit action on that page. If you multiply that by the number of pages in the wiki, you can see how the problem builds up. (PHP is the script used for MediaWiki&#8217;s web scripting engine).</p>
<p>I follow this up by browsing the actual gathered pages in Web Curator Tool using the Tree View. From here I can click on the &#8216;View&#8217; button to examine a page which I think to be suspect, and compare it with other suspect pages. Lastly, I go back to the live DigiRep site to confirm in my mind what&#8217;s happening when certain links are followed.</p>
<p>All the above gave me just about enough information to experiment with exclusion filters. After a certain amount of trial and error, and working with other Media Wiki sites, I arrived at the following exclusion codes which I can add to the Profile Setting:</p>
<p><code>.*&amp;oldid.*<br />
.*&amp;diff.*<br />
.*&amp;limit.*<br />
.*&amp;direction.*<br />
.*Recentchanges.*<br />
.*/Special.*<br />
.*\?title=Special.*<br />
.*&amp;action=edit.*<br />
.*&amp;action=history.*<br />
.*&amp;section.*<br />
.*&amp;redlink.*<br />
.*&amp;printable=yes.*<br />
.*&amp;redirect=no.*</code></p>
<p>These have the effect of telling WCT to exclude certain pages and actions from Heritrix&#8217;s harvesting action. The expectation was that I would lose the discussion / edit / history functions of the wiki in the archive copy.</p>
<p>The title with the above exclusion profile gathered just 63.41 MB and it completed in under ten minutes. I would say that&#8217;s an improvement on 16.87 GB. Log Files and the Tree View confirmed the success of this new &#8220;slimline&#8221; gather. As well as losing the discussion / edit / history functions, we also have eliminated the Toolbox functions, the &#8216;printable&#8217; views, and the login pages.</p>
<p>This is no great loss at all for our purposes, as scholars who browse the archived copy of DigiRep are not expecting to be able to edit pages, nor join in the discussions, nor browse the history of stored versions of pages. Indeed in a lot of cases, they would require a login to do so. The users simply want to see the results of the DigiRep team&#8217;s work.</p>
<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2009/03/10/working-with-web-curator-tool-part-2-wikis/' addthis:title='Working with Web Curator Tool (part 2): wikis '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></content:encoded>
			<wfw:commentRss>http://dablog.ulcc.ac.uk/2009/03/10/working-with-web-curator-tool-part-2-wikis/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Working with Web Curator Tool (part 1)</title>
		<link>http://dablog.ulcc.ac.uk/2009/02/25/working-with-web-curator-tool-part-1/</link>
		<comments>http://dablog.ulcc.ac.uk/2009/02/25/working-with-web-curator-tool-part-1/#comments</comments>
		<pubDate>Wed, 25 Feb 2009 14:32:54 +0000</pubDate>
		<dc:creator>Ed Pinsent</dc:creator>
				<category><![CDATA[All]]></category>
		<category><![CDATA[Web Archiving]]></category>
		<category><![CDATA[JISC]]></category>
		<category><![CDATA[UKWAC]]></category>
		<category><![CDATA[web]]></category>
		<category><![CDATA[web archiving]]></category>

		<guid isPermaLink="false">http://dablog.ulcc.ac.uk/?p=338</guid>
		<description><![CDATA[Keen readers may recall a post from April 2008 about my website-archiving forays working with Web Curator Tool, the workflow database, used for programming Heritrix, the crawler which does the harvesting of websites. Other UKWAC partners and myself have since found that Heritrix sometimes has a problem, described by some as &#8216;collateral harvesting&#8217;. This means [...]<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2009/02/25/working-with-web-curator-tool-part-1/' addthis:title='Working with Web Curator Tool (part 1) '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.flickr.com/photos/docman/212526202/" target="_blank"><img class="size-medium wp-image-339 alignright" style="margin: 15px;" title="212526202_c78bcda4cb" src="http://dablog.ulcc.ac.uk/wp-content/uploads/2009/02/212526202_c78bcda4cb-300x225.jpg" alt="Early Morning Wheatfield by docman. Retrieved from http://www.flickr.com/photos/docman/212526202/" width="240" height="180" /></a></p>
<p>Keen readers may recall <a href="/2008/04/30/web-wct/">a post from April 2008</a> about my website-archiving forays working with <strong>Web Curator Tool</strong>, the workflow database, used for programming Heritrix, the crawler which does the harvesting of websites.</p>
<p>Other <a href="http://www.ukwebarchive.org.uk/ukwa/" target="_blank">UKWAC</a> partners and myself have since found that Heritrix sometimes has a problem, described by some as &#8216;collateral harvesting&#8217;. This means it can gather links, pages, resources, images, files and so forth from websites we don&#8217;t actually want to include in the finished archived item.</p>
<p>Often this problem is negligible, resulting in a few extra KB of pages from adobe.com or google.com for example. Sometimes though it can result in large amounts of extraneous material, amounting to several MB or even GB of digital content (for example if the crawler somehow finds a website full of .avi files.)</p>
<p>I have probably become overly preoccupied with this issue, since I don&#8217;t want to increase our sponsor (JISC)&#8217;s overheads by occupying their share of the server space with unnecessarily bloated gathers, nor clutter up the shared bandwidth by spending hours gathering pages unnecessarily.</p>
<p>Web Curator Tool allows us two options for dealing with collateral harvesting. One of them is to use the <strong>Prune Tool</strong> on the harvested site after the gather has run. The Prune Tool allows you to browse the gather&#8217;s tree structure, and to delete a single file or an entire folder full of files which you don&#8217;t want.</p>
<p>The other option is to apply <strong>exclusion filters</strong> to the title before the gather runs. This can be a much more effective method. The method is to enter a little bit of code in the &#8216;Exclude Filters&#8217; box of a title&#8217;s profile. The basic principle is using the code .* for exclusions. <code>.*www.aes.org.*</code> will exclude that entire website from the gather. <code>.*/images/.*</code> will exclude any path containing a folder named &#8216;images&#8217;.</p>
<p>So far I generally find myself making two types of exclusion:</p>
<p>(a) <em>Exclusions of websites we don&#8217;t want</em>. As noted with collateral harvesting, Heritrix is following external links from the target a little too enthusiastically. It&#8217;s easy to identify these sites with the Tree View feature in WCT. This view also lets you know the size of the folder that has resulted from the external gathering. This has helped me make decisions; I tend to target those folders where the size is 1MB or larger.</p>
<p>(b) <em>Exclusions of certain pages or folders within the Target which we don&#8217;t want</em>. This is where it gets slightly trickier, and we start to look in the log files of client-server requests for instances where the browser is staying in the target, but performing actions like requesting the same page over and over. This can happen with database-driven sites, CMS sites, wikis, and blogs.</p>
<p>I believe I may have had a &#8216;breakthrough&#8217; of sorts with managing collateral harvesting with at least one brand of wiki, and will report on this for my next post.</p>
<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2009/02/25/working-with-web-curator-tool-part-1/' addthis:title='Working with Web Curator Tool (part 1) '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></content:encoded>
			<wfw:commentRss>http://dablog.ulcc.ac.uk/2009/02/25/working-with-web-curator-tool-part-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Continuity Girl</title>
		<link>http://dablog.ulcc.ac.uk/2008/07/25/continuity/</link>
		<comments>http://dablog.ulcc.ac.uk/2008/07/25/continuity/#comments</comments>
		<pubDate>Fri, 25 Jul 2008 13:29:21 +0000</pubDate>
		<dc:creator>Ed Pinsent</dc:creator>
				<category><![CDATA[All]]></category>
		<category><![CDATA[Web Archiving]]></category>
		<category><![CDATA[continuity]]></category>
		<category><![CDATA[JiSC-PoWR]]></category>
		<category><![CDATA[UKWAC]]></category>
		<category><![CDATA[web archiving]]></category>

		<guid isPermaLink="false">http://dablog.ulcc.ac.uk/2008/07/25/continuity/</guid>
		<description><![CDATA[Amanda Spencer gave an informative presentation at the UK Web-Archiving Consortium Partners Meeting on 23 July, which I happened to attend. The Web Continuity Project at TNA is a large-scale and Government-centric project, which includes a &#8220;comprehensive archiving of the government web estate by The National Archives&#8221;. Its aims are to address both “persistence” and [...]<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2008/07/25/continuity/' addthis:title='The Continuity Girl '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></description>
			<content:encoded><![CDATA[<p><img src="http://dablog.ulcc.ac.uk/wp-content/uploads/2008/07/51y3yjwar6l_ss500_.jpg" title="Not Amanda Spencer" alt="Not Amanda Spencer" align="right" hspace="5" vspace="0" />Amanda Spencer gave an informative presentation at the UK Web-Archiving Consortium Partners Meeting on 23 July, which I happened to attend. The <a href="http://www.nationalarchives.gov.uk/webcontinuity"><strong>Web Continuity Project at TNA</strong></a> is a large-scale and Government-centric project, which includes a &#8220;comprehensive archiving of the government web estate by The National Archives&#8221;. Its aims are to address both “persistence” and “preservation” in a way that is seamless and robust: in many ways, “continuity” seems a very apposite concept with which to address the particular nature of web resources. It&#8217;s all about the issue of sustainable information across government.</p>
<p>At ULCC we&#8217;re interested to see if we can align some &#8216;continuity&#8217; ideas within the context of our <a href="http://jiscpowr.jiscinvolve.org/">PoWR project</a>.  Many of the issues facing departmental web and information managers are likely to have analogues in HE and FE institutions, and Web Continuity offers concepts and ways of working that may be worth considering and may be adaptable to a web-archiving programme in a University.</p>
<p><span id="more-152"></span>A main area of focus for Web Continuity is integrity of websites &#8211; links, navigation, consistency of presentation. The working group on this, set up by Jack Straw, found a lot of mixed practices in e-publication (some use attached PDFs, others HTML pages); and numerous different content management systems in use. No centralised or consistent publication method, in other words.</p>
<p>To achieve <em>persistency of links</em>, Web Continuity are making use of digital object identifiers (DOIs) which can marry a live URL to a persistent identifier. Further, they use a redirection component which is derived from open-source software. It can be installed on common web server applications, eg Apache and Microsoft IIS. This component will &#8220;deliver the information requested by the user whether it is on the live website, or retrieved from the web archive and presented appropriately&#8221;. Of course, this redirection component only works if the domains are still being maintained, but it will do much to ensure that links persist over time.</p>
<p>They are building a <em>centralised registry database</em>, which is growing into an authority record of Government websites, including other useful contextual and technical detail (which can be updated by Departmental webmasters). It is a means of auditing the website crawls that are undertaken. Such a registry approach would be well worth considering on a smaller scale for a University.</p>
<p>Their <em>sitemap implementation plan</em> involves the rollout of XML sitemaps across government. XML sitemaps can help archiving, because they help to expose hidden content that is not linked to by navigation, or dynamic pages created by a CMS or database. This methodology may be something for HFE webmasters to consider, as it would assist with remote harvesting by an agreed third party.</p>
<p>The intended <em>presentation method</em> will make it much clearer to users that they are accessing an archived page instead of a live one. Indeed, user experience has been a large driver for this project. I suppose that UK Government want to ensure that the public can trust the information they find and that the frustrating experience of meeting dead-ends in the form of dead links is minimised. Further, it does something to address any potential liability issues arising from members of public accessing &#8211; and possibly acting upon &#8211; outdated information.</p>
<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2008/07/25/continuity/' addthis:title='The Continuity Girl '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></content:encoded>
			<wfw:commentRss>http://dablog.ulcc.ac.uk/2008/07/25/continuity/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>JISC-PoWR @ IWMW2008</title>
		<link>http://dablog.ulcc.ac.uk/2008/07/25/jisc-powr-iwmw2008/</link>
		<comments>http://dablog.ulcc.ac.uk/2008/07/25/jisc-powr-iwmw2008/#comments</comments>
		<pubDate>Fri, 25 Jul 2008 10:24:47 +0000</pubDate>
		<dc:creator>Richard M. Davis</dc:creator>
				<category><![CDATA[All]]></category>
		<category><![CDATA[Web Archiving]]></category>
		<category><![CDATA[Events]]></category>
		<category><![CDATA[IWMW2008]]></category>
		<category><![CDATA[JiSC-PoWR]]></category>
		<category><![CDATA[JISC-PoWR]]></category>
		<category><![CDATA[preservation]]></category>
		<category><![CDATA[repositories]]></category>
		<category><![CDATA[web archiving]]></category>

		<guid isPermaLink="false">http://dablog.ulcc.ac.uk/2008/07/25/jisc-powr-iwmw2008/</guid>
		<description><![CDATA[Seems I turned up just in time for the UKOLN IWMW 2008 event at Aberdeen. The sun was shining, weather was sweet, and the University buildings in Old Aberdeen looked magnificent. Not only outside &#8211; the Conference Hall in the Old King’s Library is a beautiful example of state of the art conferencing facilities, complete [...]<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2008/07/25/jisc-powr-iwmw2008/' addthis:title='JISC-PoWR @ IWMW2008 '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></description>
			<content:encoded><![CDATA[<p><a href="http://dablog.ulcc.ac.uk/wp-content/uploads/2008/07/aberdeen-crombie-towers.jpg" title="Aberdeen - Crombie Gates"><img src="http://dablog.ulcc.ac.uk/wp-content/uploads/2008/07/aberdeen-crombie-towers.thumbnail.jpg" class="float-right" alt="Aberdeen - Crombie Gates" /></a>Seems I turned up just in time for the UKOLN <a href="http://www.ukoln.ac.uk/web-focus/events/workshops/webmaster-2008/" target="_blank">IWMW 2008</a> event at Aberdeen. The sun was shining, weather was sweet, and the University buildings in Old Aberdeen looked magnificent. Not only outside &#8211; the Conference Hall in the Old King’s Library is a beautiful example of state of the art conferencing facilities, complete with individual microphones and voting panels &#8211; as impressive a debating chamber as I’ve seen since we attended DLM 1999, at the European Commission&#8217;s Charlemagne Conference Centre in Brussels.</p>
<p>To warm up my brain before the afternoon’s PoWR workshop on preserving web resources, I sampled James Currall’s enjoyable <a href="http://www.ukoln.ac.uk/web-focus/events/workshops/webmaster-2008/talks/currall/slides/IWMW08-JC-outline_Outline.html" title="discussion of web archiving" id="p:.w">discussion of web archiving</a>, and also a talk on Institutional Repositories by Stephanie Taylor of UKOLN and <a href="http://www.ukoln.ac.uk/repositories/digirep/index/Repositories_Research" title="RRT" id="v08s">RRT</a>.</p>
<p>In <em id="ni9y">The Tangled Web is but a Fleeting Dream &#8230; but then again&#8230;, </em>James Currall covered the essentials of web archiving in a clear and engaging way, drawing comparisons between the survival of WWI soldiers’ diaries, and the blogs of present-day servicemen in Iraq. Another example given was the trials and tribulations of the website for the Lockerbie Trial Briefing Unit: outsourced, host ceased trading, domain name lapsed, original website content dependent on outdated Coldfusion and Access environment. Thankfully a remote-harvested, static HTML image of the site does survive.</p>
<p>Why have there not been more conspicuous successes in web archiving in the past two decades? <span id="more-151"></span>Partly because it can be difficult to decide exactly whether and when a website should be treated as an authentic (and authenticable) record, a publication channel, or a publication itself &#8211; among other things. Partly because there are always  waves of innovation continually washing up more interesting things to do. Partly because archiving is a policy issue, yet is generally addressed as purely a techn(olog)ical problem (where it is addressed at all).</p>
<p>It was reassuring, as ever, to hear it said again that there is not one single tool that addresses all possible web preservation issues (behaviour, dynamic content, scripts, versioning, emerging standards, etc.); that depending on the Internet Archive is at best a partial and risky solution; and that &#8220;whatever you do is likely to be imperfect&#8221;.</p>
<p>James put the chamber’s electronic voting systems to entertaining and informative use with a number of snap votes: would, for example, present-day soldiers’ blogs still be available in 90 years’ time? Of course I could confidently vote ‘yes’, knowing that the publication of the JISC PoWR handbook is barely months away! Other ad hoc vox pops revealed that the audience was about as familiar with OAIS as with the Book of Ezra.</p>
<p>James is director of the <a href="http://www.gla.ac.uk/espida/" title="Espida" id="ho4c">Espida</a> Project, and this was a timely reminder that we must consider the relevance of that project’s work on assessing and controlling costs to the guidance we’re assembling for the forthcoming PoWR handbook. James and I share what must be a fairly unusual distinction of citing Thomas Carlyle in support of our cause in <a href="/2008/04/02/open-repositories-2008-in-southampton/" title="SNEEPing at OR08">recent presentations</a>. Unlike James, however, I don&#8217;t bear such an <a href="http://www.ukoln.ac.uk/web-focus/events/workshops/webmaster-2008/speakers/#currall" title="Currall or Carlyle?" target="_blank">uncanny likeness</a> to the great man.</p>
<p>Stephanie Taylor&#8217;s talk, <em id="ni9y0">Institutional Repositories: Asset or Obstacle?</em>, gave us a brief history of the Insitutional Repository, and a well-paced explanation of the value and purpose of IRs &#8211; rather more substantial, thankfully, than my Bluffer&#8217;s Guide To IRs last year. I was initially intrigued that her presentation might be leading to the conclusions that IRs are &#8220;obstacles&#8221;, but I quickly realised the alarmist title was rhetorical in nature. The talk was, among other things, an engaging appeal to the better nature of web managers and other techies to appreciate the value of, and issues faced by librarians.</p>
<p>In fact, the emergence of repositories, and the recognition of their place among institutional information systems, is a watershed in the evolution of electronic resources. Leaving it to researchers and teachers to manage what they create &#8211; variously on websites, blogs, Google Docs, thumb drive, or what you will &#8211; is no more sensible now than it ever was, if we want to ensure that they are consistently managed and accessible, let alone think about their preservation over the longer term.</p>
<p>There are many ways to approach it, none the only right and proper way. You may add extra value to it, through your choice of software or implementation (in-house or outsourced), or by using Web-Two-Oh-ish features and services. Institutional Repositories are only &#8220;one of many small conversations going on in different ways in different mediums”, but the need for an institution to manage valuable academic outputs in an orderly way is unarguable.</p>
<p>However one tweet that flew over the <a href="http://twemes.com/iwmw2008" target="_blank">Twitosphere</a> during the afternoon suggests that there may be more to do in assessing the pros and cons of different approaches &#8211; particularly the merits or otherwise of Google&#8217;s omnipresent panaceas. Mike Ellis commented: &#8220;I&#8217;ve never been clear why it is that institutions would trust a repository vendor more than someone like Google&#8230;.?&#8221;</p>
<p><a href="http://dablog.ulcc.ac.uk/wp-content/uploads/2008/07/aberdeen-art-richard.jpg" title="Aberdeen City Gallery"><img src="http://dablog.ulcc.ac.uk/wp-content/uploads/2008/07/aberdeen-art-richard.thumbnail.jpg" class="float-right" alt="Aberdeen City Gallery" /></a>I won’t write up the JISC-PoWR workshop &#8211; the results of that will be on the <a href="http://jiscpowr.jiscinvolve.org/" target="_blank">JISC-PoWR blog </a>- except to say that Marieke succinctly summarised the many web preservation issues we’ve accumulated, while Brian found effective waysto draw out issues and concerns around the growing range of Web 2.0 applications finding favour among staff and students at our institutions; and we were pleased that both Stephanie and James were able to contribute to the discussion.</p>
<p>And then it was off to the Aberdeen City Art Gallery for a glass of wine or three and some inspirational art&#8230;.</p>
<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2008/07/25/jisc-powr-iwmw2008/' addthis:title='JISC-PoWR @ IWMW2008 '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></content:encoded>
			<wfw:commentRss>http://dablog.ulcc.ac.uk/2008/07/25/jisc-powr-iwmw2008/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>POWRing up the POWR project</title>
		<link>http://dablog.ulcc.ac.uk/2008/05/23/powring-up-the-powr-project/</link>
		<comments>http://dablog.ulcc.ac.uk/2008/05/23/powring-up-the-powr-project/#comments</comments>
		<pubDate>Fri, 23 May 2008 16:57:05 +0000</pubDate>
		<dc:creator>Kevin Ashley</dc:creator>
				<category><![CDATA[All]]></category>
		<category><![CDATA[Web Archiving]]></category>
		<category><![CDATA[JISC]]></category>
		<category><![CDATA[JiSC-PoWR]]></category>
		<category><![CDATA[UKOLN]]></category>
		<category><![CDATA[web archiving]]></category>

		<guid isPermaLink="false">http://dablog.ulcc.ac.uk/2008/05/23/powring-up-the-powr-project/</guid>
		<description><![CDATA[The preservation of web resources is something that&#8217;s been of interest to me for some time. It presents some interesting technical challenges in terms of capture and access, and many interesting organisational and resource-oriented problems, some of which are shared with other aspects of digital preservation and some of which are unique to web resources. [...]<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2008/05/23/powring-up-the-powr-project/' addthis:title='POWRing up the POWR project '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></description>
			<content:encoded><![CDATA[<p>The preservation of web resources is something that&#8217;s been of interest to me for some time. It presents some interesting technical challenges in terms of capture and access, and many interesting organisational and resource-oriented problems, some of which are shared with other aspects of digital preservation and some of which are unique to web resources. How does one select material ? When are we trying to preserve information and when is it the experience, behaviour or appearance that is paramount ? How straightforward is it to move web resources between curatorial environments? (something that <a href="http://www.restore.ac.uk/">the ReStore</a> project is looking at.) Most everyone knows that information persistence on the web is a fragile thing. And, <a href="http://digitalcuration.blogspot.com/2008/04/rluk-launched-but-relaunch-is-flawed.html">as Chris Rusbridge has observed</a>, even those who care about information persistence don&#8217;t necessarily do a good job of it on their websites. <span id="more-114"></span>This, despite the fact that <a href="http://www.w3.org/Provider/Style/URI">good advice about URI persistence</a> has been available for some time. But URI persistence is just one small (albeit important) part of the problem.</p>
<p>Not everything on the web needs to be kept. And there&#8217;s more than one way to go about keeping it  &#8211; often it&#8217;s just the information that needs to survive, and the particular way it is presented on a web site today is not, of itself, worthy of long-term preservation. Yet there&#8217;s a lack of knowledge about *how* to preserve web resources, and even when people know how to do it, for some reason it just doesn&#8217;t happen. That&#8217;s  not a situation I feel comfortable with.</p>
<p>Thus, we&#8217;re pleased to be cooperating with UKOLN who are leading a short JISC project to produce guidelines on the preservation of web resources in UK academic institutions. We have a memorable acronym (POWR) and a project-specific blog at <a href="http://jiscpowr.jiscinvolve.org/">JISC Involve</a>, a relatively new service which allows for the creation of JISC project blogs. We&#8217;ll be doing most of the project blogging there rather than here on <a href="http://dablog.ulcc.ac.uk/">DABlog</a>. In using services like JISC Involve, we&#8217;ll (hopefully) go some way towards understanding whether such services create or solve problems in the area of resource persistence.</p>
<p>Through workshops and through the blog, we&#8217;re keen to gain input from everyone with an interest in this problem, particularly from the web managers who are most likely to be affected by the guidelines we&#8217;re charged with producing. Do contribute your thoughts.</p>
<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2008/05/23/powring-up-the-powr-project/' addthis:title='POWRing up the POWR project '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></content:encoded>
			<wfw:commentRss>http://dablog.ulcc.ac.uk/2008/05/23/powring-up-the-powr-project/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Web-archiving: the WCT workflow tool</title>
		<link>http://dablog.ulcc.ac.uk/2008/04/30/web-wct/</link>
		<comments>http://dablog.ulcc.ac.uk/2008/04/30/web-wct/#comments</comments>
		<pubDate>Wed, 30 Apr 2008 15:11:57 +0000</pubDate>
		<dc:creator>Ed Pinsent</dc:creator>
				<category><![CDATA[All]]></category>
		<category><![CDATA[Web Archiving]]></category>
		<category><![CDATA[British Library]]></category>
		<category><![CDATA[JISC]]></category>
		<category><![CDATA[UKWAC]]></category>
		<category><![CDATA[web archiving]]></category>

		<guid isPermaLink="false">http://dablog.ulcc.ac.uk/2008/04/30/web-wct/</guid>
		<description><![CDATA[This month I have been happily harvesting JISC project website content using my new toy, the Web Curator Tool. It has been rewarding to resume work on this project after a hiatus of some months; the former setup, which used PANDAS software, has been winding down since December. Who knows what valuable information and website [...]<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2008/04/30/web-wct/' addthis:title='Web-archiving: the WCT workflow tool '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></description>
			<content:encoded><![CDATA[<p><img src="http://dablog.ulcc.ac.uk/wp-content/uploads/2008/04/web-curator-tool-logo.gif" alt="web-curator-tool-logo.gif" class="float-left" style="border: 0pt none " />This month I have been happily harvesting JISC project website content using my new toy, the <a href="http://webcurator.sourceforge.net/" target="_blank">Web Curator Tool</a>. It has been rewarding to resume work on this project after a hiatus of some months; the former setup, which used PANDAS software, has been winding down since December. Who knows what valuable information and website content changes may have escaped the archiving process during these barren months?</p>
<p>Web Curator Tool is a web-based workflow database, one which manages the assignment of permission records, builds profiles for each &#8216;target&#8217; website, and allows a certain amount of inter-facing with Heritrix, the actual engine that gathers the materials. The <a href="http://crawler.archive.org/" target="_blank">open-source Heritrix project</a> is being developed by the <a href="http://www.archive.org" target="_blank">Internet Archive</a>, whose access software (effectively the &#8216;Wayback Machine&#8217;) may also be deployed in the new public-facing website when it is launched in May 2008.</p>
<p><span id="more-94"></span><img src="http://dablog.ulcc.ac.uk/wp-content/uploads/2008/04/title-icon-targets.gif" alt="title-icon-targets.gif" class="float-right" style="border: 0pt none " />Although the idiosyncrasies of WCT caused me some anguish at first, largely through being removed from my &#8216;comfort zone&#8217; of managing regular harvests, I suddenly turned the corner about two weeks ago. The diagnostics are starting to make sense. Through judicious ticking of boxes and refreshing of pages, I can now interrogate the database to the finest detail. I learned how to edit and save a target so as to &#8216;force&#8217; a gather, thus helping to clear the backlog of scheduled gathers which had been accumulating, unbeknownst to us, since December. Most importantly, with the help of <a href="http://www.webarchive.org.uk" target="_blank">UKWAC </a>colleagues, we&#8217;re slowly finding ways of modifying the profile so as to gather less external material (or reduce collateral harvesting, to put it another way); or extend its reach to capture stylesheets and other content which is outside the root URL.</p>
<p>True, a lot of this has been trial and error, involving experimental gathers before a setting was found that would &#8216;take&#8217;. But WCT, unlike our previous set-up, allows the possibility of gathering a site more than once in a day. And it’s much faster. It can bring in results on some of the smaller sites in less than two minutes.</p>
<p>Now, 200 new instances of JISC project sites have been successfully gathered during March and April alone. A further 50 instances have been brought in from the Jan-Feb backlog. The daunting backlog of queued instances has been reduced to zero. Best of all, over 30 new JISC project websites (i.e. those which started around or after December 07) have been brought into the new system. I&#8217;ll be back in my comfort zone in no time…</p>
<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2008/04/30/web-wct/' addthis:title='Web-archiving: the WCT workflow tool '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></content:encoded>
			<wfw:commentRss>http://dablog.ulcc.ac.uk/2008/04/30/web-wct/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Is this blog being preserved?</title>
		<link>http://dablog.ulcc.ac.uk/2008/01/15/is-this-blog-being-preserved/</link>
		<comments>http://dablog.ulcc.ac.uk/2008/01/15/is-this-blog-being-preserved/#comments</comments>
		<pubDate>Tue, 15 Jan 2008 11:11:24 +0000</pubDate>
		<dc:creator>Ed Pinsent</dc:creator>
				<category><![CDATA[All]]></category>
		<category><![CDATA[Web Archiving]]></category>
		<category><![CDATA[UKWAC]]></category>
		<category><![CDATA[web archiving]]></category>

		<guid isPermaLink="false">http://dablog.ulcc.ac.uk/2008/01/15/is-this-blog-being-preserved/</guid>
		<description><![CDATA[Only launched last month, yet already we received this question from Heather Needham, the Principal Archivist (ICT &#38; e-services) at Hampshire Record Office, soon after we announced the existence of the DA blog to the archival community via the JISC NRA listserv. &#8220;I presume ULCC is preserving its own blog somehow?!&#8221; she asks. By preserved [...]<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2008/01/15/is-this-blog-being-preserved/' addthis:title='Is this blog being preserved? '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></description>
			<content:encoded><![CDATA[<p>Only launched last month, yet already we received this question from Heather Needham, the Principal Archivist (ICT &amp; e-services) at <a href="http://www.hants.gov.uk/record-office/" target="_blank">Hampshire Record Office</a>, soon after we announced the existence of the DA blog to the archival community via the <a href="http://www.jiscmail.ac.uk/lists/archives-nra.html" target="_blank">JISC NRA listserv</a>. &#8220;I presume ULCC is preserving its own blog somehow?!&#8221; she asks. By preserved she means preserved in a digital repository, say like <a href="http://www.webarchive.org.uk" target="_blank">UKWAC</a> sites, or <a href="http://www.nationalarchives.gov.uk/preservation/digitalarchive/default.htm" target="_blank">TNA&#8217;s digital archive</a>. &#8220;I&#8217;d assumed technology existed to preserve blogs, as they are a 21st century diary, and therefore a continuation of material collected by archives already. I&#8217;m probably completely wrong on the technology front!&#8221; <span id="more-44"></span></p>
<p>ULCC are not actively preserving this blog as yet. It&#8217;s backed up, and uses open standards and open source software: I don&#8217;t consider there would be any technical obstacles to capturing and preserving it like any other website in UKWAC. I have already archived some JISC blogs successfully,  for example, the <a href="http://www.webarchive.org.uk/tep/16095.html" target="_blank">archived PeerPigeon project site</a>, which contains a link to an archived copy of its own blog.  It&#8217;s been good to capture a copy of this particular blog (very coincidentally, it also runs on WordPress), which only existed for nine months in 2007 and is starting to wind down. The resource is still live at time of writing, but for how much longer?</p>
<p>It&#8217;s also worth noting that within the UKWAC Consortium, major partners The British Library are actively building a <a href="http://info.webarchive.org.uk/col.html" target="_blank">collection of blogs</a> (and similar sites using social software), which they call the <a href="http://www.webarchive.org.uk/col/c8250.html" target="_blank">&#8216;Digital Lives&#8217; collection</a>.</p>
<p>Some things to resolve with blog preservation, as with any other digital resource you want to preserve, are:</p>
<ul>
<li>selecting it for preservation in the first place</li>
<li>resolving any copyright issues that might prevent its archiving and/or repurposing</li>
<li>agreeing on a suitable frequency for harvesting of the blog</li>
<li>ensuring the organisation has sufficient resources to maintain the preserved copies</li>
</ul>
<p>The web harvesting software UKWAC have used to this point (PANDAS, which sits on top of HTTrack) has not met with any problems harvesting and copying the output from blogs. No problems are anticipated with the new Web Curator Tool, either.</p>
<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2008/01/15/is-this-blog-being-preserved/' addthis:title='Is this blog being preserved? '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></content:encoded>
			<wfw:commentRss>http://dablog.ulcc.ac.uk/2008/01/15/is-this-blog-being-preserved/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>UKWAC: what about HLF websites?</title>
		<link>http://dablog.ulcc.ac.uk/2007/12/13/ukwac-what-about-hlf-websites/</link>
		<comments>http://dablog.ulcc.ac.uk/2007/12/13/ukwac-what-about-hlf-websites/#comments</comments>
		<pubDate>Thu, 13 Dec 2007 15:11:34 +0000</pubDate>
		<dc:creator>Ed Pinsent</dc:creator>
				<category><![CDATA[All]]></category>
		<category><![CDATA[Web Archiving]]></category>
		<category><![CDATA[British Library]]></category>
		<category><![CDATA[Community Archives]]></category>
		<category><![CDATA[Digital sustainability]]></category>
		<category><![CDATA[HLF]]></category>
		<category><![CDATA[JISC]]></category>
		<category><![CDATA[minority ethnic groups]]></category>
		<category><![CDATA[National Archives]]></category>
		<category><![CDATA[UKWAC]]></category>
		<category><![CDATA[web archiving]]></category>

		<guid isPermaLink="false">http://dash.ulcc.ac.uk/blog/?p=25</guid>
		<description><![CDATA[We were recently relieved to learn that the Bernie Grant Trust archives website is still alive and well at http://www.berniegrantarchive.org.uk/. For a few weeks in November 2007, the site appeared to have vanished, ostensibly another web-based resource to have fallen to the vicissitudes of short-term funding. True, the Internet Archive had captured a few impressions [...]<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2007/12/13/ukwac-what-about-hlf-websites/' addthis:title='UKWAC: what about HLF websites? '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></description>
			<content:encoded><![CDATA[<p>We were recently relieved to learn that the <strong>Bernie Grant Trust archives</strong> website is still alive and well at <a href="http://www.berniegrantarchive.org.uk/" target="_blank">http://www.berniegrantarchive.org.uk/</a>. For a few weeks in November 2007, the site appeared to have vanished, ostensibly another web-based resource to have fallen to the vicissitudes of short-term funding. True, the <a href="http://www.archive.org">Internet Archive</a> had captured a few impressions of it, but the site is a complex one &#8211; full of interactive elements and database-driven deliverables, to say nothing of the online exhibition and other materials which can only be experienced through the website.</p>
<p>Why haven&#8217;t <a href="http://www.webarchive.org.uk">UKWAC</a> got a copy of this site? True, complex sites like this one tend to remain out of the reach of harvesting tools like PANDAS, which is based on HTTrack, and can&#8217;t get good results for sites which rely on complex server-side architecture. The site however is still unarchived as far as we know. <span id="more-25"></span>ULCC&#8217;s Joanne Anthony (who had worked as the archivist for the Bernie Grant Trust) was keen to learn if there was any way of submitting the site for consideration to one of the UKWAC partners. There is indeed an <a href="http://info.webarchive.org.uk/cgi-bin/submission.cgi" target="_blank">online submissions form</a> available, but this merely delivers a message to the UKWAC webmaster, who then forwards the request to the most appropriate partner. It would help considerably if the individual collection policies of each partner were made more manifest and published on the public site. But the visitor to <a href="http://www.webarchive.org.uk/">www.webarchive.org.uk</a> will find only a sketchy description of these policies, for example &#8220;The British Library will focus on sites of cultural, historical and political importance.&#8221;</p>
<p>Among UKWAC partners, the BL and the TNA are known to be directing their energies on certain specialised collection strands. These are given more descriptive paras at <a href="http://info.webarchive.org.uk/col.html">http://info.webarchive.org.uk/col.html</a>, yet the underlying pattern or theme of these collections is not apparently obvious. At least three of them &#8211; the Tsunami, General Election and London Terrorist attack strands &#8211; appear to be based primarily on the fact that the sites are ephemeral and most in danger of loss (regardless of their informational or evidentiary value as records).</p>
<p>It is not clear how a concerned individual, or a member of the DP Community, might be empowered to somehow influence UKWAC&#8217;s collection policies for the better. In the case of the Bernie Grant website, Joanne&#8217;s interest was to see minority ethnic groups better represented in UK archival collections; but another approach would be to see it within in the larger group of &#8216;websites funded by Heritage Lottery Funding&#8217;. It seems likely there are many such project sites, all with short-term funding and therefore potentially at risk of being removed from cyberspace at any time, yet containing unique digital materials of huge potential cultural value. As a discrete collection of websites, it has parallels with JISC&#8217;s collection focus, ie JISC-funded projects which are occupying web space on a similar short-term lease. How can we persuade the relevant funding bodies to ensure their web outputs are archived, as JISC already does?</p>
<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2007/12/13/ukwac-what-about-hlf-websites/' addthis:title='UKWAC: what about HLF websites? '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></content:encoded>
			<wfw:commentRss>http://dablog.ulcc.ac.uk/2007/12/13/ukwac-what-about-hlf-websites/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>UKWAC&#8217;s migration of websites</title>
		<link>http://dablog.ulcc.ac.uk/2007/12/11/migration-of-websites/</link>
		<comments>http://dablog.ulcc.ac.uk/2007/12/11/migration-of-websites/#comments</comments>
		<pubDate>Tue, 11 Dec 2007 14:48:08 +0000</pubDate>
		<dc:creator>Ed Pinsent</dc:creator>
				<category><![CDATA[All]]></category>
		<category><![CDATA[Web Archiving]]></category>
		<category><![CDATA[British Library]]></category>
		<category><![CDATA[JISC]]></category>
		<category><![CDATA[UKWAC]]></category>
		<category><![CDATA[web archiving]]></category>

		<guid isPermaLink="false">http://dash.ulcc.ac.uk/blog/?p=23</guid>
		<description><![CDATA[As I write, the UK&#8216;s archive of websites is undergoing the process of migration, in the hands of the British Library who continue to act as the lead partners for the UKWAC Consortium. There are at least two sides to this mammoth task. The first (which I assume is probably relatively easy) involves moving the [...]<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2007/12/11/migration-of-websites/' addthis:title='UKWAC&#8217;s migration of websites '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></description>
			<content:encoded><![CDATA[<p>As I write, the <a href="http://www.webarchive.org.uk" target="_blank"><st1:country-region w:st="on"><st1:place w:st="on">UK</st1:place></st1:country-region>&#8216;s archive of websites</a> is undergoing the process of migration, in the hands of the <a href="http://www.bl.uk" target="_blank">British Library</a> who continue to act as the lead partners for the <a href="http://info.webarchive.org.uk/index.html" target="_blank">UKWAC Consortium</a>.</p>
<p>There are at least two sides to this mammoth task. The first (which I assume is probably relatively easy) involves moving the archive of gathered websites from its current server infrastructure to its new one. The previous hosts, <a href="http://www.magus.co.uk/index.html">Magus</a>, have decided they can&#8217;t see a future in archiving websites. The new host, very coincidentally, is ULCC; our infrastructure services recently won the contract to provide a home for the large quantities of stored websites.</p>
<p>The second migration aspect, which involves complexities I&#8217;m glad I don&#8217;t have to deal with, involves moving the publisher and website profiles across from the <a href="http://pandora.nla.gov.au/pandas.html" target="_blank">PANDAS</a> database to the <a href="http://webcurator.sourceforge.net/" target="_blank">Web Curator Tool</a> (WCT) database. <span id="more-23"></span> WCT, as fate would have it, is being jointly developed by the BL and the National Library of Zealand, and it will become our weapon of choice for all future web-harvesting activities. It&#8217;s certainly a more sophisticated piece of software than the clunky, web-object driven PANDAS, and appears able to handle the concept of one publisher owning more than one title (something which always baffled PANDAS).</p>
<p>The planned migration moves have been causing consternation to many of the UKWAC partners, particularly those who have been storing unprocessed gathers in the Magus &#8216;Temporary Drive&#8217; for a long time. We at ULCC have been assisting with managing that process for months. Kevin Ashley devised a simple script that could query this drive, and report back on the occupancy broken down by website number, with figures on file sizes and dates. Ed Pinsent, by querying PANDAS, was able to match website numbers to their owners, thus providing a handy set of reports on information that was otherwise unavailable. (PANDAS wasn&#8217;t able to see these unprocessed gathers, for some reason; Magus wouldn&#8217;t run a script to report on them because they&#8217;d never been asked to, and it would probably have incurred additional charges anyway.)</p>
<p>The JISC occupancy of the temp drive has been negligible however. This is mainly because we have been so efficient in processing our completed gathers, and using the ftp connection has allowed us to look more closely at failed gathers. Additionally, JISC&#8217;s requirements are such that (unlike other partner members) we have rarely had to gather entire websites, instead concentrating on a few pages that constitute a JISC Project.</p>
<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2007/12/11/migration-of-websites/' addthis:title='UKWAC&#8217;s migration of websites '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></content:encoded>
			<wfw:commentRss>http://dablog.ulcc.ac.uk/2007/12/11/migration-of-websites/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

