<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>ulcc da blog &#187; web</title>
	<atom:link href="http://dablog.ulcc.ac.uk/tag/web/feed/" rel="self" type="application/rss+xml" />
	<link>http://dablog.ulcc.ac.uk</link>
	<description>ulcc digital archives blog</description>
	<lastBuildDate>Fri, 03 Feb 2012 16:24:59 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>Working with Web Curator Tool (part 1)</title>
		<link>http://dablog.ulcc.ac.uk/2009/02/25/working-with-web-curator-tool-part-1/</link>
		<comments>http://dablog.ulcc.ac.uk/2009/02/25/working-with-web-curator-tool-part-1/#comments</comments>
		<pubDate>Wed, 25 Feb 2009 14:32:54 +0000</pubDate>
		<dc:creator>Ed Pinsent</dc:creator>
				<category><![CDATA[All]]></category>
		<category><![CDATA[Web Archiving]]></category>
		<category><![CDATA[JISC]]></category>
		<category><![CDATA[UKWAC]]></category>
		<category><![CDATA[web]]></category>
		<category><![CDATA[web archiving]]></category>

		<guid isPermaLink="false">http://dablog.ulcc.ac.uk/?p=338</guid>
		<description><![CDATA[Keen readers may recall a post from April 2008 about my website-archiving forays working with Web Curator Tool, the workflow database, used for programming Heritrix, the crawler which does the harvesting of websites. Other UKWAC partners and myself have since found that Heritrix sometimes has a problem, described by some as &#8216;collateral harvesting&#8217;. This means [...]<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2009/02/25/working-with-web-curator-tool-part-1/' addthis:title='Working with Web Curator Tool (part 1) '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.flickr.com/photos/docman/212526202/" target="_blank"><img class="size-medium wp-image-339 alignright" style="margin: 15px;" title="212526202_c78bcda4cb" src="http://dablog.ulcc.ac.uk/wp-content/uploads/2009/02/212526202_c78bcda4cb-300x225.jpg" alt="Early Morning Wheatfield by docman. Retrieved from http://www.flickr.com/photos/docman/212526202/" width="240" height="180" /></a></p>
<p>Keen readers may recall <a href="/2008/04/30/web-wct/">a post from April 2008</a> about my website-archiving forays working with <strong>Web Curator Tool</strong>, the workflow database, used for programming Heritrix, the crawler which does the harvesting of websites.</p>
<p>Other <a href="http://www.ukwebarchive.org.uk/ukwa/" target="_blank">UKWAC</a> partners and myself have since found that Heritrix sometimes has a problem, described by some as &#8216;collateral harvesting&#8217;. This means it can gather links, pages, resources, images, files and so forth from websites we don&#8217;t actually want to include in the finished archived item.</p>
<p>Often this problem is negligible, resulting in a few extra KB of pages from adobe.com or google.com for example. Sometimes though it can result in large amounts of extraneous material, amounting to several MB or even GB of digital content (for example if the crawler somehow finds a website full of .avi files.)</p>
<p>I have probably become overly preoccupied with this issue, since I don&#8217;t want to increase our sponsor (JISC)&#8217;s overheads by occupying their share of the server space with unnecessarily bloated gathers, nor clutter up the shared bandwidth by spending hours gathering pages unnecessarily.</p>
<p>Web Curator Tool allows us two options for dealing with collateral harvesting. One of them is to use the <strong>Prune Tool</strong> on the harvested site after the gather has run. The Prune Tool allows you to browse the gather&#8217;s tree structure, and to delete a single file or an entire folder full of files which you don&#8217;t want.</p>
<p>The other option is to apply <strong>exclusion filters</strong> to the title before the gather runs. This can be a much more effective method. The method is to enter a little bit of code in the &#8216;Exclude Filters&#8217; box of a title&#8217;s profile. The basic principle is using the code .* for exclusions. <code>.*www.aes.org.*</code> will exclude that entire website from the gather. <code>.*/images/.*</code> will exclude any path containing a folder named &#8216;images&#8217;.</p>
<p>So far I generally find myself making two types of exclusion:</p>
<p>(a) <em>Exclusions of websites we don&#8217;t want</em>. As noted with collateral harvesting, Heritrix is following external links from the target a little too enthusiastically. It&#8217;s easy to identify these sites with the Tree View feature in WCT. This view also lets you know the size of the folder that has resulted from the external gathering. This has helped me make decisions; I tend to target those folders where the size is 1MB or larger.</p>
<p>(b) <em>Exclusions of certain pages or folders within the Target which we don&#8217;t want</em>. This is where it gets slightly trickier, and we start to look in the log files of client-server requests for instances where the browser is staying in the target, but performing actions like requesting the same page over and over. This can happen with database-driven sites, CMS sites, wikis, and blogs.</p>
<p>I believe I may have had a &#8216;breakthrough&#8217; of sorts with managing collateral harvesting with at least one brand of wiki, and will report on this for my next post.</p>
<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2009/02/25/working-with-web-curator-tool-part-1/' addthis:title='Working with Web Curator Tool (part 1) '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></content:encoded>
			<wfw:commentRss>http://dablog.ulcc.ac.uk/2009/02/25/working-with-web-curator-tool-part-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hot off the preservation press: JISC-PoWR and the Beagrie Survey</title>
		<link>http://dablog.ulcc.ac.uk/2008/11/21/hot-off-the-preservation-press-jisc-powr-and-the-beagrie-survey/</link>
		<comments>http://dablog.ulcc.ac.uk/2008/11/21/hot-off-the-preservation-press-jisc-powr-and-the-beagrie-survey/#comments</comments>
		<pubDate>Fri, 21 Nov 2008 11:44:58 +0000</pubDate>
		<dc:creator>Richard M. Davis</dc:creator>
				<category><![CDATA[All]]></category>
		<category><![CDATA[Web Archiving]]></category>
		<category><![CDATA[digital preservation]]></category>
		<category><![CDATA[JISC]]></category>
		<category><![CDATA[JiSC-PoWR]]></category>
		<category><![CDATA[JISC-PoWR]]></category>
		<category><![CDATA[News]]></category>
		<category><![CDATA[preservation]]></category>
		<category><![CDATA[web]]></category>
		<category><![CDATA[web 2.0]]></category>
		<category><![CDATA[web preservation]]></category>

		<guid isPermaLink="false">http://dablog.ulcc.ac.uk/2008/11/21/hot-off-the-preservation-press-jisc-powr-and-the-beagrie-survey/</guid>
		<description><![CDATA[We were pleased to have finally made available version 1.0 of the JISC PoWR Handbook. The Handbook is the result of our extensive work with UKOLN on the JISC Preservation of Web Resources project, which included three hugely valuable workshops, and extensive discussion on the PoWR blog. In the Handbook we&#8217;ve tried to cover a [...]<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2008/11/21/hot-off-the-preservation-press-jisc-powr-and-the-beagrie-survey/' addthis:title='Hot off the preservation press: JISC-PoWR and the Beagrie Survey '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.flickr.com/photos/avantgardener4/2110782575/" title="Raspberry Jam by avantgardener4 on Flickr, CC by-nc-nd"><img src="http://dablog.ulcc.ac.uk/wp-content/uploads/2008/11/2110782575_a2e3d0d0b9_m.jpg" alt="Raspberry Jam by avantgardener4 on Flickr, CC by-nc-nd" align="right" width="120" /></a> We were pleased to have finally made available version 1.0 of the <a href="http://jiscpowr.jiscinvolve.org/handbook/">JISC PoWR Handbook</a>. The Handbook is the result of our extensive work with UKOLN on the JISC Preservation of Web Resources project, which included three hugely valuable workshops, and extensive discussion on the <a href="http://jiscpowr.jiscinvolve.org">PoWR blog</a>.</p>
<p>In the Handbook we&#8217;ve tried to cover a huge, and sometimes controversial, area in as accessible a way as possible. The workshops, attended by both web-management and records-management professionals from HE institutions, brought  a wide range of concerns and issues to light. It&#8217;s been quite a job fitting it all in.</p>
<p>Even as the project progressed, we became aware of new developments in thinking about how to approach the special issues of managing web resources, including everybody&#8217;s favourite new fast automatic Web 2.0 applications. We saw the publication of Steve Bailey&#8217;s Records Management 2.0 book, TNA&#8217;s Web Continuity project, and further web archiving developments at UKWAC. We&#8217;ve even heard it whispered in some quarters that approaches to preservation may need a more profound reassessment in the context of the Web and the Cloud. Many of these issues were recorded on the PoWR blog, and we tried to reflect as much of this in the Handbook as possible.</p>
<p>Another recent JISC publication, <a href="http://www.jisc.ac.uk/Home/publications/publications/jiscpolicyfinalreport.aspx" target="_blank">The Digital Preservation Policies Study </a>by Charles Beagrie Ltd, published at the same time, is complementary in many ways, and reassured us that many of the conclusions we groped towards in the Handbook were not so wide of the mark!<span id="more-234"></span> Like PoWR, the  Digital Preservation Policies Study identified the necessity of high-level policy engagement as the <em>sine qua non</em> of effective digital preservation.</p>
<p style="margin-left: 40px">Digital preservation solutions are undoubtedly partly technical, and the tools being created will enhance digital longevity, but these solutions are also equally dependent on organisational issues. It is important to remember that digital preservation relies on the interaction between the digital preservation environment and wider organisational objectives and procedural issues. These could be financial and staffing issues, collection management, legal obligations, auditing requirements, and other strategies and policies. In this respect, recognition by organisational divisions that digital data is important and key to the successful running of an organisation is crucial.</p>
<p style="margin-left: 40px" align="right"><em><a href="http://www.jisc.ac.uk/Home/publications/publications/jiscpolicyfinalreport.aspx" target="_blank">The Digital Preservation Policies Study</a></em>, p.11</p>
<p>Among the other recommendations the Study shares with PoWR include:</p>
<ul>
<li>Analysis of existing policies and strategies, and how our work can support them even if said polices don&#8217;t explicitly refer to preservation or digital assets</li>
<li>Taking a phased approach &#8211; nothing happens all at once. (PoWR recommends pilot projects and working with supportive departments.)</li>
<li>Careful scoping of preservation requirements. (With regard to web resources, PoWR suggests not everything, not every version, and not forever.)</li>
<li>Identifying if and where existing systems will do the job</li>
<li>Consideration of lifecycle, publication, and retention schedules.</li>
</ul>
<p>The Charles Beagrie survey is a very concise and accessible contribution to the field, and we hope the PoWR Handbook, with its specific focus on established and emerging Web issues, and attention to the detailed and everyday concerns of our many contributors and correspondents, will be similarly useful. We also hope that the work of PoWR will continue in some form, on the blog and perhaps in the form of new projects and workshops, to fill in the gaps we left, and deal with the constantly emerging Web developments. Anyone for PoWR 2.0?</p>
<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2008/11/21/hot-off-the-preservation-press-jisc-powr-and-the-beagrie-survey/' addthis:title='Hot off the preservation press: JISC-PoWR and the Beagrie Survey '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></content:encoded>
			<wfw:commentRss>http://dablog.ulcc.ac.uk/2008/11/21/hot-off-the-preservation-press-jisc-powr-and-the-beagrie-survey/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Digital preservation in a nutshell, part II</title>
		<link>http://dablog.ulcc.ac.uk/2008/06/10/digital-preservation-in-a-nutshell-part-ii/</link>
		<comments>http://dablog.ulcc.ac.uk/2008/06/10/digital-preservation-in-a-nutshell-part-ii/#comments</comments>
		<pubDate>Tue, 10 Jun 2008 13:04:32 +0000</pubDate>
		<dc:creator>Ed Pinsent</dc:creator>
				<category><![CDATA[All]]></category>
		<category><![CDATA[Web Archiving]]></category>
		<category><![CDATA[digital preservation]]></category>
		<category><![CDATA[JISC]]></category>
		<category><![CDATA[JiSC-PoWR]]></category>
		<category><![CDATA[JISC-PoWR]]></category>
		<category><![CDATA[preservation]]></category>
		<category><![CDATA[web]]></category>
		<category><![CDATA[web 2.0]]></category>
		<category><![CDATA[web preservation]]></category>

		<guid isPermaLink="false">http://dablog.ulcc.ac.uk/2008/06/10/digital-preservation-in-a-nutshell-part-ii/</guid>
		<description><![CDATA[As Richard noted in Part I, digital preservation is a &#8220;series of managed activities necessary to ensure continued access to digital materials for as long as necessary.&#8221; But what sort of digital materials might be in scope for the PoWR project?
We think it extremely likely that institutional web resources are going to include digital materials [...]<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2008/06/10/digital-preservation-in-a-nutshell-part-ii/' addthis:title='Digital preservation in a nutshell, part II '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></description>
			<content:encoded><![CDATA[<p><em>Originally published on the <a href="http://jiscpowr.jiscinvolve.org/2008/06/10/digital-preservation-in-a-nutshell-part-ii/">JISC-PoWR blog</a>.</em><br />
<hr /></p>
<p>As Richard noted in <a href="/2008/05/23/digital-preservation-in-a-nutshell-part-i/">Part I</a>, digital preservation is a “series of managed activities necessary to ensure continued access to digital materials for as long as necessary.” But what sort of digital materials might be in scope for the PoWR project?</p>
<p>We think it extremely likely that institutional web resources are going to include digital materials such as “records created during the day-to-day business of an organisation” and “born-digital materials created for a specific purpose”.</p>
<p>What we want is to “maintain access to these digital materials beyond the limits of media failure or technological change”. This leads us to consider the longevity of certain file formats, the changes undergone by proprietary software, technological obsolescence, and the migration or emulation strategies we’ll use to overcome these problems.</p>
<p>By <strong>migration</strong> we mean “a means of overcoming technological obsolescence by transferring digital resources from one hardware/software generation to the next.” In contrast, <strong>emulation</strong> is “a means of overcoming technological          obsolescence of hardware and software by developing techniques for imitating          obsolete systems on future generations of computers.”</p>
<p>Note also that when we talk about preserving anything, “for as long as necessary” doesn’t always mean “forever”. For the purposes of the PoWR project, it may be worth us considering <strong>medium-term preservation</strong> for example, which allows “continued access to digital materials beyond changes in technology for a defined period of time, but not indefinitely.”</p>
<p>We also hope to consider the idea of <strong>life-cycle management</strong>. According to DPC, “The major implications for life-cycle management of digital resources is the need actively to manage the resource at each stage of its life-cycle and to recognise the inter-dependencies between each stage and commence preservation activities as early as practicable.”</p>
<p>From these definitions alone, it should be apparent that success in the preservation of web resources will potentially involve the participation and co-operation of a wide range of experts: information managers, asset managers, webmasters, IT specialists, system administrators, records managers, and archivists.</p>
<p>(All the quotations and definitions above are taken from the <a href="http://www.dpconline.org/graphics/intro/definitions.html">DPC’s online handbook</a>.)</p>
<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2008/06/10/digital-preservation-in-a-nutshell-part-ii/' addthis:title='Digital preservation in a nutshell, part II '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></content:encoded>
			<wfw:commentRss>http://dablog.ulcc.ac.uk/2008/06/10/digital-preservation-in-a-nutshell-part-ii/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Digital preservation in a nutshell (Part I)</title>
		<link>http://dablog.ulcc.ac.uk/2008/05/23/digital-preservation-in-a-nutshell-part-i/</link>
		<comments>http://dablog.ulcc.ac.uk/2008/05/23/digital-preservation-in-a-nutshell-part-i/#comments</comments>
		<pubDate>Thu, 22 May 2008 23:00:06 +0000</pubDate>
		<dc:creator>Richard M. Davis</dc:creator>
				<category><![CDATA[All]]></category>
		<category><![CDATA[Web Archiving]]></category>
		<category><![CDATA[JiSC-PoWR]]></category>
		<category><![CDATA[preservation]]></category>
		<category><![CDATA[web]]></category>

		<guid isPermaLink="false">http://dablog.ulcc.ac.uk/2008/05/23/digital-preservation-in-a-nutshell-part-i/</guid>
		<description><![CDATA[One of the goals of PoWR is to make current trends in digital preservation meaningful and relevant to information professionals with the day-to-day responsibility for looking after web resources. Anyone coming for the first time to the field of digital preservation can find it a daunting area, with very distinct terminology and concepts. Some of [...]<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2008/05/23/digital-preservation-in-a-nutshell-part-i/' addthis:title='Digital preservation in a nutshell (Part I) '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></description>
			<content:encoded><![CDATA[<em>From the <a href="http://jiscpowr.jiscinvolve.org/">JISC-PoWR Project blog</a>.</em>
<hr />
One of the goals of PoWR is to make current trends in digital preservation meaningful and relevant to information professionals with the day-to-day responsibility for looking after web resources. Anyone coming for the first time to the field of digital preservation can find it a daunting area, with very distinct terminology and concepts. Some of these are drawn from time-honored approaches to managing things like government records or institutional archives, while others have been developed exclusively in the digital domain. It is an emerging and evolving field that can take some time to get your head round: so we thought it was a good idea to offer a series of brief primers.

Starting, naturally, with <strong>digital preservation</strong>: this is defined as a "series of managed activities necessary to ensure continued access to digital materials for as long as necessary" (<a href="http://www.dpconline.org/graphics/intro/definitions.html" title="DPC">Digital Preservation Coalition</a>, 2002). <span id="more-118"></span>It's best to consider the scope of digital preservation as much broader than <strong>digital archiving</strong>, though the terms are often used interchangeably. Because, in computing generally, “archiving” is the process of backup and offline storage of data, the term "digital preservation” helps avoid confusion when referring to the broader issues of managing digital materials and information in and about them.

The Digital Preservation Coalition (DPC) is a consortium of many leading institutions working in the field, including The British Library and The National Archives. Its <a href="http://www.dpconline.org/graphics/handbook/">online handbook</a> contains much excellent information (though its online format could be improved), and includes a useful <a href="http://www.dpconline.org/graphics/intro/definitions.html">glossary</a>.

A third term, <strong>digital curation</strong>, has recently gained prominence. This places greater emphasis on the activities required to maintain the integrity of digital collections over time, and keep them usable. It promotes a pro-active approach to managing digital resources and the use of technological solutions, like web services, to address the problems that technology itself has created. It also paves the way for the emergence of “digital curators”, continually monitoring collections and intevening when necessary - a role analogous to their non-digital counterparts. The best source of information about digital curation is the <a href="http://www.dcc.ac.uk/">Digital Curation Centre</a>, based at Edinburgh University.

In the next part we'll look at some of the key concepts in digital preservation, including migration, emulation, and life-cycle models for digital objects. This will help us identify some of the things we should be considering when trying to preserve web resources.
]]></content:encoded>
			<wfw:commentRss>http://dablog.ulcc.ac.uk/2008/05/23/digital-preservation-in-a-nutshell-part-i/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

