<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>ulcc da blog &#187; file formats</title>
	<atom:link href="http://dablog.ulcc.ac.uk/tag/file-formats/feed/" rel="self" type="application/rss+xml" />
	<link>http://dablog.ulcc.ac.uk</link>
	<description>ulcc digital archives blog</description>
	<lastBuildDate>Fri, 03 Feb 2012 16:24:59 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>Scanning is different from digitisation</title>
		<link>http://dablog.ulcc.ac.uk/2011/07/26/scanning-is-different-from-digitisation/</link>
		<comments>http://dablog.ulcc.ac.uk/2011/07/26/scanning-is-different-from-digitisation/#comments</comments>
		<pubDate>Tue, 26 Jul 2011 12:47:38 +0000</pubDate>
		<dc:creator>Richard M. Davis</dc:creator>
				<category><![CDATA[Digitisation]]></category>
		<category><![CDATA[access]]></category>
		<category><![CDATA[accessibility]]></category>
		<category><![CDATA[digitisation]]></category>
		<category><![CDATA[e-books]]></category>
		<category><![CDATA[file formats]]></category>
		<category><![CDATA[preservation]]></category>

		<guid isPermaLink="false">http://dablog.ulcc.ac.uk/?p=1630</guid>
		<description><![CDATA[If you haven’t seen it, can I recommend Kristen Snawder&#8217;s recent post on the Library of Congress Digital Preservation blog, Digitization is different than digital preservation. Kristen reiterates familiar points about the long-term commitment necessary for serious digital preservation, contrasted with the quick hit of a scanning project. “In the hurry to meet user expectations, [...]<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2011/07/26/scanning-is-different-from-digitisation/' addthis:title='Scanning is different from digitisation '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.flickr.com/photos/kylemcdonald/4287375982"><img class="alignright size-medium wp-image-1631" title="Autocorrelation scan by Kyle McDonald on Flickr" src="http://dablog.ulcc.ac.uk/wp-content/uploads/2011/07/4287375982_5b5767939d_o-300x300.png" alt="" width="171" height="171" /></a></p>
<p>If you haven’t seen it, can I recommend Kristen Snawder&#8217;s recent post on the Library of Congress Digital Preservation blog, <a class="c3" href="http://blogs.loc.gov/digitalpreservation/2011/07/digitization-is-different-than-digital-preservation-help-prevent-digital-orphans/">Digitization is different than digital preservation</a>.  Kristen reiterates familiar points about the long-term commitment necessary for serious digital preservation, contrasted with the quick hit of a scanning project. “In the hurry to meet user expectations, institutions may scan large quantities of materials without having a solid plan for preserving the digital images into the future.”</p>
<p class="c2">However another recent find on the Web compels me to make an additional point, namely that we might do equally well to differentiate between scanning and digitisation. Anyone can set to work with a scanner and create a bunch of digital images &#8211; but that barely scratches the surface of what I think we should be expecting of a digitisation project in 2011.</p>
<p class="c0">First and foremost, we need metadata: the more the merrier, but something at least. Even if we expect to come back later and polish it up (once the images can be browsed and examined on screen). In the absence of any established metadata profiles for a project, at least try to cover as many <a class="c3" href="http://dublincore.org/documents/dces/">Dublin Core</a> elements as possible &#8211; title, creator, date, subject/keywords&#8230; Images, in particular, may prove tricky or time-consuming to find again, especially once there are thousands of them on a disk. We should probably keep the metadata in a database, and perhaps additionally store metadata with the objects. This can be as XML or plain text files stored alongside the digital images, or embedded in the files we create (many common file formats &#8211; TIFF, JPEG, MPEG, PDF &#8211; support metadata embedding, and there are many free tools available to help).</p>
<p class="c0">There is yet more, though, that we should be doing, particularly when we are scanning text-based objects (articles, books, magazines, reports, etc). Most importantly, we really should try and extract the text from the image if possible. <sup class="c1"><a name="ftnt_ref1" href="#ftnt1">[1]</a></sup></p>
<p class="c2">My recent web find was the teaching blog of Dr Toine Bogers at the <a class="c3" href="http://www.iva.dk/">Royal School of Library and Information Science</a> (RSLIS) in Copenhagen, Denmark. One fascinating post describes a Lab Session exercise, <a class="c3" href="http://itlab.dbit.dk/~toine/?page_id=304">From OCR To NER</a>, a set of comparatively simple command-line processes to get the most out of a scanned-text project.</p>
<p class="c0"><span id="more-1630"></span>Toine’s post walks us through the process. Once the article is scanned, we should apply some OCR. The exercise goes further and also describes the use of tools to clean up and spell-check the resulting OCR’d text. This will, at the very least, result in a separate text file, hopefully containing a fairly accurate version of the article text. Finally, the cleaned-up text can be submited to a Named Entity Recognition service. Toine’s exercise uses NER <a class="c3" href="http://cogcomp.cs.illinois.edu/demo/ner/">tools at University of Illinois</a>. (We’ve been using similar functionality provided by <a class="c3" href="http://www.opencalais.com/">OpenCalais</a> and <a class="c3" href="http://gate.ac.uk/">GATE</a> for our <a class="c3" href="http://www.jisc.ac.uk/whatwedo/programmes/inf11/infrastructureforresourcediscovery/pathfinder.aspx">AIM25 Open Metadata</a> project.)</p>
<p class="c0">Why do all this? The most important, instant, result of this is that we can now easily index our article for full-text searching &#8211; in a local repository system, such as EPrints or DSpace provide &#8211; and of course by Google. None of this is possible if we leave the scanned image as just that &#8211; an image.</p>
<p class="c2">Another  side-effect of any successful OCR outcome, is that the text is now free to be re-flowed. This means that we might consider sharing it with users in a variety of forms enhancing usability and accessibility.</p>
<p class="c2">It’s important not to confuse preservation formats with formats for access and dissemination. You probably will have your scanned image masters in TIFF, RAW, JPEG2000, PostScript, SVG. None of these are likely to be of much use to your users over the Web. Not only are the formats not widely supported by Web browsers, but most users probably don’t need or want your master image. If it’s a high-resolution scan of a 100 page book, they might be looking at 100Mb download, or worse &#8211; slow to load, and probably slow to render and navigate.</p>
<p class="c2">Time taken thinking what formats will give users the best experience is time well spent. What platforms might they want to use now and in the foreseeable future? It’s less than 18 months since Kindle3 made e-book readers affordable, and the Ipad made them sexy. E-books look and function very impressively on both platforms (albeit in different ways): for an overview of some of the benefits of the EPUB format, see Martin Fenner&#8217;s post <a href="http://blogs.plos.org/mfenner/2011/01/23/beyond-the-pdf-%E2%80%A6-is-epub/">Beyond The PDF&#8230; is EPUB</a>. PDF outputs may yet have their uses, if users can at least search for text within them. The point is that only with properly digitised text, do these kinds of accessibility options become possible.</p>
<p class="c2">Even image collections can also be disseminated as E-books &#8211; nice offline items some users might care to flick through on their tablet computers, possibly even smartphones. I&#8217;ve demonstrated how we can <a href="http://sasopenjournals.blogspot.com/2011/07/populating-ojs-from-eprints.html">create OJS XML from EPrints XML on-the-fly with XSLT</a>: since EPUB and Mobi/Kindle are XML-based formats, we should be able to do something similar to create e-books using repository APIs. Also, by using appropriately sized images in dissemination formats (Ipad screen is 1024x768px; Iphone4 is 960x640px) we can not only ship our users a sensibly-sized download, we can protect any capital we may have in the master images, without having to resort to ugly tricks like watermarking. (Giving users full-size, high-res images with embedded watermarks seems to me the worst of all worlds.)</p>
<p class="c2">Therefore I&#8217;d suggest that, in order to get the best out of a digitisation project, consider what would you like to see at the end of the project &#8211; and, more importantly, what would give your  users the best experience, or even win you new users? Ask around, do some tests, with users if possible, and get an idea how they want to use the materials and how they will get the best out of them. Maybe there are comparable projects and systems that you admire, with features you’d like to be available for your collection. What about in five or ten years’ time: will your current project outputs help or hinder longer term accessibility goals?</p>
<p class="c2">This kind of vision of is essential. Without some conception of the end result, how the materials will be used and managed most effectively, all the scanning in the world isn’t going to amount to  a successful digitisation project.</p>
<p>&nbsp;</p>
<hr class="c9" style="text-align: left; width: 50%; margin: 0 auto 0 0;" />
<div>
<p class="c2"><a name="ftnt1" href="#ftnt_ref1">[1]</a> Of course manuscipts and ‘difficult’ print formats &#8211; early printing typefaces, multilingual objects &#8211; may be resistant to OCR. For that we may need specialised solutions or rekeying, as discussed in recent posts on DA Blog (<a class="c3" href="http://dablog.ulcc.ac.uk/2011/05/05/house-of-books-part-2/">House Of Books (Part 2)</a>, <a class="c3" href="http://dablog.ulcc.ac.uk/2011/02/21/synergies-abound/">Synergies Abound</a>). Or the kind of online tool we developed with UCL for <a class="c3" href="http://dablog.ulcc.ac.uk/2010/03/01/transcribing-bentham/">Transcribe Bentham</a>.</p>
</div>
<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2011/07/26/scanning-is-different-from-digitisation/' addthis:title='Scanning is different from digitisation '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></content:encoded>
			<wfw:commentRss>http://dablog.ulcc.ac.uk/2011/07/26/scanning-is-different-from-digitisation/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>File formats&#8230;or data streams?</title>
		<link>http://dablog.ulcc.ac.uk/2009/12/03/ffods/</link>
		<comments>http://dablog.ulcc.ac.uk/2009/12/03/ffods/#comments</comments>
		<pubDate>Thu, 03 Dec 2009 16:42:53 +0000</pubDate>
		<dc:creator>Ed Pinsent</dc:creator>
				<category><![CDATA[All]]></category>
		<category><![CDATA[Digital Archives]]></category>
		<category><![CDATA[DPC]]></category>
		<category><![CDATA[Events]]></category>
		<category><![CDATA[file formats]]></category>
		<category><![CDATA[preservation]]></category>
		<category><![CDATA[Technical]]></category>

		<guid isPermaLink="false">http://dablog.ulcc.ac.uk/?p=811</guid>
		<description><![CDATA[On 1st December Malcolm Todd of The National Archives gave a good account of the work he&#8217;s been doing on File Formats for Preservation, resulting in a substantial new Technology Watch report for the DPC. It was a seminar hosted by William Kilbride, with participants from the BBC, the BL, NLW and others. The afternoon [...]<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2009/12/03/ffods/' addthis:title='File formats&#8230;or data streams? '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></description>
			<content:encoded><![CDATA[<p>On 1st December Malcolm Todd of The National Archives gave a good account of the work he&#8217;s been doing on <strong>File Formats for Preservation</strong>, resulting in a substantial new <a href="http://www.dpconline.org/docs/reports/dpctw09-02.pdf">Technology Watch report for the DPC</a>. It was a seminar hosted by William Kilbride, with participants from the BBC, the BL, NLW and others. The afternoon was useful and interesting for me since I teach an elementary module on file formats in a preservation context for our DPTP courses.</p>
<p>My naïve thinking in the area has been characterised by the assumption that the process is rather static or linear, and that the problem we&#8217;re facing is broadly the same every time; migrate data from a format that&#8217;s about to become obsolete or unsupported, onto another format that&#8217;s stable, supported, and open. MS Word document to PDF or PDF/A…now <em>that</em>, I can understand!</p>
<p>In fact, I learned at least two ways of thinking about formats that hadn&#8217;t occurred to me before. One simple one is costs; some formats can cost more to preserve than others. This can be calculated in terms of storage costs, multiplied over time, and the costs associated with migrations to new versions of that format. <span id="more-811"></span>For example, we&#8217;ve tended to pin our faith on the TIFF format for images for many reasons, but there&#8217;s a high storage price to be paid for all that wonderful losslessness. This may be one reason why the DP world is looking with more favour on the JPEG2000 format, which is &#8216;virtually&#8217; lossless and smaller in size.</p>
<p>Secondly, the problems of preserving digital data which doesn&#8217;t actually have a specified stable preservation format. Chris Puttick of <a href="http://thehumanjourney.net/">Oxford Archaeology</a> gave a vivid description of the problems he&#8217;s facing with CAD and GIS files, where the data can&#8217;t easily be tied to a single format in the first place (nor can a stable format for migration be identified). As the NLA put it on their <a href="http://www.nla.gov.au/padi/topics/432.html">PADI page</a>, &#8220;At present there is little dealing specifically or comprehensively with the preservation of this particular type of data, although some aspects of database preservation are applicable to GIS. Some long term preservation issues include a lack of open source formats and metadata standards, large data volume and complex data objects.&#8221; Puttick suggests that his data doesn&#8217;t really perform at all unless it&#8217;s operated within a very specific environment of hardware and software. How do we preserve an environment? This appears to be quite a distinct preservation problem and much harder to solve than Word to PDF, to put it mildly.</p>
<p>William Kilbride suggested that such cases (and websites too, arguably, because they are time-based) are more like a <em>stream </em>of data &#8211; a handy image which conveys something about the dynamic of such information packages, and showing us that it&#8217;s much harder to nail them down into a single format. You can never step into the same river twice.</p>
<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2009/12/03/ffods/' addthis:title='File formats&#8230;or data streams? '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></content:encoded>
			<wfw:commentRss>http://dablog.ulcc.ac.uk/2009/12/03/ffods/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>DCC discussions on image formats</title>
		<link>http://dablog.ulcc.ac.uk/2008/07/04/dcc-discussions-on-image-formats/</link>
		<comments>http://dablog.ulcc.ac.uk/2008/07/04/dcc-discussions-on-image-formats/#comments</comments>
		<pubDate>Thu, 03 Jul 2008 23:14:12 +0000</pubDate>
		<dc:creator>Richard M. Davis</dc:creator>
				<category><![CDATA[All]]></category>
		<category><![CDATA[Digitisation]]></category>
		<category><![CDATA[DART]]></category>
		<category><![CDATA[DCC]]></category>
		<category><![CDATA[file formats]]></category>
		<category><![CDATA[images]]></category>
		<category><![CDATA[JPEG]]></category>
		<category><![CDATA[preservation]]></category>
		<category><![CDATA[RAW]]></category>
		<category><![CDATA[Technical]]></category>
		<category><![CDATA[TIFF]]></category>

		<guid isPermaLink="false">http://dablog.ulcc.ac.uk/2008/07/04/dcc-discussions-on-image-formats/</guid>
		<description><![CDATA[Rich pickings in a couple of fascinating posts on the DCC Digital Curation Blog, in which Chris Rusbridge summarises recent discussions on the DCC-Associates email list about appropriate photo image file formats for preservation, specifically TIFF, RAW and JPEG 2000. A sibling post also discusses the merits of RAW versus TIFF from the perspective of [...]<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2008/07/04/dcc-discussions-on-image-formats/' addthis:title='DCC discussions on image formats '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></description>
			<content:encoded><![CDATA[<p>Rich pickings in a couple of fascinating posts on the <a href="http://digitalcuration.blogspot.com/2008/07/responses-to-raw-versus-tiff.html" title="Responses to RAW versus TIFF on DCC Blog" target="_blank">DCC Digital Curation Blog</a>, in which Chris Rusbridge summarises recent discussions <span class="postbody">on the DCC-Associates email list </span>about appropriate photo image file formats for preservation, specifically TIFF, RAW and JPEG 2000. A <a href="http://digitalcuration.blogspot.com/2008/07/responses-to-raw-versus-tiff-image.html" title="Responses to RAW versus TIFF image on DCC Blog" target="_blank">sibling post</a> also discusses the merits of RAW versus TIFF from the perspective of different users and uses.</p>
<p>The proprietary nature of RAW formats (an emerging OpenRAW standard notwithstanding) and the relative newness on the block of JPEG 2000 would both tend to bolster the longstanding preference for TIFF, but as Chris&#8217;s posts make clear, each preservation project should nevertheless weigh the options based on its own requirements and resources.</p>
<p>If in doubt, the &#8220;keep everything&#8221; approach is attractive, as ever, but &#8211; in spite of the old mantras about ever-cheaper filestore &#8211; the implications for storage space and management are potentially very costly once one enters the world of Terabytes and Petabytes. <span id="more-134"></span>In one example, Sean Martin of BL concludes that</p>
<blockquote><p>probably only a small amount of additional value is created for the additional expense approaching £200K. This leads to the question &#8220;if we had £200K on what would we spend it?&#8221; and probably the answer is &#8220;not in this way&#8221;.</p></blockquote>
<p>The posts contain many interesting examples of costings, based on the experience of BL, SDSC and others. I won&#8217;t even attempt to summarise (Chris&#8217;s summary of) them here, but I hope that this post will still be available for me to consult next time I have to dabble in the murky science of DP costings.</p>
<p>In some instances, the case for preserving the RAW image may nevertheless be compelling. One can imagine that the risk of missing a new planet or virus (or identifying a non-existent one), or other potential infelicities in scientific and medical imaging, is not worth contemplating. By contrast, it&#8217;s hard to see what value might be added to a collection like the <a href="http://dablog.ulcc.ac.uk/2007/11/02/7/" title="Launch of Linnean Online">Linnean Society&#8217;s</a> by keeping raw camera data alongside the TIFFs, and more than doubling the amount of storage capacity required.</p>
<p>It&#8217;s for making decisions like this that the intellectual exercise of identifying <a href="http://dablog.ulcc.ac.uk/2008/04/08/significant-properties/">significant properties</a> <em>and</em> the needs of all stakeholders &#8211; creators, curators and users &#8211;  seems particularly essential.</p>
<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2008/07/04/dcc-discussions-on-image-formats/' addthis:title='DCC discussions on image formats '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></content:encoded>
			<wfw:commentRss>http://dablog.ulcc.ac.uk/2008/07/04/dcc-discussions-on-image-formats/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>KB file format guides</title>
		<link>http://dablog.ulcc.ac.uk/2008/03/20/kb-reading-list-file-formats/</link>
		<comments>http://dablog.ulcc.ac.uk/2008/03/20/kb-reading-list-file-formats/#comments</comments>
		<pubDate>Thu, 20 Mar 2008 17:00:28 +0000</pubDate>
		<dc:creator>Richard M. Davis</dc:creator>
				<category><![CDATA[All]]></category>
		<category><![CDATA[Digital Archives]]></category>
		<category><![CDATA[DCC]]></category>
		<category><![CDATA[file formats]]></category>
		<category><![CDATA[KB]]></category>
		<category><![CDATA[Technical]]></category>

		<guid isPermaLink="false">http://dablog.ulcc.ac.uk/2008/03/20/kb-reading-list-file-formats/</guid>
		<description><![CDATA[I was pleased the DCC news feed alerted me to a new publication by the Koninklijke Bibliotheek, Alternative File Formats for Storing Master Images of Digitisation Projects. Particularly as on the KB publications page are also links to two other recent reports Evaluating File Formats for Long-term Preservation and Recommendations for the creation of PDF [...]<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2008/03/20/kb-reading-list-file-formats/' addthis:title='KB file format guides '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></description>
			<content:encoded><![CDATA[<p>I was pleased the <a href="http://www.dcc.ac.uk/news/" title="DCC" target="_blank">DCC news</a> feed alerted me to a new publication by the Koninklijke Bibliotheek, <em><a href="http://www.kb.nl/hrd/dd/dd_links_en_publicaties/publicaties/Alternative%20File%20Formats%20for%20Storing%20Masters%202%201.pdf" title="KB report on alternative file formats " target="_blank">Alternative File Formats for Storing Master Images of Digitisation Projects</a>. </em>Particularly as on the KB publications page are also links to two other recent reports  <a href="http://www.kb.nl/hrd/dd/dd_links_en_publicaties/publicaties/KB_file_format_evaluation_method_27022008.pdf">Evaluating File Formats for Long-term Preservation</a> and <a href="http://www.kb.nl/hrd/dd/dd_links_en_publicaties/PDF_Guidelines.pdf" target="_blank">Recommendations for the creation of PDF files for long-term preservation and access</a>. These look like excellent resources for wrestling with the gnarly issue of file formats. It can be hard to find one&#8217;s way among the many reports and papers circulating on all aspects of  DP &#8211; as I found out working on the <a href="http://spelos.ulcc.ac.uk/" title="SPeLOs" target="_blank">SPeLOs</a> report, at the busy intersection of E-learning and Digital Preservation. For file formats issues, at least, these very recent KB publications look like a good place to start. I only hope I get time soon to read them thoroughly.</p>
<p>It occurred to me that the blog is the perfect place to invite any passing readers to suggest other key readings, so if you know of any other recent, authoritative work in file formats, please leave a link in the comments box below.</p>
<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2008/03/20/kb-reading-list-file-formats/' addthis:title='KB file format guides '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></content:encoded>
			<wfw:commentRss>http://dablog.ulcc.ac.uk/2008/03/20/kb-reading-list-file-formats/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The MS Office 2003 format debacle</title>
		<link>http://dablog.ulcc.ac.uk/2008/01/11/the-ms-office-2003-format-debacle/</link>
		<comments>http://dablog.ulcc.ac.uk/2008/01/11/the-ms-office-2003-format-debacle/#comments</comments>
		<pubDate>Fri, 11 Jan 2008 17:45:13 +0000</pubDate>
		<dc:creator>Kevin Ashley</dc:creator>
				<category><![CDATA[All]]></category>
		<category><![CDATA[Digital Archives]]></category>
		<category><![CDATA[file formats]]></category>
		<category><![CDATA[preservation]]></category>
		<category><![CDATA[Technical]]></category>

		<guid isPermaLink="false">http://dablog.ulcc.ac.uk/2008/01/11/the-ms-office-2003-format-debacle/</guid>
		<description><![CDATA[The consternation that has been caused by Office 2003 Service Pack 3 has provoked a great deal of discussion amongst the electronic records community about the vulnerability of proprietary file formats. (It&#8217;s provoked a lot of other discussion as well, much of it not complimentary to Microsoft.) If you&#8217;re not familiar with the problem, it [...]<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2008/01/11/the-ms-office-2003-format-debacle/' addthis:title='The MS Office 2003 format debacle '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></description>
			<content:encoded><![CDATA[<p>The consternation that has been caused by <a href="http://support.microsoft.com/kb/923618">Office 2003 Service Pack 3</a> has provoked a great deal of discussion amongst the electronic records community about the vulnerability of proprietary file formats. (It&#8217;s provoked a lot of other discussion as well, much of it not complimentary to Microsoft.) If you&#8217;re not familiar with the problem, it boils down to this: the latest major update to Office 2003 silently disables the ability to open many files that Office would previously handle. They include formats such as Corel Draw, but worryingly for many people also include older MS Office formats, such as Excel 97 and Word 97.</p>
<p>I say &#8216;silently&#8217; because the information about the effect that this update will have is extremely well hidden. <span id="more-41"></span> Many individuals, and many large organisations, will do as MS recommends and allow their systems to download and install updates like this automatically. They&#8217;ll be caught unawares. Others may be more like me and read through the release information to see what the update is going to do beforehand. They would be caught unawares. The information linked to at the front of this post tells you that this is a major security update for Office. It says nothing about losing access to old files. It links to another page which goes into <a href="http://support.microsoft.com/kb/923618">greater detail</a> about what the update contains. That doesn&#8217;t mention the issue either, except that under the &#8220;known issues&#8221; section, it says:</p>
<blockquote><p> After you install Office 2003 SP3, you may receive an error message when you try to open or to save a file. For more information about this issue, click the following article number to view the article in the Microsoft Knowledge Base:</p></blockquote>
<p>And even that article doesn&#8217;t really list things in detail, although it does mention that older office formats are amongst those affected. The recommended solutions are not ideal either.</p>
<p>Microsoft has belatedly realised that it was a bad idea and released a patch to circumvent the problem. But you need to explicitly request that and many people won&#8217;t realise they need it for some time.</p>
<p>There&#8217;s been a good deal of <a href="http://listserv.albany.edu:8080/cgi-bin/wa?A1=ind08&amp;L=erecs-l#10"> discussion on the e-recs list </a> about this, including this remark from Marc Fresko:</p>
<blockquote><p> Personally, I think this is one of the more &#8220;visible&#8221; wake-up calls that<br />
should contribute to everyone realising that long-term access to<br />
electronic records, even over less than ten years, cannot be ignored.<br />
Active management is needed.  Anything that helps to get this message<br />
across is welcome, so long as it is (as is the case here) practically<br />
harmless.</p></blockquote>
<p>I agree with Marc that it&#8217;s a useful wake-up call. Consider that this could cause problems even to those organisations that had taken explicit steps to ensure continued access:</p>
<ul>
<li>You checked when moving to Office 2003 that you could still access your older documents using it.</li>
<li>You checked the documentation with each update to ensure that it wasn&#8217;t going to invalidate your earlier checks</li>
</ul>
<p>With those precautions, you would <strong>still</strong> be caught out by a silent update that you <strong>cannot uninstall.</strong> Only if yours is the sort of organisation that explicitly tests all updates in a safe environment, and specifically includes a check for access to legacy material as part of those tests, would you be alerted to the problem in time.</p>
<p>Microsoft aren&#8217;t the only software providers to build in automated update features to their products &#8211; the practice is common with both propretary and open-source products. Incidents like this <strong>will</strong> occur again. It reinforces the belief that one needs to have good control over the application(s) needed to access preserved content as well as the formats used to preserve it, whether one is looking at 5-year retention or 500-year retention.</p>
<div class="addthis_toolbox addthis_default_style " addthis:url='http://dablog.ulcc.ac.uk/2008/01/11/the-ms-office-2003-format-debacle/' addthis:title='The MS Office 2003 format debacle '  ><a class="addthis_button_facebook_like" fb:like:layout="button_count"></a><a class="addthis_button_tweet"></a><a class="addthis_button_google_plusone" g:plusone:size="medium"></a><a class="addthis_counter addthis_pill_style"></a></div>]]></content:encoded>
			<wfw:commentRss>http://dablog.ulcc.ac.uk/2008/01/11/the-ms-office-2003-format-debacle/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>

