Working with Web Curator Tool (part 2): wikis

March 10th, 2009 Ed Pinsent Posted in JISC, UKWAC 7 Comments »

mediawiki
How to archive a website built with a wiki? It’s worth looking into this as increasingly JISC projects are using wikis to manage and report on their projects; of the available brands, MediaWiki is a popular one.

The challenge for me is how to bring in a good copy of a wiki site without causing Web Curator Tool to gather too many pages from it. We don’t want that, because (a) the finished result occupies unnecessary space in the archive and (b) because it takes so long to complete that it can hold up the gather queue in the shared web-archiving service, delaying the work of other UKWAC partners.

I am not technical enough to tell you in great detail what’s causing this, although I sense that it’s something to do with the Heritrix crawler requesting too many pages from the wiki. When you consider that a wiki is database-driven it should not surprise us that it’s creating a lot of its pages on the fly. Secondly, since a wiki is editable by lots of contributors (that’s its core function after all), it presumably means we have numerous past versions of pages also stored somewhere in the wiki labyrinth, and it’s possible that the implacable Heritrix will not cease until it’s faithfully requested and copied every single one of them.

Let’s look at the Repositories Research Team wiki (DigiRep) owned by UKOLN, which I tried to gather five times in 2008. WCT conveniently keeps a history of these attempts, information about which I can still access even if the actual gathered pages have been discarded or archived. The size problems were chronic. Of five 2008 gathers, one was aborted after it had reached a massive 16.87 GB; a second one was rejected at 14.69 GB. I have archived one impression at 5.31 GB, another at 736.26 MB and another at 157.36 MB. Quite large variations there, which was worrying enough in itself.

At first, my workaround was to adjust the Profile Setting in the title to override the maximum number of documents Heritrix can gather. Setting ‘Maximum Documents’ at 10000 worked, but it was not ideal; I suppose all this means is that Heritrix stops when it collects 10,000 pages, whether we have everything we want or not. (I found that the copies in the archive seemed to render OK however).

To get a closer look at what’s going on, I started to browse the Log Files created by WCT (complete records of every single client-server request), which show patterns which I can vaguely understand; when these Log Files are packed with near-identical strings of code I sense that something’s up. For example, a string containing index.php?title=Repositories_Research&action=edit tells us that the wiki is requesting a specific named page, and allowing an edit action on that page. If you multiply that by the number of pages in the wiki, you can see how the problem builds up. (PHP is the script used for MediaWiki’s web scripting engine).

I follow this up by browsing the actual gathered pages in Web Curator Tool using the Tree View. From here I can click on the ‘View’ button to examine a page which I think to be suspect, and compare it with other suspect pages. Lastly, I go back to the live DigiRep site to confirm in my mind what’s happening when certain links are followed.

All the above gave me just about enough information to experiment with exclusion filters. After a certain amount of trial and error, and working with other Media Wiki sites, I arrived at the following exclusion codes which I can add to the Profile Setting:

.*&oldid.*
.*&diff.*
.*&limit.*
.*&direction.*
.*Recentchanges.*
.*/Special.*
.*\?title=Special.*
.*&action=edit.*
.*&action=history.*
.*&section.*
.*&redlink.*
.*&printable=yes.*
.*&redirect=no.*

These have the effect of telling WCT to exclude certain pages and actions from Heritrix’s harvesting action. The expectation was that I would lose the discussion / edit / history functions of the wiki in the archive copy.

The title with the above exclusion profile gathered just 63.41 MB and it completed in under ten minutes. I would say that’s an improvement on 16.87 GB. Log Files and the Tree View confirmed the success of this new “slimline” gather. As well as losing the discussion / edit / history functions, we also have eliminated the Toolbox functions, the ‘printable’ views, and the login pages.

This is no great loss at all for our purposes, as scholars who browse the archived copy of DigiRep are not expecting to be able to edit pages, nor join in the discussions, nor browse the history of stored versions of pages. Indeed in a lot of cases, they would require a login to do so. The users simply want to see the results of the DigiRep team’s work.

AddThis Social Bookmark Button

Working with Web Curator Tool (part 1)

February 25th, 2009 Ed Pinsent Posted in JISC, UKWAC No Comments »

Early Morning Wheatfield by docman. Retrieved from http://www.flickr.com/photos/docman/212526202/

Keen readers may recall a post from April 2008 about my website-archiving forays working with Web Curator Tool, the workflow database, used for programming Heritrix, the crawler which does the harvesting of websites.

Other UKWAC partners and myself have since found that Heritrix sometimes has a problem, described by some as ‘collateral harvesting’. This means it can gather links, pages, resources, images, files and so forth from websites we don’t actually want to include in the finished archived item.

Often this problem is negligible, resulting in a few extra KB of pages from adobe.com or google.com for example. Sometimes though it can result in large amounts of extraneous material, amounting to several MB or even GB of digital content (for example if the crawler somehow finds a website full of .avi files.)

I have probably become overly preoccupied with this issue, since I don’t want to increase our sponsor (JISC)’s overheads by occupying their share of the server space with unnecessarily bloated gathers, nor clutter up the shared bandwidth by spending hours gathering pages unnecessarily.

Web Curator Tool allows us two options for dealing with collateral harvesting. One of them is to use the Prune Tool on the harvested site after the gather has run. The Prune Tool allows you to browse the gather’s tree structure, and to delete a single file or an entire folder full of files which you don’t want.

The other option is to apply exclusion filters to the title before the gather runs. This can be a much more effective method. The method is to enter a little bit of code in the ‘Exclude Filters’ box of a title’s profile. The basic principle is using the code .* for exclusions. .*www.aes.org.* will exclude that entire website from the gather. .*/images/.* will exclude any path containing a folder named ‘images’.

So far I generally find myself making two types of exclusion:

(a) Exclusions of websites we don’t want. As noted with collateral harvesting, Heritrix is following external links from the target a little too enthusiastically. It’s easy to identify these sites with the Tree View feature in WCT. This view also lets you know the size of the folder that has resulted from the external gathering. This has helped me make decisions; I tend to target those folders where the size is 1MB or larger.

(b) Exclusions of certain pages or folders within the Target which we don’t want. This is where it gets slightly trickier, and we start to look in the log files of client-server requests for instances where the browser is staying in the target, but performing actions like requesting the same page over and over. This can happen with database-driven sites, CMS sites, wikis, and blogs.

I believe I may have had a ‘breakthrough’ of sorts with managing collateral harvesting with at least one brand of wiki, and will report on this for my next post.

AddThis Social Bookmark Button

The Continuity Girl

July 25th, 2008 Ed Pinsent Posted in JiSC-PoWR, UKWAC No Comments »

Not Amanda SpencerAmanda Spencer gave an informative presentation at the UK Web-Archiving Consortium Partners Meeting on 23 July, which I happened to attend. The Web Continuity Project at TNA is a large-scale and Government-centric project, which includes a “comprehensive archiving of the government web estate by The National Archives”. Its aims are to address both “persistence” and “preservation” in a way that is seamless and robust: in many ways, “continuity” seems a very apposite concept with which to address the particular nature of web resources. It’s all about the issue of sustainable information across government.

At ULCC we’re interested to see if we can align some ‘continuity’ ideas within the context of our PoWR project. Many of the issues facing departmental web and information managers are likely to have analogues in HE and FE institutions, and Web Continuity offers concepts and ways of working that may be worth considering and may be adaptable to a web-archiving programme in a University.

Read the rest of this entry »

AddThis Social Bookmark Button

Web-archiving: the WCT workflow tool

April 30th, 2008 Ed Pinsent Posted in UKWAC No Comments »

web-curator-tool-logo.gifThis month I have been happily harvesting JISC project website content using my new toy, the Web Curator Tool. It has been rewarding to resume work on this project after a hiatus of some months; the former setup, which used PANDAS software, has been winding down since December. Who knows what valuable information and website content changes may have escaped the archiving process during these barren months?

Web Curator Tool is a web-based workflow database, one which manages the assignment of permission records, builds profiles for each ‘target’ website, and allows a certain amount of inter-facing with Heritrix, the actual engine that gathers the materials. The open-source Heritrix project is being developed by the Internet Archive, whose access software (effectively the ‘Wayback Machine’) may also be deployed in the new public-facing website when it is launched in May 2008.

Read the rest of this entry »

AddThis Social Bookmark Button

Say Hello, Wave Goodbye

April 1st, 2008 Kevin Ashley Posted in Events, Organisations, Reports, UKWAC 3 Comments »

I spent much of yesterday at an event at KCL celebrating the achievements of the AHDS (and the Methods Network) on its final day of existence, and welcoming the phoenix-like birth of CeRch. The day was informative, entertaining and emotional in equal measure and I am very glad I was there.
Professor Mark Greengrass
The morning’s overview of the Methods work was almost all new to me, and extremely interesting. Read the rest of this entry »

AddThis Social Bookmark Button