Searching the full text of decades of Guardian and Observer articles is now possible. Making sure no libellous material was republished was the difficult task of a library team − together with more
straightforward quality assurance work. Katy Heslop explains.

Last November, Guardian News & Media launched its digital archive. It spans every issue of the Guardian printed between 1821 and 1975 and also of the Observer, 1900-75. After the second phase is launched later this year, there will be 20m articles, images and advertisements in a fully searchable, keyworded database.

The archive is the culmination of several years’ work by various Guardian departments, including the library, Newsroom and Syndication. Olive, a company that also created archives for the Sydney Morning Herald and the Scotsman, scanned the full run of both newspapers from microfilm. Each page was segmented so that all articles, adverts and images became separate entities. Each word within these files was then indexed to create a database of keywords. Users can search for articles chronologically or by relevance, and use various wildcards, returning results that contain words that sound similar, have the same meaning or are spelled wrongly. It also allows the user to place more weight on one of several keywords, so that articles containing that word are returned first.
 
I work as a librarian in the Guardian’s research department. I was approached 18 months ago to help run Quality Assurance (QA) testing on the archive once the pages were digitised, to ensure that it met the rigorous standards agreed with Olive; and to identify any articles that would leave the paper open to claims of libel if they were republished.

I headed a team of four – Amy Williams, Ruth Craven, Anastacia Regan and Steven Duckworth – who juggled the digital project with their job as library trainees. When I took maternity leave at the end of August, Caroline White took over the project, overseeing our new trainee team.

QA testing
The aim of testing was to ensure the archive would operate under all possible search conditions, so that it could be: browsed by issue; searched by non-computer-savvy readers looking for specific articles but with few of the details to hand; and used as a teaching tool in primary and secondary classrooms and by trained researchers in academic libraries seeking coverage of a specific topic across the years.

The issues were split into four roughly equal periods, and each trainee took responsibility for testing a section. They formulated 50 search scenarios, from vague queries of the ‘I remember reading an article some time in the 1960s about fish’ kind, to specific article searches where all details (date, headline, byline) were known. Additionally, the trainees compiled a list of 50 keywords to test, from the general ‘cheese’ to the more precise ‘Virginia Woolf’ and ‘draft-dodgers’. They also browsed issues at random.

With 900,000 pages contained in the first phase (initially intended to cover both papers, 1925-75), it was impossible to test a high proportion within the time allotted. It was decided that a survey of five per cent would suffice; this was later reduced to one per cent as deadlines loomed. This equated to five issues per year to be checked for errors in three fields: readability, searchability and segmentation.

Each query was run through the archive’s advanced search page, which allows searches by publication, date, article type (article, advert or image) and keyword. First we tested for readability. If the text or image of the selected article(s) was understandable, it passed. Only six search queries failed – easily surpassing our target of 80 per cent accuracy – all because of the poor quality of the print before it was scanned.

Next we checked for searchability. If the relevant article was returned for specific searches (Thomas Edison’s obituary, for example), or if eight of the first 10 results were relevant if the search was more vague (references to Churchill in the 1950s), it passed. Twenty-two of 408 searches failed, meeting the 80 per cent target. The problem was specific to older issues, and is the result of the poor quality of the papers when they were originally microfilmed and the smaller size of print used in the earlier years. Some examples look like other words and others are unreadable, so the software has had to guess which word to index. Amy, for example, searched the 1920s and 1930s for ‘briefcase’; the archive highlighted words as diverse as ‘brick’, ‘picking’ and ‘Brisbane’, but only two briefcases. Unfortunately there is nothing that can be done about this; it is simply the reality of archiving old newsprint.

Finally, the segmentation was checked. If a paragraph was missing or a section from a second article was attached, the item failed. This was the main problem that surfaced during testing. Olive’s software identifies solid black lines between articles and uses these to segment them. Unfortunately, some of the lines weren’t visible on the microfilm, particularly in the older issues. The software also fell down when it came across unusual page layouts, such as quotes embedded in article text and long subheadings, which became separated from the articles they applied to. Fortunately Olive also supplied a software tool to amend and maintain the archive, so the errors have been corrected manually by the library staff − a continuing task as further errors surface.

Libel issues
The second phase of the library’s task was to track down articles involved in libel cases, particularly where the claimant had been successful. Numerous articles in which details have proved untrue or defamatory have later been rescinded. In some cases, a printed apology was enough to pacify the injured party; in others, substantial damages were paid. The law surrounding digital versions of print is murky but, according to precedent, creating a digital archive is the same as republishing the original material. So it would be illegal to include in the archive any articles which the paper had agreed in court not to print again.

The Guardian and the Observer have been involved in some of the most high-profile libel cases of the last century. One of the most notorious was the suit brought by Jonathan Aitken, former Minister of State for Defence Procurement, in 1995, after the Guardian published allegations concerning his involvement with a Saudi arms dealer. The Guardian was exonerated and Aitken was convicted of perjury, so the original articles remain in the archive.

Some cases involved ordinary members of the public who felt they had been defamed. A personal favourite is an article from April 1950 reporting the death of former fascist leader General Attilio Teruzzi, which stated he had died on the penal island of Procida in Italy, where his wife ran a hotel; he was photographed with his partner and their 16-year-old daughter. A letter from a Mrs Lilliana Teruzzi followed, complaining that in fact she had never been to Procida, she had emigrated to America some years ago, and certainly was not involved in the hotel business. Through lawyers’ letters it transpired that General Teruzzi had divorced his wife, though she did not recognise it, and had since remarried. The final correction stated: ‘We are satisfied that Mme Teruzzi has never been on the island of Procida...and has never opened or had any interest in any hotel there.’ No mention was made of the mistress or illegitimate daughter.

Approaching this task, I assumed it would be relatively easy to locate a list of libel cases brought against the two papers. Surely our former law firms, legal department or the Newsroom archive would have a log of such complaints? Sadly not. The law firms that represented the Guardian until 1997, when the internal legal department was set up, only store case files dating back 16 years. The internal legal department had only documented the decade of cases since it was established. The Newsroom held boxes of files relating to legal cases, but they were piecemeal and few were catalogued.

So we set about finding as many cases as we could, hoping that we wouldn’t miss any important ones. It was reasonable to assume that those who had brought a suit against the newspapers before 1935 would not be in a position to sue us again and, as a case of libel cannot be brought by anyone but the individual in question (families of the deceased cannot claim), we searched only for cases after that date.

Searching boxes of files at the Newsroom, the Guardian and Observer’s historical archive, proved exhausting but fruitful. The archive opened in 2002, but understandably has yet to catalogue much of its vast collection. Fortunately the outgoing archivist pointed us in the direction of some useful material, including former editors’ correspondence, financial records and legal letters.
The library’s own cuttings files were also brought into play. The cuttings collection dates back to the 1950s, and was maintained until 2001. The trainees searched under ‘Law: Reports: Libel, Press Council and Press Complaints Commission’, and found several printed apologies and reports of libel cases in which the Guardian was involved. We also visited the John Rylands Library at Manchester University, which houses part of the Guardian archive. They provided several folders containing letters of complaint from readers and the ensuing legal correspondence. The legal department’s records were raided for recent cases, and the trainees looked for apologies and corrections in the digital archive itself, using keywords like ‘sued’, ‘libel’ and ‘damages’.
In all, more than 1,000 libel complaints were identified and reviewed. Our next task was to remove the offending articles from the archive. It is frustrating when one searches for an article one knows was published, only to come up against a blank results page and no explanation. I have often had to explain patiently to a journalist that, yes, the article was printed but, no, owing to legal problems they cannot have it in electronic format. Olive’s library tool navigates this problem quite neatly; it enables the administrator to blur an article on the page, so users cannot read it but know it exists, saving them hours of fruitless searching.

The main problem that arose with the legal side of our work was how to identify those libel cases the newspapers had won. We want as much original material as possible to remain in the archive; where the papers had won, there was no obligation to refrain from republishing the disputed article. However, some of the cases identified did not have a clear outcome. We found reports of trials in the cuttings files but no report of the judge’s decision, letters of complaint with no final follow-up from the paper’s legal team, and cases where the legal machinations were so confusing the result remained unclear even though a judgement was reached. After much debate, it was decided that the cases whose outcome was disputed would have to be removed from the archive but that the trainees would work towards finding out what happened and, where fit, restore articles to the database.

Later in 2008, the second phase of the archive will be launched − covering the earlier period of the Observer, from the first issue in 1791, and both papers from 1976 to 2003. Though the testing and legal searches will end there, the library will maintain an administrative role over the product, removing any further libellous articles that come to light, correcting badly segmented pages and in the process ensuring its continued success.

Katy Heslop (katy.heslop@guardian.co.uk) is a Librarian in the Guardian Research Department.

Updated: 04 April 2008