Reg Readings and Mark Holland describe constructing the digital archive for 200 years of The Times.

This article is from the July 2003 issue of Update.

The importance of a contemporary record has become a given since Boswell recorded Johnson’s insistence on scholarly rigour. Newspapers flourished in the 18th century, being read by an increasingly literate population. Ever since, they have played the role of recording the daily events and circumstances that concerned their readerships. Their back issues are bread and butter to the political, social and economic historian.

Today’s younger generation of readers and researchers increasingly assume that electronic access to the sum of the world’s knowledge will be available — and if it isn’t, it is too tiresome to retrieve, or it probably doesn’t exist at all. To many it comes as a surprise that it is only since the mid-1980s that newspapers have been composed electronically.

Cumbersome access
Until recent years, access to historic newspapers has been a cumbersome exercise, requiring the mining of a frequently idiosyncratic and selective index (where available at all) alongside the bulky and often fragile original paper edition, or its microfilm surrogate. Happily for readers and librarians, the last decade has seen technological advances inspiring initiatives to digitise older newspaper and journal runs in every continent. Digital newspaper libraries are what researchers want, promising as they do the saving of hundreds of hours in the library and the classroom.

It is user expectations which have fuelled the numerous digitisation projects now active or in preparation. Those in production are found in varying degrees of sophistication and have had equally varying degrees of success. While some projects have flowed smoothly and effectively to their conclusion, some publicly-funded ones have had a short or interrupted life; others have converted quite long runs of material, but have not made it easy for the user to find useful information.1 The most complete runs of English-language international newspaper titles have largely been entrusted to the commercial sector, for delivery to researchers and the public via a library subscription or digital archive acquisition.2 Recently, OCLC’s Preservation Resources has announced a newspaper digitising service to its research community, powered by the same software used in the British Library’s pilot Online Newspaper Archive.3

Times/Gale agreement
Gale, under an agreement with The Times, has planned and implemented a digital edition covering 200 years of the oldest continuously published daily newspaper in the English language. Over this period The Times has been a witness to the world’s changing ideologies, its social and political constructs, its conflicts and creative endeavours. In the course of its life the newspaper has become an institution in its own right, on occasion reflecting the views of Britain’s leaders and on others urging change with such confident authority that by the mid-19th century it had acquired the sobriquet of ‘The Thunderer’.

Its coverage of the industrial and political revolutions of that century, of the British colonial experience, of the expansion and ascendancy of the US in the West, and of the conflicts of the 20th century, have long made it an essential primary source for research. Such was its reputation during the American Civil War that Lincoln reportedly acknowledged that if he wanted to know what was happening across the Potomac, he read Russell in The Times.

Broad range of users
Times editorials, letters, feature articles, and the law and parliamentary reports and obituaries are key witnesses to governing opinion and movements for reform. Yet before we began to design the digital historical edition, discussions with librarians and faculty members revealed the need to open out the paper beyond these well-indexed, if tedious to recover, materials, creating a resource for a broader range of users. We needed to meet the needs of very diverse and often discrete interests, for example:

  • those looking at the effect of change over decades — e.g. in gender roles in public and private life, at the adaptation to change within institutions and to the explosion of new knowledge and perceptions in the sciences and industry;
  • students of such disciplines as art and design, the media, and visual culture, including fashion and advertising history;
  • pupils working on projects such as ‘The Victorians’ and ‘Women at War’;
  • local historians, and the general public in their present or future communities;
  • family historians — by disclosing not only the births, deaths and marriages of those whose families were prepared to pay for classified advertisements, but also the movements and achievements of a much broader population through access to notices of appointments and honours, and career advancement in the professions and trades.

How could we provide access from these differing perspectives to nearly 1,000,000 pages containing about 10 million articles? The project required some clear goals: we had to get it right, as such a task is unlikely to be repeated in the near future. This pre-eminent national resource had to be accessible in digital form in as friendly, effective and accurate a format as possible.

Determining principles
So, with a sense of trusteeship of the national ‘paper of record’, we settled — with much advice — on the following determining principles:

  • make the complete newspaper, including display and classified advertising, easily retrievable;
  • ensure the most accurate and comprehensive search results possible;
  • add a subject category to the metadata of every article, so that results could be limited to particular areas of enquiry such as law reports, letters and birth announcements;
  • make use as attractive and as easy as possible (we expanded the functionality of Gale’s well-known Infotrac platform);
  • keep the delivery time of results to a minimum;
  • indicate the context of publication: alongside every search result would appear a full-page thumbnail highlighted to show the position of an article or advertisement and the relationship between editorial and advertising content;
  • offer seamless navigation from an individual article to its related full page;
  • in the browse function, hotlink all article headlines from a page of the paper alongside a large thumbnail page image, allowing users to see the article’s position on a page.

Ensuring the best microfilm base
It was clear by the mid 1990s that advances in computer software, reduction in storage costs, and web bandwidth would soon make possible the digital ‘de-construction’ and delivery of a complex newspaper page, searchable facsimile articles in context, and full pages, over the web.

The Times was among the earliest serials to be microfilmed for use alongside the Times Index. However, the poor quality of the original imaging led to a complete re-filming from original paper copies in the early 1970s. It is from the masters created at that time and from subsequent continuations that Gale’s Primary Source Microfilm imprint has offered copies of the paper to the library community over the last 30 years.

Using our own film duplication plant we identified the best medium-to-low-contrast duplicating microfilm to get the best possible image for our film scanning. This is important: the specification of microfilm for reader use in libraries does not necessarily give the best results for film scanning. Testing of every reel ensures the best microfilm base by adjusting duplicating process parameters to overcome shortcomings in the quality of the original material, and changes in filming techniques and standards over the last 30 years.

After testing several machines we decided to use Mekel M500 greyscale scanners. Our own algorithms combined with the controls available on the Mekel provide superior results to those available with other makes — at least for newspaper work. In the model we adopted, images are captured in greyscale and converted to 300dpi bi-tonal tiffs at a rate of six frames a minute. Allowing for set-up, each operator converting Times microfilm to specification can process between three and five reels a day. This can drop to as low as two reels a day if the film quality is inconsistent and frequent operator intervention is required.

The digital stream from the scanner’s 6,000 pixel charge-coupled device (CCD) passes to an image processor where algorithmic filters are applied for gamma correction (light balance), noise removal and edge enhancement. By experimenting with these filters and CCD exposure control, high-contrast images are produced capable of giving good OCR — typically 95 per cent or better for most editorial content.

Image post-processing
On completion of a year’s filming the output images are renamed and divided into monthly subdirectories. The images are then moved to other machines for cropping and de-skewing.

Most cropping programs require images to be of a regular size with constant black edges. However, The Times was microfilmed by our predecessors from a variety of sources including bound volumes, with a significant run of originals being between 100 and 200 years old. Folding, tears, adhesive tape and bad trimming tend to be reflected in the borders of the page. We had to create parameter-driven programs capable of removing the worst excesses without cropping the newspaper’s text.

 

 

Fig. 1 Keyword Search screen for the Times Digital Archive: the screen appears immediately, is uncluttered, and offers fast results.

If pages were not filmed flat or straight the final scanned image can appear skewed and sometimes trapezoidal in shape. Because we electronically clip the page into articles it is important that the page is corrected vertically: without this the clipped article image would start to stray into adjacent columns. Although this often leaves us with images skewed in the horizontal axis, most OCR engines can deal with up to five degrees of skew.

Our own tools identify and remove skew and report images that are too badly skewed to process to acceptable standards. These images are manually corrected and fed back into the system. Pages that were badly cropped prior to microfilming are manually cleaned and the page title and number electronically inserted where necessary.

These processes not only make the final image look more attractive but, more importantly, they improve the OCR of the text and reduce the overall file size for delivery over the web.

As a final test a check is made to ensure that pages are all present, correctly numbered, sequenced and dated. Where they are missing or of unsatisfactory quality despite our best efforts, individual copies are bought in the second-hand market, filmed and integrated. The images are then output to DLT for transfer to our clipping vendor.

 

 

Fig. 2 Sample Results screen. Showing the context in which an article was originally published (the article is highlighted in the thumbnail to the left) came high on the 'wish list' of project advisers.

Clipping
At our contractor’s facility in Hyderabad, the pages are passed to operators who decide how each page is to be marked up for clipping. This is done twice to ensure that the whole page is being captured and that each clip is being defined in the same way.

Every page is clipped into articles or segments: editorial content is nearly always clipped into individual articles. In addition to OCRing the individual clips, fielded metadata is added, as is the article category already mentioned.

Clipping the page into articles generates co-ordinates or positional information for each article on the page and the OCR process provides the co-ordinates of every word within each clip or article. In this way articles are extracted ‘on the fly’ from full-page images and the selected search terms are highlighted on the delivered image.

This approach overcomes what would otherwise be the need to store both full pages and clips on the image servers.

Images, clipped images and XML containing the OCR’d text and metadata are returned to the UK for checking before being forwarded to our US site for indexing and uploading to our Infotrac servers.

 

Fig. 3 The article delivered is the third of the three results shown in Fig. 2. The search term is highlighted in colour in the metadata and the body of an article - especially useful in longer articles.

 

Searching the archive
The design of the delivery platform involved many specialised staff working together over many months. Considerable effort went into the design of the Infotrac screens to enable easy searching and retrieval by both novice and experienced users.

As currently configured the Times Digital Archive can be searched in four ways.

Relevance Search
A Relevance Search is a simple yet effective way to search for articles. It looks for words and word variants, alone and in combination.

Relevance Search is most effective when two or more search terms are entered. Each term is analysed for its frequency of use in each article and within all articles. Articles are assigned a higher relevancy score when they contain terms more often, or when they contain terms that are found in relatively few other articles.

 

 

Fig. 4 Newspaper articles are categorised manually into six broad categories, each of which is subdivided to allow narrowing of a search across nearly 10 million individual articles.

 

Keyword Search
A Keyword Search lets the user match words in the metadata or in the article text. The screen capture (Fig. 1) from the first segment released shows how a basic search for the word ‘Microfilm’ in the entire database can be made.

A successful search results in a list of citations for matching articles. Citations are brief references to articles, normally displayed from oldest to most recent: with relevance searches the most relevant are displayed first.

If the search returns results in more than one category of newspaper article, a list of all article categories that contain results appears on the first page of citations. If the researcher is only interested in a particular category, they may click here to reduce the results to just that category profile.

Each citation contains information designed to let the user decide if they want to view or retrieve the article itself. For relevance searches, you’ll also see the relevance rank and a list of the exact words that matched the article.

On the left side of the citation is a thumbnail image of the page on which the article is to be found — or begins. The portion of the page containing the article is outlined, to provide an idea of the size of the article clip and its location on the page. An icon appears below the thumbnail where an article continues on another page — a feature more common with US newspapers than with those from the UK. When the thumbnail image is clicked the article facsimile image is delivered to the screen.

The title of each citation links to the content of the full record. Often there are multiple formats available; these are listed as links just below the citation. Clicking on the first link is the same as clicking on the title. The rest of the links show other formats available, either electronically (such as a PDF) or at the user’s institution (library holdings).

To the left of each citation is the mark box. Click on the box to mark (set aside) the citation for later action. By selecting the ‘View mark list’ link in the left-hand column at any time the user can view all the articles tagged to date.

 

Fig. 5 Page 2 of The Times of 9 January 1951. In the 'Browse by Date' search, mousing over a headline on the right-hand side outlines the related article in the large thumbnail image. Clicking on either the headline or within the outlined area will deliver the article selected.

Advanced Search
Advanced Search presents a framework for building as simple or as complex a search expression as required. Search for up to three terms (consisting of one or more words) from as many as three different indexes, and you can link the search terms with any of three logical operators.

The screen shot (Fig. 4) illustrates the use made of categories to subdivide the newspaper’s contents. For reasons of timeliness and cost no attempt is made to assign subject headings based on individual article content.

Browse by Date
With a drop-down calendar, Browse by Date permits the selection of an individual issue of The Times, and browsing through that issue page by page. Pages are initially presented as large thumbnails, with titles in the selected page displayed on the right hand side of each image (Fig. 5).

The browse pages provide the electronic equivalent of leafing through the newspaper. This screenshot shows one of the four small images available in a typical browse view.

To the right of each small newspaper page is displayed a list of the articles on that page. As you point to a title, the article is outlined on the page image, so that you can see the size and placement of the article. If you click on the article title, the article is downloaded for reading.

If you point to portions of the page image, a pop-up appears telling you the title of the article at that spot of the page. You can move the cursor around the page image to find an article. If you click on a spot of the page image, the article there is displayed so that you can read it.

Above each page image there are a number of links. The View Page link displays that full page in the browser: three further options allow retrieval of the page in PDF format for printing.

While there is much content yet to be added to the database — the project will take until towards the end of this year to be completed — more than 100 years are already available. Early feedback from users and librarians has been enthusiastic. Among the points made so far are that the interface is particularly easy to navigate, searching is fast and flexible, and that interrogation and navigation of the large full-page images is particularly innovative.

Accuracy of results
We have received some helpful criticism also. One team of reviewers has been particularly concerned about the accuracy of some search results, and it is worthwhile pointing here to how they may occur and how they can be avoided.

A search across the complete database on the word ‘beatles’, for example, can return ‘beagles’ or more oddly ‘beating’ or ‘hunting’. Despite all the image clean-up, some newspaper text is set in a font so small that it is hard enough to read even in the original: the impression on the printed page may be poor and the paper used may encourage ink to spread and make letters ‘fuzzy’ to the search engine. (A poor ‘b’ can be read as an ‘h’, and a broken ‘l’ may be read as an ‘i’).

Users coming to the database may assume that the searchable text of the Times Digital Archive has been keyboarded in the same way as a current newspaper or journal. It is as well to point out that the text here is created not from modern electronic keying but from scanned pages of the newspaper printed over 200 years. The voting engine has to make its best ‘stab’ at a word, and it will not always be correct.

Eliminating errors
Happily there are various ways to limit the occasions when inaccurate or unwanted results are delivered.

If you want to search for The Beatles, key ‘The Beatles’. This will omit the word ‘beatles’ on its own, and cut out most mentions of the insects in agricultural articles and Nature Notes!

  • Only include advertising in your search if you specifically want to include it. If you do want to include display advertising, include that also, but consider not including classified adverts: most are set in very small type, and the results here will inevitably be less accurate than elsewhere.
  • Limit the date range of your search. Staying with The Beatles, users should at least know that the group was a post-the Second World War phenomenon. Limit a search on the phrase to post 1945, and reports of the meets of pre-war beagle packs will not be included.
  • Choose a section or those sections of The Times you wish to search in. If you want editorial/news coverage of The Beatles, and don’t want to see every concert or broadcast listing (you will have the same concerns here as with classified advertising), limit your search to these categories alone. If you want to include feature articles on the band, search only the arts/entertainment and reviews areas, so that you do not cover sport or the weather.

While you can get round some existing irritants, others will need attention when the whole content of the database has been loaded later this year. Then we’ll be introducing an element of ‘fuzzy searching’ so that, for example, a search can be made on ‘Hitler’ which will encompass the term ‘Hittler’, as The Times spelled the name until the mid-1920s. There are some minor navigational points to address also.

Finally, we would like to emphasise that we encourage all comments which will make working with the Times Digital Archive an even more rewarding pathway to the history of Britain and the world — from the French Revolution to the age of Mrs Thatcher.

References
1
There are links to many initiatives at http://bcdlib.tc.ca/links-subjects-newspapers.html

2 Gale will complete a 200-year run of The Times in the autumn of 2003. In the US, ProQuest has completed The New York Times and is expanding its portfolio of other national titles.

3 www.oclc.org/digitalpreservation/services
/digitizing/text/
and www.uk.olivesoftware.com

Reg Readings and Mark Holland are on the staff of Gale/Thomson Learning in the company’s Reading offices. Reg Readings (reg.readings@gale.com) is Production and IT Director, and Editor of The Times Index. Mark Holland (mark.holland@gale.com) is a Vice-President of Gale and the company’s UK publisher.

An earlier version of this article appeared in Microfilm and Imaging Review, Vol. 31, No. 3 (Summer 2002).

For further information, or to set up a trial of the Times Digital Archive, 1785-1985, contact: Gale/Thomson International, High Holborn House, 50-51 Bedford Row, London WC1 4LR (020 7067 2500; online@gale.com; www.gale.com/world).

The Times Digital Archive was launched during Online Information at Olympia, December 2002.

Updated: 11 August 2004
Registered charity no. 313014
VAT Registration No GB 233 1573 87
© Copyright CILIP 2008
CILIP, 7 Ridgmount Street, London WC1E 7AE
Tel: +44 (0)20 7255 0500 Fax: +44 (0)20 7255 0501