4 questions to ask when setting up an in-house digitisation project

A selection of documents being digitised at Kew

The process of digitisation, the conversion of data into a digital form so that it can be directly processed by computer, presents a huge opportunity for libraries and archives to reach audiences beyond our physical four walls and to protect and preserve unique original material through reduced handling. 

Here we present examples of two digitisation projects in progress at the Royal Botanic Gardens, Kew to highlight four of the key questions to consider when establishing an in-house digitisation project.

About digitisation at the Royal Botanic Gardens, Kew

Major digitisation of collections held by the Royal Botanic Gardens, Kew was initiated in 2004 as part of the African Plants Initiative, a global collaboration of partner organisations bringing together scholarly resources relating to African plants. 

Kew's African type herbarium specimens, that is the original specimens on which new species descriptions have been based, were digitised through the generous funding of the Andrew W. Mellon Foundation, who also provided funds for the digitisation of the African related Directors' Correspondence, a significant collection held within the archive. 

About the Directors' Correspondence Project (DC)

Following the successful digitisation of the African DC and with the continued support of the Mellon Foundation, a team of up to four staff has been digitising, summarising and making available online further parts of the DC collection.  The collection itself comprises incoming correspondence to the Directors and senior staff of Kew from 1841 to 1928, as well as correspondence received by Kew's first official Director, Sir William Jackson Hooker (1785-1865), as Professor of Botany at the University of Glasgow. 

The correspondents comprise a Who's Who of nineteenth century society and include scientists, politicians, explorers, gardeners, and members of the general public. The letter content varies enormously from the anecdotal to the ethnological, from traditional plant remedies to the establishment of botanical gardens across the globe. To date over 29,000 letters have been made available online, comprising correspondence from Latin America, North America, Africa and Asia. 

About the Joseph Hooker Correspondence Project (JHC)

The JHC Project is conserving, digitising, transcribing and making available online the personal and scientific correspondence of Sir Joseph Dalton Hooker (1817-1911), an important but often overlooked nineteenth century naturalist and explorer. 

Hooker was the leading botanist of his day, pioneered the discipline of geographical botany, served as President of the Royal Society (1873-1878) and was Director of the Royal Botanic Gardens, Kew for twenty years (1865-1885).

The JHC project is a relative newcomer to Kew, running since July 2013, but the material being digitised is very much complimentary to the ongoing DC project. 

1. How many items should we digitise?

A decision must be made regarding the scale of a digitisation project. The associated costs will of course be a major limiting factor but it is also important to consider the scale of the collections or parts thereof to be worked on. For instance, it may make sense to digitise the most well used collections to prevent further handling damage.

The Directors' Correspondence Project

It was decided that, due to the large scale of the DC collection, the best approach would be to break the digitisation into discrete projects reflecting the existing geographical based archive arrangement e.g. digitisation of all the Latin American DC volumes would form a discrete project. The collection itself is indexed by author so it would have been possible to selectively digitise particular authors but it was felt that, as the exact content of much of the correspondence was unknown, working on all the items would create the greatest potential for research use. 

The Joseph Hooker Correspondence Project

While the DC comprises incoming correspondence to Kew, the Hooker correspondence project is currently focusing on letters written by Hooker himself. This is in contrast to other similar projects, notably the Darwin Correspondence and Wallace Correspondence projects which seek to create a collection representing all letters to, from and about their focal figure. 

The reasons for initially limiting the JHC project content raise several interesting points to consider at the planning stages of a digitisation project: 

  • Firstly, this approach avoids crossing over with material already digitised; we wish to maximise project funding to reveal new material and increase our digital offer. 
  • Limiting the content also minimises the risk of potential copyright issues. Currently all unpublished material automatically remains in copyright to the descendants of the original author until 2039. In the case of the JHC project, by only including Joseph Hooker material, the due diligence required to clear copyright is limited to one set of descendants.

2. What method should we use?

A wide variety of imaging techniques are now available for digitisation from book scanners to tripod mounted digital cameras. At Kew we have acquired new imaging equipment with project funding and utilised equipment used for imaging enquiries and reprographics requests.

The Directors' Correspondence Project

600 dpi TIFF images of each letter are captured using a Cambo camera with a digital back, purchased using project funds. Metadata for each item is also captured, comprising the letter author, recipient, date and location, and each item is assigned a unique ID which links to the metadata via a custom built Access database.

One of the strengths of the DC project is that other important information is captured from the letter content through a free-text searchable content summary. This summary includes key information such as the names of people, locations, plants and publications discussed. 

Unfortunately as the handwriting of the correspondents varies widely and can often be difficult to decipher or is cross-written, the use of Optical Character Recognition (the conversion of images of text into machine encoded or computer-readable text) was not a possibility. 

The Joseph Hooker Correspondence Project

We use a Zeutschel book scanner to create high quality 300 dpi TIFF images of the letters, which are retained as an archival copy, and the images are processed into smaller JPEG files for the web. 

Imaging is performed by a member of staff who is funded for one day per week. This is sufficient staff time as the imaging progresses relatively quickly when compared to the corresponding data creation. 

As with the DC project, JHC staff generate the basic metadata and a summary of the letter content; the data is compiled into a project database (also in Access and custom built) and linked with the image using a unique ID. 

3. Should we use volunteers or full time staff?

Volunteers can often be attractive to potential funders as well as a boon to the project. Having a team of volunteers is a great way to involve people from the community, whether that is local residents, a community of people who have an interest in your specific content, or those looking for career experience in the sector. 

Community outreach is a big plus on a grant application and volunteers can be excellent ambassadors for a project, helping to spread the word to their peers. However it can be hard to strike a balance between the correct amount of staff time spent on checking work and the need to ensure a high standard particularly when volunteers are creating a complex document such as a transcript, rather than simple data entry into a standard format. 

The Directors' Correspondence Project

The DC project employs several full time staff, representing a significant investment. Staffing costs can be high but provided turnover is avoided, full-time staff will develop expertise at interpreting specific collections and the efficiency and quality of the documents they process will improve quickly. A smaller team of full time staff will likely produce work of a more even quality and at a more reliable rate but at a higher overhead cost.

The Joseph Hooker Correspondence Project

The JHC project differs from DC in that its data creation also includes a full transcript of every letter. As with the DC, OCR is not possible, so the transcription is performed by volunteers. 

The project consists of one full time member of staff who manages a team of volunteer letter transcribers. The project has 12-15 volunteers who work remotely; they are emailed digital images of the letters and return the transcripts as word documents. 

Though Kew has a long history of using volunteers, this model, remote volunteering, was new to the organisation. As a transcript is a complex document unique to each letter, the decision was made to use a small team of volunteers rather than attempt to create a framework for crowdsourcing. 

The crowdsourcing model is used for projects where clearly defined data can be entered into consistent fields or where specific information within a document is tagged. For many examples of this crowdsourcing model see Zooniverse, which hosts projects in collaboration with academic and other partner institutions, such as the long standing 'Old Weather' project and the more recent 'Operation War Diary'. 

4. Should we host our own content?

Hosting your own digital content allows you to maintain greater control over the content, for instance controlling the format in which it is provided, for example, as a high resolution screen image or a PDF download. It also allows you to determine how metadata is presented to users, allowing you to tailor to the ‘quirks’ of your particular collections. 

Maintaining online content yourself can be expensive, especially if design and maintenance work cannot be performed in house - website contractors can be costly. Contributing data to a partner organisation passes the burden of hosting, development and maintenance costs to experts and can help ensure the longevity of data beyond the end of project funding. Your content may also shine alongside a partner’s complimentary materials and reach a larger audience as part of a more comprehensive resource.

The Directors' Correspondence Project

High quality digital images of the correspondence and the associated summaries are made available online through JSTOR Global Plants. This portal hosts an unrivalled collection of botanical resources including digitised herbarium specimens, primary sources and reference works, contributed by over 300 herbaria worldwide. 

This is a subscription site; non-subscribers still have access to the metadata summary of the correspondence but not the high quality images of the original. Though a fantastic resource the search functionality of the site is tailored more specifically towards herbarium specimens than to archival documents and it may take a little time for a researcher to use the site optimally to search for the DC. The DC metadata is also being made available online through the archive catalogue of the Royal Botanic Gardens, Kew. 

The Joseph Hooker Correspondence Project

Digitised JHC material is made available online through a non-subscription microsite on Kew's website. Researchers can view the metadata, summary and images and  full transcript is available to download. A particular advantage of the microsite is that it allows Kew to provide context for collection and supporting material such as biographical details and links to other related Kew collections for example in the library and economic botany collection.  

Let us know about your in-house digitisation experiences in the comments below

What have you learned about working in partnership? What would you do differently if you were initiating a new project? 


Image source: Image courtesy of the Royal Botanic Gardens, Kew / Original image cropped and resized


Read our blog comment guidelines