The Illustration Archive: Retrieving the Visual Element of Books from the Nineteenth Century

A selection of illustrations from the Lost Visions Project

Despite the mass digitization of books, illustrations have remained more or less invisible. As an aesthetic form, illustration is conventionally positioned at the bottom of a hierarchy that places painting and sculpture at the top. The hybridity or bi-mediality of illustration is also problematic, the genre having fallen between the cracks of literary studies and art history. In a digital context, illustration has fared no better: new technologies can aid the editing of a literary text far more successfully than they can deal with the images that accompany it.

Lost Visions is an AHRC-funded Big Data project led by Professor Julia Thomas in Cardiff University’s School of English, Communication and Philosophy, in collaboration with the School of Computer Science and Informatics. The team has created a searchable online archive containing over a million book illustrations from the British Library’s collections, which were originally scanned by Microsoft.

The images span the late eighteenth to the early twentieth century, cover a variety of reproductive techniques (including etching, wood engraving, lithography and photography) and are taken from around 68,000 works of literature, history, geography and philosophy. Users can search across a wide range of illustrations ranging from maps to decorative motifs to scientific diagrams and can view and curate online exhibitions, as well as creating and sharing their own collections of images from the archive. 

Bibliographic Metadata: The Challenges

On release of the scanned images, the British Library also made a full set of bibliographic metadata available in TSV format on their public GitHub account. This data included details of the author, title and publisher for the book each illustration was retrieved from, as well as the date of publication, page number and an index in the case where multiple illustrations appear on a single page. However, much of the data was fallible or incomplete. For example, many of the book titles had been abridged or amended, possibly as a result of varying catalogue records, the processes used to digitise handwritten records, or technological limitations such as maximum length text entry fields in databases. In addition, the page numbers themselves were incorrect because they had been counted from the book’s front cover.

The challenges associated with the collection’s bibliographic metadata draw attention to wider complexities relating to the status of illustration: illustration is positioned as subservient to the words it accompanies in various ways. For example, the publication dates provided for the images in the dataset refer to the publication of the volume from which the illustration was taken, not to the date of production or original publication of the illustration. Even more significantly, the metadata contained no discrete field for the illustrator or engraver of an image: where they are included, these details are subsumedwithin the title of the work itself. 

Analysing Iconographic Features: Illustration and Crowdsourcing

The existing bibliographic metadata is useful in its own right: a bibliographic search in terms of names of authors, for instance, could reveal which writers were illustrated most frequently in the period, or who were seldom illustrated. The result, is that, one can begin to see how illustration, by popularising certain texts and keeping them in circulation, had the power to determine their market, or even literary, value. However, the bibliographic metadata is not always a reliable way of describing the content of an image. In some cases the bibliographic and iconographic descriptions of images coincide and the title also helps to describe the content of the text but this is not always the case.

The ability to search iconographically is of obvious importance in a digital image archive: an iconographic search can reveal hitherto unrecognised parallels between images illustrating disparate texts, suggesting ways in which these images signify by the repetition of stylistic features, characteristics and devices and, as such, form part of an aesthetic and artistic tradition that is distinct from textual analogues.

In Professor Thomas’s previous AHRC-funded project, the Database of Mid-Victorian Illustration (DMVI), one of the principal aims was to develop a method of allowing images to be searched by content, as well as by artist, author and title. Such a scheme involved 'tagging' or 'marking-up' the content of images, i.e. attaching particular search terms to a particular illustration, according to what is depicted. The user of the database could then search according to these tags, either by typing a word into a search box, or by browsing the set of available terms.

This process was done manually, using a tailor-made system of keywords designed for Victorian visual art, and for illustrations in particular. While the manual tagging of iconographic features was possible for the 896 illustrations in DMVI, it was obviously not a practical methodology for a dataset of one million images. The approach taken for the Illustration Archive, therefore, combined image recognition software, crowdsourced tagging and machine learning.

Illustration and Crowdsourcing: Classification and Hierarchy

Crowdsourcing – in this case in by asking the public to help categorise and tag digital image archives - has a dual function in the construction of digital archives, simultaneously helping to improve bibliographic metadata and engage the public. Within the context of the Illustration Archive, generating a process of crowdsourced tagging also drew attention to some of the wider complexities relating to the interaction between word and image.

The process of tagging needed to be sufficiently simple and reducing task complexity was important, which meant we were not able to ask for specialist or technical knowledge; for example, most taggers would not be able to distinguish between different reproductive techniques.  The tagging process also needed to be attuned to the ultimate goal of searchability and image retrieval and to a wide range of potential users.

The team consulted various subject and genre indexing resources in the hope of finding an appropriate model for classifying images but, while the detailed hierarchies and categories offered by these resources were hugely appealing in the construction of a searchable archive, the level of complexity and specialist knowledge they represent was not consistent with the requirements of the tagging process. We therefore devised a simple tagging sequence, which asks users to define the type of illustration from a range of categories, then to describe ‘things or ideas’ present in the illustration, before transcribing the caption (if present) and providing any additional relevant information. 

The Digital Archive and Illustration Studies

The challenges faced throughout the Lost Visions project draw attention to some of the broader complexities that characterise the relationship between word and image. These challenges were wide-ranging: the lack of established systems of classification for illustration, the hierarchy of textual production in which the illustrator is secondary to the author, and the tendency for graphic texts to be invaded by the verbal in digitisation projects.

The Illustration Archive has sought to reclaim and re-establish the visual element of the printed book; as a result, the Lost Visions project has exposed and problematised the complex interactions between word and image in the creation of the digital archive. The archive was launched on 31st March and is now available at


Image credits: The Illustration Archive


Read our blog comment guidelines