This article is from the May 2002 Issue of Update.
The Gutenberg printing revolution led Europe out of the Dark Ages — the loss of knowledge of the learning of the ancient Greeks and Romans. The digital revolution may land us in an age even darker if urgent action is not taken.
In particular need of attention is the ever-changing information available on the internet, through email and the world wide web. Details preserved in private communications, by accident or design, reveal much of the past. The writings of Roman soldiers in Britain on the Vindolanda tablets, the notebooks of King Alfred, the letters of Charles Dickens, the papers of statesmen like Jefferson or Mountbatten, offer many personal, literary and political insights unavailable in more public documents. The telegraph and telephone led to a decline in letter-writing. The rise of email and text-messaging has meant that communication through writing is popular again.
The disadvantage is that either the written products will be deleted or destroyed, or we will be tempted to keep so much that the costs and problems of data organisation render them virtually useless.
The many advantages of digital data were apparent from the beginning, and this has led to large programmes of
retrospective conversion of traditional materials into digital format. However, some data is born, lives and dies only in digital form, and it is the potential death of this data that will have far-reaching consequences for society. Digital documents may seem easy to store, back up and restore, but this is only a short-term benefit — try to find a file a year later on the hard drive, among 500 others, when you have forgotten what it was called.
Materials published on the web may derive from professional sources such as publishers or libraries, with analogues elsewhere, which will continue to be available even after they disappear from digital view, but what of other kinds of digital documents? In the past, ephemera such as playbills, advertisements, menus, etc have been conserved as vital witnesses to aspects of the past. Today, these artefacts appear on the web for a matter of days, to disappear in the twinkling of an eye, the flash of a pixel.
There are, too, many government documents, records and other official papers which only ever have a digital form: what is to be done about these? We have a warning from recent history of the possible dangers: tapes of many seminal radio and television programmes were destroyed, and many live broadcasts were never taped, resulting in serious gaps in the history of the performing arts of the 20th century. In the digital world, the experience of solving the Year 2000 problem has demonstrated that avoiding data loss is extremely expensive.
Another major source of digital data is the retro-conversion of analogue originals by libraries in order that valuable cultural objects can be better preserved. Digitisation has often been mooted as a possible preservation alternative to microfilm, but while digital objects clearly offer better access than microfilm, they are still regarded as high risk for preservation. As Mannerheim points out, 'on preservation and conservation …information technology so far has had only marginal effects'.1 Large-scale microfilming of compromised originals has been carried out for many decades because of acid paper. Many digital imaging projects have grown out of earlier microfilming initiatives.
Why data needs preservation
Data is at risk because it is recorded on a transient medium, in a specified file format, and it needs a transient coding scheme (a programming language) to interpret it. Digital data is also complex, and meaning can depend as much on how individual data objects are linked as on what those objects are. Of course, written documents are also highly complex objects, but their structure does not need to be comprehended for their preservation, only for their interpretation.
Over time, knowledge of how to interpret documents can be lost, but this can usually be recreated, as their textual and physical characteristics are explicit. Their decipherment generally needs only human faculties. Digital documents differ from analogue, too, in that they are not inextricably bound to their 'containers', and therefore preserving them is not necessarily a matter of preserving containers as it is in the analogue world. With digital data, a machine needs to be interposed between it and its human interpreter, which adds another layer of complication. Meanings recreated in modern contexts will necessarily differ from 'original' meanings, but this is generally the case in the interpretation of the past, even when the language and contexts have come down to us in an unbroken line: the past is always interpreted through our own historical moment.
Digital data is in danger, not because it is inherently fragile or flawed, but because there is a continually accelerating rate of replication, adaptation and redundancy of hardware, software and data formats and standards. This means that the bit stream may not be readable, interpretable or usable for very long in future.
For most people the bit stream for the word-processing document used to write this piece would be totally unintelligible without the suitable computer applications, software and operating system environments to interpret and repackage the data into a readable form. We take this automatic decoding for granted until we try to read a word-processing file from 10 years ago and find that none of our current systems or software have any idea what the bit stream means without expert help. The longer the data is left unattended, its data coding unrecorded, the higher the risk that systems will become obsolete and the decoding expertise will become unavailable.
The challenge is not to perpetuate the scenario of data loss and poor records that has dogged our progress over the last 25 years. Otherwise, in just 50 years from now, the human record of the early 21st century may be unreadable, and its decipherment an expensive and intellectually challenging feat way beyond the achievements of the great codebreakers of the 20th.
How is digital data to be preserved?
There are two key issues for data preservation, which surprisingly have little to do with preserving the original bit stream:
- preserving the physical media on which the bit stream is recorded;
- preserving the means of interpreting, reading and utilising the bit stream.
Given that the bit stream is merely a very long series of binary codes, the physical media should maintain its integrity over time. However, being able to read, use or interpret that bit stream may become increasingly difficult as systems evolve, adapt and eventually become redundant, so presenting a fog through which the bit stream becomes unusable.
The main preservation methods that have been suggested for digital data are technology preservation; refreshing; migration; emulation; data archaeology; and output to analogue media. Technology preservation, the maintaining of the hardware and software which can be used to access the data, can only be a short-term solution while better methods are found. The increasing load of maintaining generations of out-of-date equipment is just too onerous and expensive. Refreshing of data, the copying of the bit stream from one medium to another (from tape to CD-Rom for instance) without changing format, will always need to be done at regular intervals, whatever other preservation methods are undertaken, given the uncertainties about media longevity. This has certain dangers, however: data corruption can go unnoticed and can be copied from medium to medium unless stringent checks are carried out.
Migration involves changes in format as data is moved between systems, and often leads to loss if the new formats cannot handle some of the complexities or linkages of the old. Complex computer games, or simulations, for instance, may lose some functionality when migrated. Emulation is the recreation of the hardware and software environment required to access a resource.
There is a lot of debate about emulation in the digital library world: some say that, without it, preservation of interactive digital objects is almost impossible, others that the technical and intellectual challenges of emulation are too great to make it an affordable or scalable solution for any but the most valuable data.
Data archaeology encompasses a whole range of rescue techniques which can be used when valuable data has already been lost. Output to analogue media, some say, is the only guaranteed way to save valuable materials. There are a number of options: documents can be printed out and saved; files can be output to microfilm; and there is the extreme example of Hansard, one copy of which is printed every year on vellum, at the cost of around £30,000.
National and international initiatives in digital preservation
The costs of digital preservation are high. The costs of not preserving our digital heritage will be even higher in terms of lost data and materials. Many individual projects and institutions are working hard on this problem, and there are also national and international initiatives collaborating to find long-term solutions.
The most ephemeral content is that on websites, and there are a number of initiatives established to archive the websites of particular countries. The National Library of Australia archives 'snapshots' of a whole range of websites, public, private, serious and light-hearted, including those of the 2000 Sydney Olympics. Key sites are archived every few months, and there are no guarantees that external links will continue to work.2
In the US, the Library of Congress has been awarded $100m to develop a national programme for the preservation of digital materials, in particular that which is only ever digital.3 Some archived sites are now accessible through the recently announced Wayback Machine, available through the internet Archive.4 Also, in the US, the General Printing Office5 is archiving defunct government websites, and in January 2001 was expecting an increase in demand as Clinton government agencies were replaced under the Bush regime. Future historians will thank government workers who paid attention to archiving websites. After all, saving the content will allow internet researchers to compare politicians' past promises with their current votes, track reports on health care and social security, or even reread the Starr report — all without leaving their computers.6
In Europe, national libraries in Denmark, Sweden, Finland and the UK are grappling with the issues around preserving ephemeral information on the web.
There are also a number of initiatives to establish national and international partnerships to solve digital preservation problems. In the UK, the Joint Information Systems Committee (Jisc), the National Preservation Office (NPO), the British Library and other leading UK organisations have established a Digital Preservation Coalition, together with international organisations such as RLG and OCLC. This has been set up to deal urgently with digital preservation issues in the UK within an international context.7
The coalition was launched at the House of Commons at the end of February, and received extensive press coverage (April Update, p. 4) — clearly the threat is one which is increasingly being taken seriously.
Conclusion
Much cultural material has always been lost to following generations, through accident or intent. Sometimes we know what was lost, as in the case of the destruction of the statues of Buddha by the Taliban in 2001, sometimes we don't, as in the Viking raids on Anglo-Saxon monasteries or the plundering of churches by Cromwellian forces. Often, we don't even know if things existed or not, and the survival of the ephemeral has always been a matter of happenstance. But we are facing a new situation where, without urgent action, a digital black hole could open up in late 20th- and early 21st-century written culture — truly a digital dark age from which information may never reappear.
References
1 J. Mannerheim. 'The WWW and Our Digital Heritage: the new preservation tasks of the library community.' In 66th IFLA Council and General Conference, Jerusalem, Israel, 2000 (www.ifla.org/IV/ifla66/66cp.htm).
2 http://pandora.nla.gov.au/
3 www.loc.gov/
4 www.archive.org
5 www.access.gpo.gov/
6 L. M. Bowman. Bush Camp Takes Charge of Whitehouse.gov in Transition. cnet News.com, 9 January 2001 (http://news.cnet.com/ news/0-1005-201-4421306-0.html).
7 www.jisc.ac.uk/dner/preservation/
Marilyn Deegan is Director of Forced Migration Online, Refugee Studies Centre, Queen Elizabeth House, University of Oxford (marilyn.deegan@qeh.ox.ac.uk). Simon Tanner is Senior Consultant at the Higher Education Digitisation Service, University of Hertfordshire (s.g.tanner@herts.ac.uk).
The concepts in this paper are discussed in more detail in Digital Futures: strategies for the information age (Deegan and Tanner, Library Association Publishing, 2002).