11.2 Long term strategies for electronic documents - report from a swedish study 11 August 1995

Mats G. Lindquist, Consultant

The proliferation of electronic documents (henceforth called e-documents) both from original sources and from the conversion, mostly by digitization, of traditional documents calls for increased attention to the questions of preservation and access of these documents.

A Swedish study was launched in May 1994 by the Royal Library (KB) in cooperation with the National Archive of Recorded Sound and Moving Images (ALB) and the National Archives (RA). The objective was to identify methods for the long term preservation and access of e-documents. Subsequently, in USA, a similar study was initiated by the Commission for Preservation and Access (CPA). In Europe the COBRA Task Group Five has been set up to study the long-term availability of electronic publications. Since these studies have essentially the same mandate, it seems sensible to coordinate the investigations and share results as they emerge. In the following a summary of the findings from the Swedish study will be given.

Specification of the task and the identification of e-documents.

A method is seen as a systematic mode of procedure to attain an objective, in this case the long term availability of e-documents. A method for the preservation of e-documents encompasses a set of choices or selections:

1. The selection of material. what is an e-document? which of these should be preserved?
2. Choice of technology; which information carriers and which equipment should be used?
3. Choice of form for representing and storing the information.
4. Selection of access mechanism; how can logical/intellectual access be secured?
5. Choice of mechanism (system) for making e-documents available.

Many projects have been carried out in different countries to address one or more of these issues, and there are many published findings. The ambition of this study is to combine these findings into a holistic set of recommendations.

From a national viewpoint the selection of material for long term preservation is tied to the legislation for legal deposit (in some countries: copyright deposit) of documents. The material definitions in this legislation do not reflect current technology. A review of these definitions is therefore called for. For this study the following definition has been used:

An electronic document is one or more objects carrying information for reading, viewing and/or listening, the content of which cannot be rendered without the aid of electronic equipment.

The information carrier together with a specified way of recording is a medium; the same information can occur on several media, eg. CD-ROM and magnetic tape.

The term "multi media" usually refers to different kinds of information such as text, pictures, sound and moving images, and although they mostly occur on the same medium the term is so established that it is more practical to live with this inconsistency for the time being. A multi media document, then, may or may not be on multiple media.

The development in electronic media is currently in a very dynamic, almost turbulent phase. Some overall trends can, however, be seen:

1. With regard to production, the volume of e-documents is increasing, and the digital form is increasing its share at the expense of the analog form. Artistic works, and other information products, are to an increasing extent published on multiple media, in some cases as straightforward editions, in other cases as more complex constructs where the contents are rearranged and modified.

2. With regard to distribution the communications networks are growing in importance as a delivery channel, both as an alternative to physical distribution and as a vehicle for downloading information products. Broadcast, cable and telecom channel operators are entering each others' markets.

3. With regard to storage and information carriers there is a technological convergence between development in the computer industry and the (information) media business: the same magnetic tape cartridge can, for example, be used for digital moving images and data processing files. The CD-family of media is gaining a strong market position which is further strengthened by its adoption by the PC industry.

E-documents are intimately tied to the technology used to create and display them. The technological base also makes new conceptual constructs possible, for example so called "hyperlinks". E-documents are in several ways fundamentally different from traditional ones, and in Appendix I a list is given of unique functional properties (UFP's) that require consideration when planning procedures and processes for the management of e-documents.

Findings regarding the selection of material.

For library material the technological development, above all in multimedia, will affect the selection that is controlled by the legal deposit legislation. Content that formerly was considered as library material will, to an increasing extent, be subject to other legislation, or none at all. In Sweden, according to law, multi-media will be delivered to ALB, in one copy stipulated instead of seven which is the required quantity for print material. So as publishers move from print to multimedia, eg for encyclopedia, there will be gaps and inconsistencies in the collections of the deposit libraries. This is an issue of great concern, and ways are being sought to resolve the problem. For the long term a revision of the legislation is necessary.

In general all libraries should reconsider media-bound policies and guidelines, and instead re-interpret their mission statements when planning their acquisitions and collection development.

Digitisation into e-documents of traditional material in library collections can improve both access and preservation. Before selecting material for digitization it should be clearly stated which benefits are sought. Digitization will not reduce preservation costs if measures are not taken to reduce the traditional preservation activities.

Findings regarding media

For assessing the long term adequacy of different media there are several aspects to consider:

- physical deterioration of the information carrier
- technical obsolescence of the recording method
- technical obsolescence of the equipment

Together these make continuous migration of information (sometimes referred to as "refreshing" or "re-copying") inevitable.

The longevity of different carriers has been the subject of many studies, and is "under control" in the sense that the usability can be statistically predicted. Advances in diagnostic quality assessment are being made, so the risk of information loss will be reduced continuously. The recording methods are quite often dependent on specific equipment, and the obsolescence of these constitute the greatest threat to the long term availability.

Currently the Compact Disc (CD) is emerging as an important medium for many different applications (texts, video, photographs, sound, multi-media), and there is convergence in the technology so that different variants of CDs (Audio-CD, CD-ROM, Photo-CD, CD-I) are becoming compatible. This makes the CD a suitable candidate for holding e-documents, provided that the capacity of the CD is sufficient for the application.

For very large volumes, magnetic tape is still the most feasible solution. High capacity storage systems based on tape-cartridges and cabinet-robots might gain enough of a market to become a viable technical approach in the long run. The same medium can be used for data processing storage (including back-ups) and digital video, which can give these media a big enough market base. However, the market growth of services such as Video-on-Demand should be followed since it will push the technology towards high-capacity disc-storage.

Findings regarding form for representation and storage

There are many competing formats for representing and storing information, for example different image formats. There are also many compression algorithms, aimed at reducing the storage required. These constitute further risks for the long term availability because of:

- computer software obsolecence
- computer hardware obsolecence

The use of compression results in some loss of information. (There is research on designing loss-less compression algorithms, but it has not yet resulted in specific standards.) To achieve the benefits of compression regarding storage and handling economy there is therefore a trade off that has to be made. Can it be justified that less than 100% of the information is preserved for posterity if the quality of the video material is deemed to be sufficient for current usage?

For video compression the method defined by the Motion Picture Expert Group (MPEG) is growing in the consumer market. There are, however, different levels of MPEG compression and it is too early to identify a long term dominant method. The quest for more powerful compression is supported by strong market forces.

Digitization by scanning can lead to loss of information. If scanning is used to make a preservation copy the image quality (resolution) should be sufficiently high to make the original superfluous.

For the representation of text there are different coding schemes, the most prevalent being ASCII related. In an increasingly international exchange this situation is not satisfactory since the overall design of the different ISO set for language groups is not modular (leads to collisions when mixing languages) and does not have sufficient scope.

The problems related to character sets have been underestimated. To implement a scheme based on UNICODE (which is technically compatible with ISO 10646) seems to be a feasible alternative in the long run. The adoption of this approach in the commercial sector will determine the long term viability of this alternative.

For the structuring of texts SGML has gained a strong position both in academia and in the commercial sector. It can therefore be considered a candidate as part of a preservation strategy. However, for the needed supplements Document Type Definitions (DTDs) the situation is still without convergence.

For document structures the situation is still fluid: the ODA and ODIF standards have been established but do not have a wide market acceptance.

Findings regarding logical access

For access it is necessary to achieve bibliographic control of the material. Many of the Unique Functional Properties (see Appendix 1) of e-documents have direct implications for cataloging and bibliographic description. One main difference compared to traditional material is that it is necessary to describe how to access and use. Requirements to this effect are part of the delivery procedure for material to the Swedish National Archives (RA). Furthermore, the dependence on technology makes it necessary to include meta-information about the e-document so that access and preservation can be secured over time. In general, "the principle of provenance", which is fundamental for archives, should be given more recognition in the library world.

There are several international efforts underway to develop rules for cataloging e-documents. All these should be followed closely, but since cataloging is an activity where local features are of importance, national initiatives should be carried out as well. Since e-documents do not fall naturally into collection categories there is a need for more uniformity in cataloging and description between libraries, archives for broadcast media, and national archives. The Swedish effort to coordinate (bibliographic) authority control between archives, museums, and libraries is a case in point.

The links between documents and document parts, and the emerging linking between collections, pose special problems for bibliographic control. The different parts (objects) in a linked structure can be under the control of different bodies; coordination of authority and budgetary responsibility must be sorted out. The "ownership vs. access" trade-off requires this organizational foundation.

Findings regarding availability

Copyright issues are still very difficult to resolve when making e-documents available to the public. Transition to electronic form will require a total review of the compensation systems for artistic works. Until this has happened there will be many, and possibly individual and specific, limitations to availability.

The technological requirements for making e-documents available has implications for investments in equipment and in staff training. So there are necessary cost increases, but availability through electronic means can also be more effective and have a great geographical reach. An electronic network can also tie together collections and archives at different locations.

Equipment for electronic access is unevenly distributed among the population; this fact must be considered when electronic availability is planned.

Tentative recommendations:

The final set of recommendations from the Swedish study has not yet been formalized; the following points seem important to be included:

- Review the acquisition and collection development policy in view of the mission, to identify gaps and inconsistencies caused by changes in material from traditional to electronic form.
- Before digitizing material, make explicit what benefits are sought in terms of accessibility, preservation and economy.
- Prepare for digital representation of images and video.
- Plan for a continuous migration ("refreshing") of e-documents. This will incur a cost that is a direct function of the longevity of the medium. Shelving arrangements to facilitate migration and technical maintenance should be considered.
- Choose standardized products with a wide market acceptance as archival media. For some applications the CD seems to be a suitable candidate for archiving information.
- Make trade-offs between compression and image (and video) quality explicit. Define minimum quality levels in terms of resolution (and video quality) for different applications.
- Prepare for a two-byte representation of text; follow the market acceptance of Unicode.
- Harmonize cataloging rules with ongoing international work regarding e-documents. Work nationally on harmonizing rules for description of library collections and archives.

A note on experience exchange

To an increasing extent libraries, archives on the national and local levels, and archives of recorded sound and moving images will face the same problems with regard to preservation and access since the material they handle will be based on similar technologies. Digital information will in some respects look the same regardless of whether it represents text, images, sound or video.

In the Swedish study the steering committee consists of representatives from the Royal Library (KB), the National Archive of Recorded Sound and Moving Images (ALB), and the National Archives (RA). It has been fruitful to have three different points-of-view when discussing issues relating to long term strategies. Since the amount of e-documents will increase for all three organizations there are benefits to be achieved in sharing experiences.

To outline the situation one can point out specific areas of expertise:

- Libraries have a strong position with experience in subject control and character sets (and questions about filing order);
- Archives for broadcast media have experience of handling large volumes of information, and of migration. They also have technical experience of video and moving images.
- National archives have long standing experience of applying the principle of provenance, and of considering meta-data in object descriptions. They also have experience of handling computer produced records (in various formats).

It therefore seems natural that all three types of organization should be represented in future projects relating to e-documents. In addition it would be fruitful to include museums since these organizations also have begun to digitize some of their document collections.

Appendix 1.

UFP's for electronic documents

The following is a list of unique functional properties (UFP's) of electronic documents (e-documents) that set them apart from traditional documents, and that require consideration when planning procedures and processes for the management of e-documents. In some cases established concepts and legal aspects must be reviewed.

Transcendence. E-documents encompass in a uniform way information that traditionally has been considered to be of different kinds: text, graphics, images, sound, and video. All definitions and classification of documents based on media must be reconsidered. Digitalization is making it difficult to maintain consequential differences based on media. E-documents are also, at the same time, potential print, film, phonogram and video.

Large volume. Technical tools for the production of e-documents are powerful and have a large installed base. The number of producers is beyond estimation. E-documents with image information are voluminous,

Multiplicity (variants). E-documents can be manipulated relatively easy, and this is indeed one of the benefits of them. Re-use of information characterizes both the commercial publishing world and the individual arena. The consequences are problems of physical control and problems with information integrity.

Copies equal to or better than the original. E-documents can be copied without loss of quality. Together with the ease of manipulation this compounds the problem of establishing authenticity. The distribution of "originals" cannot be controlled by technical means. The quality of an e-document can be enhanced by algorithmic methods. shapes and forms can be made more distinct, shadows can be washed away. Restoration of e-documents must be considered as part of preservation.

Links (pointers). E-documents have structures that are, at least to a part, logical constructs. They can encompass parts which are not physically connected (or bundled). Links occur on several levels. within a document eg. hypertext), between documents, within/between series, and within/between collections. Emerging are also links between libraries/archives. These links raise organizational questions about responsibility and economic aspects of co-ordination.

Foreseeable impermanence. E-documents are intimately tied to the technology used for their creation. Technological development gives new dimensions to maintenance and preservation. Technological obsolescence must be considered especially. Physical attrition is also a problem.

Volatile distribution. E-documents can be "distributed" without manifesting themselves as a physical instance ("copy"). Access to an e-document can be equivalent to having it.

Complex copyright. As a consequence of the transcendence (above) it is problematic to apply the legislation on intellectual property rights for e-documents since the laws often build on definitions that are media-based. Economic (and aesthetic) consequences cannot be foreseen which leads to complicated discussions about compensations.

