|1. Fading of information in classical documents|
|2. A special case: the text|
|3. The digital data|
|4. The stability of the physical carrier is not critical|
|5. Platform independence required|
|6. The document as a system of objects|
|7. The compound document|
|8. The SGML solution for the compound document|
In the non-computer world of traditional documents and analogue recording methods, the data is encoded with the help of various types of chemical materials, be it inks, colours, plastics, etc. These materials form together with data a very special unit which is the expression of the information to be shared. Any change of data is also a material change within this unit and any material change in this system is also a change of data. These changes can be very slow and within certain periods of time they are perhaps not relevant to the user. However, they happen and they can never be stopped. Thus, the illumination from a medieval manuscript we saw yesterday or the analogue video we are watching now will not be the same next time. There will be a small shift of information which will surely not disturb the illusion the repeated viewing of the documents will recreate in our minds.
A longer period of time, however, can bring larger shifts of
information which, under some conditions, will not be able to recreate the original
impression generated by the document. Thus, the message sent by the author of the document
will be altered and the quality of communication will be changed. Moreover, this effect
will be speeded up by the repeated use of the document. Any analogue copying of any image,
sound or movies will lower the original resolution, and any next generation of copies will
bring further losses of information in comparison with the original. Everybody having
copied a video tape or a tape sound recording knows it.
As to the text, there is a difference: if we take the text as an image, which may be very important for hand-written texts or old printed documents, and make a photograph or a xerox copy or any other type of analogue image reproduction, the above mentioned words are fully true also here both for the original and the copies and their next generations. It is enough to try to xerox a page and then to make xeroxes each time from the most recent copy. The information loss will be very clearly observable. However, if we take a text only as a metatool for communication of words and ideas, then the linguistic preservation of the text will be also the exact preservation of words and ideas. There will be no information loss all the time we will be able to preserve the integrity of the text.
It results from here that the quality of copying of the text - when we move the text from an information carrier onto another one - is controlled by the criterion of its readability. If the entire text is readable as it was, the information expressed by the text is as complete as it was.
The text has, however, other problems to fight with: it is the
viability of languages, of their grammar, of alphabets. In the periods of time measured in
centuries and millennia, there can be problems of understanding. This is not identical for
images, sound, or motion pictures, because they act upon our perception more directly than
such a complex encoding system as the human language is.
If texts, images, sound, or motion pictures are encoded digitally, then the encoding carrier is immaterial; therefore, it does not affect the encoded information as it was in the non-computer world. This is a very good advantage which can help the preservation of information which can be frozen in digits in the shape which it had when it was digitized.
One of the important preservation criteria can be here the stability of
the carrier on which the encoded (digitally or non-digitally) information was written. The
classical carriers as parchment, palm-leaves, stone, film strips, or even good paper are
more stable then the carriers of digital information even if now the golden compact disc
is becoming more reliable than for example the newsprint.
However, ..., it happened to us that even inscriptions in stone which thanks to the stability of the carrier have safely remained through centuries undeterioriated cannot be understood because of lack of information about the alphabet, language or grammar used to encode the information. It is happening to us similarly that certain electronic texts or databases not having been in use for a long time are about to create the same problem, because since the time when they were created, the technological environment has changed substantially. The problem is then not so much the readability of physical carriers (e.g. diskettes or tapes) as no longer existing data formats for texts or database machines, operating systems, or even hardware.
If the old electronic data is still readable, it is thanks to the fact
that it has been used since the moment when it was created. Thanks to this use, it
migrated perhaps several times from a format into another, from one database machine into
another, etc. Sometimes, even encoding of non-ASCII characters has changed because of
better shaped standards.
All this wants to say that the most critical moment in the preservation of and long time access to digital data is its software/hardware platform dependence.
The digital data can be copied easily from carrier onto carrier without loss of information. This must be applied as the carriers develop and new and new ones appear. A good monitoring archival policy can ensure that from this side of the problem no danger will appear for preservation of recorded data. This procedure is called data refreshment and should be combined with reliable information and management systems within digital archives.
However, the problem can be the standard used for the data format.
All of us know it as we change text editors and their versions during our work in time. It
is the text domain which was the first to feel this problem of long-time based
communication. In 1986, a standard solution, independent of computer and software
platforms, was defined: it is SGML (ISO
8879. Information Processing - Text and Office Systems - Standard Generalized Markup
Language (SGML), 1986. The principle of this encoding is the mark-up of predefined objects
in the text and the rules how to write the structural definition to be followed by the
mark-up process. This definition is written in the so-called DTD (Document Type
From the semantic point of view, the objects within the text can be very rich (names of persons or any content based objects) or very poor (formal objects as e.g. a paragraph). If the mark-up is based mainly on semantically poor formal objects, it prescribes in fact how the document will be formatted when displayed or printed. This formal formatting is based on typographic and publishing practices. Thanks to such a formal formatting SGML-based standard called HTML, we are able to share information on the WWW and you can share with me now what I have written for you here. It was the HTML DTD which is enabling this communication.
However, also the semantically richer objects can be defined in a DTD and then marked through the text. It is even recommended to do it, because it enables to use such objects for multiple outputs and building of dictionaries or thesauri. These objects are in fact described semantically. We know it from cataloguing formats: we can imagine the catalogue record as a text in which well defined objects (by the cataloguing rules) are marked following a mark-up system (e.g. UNIMARC) and then formatted in various ways when displayed on the basis of prescribed formats. The idea of SGML is similar.
The situation is more complex when the data is used for representing images, sound, or movies. If we follow strictly the philosophy based on text approaches, we remain only with series of digital files and our main concern will be the stability of the data formats themselves. For example, now the image data shared on Internet is encoded in GIF or JPEG formats. If we want to preserve it during a long time period, we must monitor the development of raster graphic formats, and in the moment when the GIF or JPEG formats will cease to be used or the used versions will differ very much from the versions in which our data is stored, we shall have to convert our GIFs or JPEGs into new reliable formats. In principle, the reliable formats are those which are highly used.
In the text domain, much additional information about the text itself was encoded by the mark-up and within the appropriate DTDs to make easier the orientation of any future user. In fact the text itself was described. The descriptive information is information about information, data about data; therefore, it can be called metadata, while the object of the description is data.
The documents in which the data is contained in image, sound, or video files, while the metadata has no other choice than to be text-based, are called compound documents. In order to ensure also here long life for data, it is recommended to add to it the necessary metadata. This added value should be encoded in such a way which would have the highest possible degree of stability.
The digital image, sound, or video data is frequently stored without any additional descriptive electronic metadata (digital sound recordings, for example, as published on CDs) or it is stored in and accessed through various database systems, where electronic data is database-dependent (used for image files very often).
These two above mentioned approaches can hardly be recommended, because
they lack platform independent metainformation about data within the stored documents.
Sometimes, we can even hardly speak about autonomous documents, because there are more or
less only certain numbers of files stored together and differing by name.
Long-time archival storage and access cannot be ensured in this way. It is, however, more and more recommended to apply the SGML also for the storage of compound documents. These are, however, much more complex than texts, they are not so flat, they can form multiple trees.
From the formal point of view, a very good container for compound documents is the HTML. It is quite clear when working with Web documents. The HTML offers also a very good basic open access to the data files of the compound document and a very good orientation in it. However, it does not offer the content descriptive qualities.
The non-HTML-based SGML approaches create problems of access: in practice the access to them is usually enabled by their conversion into HTML. It is the access quality of HTML which has led us to define the SGML content description environment for compound documents by accepting the formal formatting qualities of HTML which are enriched by the SGML-based description of contents.
Our solution for compound documents is then an SGML environment defined in a special DTD called DOBM DTD. This DTD adds content descriptions qualities to HTML. The concrete definitions of objects to be marked up are based on application of content descriptive standards (as e.g. AACR2) and good practices established during the work with documents. They can change from one group or type of documents to another. The concrete objects to be marked up in the compound document are defined in the so-called map of the document represented by the SGML file called DOBMENT. The DOBMENT definition is defined in its DTD. The rules are general, their specifications for each type of document are mapped in special concrete applications of this new standardized approach.
The SGML dependence of our application is as follows:
This CD-ROM contains the explanation of the new structure with examples
of encoded documents and also with special access software to show that the encoded
data can be not only viewed in your WWW browser, but also used and evaluated in a special