Národní knihovna CR
Albertina icome Praha
images/space.gif

 

intro1.jpg (5661 bytes)
intro2.jpg (2327 bytes)


Storage of and Access to Data: The Solution
for the Compound Document

 

Contents:
1. Fading of information in classical documents
2. A special case: the text
3. The digital data
4. The stability of the physical carrier is not critical
5. Platform independence required
6. The document as a system of objects
7. The compound document
8. The SGML solution for the compound document

 

 

1. Fading of information in classical documents

In the non-computer world of traditional documents and analogue recording methods, the data is encoded with the help of various types of chemical materials, be it inks, colours, plastics, etc. These materials form together with data a very special unit which is the expression of the information to be shared. Any change of data is also a material change within this unit and any material change in this system is also a change of data. These changes can be very slow and within certain periods of time they are perhaps not relevant to the user. However, they happen and they can never be stopped. Thus, the illumination from a medieval manuscript we saw yesterday or the analogue video we are watching now will not be the same next time. There will be a small shift of information which will surely not disturb the illusion the repeated viewing of the documents will recreate in our minds.

A longer period of time, however, can bring larger shifts of information which, under some conditions, will not be able to recreate the original impression generated by the document. Thus, the message sent by the author of the document will be altered and the quality of communication will be changed. Moreover, this effect will be speeded up by the repeated use of the document. Any analogue copying of any image, sound or movies will lower the original resolution, and any next generation of copies will bring further losses of information in comparison with the original. Everybody having copied a video tape or a tape sound recording knows it.
 

2. A special case: the text

As to the text, there is a difference: if we take the text as an image, which may be very important for hand-written texts or old printed documents, and make a photograph or a xerox copy or any other type of analogue image reproduction, the above mentioned words are fully true also here both for the original and the copies and their next generations. It is enough to try to xerox a page and then to make xeroxes each time from the most recent copy. The information loss will be very clearly observable. However, if we take a text only as a metatool for communication of words and ideas, then the linguistic preservation of the text will be also the exact preservation of words and ideas. There will be no information loss all the time we will be able to preserve the integrity of the text.

It results from here that the quality of copying of the text - when we move the text from an information carrier onto another one - is controlled by the criterion of its readability. If the entire text is readable as it was, the information expressed by the text is as complete as it was.

The text has, however, other problems to fight with: it is the viability of languages, of their grammar, of alphabets. In the periods of time measured in centuries and millennia, there can be problems of understanding. This is not identical for images, sound, or motion pictures, because they act upon our perception more directly than such a complex encoding system as the human language is.
 

3. The digital data

If  texts, images, sound, or motion pictures are encoded digitally, then the encoding carrier is immaterial; therefore, it does not affect the encoded information as it was in the non-computer world. This is a very good advantage which can help the preservation of information which can be frozen in digits in the shape which it had when it was digitized.

One of the important preservation criteria can be here the stability of the carrier on which the encoded (digitally or non-digitally) information was written. The classical carriers as parchment, palm-leaves, stone, film strips, or even good paper are more stable then the carriers of digital information even if now the golden compact disc is becoming more reliable than for example the newsprint.
 

4. The stability of the physical carrier is not critical

However, ..., it happened to us that even inscriptions in stone which thanks to the stability of the carrier have safely remained through centuries undeterioriated cannot be understood because of lack of information about the alphabet, language or grammar used to encode the information. It is happening to us similarly that certain electronic texts or databases not having been in use for a long time are about to create the same problem, because since the time when they were created, the technological environment has changed substantially. The problem is then not so much the readability of physical carriers (e.g. diskettes or tapes) as no longer existing data formats for texts or database machines, operating systems, or even hardware.

If the old electronic data is still readable, it is thanks to the fact that it has been used since the moment when it was created. Thanks to this use, it migrated perhaps several times from a format into another, from one database machine into another, etc. Sometimes, even encoding of non-ASCII characters has changed because of better shaped standards.
 

5. Platform independence required

All this wants to say that the most critical moment in the preservation of and long time access to digital data is its software/hardware platform dependence.

The digital data can be copied easily from carrier onto carrier without loss of information. This must be applied as the carriers develop and new and new ones appear. A good monitoring archival policy can ensure that from this side of the problem no danger will appear for preservation of recorded data. This procedure is called data refreshment and should be combined with reliable information and management systems within digital archives.

However, the problem can be the standard used for the data format. All of us know it as we change text editors and their versions during our work in time. It is the text domain which was the first to feel this problem of long-time based communication. In 1986, a standard solution, independent of computer and software platforms, was defined: it is SGML (ISO 8879. Information Processing - Text and Office Systems - Standard Generalized Markup Language (SGML), 1986. The principle of this encoding is the mark-up of predefined objects in the text and the rules how to write the structural definition to be followed by the mark-up process. This definition is written in the so-called DTD (Document Type Definition).
 

6. The document as a system of objects

From the semantic point of view, the objects within the text can be very rich (names of persons or any content based objects) or very poor (formal objects as e.g. a paragraph). If the mark-up is based mainly on semantically poor formal objects, it prescribes in fact how the document will be formatted when displayed or printed. This formal formatting is based on typographic and publishing practices. Thanks to such a formal formatting SGML-based standard called HTML, we are able to share information on the WWW and you can share with me now what I have written for you here. It was the HTML DTD which is enabling this communication.

However, also the semantically richer objects can be defined in a DTD and then marked through the text. It is even recommended to do it, because it enables to use such objects for multiple outputs and building of dictionaries or thesauri. These objects are in fact described semantically. We know it from cataloguing formats: we can imagine the catalogue record as a text in which well defined objects (by the cataloguing rules) are marked following a mark-up system (e.g. UNIMARC) and then formatted in various ways when displayed on the basis of prescribed formats. The idea of SGML is similar.

 

7. The compound document

The situation is more complex when the data is used for representing images, sound, or movies. If we follow strictly the philosophy based on text approaches, we remain only with series of digital files and our main concern will be the stability of the data formats themselves. For example, now the image data shared on Internet is encoded in GIF or JPEG formats. If we want to preserve it during a long time period, we must monitor the development of raster graphic formats, and in the moment when the GIF or JPEG formats will cease to be used or the used versions will differ very much from the versions in which our data is stored, we shall have to convert our GIFs or JPEGs into new reliable formats. In principle, the reliable formats are those which are highly used.

In the text domain, much additional information about the text itself was encoded by the mark-up and within the appropriate DTDs to make easier the orientation of any future user. In fact the text itself was described. The descriptive information is information about information, data about data; therefore, it can be called metadata, while the object of the description is data.

The documents in which the data is contained in image, sound, or video files, while the metadata has no other choice than to be text-based, are called compound documents. In order to ensure also here long life for data, it is recommended to add to it the necessary metadata. This added value should be encoded in such a way which would have the highest possible degree of stability.

The digital image, sound, or video data is frequently stored without any additional descriptive electronic metadata (digital sound recordings, for example, as published on CDs) or it is stored in and accessed through various database systems, where electronic data is database-dependent (used for image files very often).

These two above mentioned approaches can hardly be recommended, because they lack platform independent metainformation about data within the stored documents. Sometimes, we can even hardly speak about autonomous documents, because there are more or less only certain numbers of files stored together and differing by name.
 

8. The SGML solution for the compound document

Long-time archival storage and access cannot be ensured in this way. It is, however, more and more recommended to apply the SGML also for the storage of compound documents. These are, however, much more complex than texts, they are not so flat, they can form multiple trees.

From the formal point of view, a very good container for compound documents is the HTML. It is quite clear when working with Web documents. The HTML offers also a very good basic open access to the data files of the compound document and a very good orientation in it. However, it does not offer the content descriptive qualities.

The non-HTML-based SGML approaches create problems of access: in practice the access to them is usually enabled by their conversion into HTML. It is the access quality of HTML which has led us to define the SGML content description environment for compound documents by accepting the formal formatting qualities of HTML which are enriched by the SGML-based description of contents.

Our solution for compound documents is then an SGML environment defined in a special DTD called DOBM DTD. This DTD adds  content descriptions qualities to HTML. The concrete definitions of objects to be marked up are based on application of content descriptive standards (as e.g. AACR2) and good practices established during the work with documents. They can change from one group or type of documents to another. The concrete objects to be marked up in the compound document are defined in the so-called map of the document represented by the SGML file called DOBMENT. The DOBMENT definition is defined in its DTD. The rules are general, their specifications for each type of document are mapped in special concrete applications of this new standardized approach.

The SGML dependence of our application is as follows:

This CD-ROM contains the explanation of the new structure with examples of encoded documents and also with  special access software to show that the encoded data can be not only viewed in your WWW browser, but also used and evaluated in a special application.