

In the non-computer world of traditional documents and analogue
recording methods, the data is encoded with the help of various types of chemical
materials, be it inks, colours, plastics, etc. These materials form together with data a
very special unit which is the expression of the information to be shared. Any change of
data is also a material change within this unit and any material change in this system is
also a change of data. These changes can be very slow and within certain periods of time
they are perhaps not relevant to the user. However, they happen and they can never be
stopped. Thus, the illumination from a medieval manuscript we saw yesterday or the
analogue video we are watching now will not be the same next time. There will be a small
shift of information which will surely not disturb the illusion the repeated viewing of
the documents will recreate in our minds.
A longer period of time, however, can bring larger shifts of
information which, under some conditions, will not be able to recreate the original
impression generated by the document. Thus, the message sent by the author of the document
will be altered and the quality of communication will be changed. Moreover, this effect
will be speeded up by the repeated use of the document. Any analogue copying of any image,
sound or movies will lower the original resolution, and any next generation of copies will
bring further losses of information in comparison with the original. Everybody having
copied a video tape or a tape sound recording knows it.
As to the text, there is a difference: if we take the text as an image,
which may be very important for hand-written texts or old printed documents, and make a
photograph or a xerox copy or any other type of analogue image reproduction, the above
mentioned words are fully true also here both for the original and the copies and their
next generations. It is enough to try to xerox a page and then to make xeroxes each time
from the most recent copy. The information loss will be very clearly observable. However,
if we take a text only as a metatool for communication of words and ideas, then the
linguistic preservation of the text will be also the exact preservation of words and
ideas. There will be no information loss all the time we will be able to preserve the
integrity of the text.
It results from here that the quality of copying of the text - when we
move the text from an information carrier onto another one - is controlled by the
criterion of its readability. If the entire text is readable as it was, the information
expressed by the text is as complete as it was.
The text has, however, other problems to fight with: it is the
viability of languages, of their grammar, of alphabets. In the periods of time measured in
centuries and millennia, there can be problems of understanding. This is not identical for
images, sound, or motion pictures, because they act upon our perception more directly than
such a complex encoding system as the human language is.
If texts, images, sound, or motion pictures are encoded
digitally, then the encoding carrier is immaterial; therefore, it does not affect the
encoded information as it was in the non-computer world. This is a very good advantage
which can help the preservation of information which can be frozen in digits in the shape
which it had when it was digitized.
One of the important preservation criteria can be here the stability of
the carrier on which the encoded (digitally or non-digitally) information was written. The
classical carriers as parchment, palm-leaves, stone, film strips, or even good paper are
more stable then the carriers of digital information even if now the golden compact disc
is becoming more reliable than for example the newsprint.
However, ..., it happened to us that even inscriptions in stone which
thanks to the stability of the carrier have safely remained through centuries
undeterioriated cannot be understood because of lack of information about the alphabet,
language or grammar used to encode the information. It is happening to us similarly that
certain electronic texts or databases not having been in use for a long time are about to
create the same problem, because since the time when they were created, the technological
environment has changed substantially. The problem is then not so much the readability of
physical carriers (e.g. diskettes or tapes) as no longer existing data formats for texts
or database machines, operating systems, or even hardware.
If the old electronic data is still readable, it is thanks to the fact
that it has been used since the moment when it was created. Thanks to this use, it
migrated perhaps several times from a format into another, from one database machine into
another, etc. Sometimes, even encoding of non-ASCII characters has changed because of
better shaped standards.
All this wants to say that the most critical moment in the preservation
of and long time access to digital data is its software/hardware platform dependence.
The digital data can be copied easily from carrier onto carrier without
loss of information. This must be applied as the carriers develop and new and new ones
appear. A good monitoring archival policy can ensure that from this side of the problem no
danger will appear for preservation of recorded data. This procedure is called data
refreshment and should be combined with reliable information and management
systems within digital archives.
However, the problem can be the standard used for the data format.
All of us know it as we change text editors and their versions during our work in time. It
is the text domain which was the first to feel this problem of long-time based
communication. In 1986, a standard solution, independent of computer and software
platforms, was defined: it is SGML (ISO
8879. Information Processing - Text and Office Systems - Standard Generalized Markup
Language (SGML), 1986. The principle of this encoding is the mark-up of predefined objects
in the text and the rules how to write the structural definition to be followed by the
mark-up process. This definition is written in the so-called DTD (Document Type
Definition).
From the semantic point of view, the objects within the text can be
very rich (names of persons or any content based objects) or very poor (formal objects as
e.g. a paragraph). If the mark-up is based mainly on semantically poor formal objects, it prescribes
in fact how the document will be formatted when displayed or printed. This formal
formatting is based on typographic and publishing practices. Thanks to such a formal
formatting SGML-based standard called HTML, we are able to share information on the WWW
and you can share with me now what I have written for you here. It was the HTML DTD which is
enabling this communication.
However, also the semantically richer objects can be defined in a DTD
and then marked through the text. It is even recommended to do it, because it enables to
use such objects for multiple outputs and building of dictionaries or thesauri. These
objects are in fact described semantically. We know it from cataloguing formats: we
can imagine the catalogue record as a text in which well defined objects (by the
cataloguing rules) are marked following a mark-up system (e.g. UNIMARC) and then formatted
in various ways when displayed on the basis of prescribed formats. The idea of SGML is
similar.
The situation is more complex when the data is used for representing
images, sound, or movies. If we follow strictly the philosophy based on text approaches,
we remain only with series of digital files and our main concern will be the stability of
the data formats themselves. For example, now the image data shared on Internet is encoded
in GIF or JPEG formats. If we want to preserve it during a long time period, we must
monitor the development of raster graphic formats, and in the moment when the GIF or JPEG
formats will cease to be used or the used versions will differ very much from the versions
in which our data is stored, we shall have to convert our GIFs or JPEGs into new reliable
formats. In principle, the reliable formats are those which are highly used.
In the text domain, much additional information about the text itself
was encoded by the mark-up and within the appropriate DTDs to make easier the orientation
of any future user. In fact the text itself was described. The descriptive information is
information about information, data about data; therefore, it can be called metadata,
while the object of the description is data.
The documents in which the data is contained in image, sound, or video
files, while the metadata has no other choice than to be text-based, are called compound
documents. In order to ensure also here long life for data, it is recommended to add
to it the necessary metadata. This added value should be encoded in such a way which would
have the highest possible degree of stability.
The digital image, sound, or video data is frequently stored without
any additional descriptive electronic metadata (digital sound recordings, for example, as
published on CDs) or it is stored in and accessed through various database systems, where
electronic data is database-dependent (used for image files very often).
These two above mentioned approaches can hardly be recommended, because
they lack platform independent metainformation about data within the stored documents.
Sometimes, we can even hardly speak about autonomous documents, because there are more or
less only certain numbers of files stored together and differing by name.
Long-time archival storage and access cannot be ensured in this way. It
is, however, more and more recommended to apply the SGML also for the storage of compound
documents. These are, however, much more complex than texts, they are not so flat, they
can form multiple trees.
From the formal point of view, a very good container for compound
documents is the HTML. It is quite clear when working with Web documents. The HTML offers
also a very good basic open access to the data files of the compound document and a very
good orientation in it. However, it does not offer the content descriptive qualities.
The non-HTML-based SGML approaches create problems of access: in
practice the access to them is usually enabled by their conversion into HTML. It is the
access quality of HTML which has led us to define the SGML content description environment
for compound documents by accepting the formal formatting qualities of HTML which are
enriched by the SGML-based description of contents.
Our solution for compound documents is then an SGML environment defined
in a special DTD called DOBM DTD. This
DTD adds content descriptions qualities to HTML. The concrete definitions of objects
to be marked up are based on application of content descriptive standards (as e.g. AACR2)
and good practices established during the work with documents. They can change from one
group or type of documents to another. The concrete objects to be marked up in the
compound document are defined in the so-called map of the document represented by
the SGML file called DOBMENT. The
DOBMENT definition is defined in its DTD. The rules are general, their specifications for
each type of document are mapped in special concrete applications of this new standardized
approach.
The SGML dependence of our application is as follows:
This CD-ROM contains the explanation of the new structure with examples
of encoded documents and also with special access software to show that the encoded
data can be not only viewed in your WWW browser, but also used and evaluated in a special
application.
|