Národní knihovna CR
Albertina icome Praha
images/space.gif

 

digiti1.jpg (6595 bytes)
digiti2.jpg (4010 bytes)

 

The Format for Storage of Metadata 

Version 2.1

 

Contents:
1. Introduction
2. Data created during digitization
3. Definition of the format
3.1. The DOBM format
3.1.1. DOBM and SGML
3.1.2. DOBM and HTML
3.1.3. The category
3.1.4. The DOBM element
3.1.5.  The DOBM.DX element
3.1.6. Access to data -  DOBM.DATA element
3.1.7. The tree structure of DOBM files DOBM.REFERENCE element
3.1.8. The sequence of DOBM files
3.1.9. The structure of categories
3.2. The DOBMENT format
3.2.1. DOBMENT and SGML
3.2.2. The structure of the DOBMENT file
3.3. Storage of documents on CD-ROM - the MNSXDEF.INF file
3.4. Writing of characters
4. Concrete application of the format
5. Conclusion
6. Literature


1. Introduction

The article follows the ideas presented in the document Proposal of the Structure of Digitized Old Books and Manuscripts [1] in which the authors explained the proposal of a format recommended for storage of digital copies of manuscripts and old printed books. 

Thanks to the fact that the format has already been accepted by the Digitization Centre of the National Library of the Czech Republic in Prague, several problems have been defined [1]: for example, it is not possible to merge, in a simple way, more digital copies, it is difficult to use the format for the digitization of sound recordings or films, etc. Other problems concern a complicated combination of the two tree structures - the HTML structure and the structure of ‘levels’ of description - the simultaneous use of which can result in equivocal interpretations during the computer processing of data. 

The sense of this work is to solve the above mentioned problems and to offer a sufficiently general definition of an improved format for storage of digital copies of various types of documents. 

The concrete application of the defined format for digital copies of manuscripts and old printed books is based on this work [2]. The specifications for digital copies of journals and newspapers, sound recordings, and other documents have already been written, too. 

The documents [6], [7] a [8] are also parts of this proposal. 
 

2. Data created during digitization

The two groups of data are created during the digitization processing: 

  1. digital copies of documents 
  2. a supporting structure, mostly textual, which enables access to the first group of data

Let us call the first group data and the other one metadata. The data are, for example, images of the manuscript pages, while the metadata are descriptions of these pages. 

The distinction between these two groups is rather imprecise, because the digital copy can often be directly a component part of the description. In this case, the point of view of the end-user is decisive and the metadata can be taken as the whole description including various preview images which are component parts of the text, while the data is everything which is not a visible part of this description and which is referenced from it as an external file. 

If we want to proceed with digitization, we must decide which formats will be applied for  storage of the two groups of data. This article concerns especially the metadata. We think that the format for its storage should comply with the following requirements: 

  1. It should have the highest possible degree of independence of software which will enable the user to work with metadata; only in this way, it will be possible to use the digital copy also in the future. 
  2. It should enable to classify metadata into various categories, such as, e.g., author, shelf-number, page number, etc., in case of the description of a book. This classification is very useful for the mass processing of data by special software. 
  3. It should enable the hierarchical classification of metadata in order to make, e.g., the difference between the description of the book as a whole and the description of the page. Also this further classification of metadata can be, from the point of view of its signification, processed for the user appropriately by special software
  4. It should enable an easy transition from metadata to data. 

It is the HTML which complies with the first and the fourth requirements very well. The HTML is also a specification of the SGML standard [3] (HTML format). If the  metadata is stored in this way, it is very easy to share it on Internet. However, the disadvantage of HTML is that it does not enable to classify the text from the point of view of its contents; therefore, it does not comply with the point no.2 and it neither complies with the requirement no.3. If we want to use the HTML format for storage of metadata, we must enlarge its functions in a certain way. 
This article contains the definition of the format which is an enlargement of HTML, version 2.0, and which complies with all the four above mentioned points;  therefore, it can be used for storage of metadata. This definition is followed by the specification of a format for storage of metadata of digital copies of manuscripts and old printed books [2]. 
 

3. Definition of the format

The files representing metadata can be grouped into three parts: 

  1. The tree structure of DOBM files which contain the structured text from which references point to data. The contents of the DOBM documents are objects of  study for users.
  2. The DOBMENT file which defines how the  DOBM files can be accessed by software.


  3. The MNSXDEF.INF files closely bound to CD-ROM carriers. These files contain basic information about the location of DOBM files on individual discs. 

The data in the above shown figure is marked by red frames. The users access this data through DOBM files (black frames). These files contain the description of data and they are interrelated through significant references (see the chapter 3.1.7. Tree structure of DOBM files -  DOBM.REFERENCE element). They form the so-called tree structure. 

The software will access the structure or only its part with the help of the DOBMENT file (blue frame). This file defines which part of the tree structure can be accessed and of which objects it consists. 
Each CD-ROM which is a part of the digital copy must contain only one MNSXDEF.INF file with basic information related to this carrier. The MNSXDEF.INF file contains the identification of the CD-ROM and basic data about the location of files on carriers. 
The three sub-chapters written below define how all the three parts of metadata are stored. 
 

   3.1. The DOBM format

The DOBM format has been written for storage of descriptions  (metadata) of digitized data. The descriptions in this format are stored in the so-called DOBM files which are interconnected through significant references
 

   3.1.1. DOBM and SGML

The DOBM file is a text file complying with the internationally recognized SGML standard [4]. Its structure is defined in a standard manner in the DOBM DTD  [6]
 

Note:  Any file complying with the SGML standard consists of a DOCTYPE declaration and an element. Each element is marked by the so-called tags (special symbols) and it consists of characters and other elements through which it forms a certain structure. The DTD is a formal description of this structure.


The first line of the DOBM file must contain the DOCTYPE declaration  <!DOCTYPE DOBM PUBLIC " -//AIP/DTD DOBM 2.1//EN">. 

The SGML declaration is based on HTML [3] and it is written in [8]
 

   3.1.2. DOBM and HTML

The DOBM document is stored in fact in the HTML version 2. 0 format. It differs from the usual HTML file only in the DOCTYPE declaration, in a wider offer of tags, possibly also in coding - see the chapter 3.4. 

The HTML only prescribes how the text will be formatted visually - which fonts will be used and on which place, where the image will be situated, where the heading of the chapter located, etc. However, it is typical for the descriptions of the documents that they can be classified into various categories following their signification. These categories form a certain hierarchy. The differences of content are not always represented  visually in formatted documents, but they are very important for the mass processing of metadata by special software. 

The classification of  metadata from the point of view of its signification is enabled by new special tags. (HTML browsers usually skip unknown tags; therefore the DOBM files can be read by them and shared even on Internet.) 

Any enlarging tag has this format: 

    <Name and  parameters>

The name always starts with the letters DOBM (<DOBM>, <DOBM.DX>, ... ). 
The enlarging elements are not obligatory. If in an HTML 2.0  file the declarations
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">, 
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">, 
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML Level 2//EN"> 
or  <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0 Level 2//EN"> 
are replaced by the declaration 
<!DOCTYPE DOBM PUBLIC "-//AIP//DTD DOBM 2.1//EN">, 
the final document will comply with all the requirements of the DOBM format. 
 

 

   3.1.3. The category

The category is a class of  metadata (or data) which has the same meaning. Three types of categories are defined: 

  1. DOBM category, to which entire DOBM files can be assigned. Thanks to this, the special software can make the distinction between, for example, files containing  bibliographic data about the book and files describing individual pages of the book.
  2. DX category, to which parts of DOBM files can be assigned. Thanks to this, the software can, for example, find the shelf-number or the author in a file with bibliographic description. 
  3. DATA category, to which only data can be assigned. Thanks to DATA categories, it is possible to make the distinction e.g. between various quality levels of  digital copies. 


Each category has its own label and denomination. DX and DATA categories have also the type attribute. 

The label identifies the category, it consists of ASCII characters, letters or numbers, while on the first place there is always a letter. There is no distinction between low and upper case. The concrete applications of the DOBM format can define these labels and assign meanings to them. Each concrete application of this kind has also its identification. The concrete application of the DOBM format and the label identify fully the category. For example, "NKP//MANUSCRIPT 2.1" LIBRARY identifies the name of the library- see the chapter 4. The label is important for the special software. 

The denomination of the category is interesting especially for users. The denomination should inform clearly about the meaning of the category. 

For the DX category the  DOBM format defines these  types: TEXT, NUMBER, DATE, or PICTURE. The DATA category can have the types: TEXT, SOUND, IMAGE, or VIDEO. The categories which have the same label and have issued from the same concrete application must have the same type
 

   3.1.4. The DOBM element

The tag DOBM marks the whole text of the DOBM file. It has the parameters: 

  • SPEC - identification of the concrete application of the DOBM format which controls the structuring of the document. For example,. "NKP//MANUSCRIPT 2.1" signifies that, e.g., SHELFNO will mark the shelf-number of the manuscript. The default value is an EMPTY STRING.
  • LANG - parameter based on the ISO 639 standard [5] informs in which language the file has been created. Its default value is EN.
  • ENCODING - parameter based on the ISO 646 standard defines the code page used for the creation of the file. The default value is "ISO646 - ASCII". For more information see the chapter 3.4. Writing of characters.
  • CTGLABEL - label of the category to which the entire document belongs. For example, the above mentioned value of the SPEC parameter permits the values MANUSCRIPT for the file with the description of the entire manuscript, GALLERY for the file with a gallery of small images of manuscript pages, PAGE for files with the description of one page of the manuscript, etc. The default value is HTML.
  • NAME - denomination of the category to which the document belongs. The NAME is written in the language defined by LANG. The default value is HTML.

   Example of the correct location of the DOBM tag in the  DOBM file:

    <!DOCTYPE DOBM PUBLIC "-//AIP//DTD DOBM 2.0//EN">
    <DOBM SPEC="NKP//MANUSCRIPT 2.1" CTGLABEL=PAGE
     NAME="Page" LANG=EN>
    <HTML>
    <HEAD>
         ...
    </HEAD>
     <BODY>
         ...
    </BODY>
    </HTML>
    </DOBM>

 

    3.1.5.  The DOBM.DX element

The tag DOBM.DX places a part of the DOBM file into a DX category. It has four parameters: 

  • LANG - defines the language of the text in the interior of the element marked in this way. The default value is given by the higher DX, or the DOBM tag if there is no higher DX tag. The parameter is applied only for the TEXT type.
  • CTGLABEL - label of the category.
  • NAME - denomination of the category in the language of the higher DX, or in the language of the DOBM element.
  • TYPE - type of the category. It can be TEXT, NUMBER, DATE, or PICTURE. The default value is TEXT.

The text in the interior of the DOBM.DX tags can contain any HTML formatting tags as for example, <I> ... </I>, <B> ... </B>  (with the exception of limitations given by the type parameter); it can also contain other DOBM.DX elements and even references to data (DOBM.DATA elements). 
 

   Example of the correct location of the DOBM.DX tag in the DOBM file:

    <TITLE>...<TITLE>
    <H3>Bibliographic Description</H3>
    <DOBM.DX CTGLABEL=BIBLDESCR NAME="BibliographicDescription">
        ...
    <DT>Date of Publication</DT>
    <DD>
    <DOBM.DX CTGLABEL=DATOFPUBL NAME="Dateof Publication" TYPE = DATE>
      1756
    </DOBM.DX>
    <DD>
       ...
    </DOBM.DX>

 

   Signification of the TYPE parameter in the DOBM.DX tag

The arbitrary TYPE parameter has values TEXT (default), NUMBER, DATE, or PICTURE. There is no distinction between low and upper case. 

  • TEXT - in practice, this is the most used type. If the TYPE parameter is not indicated in the DOBM.DX tag, then the type of the given category is defined as TEXT. The categories with the TEXT type cover any texts including the HTML formatting tags (e.g. <P>, <I> ... </I>, <B>...<B>, ... ) and the inline images.

Example of the correct mark-up of the metadata belonging to the category with the TEXT type:
                   ...

    <H3>Annotation</H3>
    <DOBM.DX CTGLABEL =ANNOT NAME="Annotation">
    <P>
               ...
    <P>
               ...
    </P>
    </DOBM.DX>
    <P>
               ...

    NUMBER - the categories with the NUMBER type include only numbers. The text marked by the tags with the TYPE parameter equal to NUMBER must be (blanks and HTML formatting tags omitted) a decimal number or an interval of decimal numbers  in one of the following formats:

 

a) 

Number: 

 

 

decimalnumber 

 

 

decimalnumberEinteger 

 

 

decimal numbereinteger 

 

 

 

b) 

Interval: 

 

 

 

number..number 

 

 

..number 

 

 

number.. 

   For storage of physical quantities a special format can be used: 
         decimalnumber(naturalnumber) 
         decimalnumber(naturalnumber)Einteger 
         decimalnumber(naturalnumber)einteger 

   The decimal point is used for writing a decimal fraction: 
         naturalnumber are numbers of the type 0, 1, 2, 3, ... 
         decimalnumber are numbers of the type 10.8, +21.55, -6, 0.08., 
         integer are numbers of the type 4, +56, -1025., 
         number are numbers of the type shown under a). 

Example of metadata belonging to the category with the NUMBER type: 

      10.5         10.5e5         10.1e-1..10.2E3          10.55(25)e+5 

 

Example of metadata belonging to the category with the NUMBER type: 

      10.5         10.5e5         10.1e-1..10.2E3          10.55(25)e+5 

 

Example of the correct mark-up of metadata belonging to the category with the NUMBER type: 

     <DOBM.DX CTGLABEL=PAGE NAME Page"> 
         23 
     </DOBM.DX> 
 

 

    DATE - this category includes only the data which must comply (blanks and HTML formatting tags omitted) with one of the following patterns: 

a) Date:
day.month.year month.year  year 
month/day/year month/year
year-month-day year-month 
centuryc. 
day.month.yearB.C.  month.yearB.C.  yearB.C.
month/day/yearB.C. month/yearB.C. 
year-month-dayB.C.  year-monthB.C.
centuryc.B.C. 
b) Interval: 
..date
date..
date..date 

 day is  an integer 1..31, month 1..12, year 1.. , while century 1.. . c. indicates century, B.C. means  before the Christian era. The date indicates the data of the type a). 

 

Example of metadata belonging to the category with the DATE type: 

    1450..1500             15c...16c.             1950-12-01 

 

Example of the correct mark-up of metadata belonging to the category with the DATE type: 

    <DOBM.DX CTGLABEL = DatOfPubl NAME="Date 
      of Publication" TYPE = DATE>23.5.  
    <B>1705</B></DOBM.DX> 
    <DOBM.DX CTGLABEL = DatOfPubl NAME="Date 
      of Publication" TYPE = DATE > 
      1700..1750  
    </DOBM.DX> 
 

 

    PICTURE - the category of this type includes only images. 

Example of the correct mark-up of metadata belonging to the category with the PICTURE type: 

   <DOBM.DX CTGLABEL = Preview NAME="Preview quality 
    picture" TYPE = PICTURE> 
   <IMG SRC = „preview/0001.gif> 
   </DOBM.DX> 

 

    3.1.6. Access to data -  DOBM.DATA element

The metadata must comply with two important functions - to describe data  and to provide access to data. This latter function is enabled by references to data in the DOBM files. 

The classification of data into categories is enabled by the DOBM.DATA tag with the parameters HREF, CTGLABEL, NAME, TYPE, LANG,  and ENCODING. 

HREF contains the URL of the file with data. 

ENCODING marks the code page used in the data file. The default value of this parameter is given by the parameter ENCODING of the  DOBM tag; for the list of possible values see the chapter 3.4. Writing of characters. 
 
The signification of the parameters CTGLABEL, NAME, LANG is the same as in the case of the DOBM.DX tag. 

The parameter TYPE is the type of the data category. It can have the values as TEXT, IMAGE, SOUND, or MOTION. 

 

Note: The parameters LANG and ENCODING are applied only for the type TEXT. 

 

The element DOBM.DATA has no end tag; therefore, it is empty. It is placed between the tags DOBM and HTML. 
As the HTML browsers do not recognize the tags DOBM.DATA, it is recommended to write the corresponding HTML hypertext references (<A HREF=... >...</A>) within the body of the document (between the tags <HTML> ... </HTML>). 

Example of the correct location of the DOBM.DATA tag in the DOBM file: 

   <!DOCTYPE DOBM PUBLIC "-//AIP//DTD DOBM 2.0//EN"> 
   <DOBM SPEC="NKP//MANUSCRIPT 2.1" 
     CTGLABEL=MANUSCRIPTNAME="Manuscript description"> 
   <DOBM.DATA HREF = „high/0001r.jpg" CTGLABEL = HIGHQ 
     NAME= "High quality  picture">  
   <DOBM.DATA HREF = „preview/0001r.gif" CTGLABEL =  
     PREVIEWQNAME =  "Preview quality picture">  
   <HTML>  
        ...  
   </HTML>  
 


  3.1.7. The tree structure of DOBM files DOBM.REFERENCE element

The DOBM files describing the digital copy can be arranged in a tree structure. This is done by the so-called significant references given by the DOBM.REFERENCE tags. These tags represent the reference which points to another DOBM  metadata file. Thus this DOBM file has a certain subordinate character to the DOBM file with the reference. Caution! Such references must form a tree structure; it means that they may not point to the file which has already been included into metadata by another significant reference. 


 


The tag DOBM.REFERENCE has the parameters HREF,  SPEC, CTGLABEL, LANG, ENCODING, and NAME. 

The parameter HREF contains the URL of the subordinate DOBM file  to which the REFERENCE points. 

The parameters SPEC, CTGLABEL, and LANG correspond to the same parameters of the DOBM element in the subordinate file. 

NAME is the denomination of the  DOBM category of the subordinate file in the language of the file from which the reference is made. 
 

Note:  The default values of SPEC, LANG, and ENCODING of the subordinate file are given by the values of  SPEC, LANG, and ENCODING of the DOBM element which contains the reference to the subordinate file.


The DOBM.REFERENCE elements have no end tag; their start tags are placed just before the DOBM tag and before the tags DOBM.DATA and HTML. As the HTML browser does not recognize the DOBM.REFERENCE tags, it is recommended to add also the corresponding HTML hypertext references (<A HREF=... >...</A>) into the body of the document (<HTML >...</HTML>). 

   Example of the correct location of the DOBM.REFERENCE tags:   

   <!DOCTYPE DOBM PUBLIC "-//AIP//DTD DOBM 2.0//EN">   
   <DOBM SPEC="NKP//MANUSCRIPT 2.1"  
     CTGLABEL=MANUSCRIPT  NAME="Manuscript description">   
   <DOBM.REFERENCE HREF = „0001r.htm" CTGLABEL=PAGE  
      NAME = "Page">   
   <DOBM.REFERENCE HREF = „0021r.htm"  CTGLABEL=PAGE  
     NAME = "Page">   
   <HTML>   
        ...   
   </HTML>   
 

 

    3.1.8. The sequence of DOBM files

The tree structure of DOBM files can be converted into a sequence. The succession of DOBM files is given by their order when following the multiple tree into its depth, while the evaluation of its edges is given by the succession of tags as written in the DOBM file (the evaluation of the significant references is controlled by their order in the DOBM file). 
 

    3.1.9. The structure of categories

As said already, the categories are considered to be identical if they have the same label (parameter CTGLABEL) and if they are controlled by the same concrete application of the DOBM format (parameter SPEC). 
In the same way as the DOBM.REFERENCE tags form a tree structure, their DOBM categories form an oriented graph. This graph is not necessarily a tree. For example, it is possible to have a significant reference from a DOBM file belonging to the PAGE category to another file from the same category; thus the PAGE category points to itself. Some files of the PAGE category can be accessed through the DOBM file of the GALLERY category, while some other files of the same PAGE category can be accessed through the file of theBOOK category. 

 

    However, the references between categories cannot be arbitrary. The fact that a file points to another one through a significant reference means that the latter file completes the information from the former file. The category to which the former file belongs is superior to the category of the referenced file or it is equal to it. If the former category is superior to the latter one, the latter one cannot be superior to the former one. The notion of superiority is not defined necessarily between all the categories (see the categories nos. 2 and 3 in the first part of the above shown figure), but for all of them it must be transitive that means that in the graph of DOBM categories there may not be any  loops, with the exception of the loops representing the reference to the same category (see the categories 2 in the second part of the above shown figure). 

  

    A similar situation can appear also in the case of DX categories. As the DOBM.DX elements are getting nested, they form the graphs similar to the above shown graphs. The category of the element which contains another element is then superior to the category of the nested element. Also for this superiority, the format defines the condition of non-ambiguity and transitivity. Thus it is not possible to have an element of the category A inside an element of the category B in a file and simultaneously to have it vice versa in another file. Furthermore, in the DX categories it is not possible to have in the element of the category A an occurrence of another element of the same category A. 

 

 3.2. The DOBMENT format

The structure of DOBM files can be rather complex; therefore, it is useful to declare it in advance. Thus the special software can e.g. shorten substantially the time necessary for processing of metadata. It is the SGML file marked as DOBMENT which describes the tree structure. It contains the names of root files from which the tree of DOBM files starts with the help of significant references. It contains also the structure of the categories  which the software accesses. The DOBMENT document enables also to limit from the bottom (in the direction from the leaves) the tree structure of DOBM files which the software will access. 
 

   3.2.1. DOBMENT and SGML

The DOBMENT file is an SGML file [4]; ; its structure is defined in the DTD DOBMENT [7]

The first line must contain the DOCTYPE declaration <!DOCTYPE DOBMENT PUBLIC "-//AIP//DTD DOBMENT//EN">. 

 

The SGML declaration is based on HTML [3] and it is provided in [8]
 

   3.2.2. The structure of the DOBMENT file

The tag DOBMENT marks the whole SGML file. It has the parameter LANG (default value is EN) which is a two-letter abbreviation of the language given by the ISO-639 standard [5]. The LANG value defines the language of „access" to DOBM files and it is important only for interpretation of special tags (parameter NAME). 

The DOBMENT file consists of the definition of the root files (element ROOTS) and the declaration of the categories (element CTGSET). The root file is the top of the tree structure of DOBM files given by significant references. The digital copy can consist of more trees of this kind. The root files may not have any common subordinate file. Simultaneously, the trees which start from them must cover all the DOBM files of the digital copy. 

The arrangement of the definitions of the root files has impact on the order in which the special software processes the metadata. The software starts with the tree having the top defined as being the first, then it takes the tree with the top defined as the second, etc. 

   Example of the basic structure of the DOBMENT file:   

   <!DOCTYPE DOBMENT PUBLIC "-//AIP//DTD DOBMENT//EN">   
   <DOBMENT>   
   <ROOTS>   
       ...   
   </ROOTS>   
   <CTGSET>   
       ...   
   </CTGSET>   
   </DOBMENT> 

The element ROOTS contains only the elements BEG. 

The element BEG is empty. It has the parameters HREF, CTGLABEL, SPEC, NAME, LANG, ENCODING. The parameter HREF contains the URL of the root DOBM file from which the special software starts to process the metadata. Through significant references from the root document, the software can also access subordinate DOBM files. The NAME defines the denomination of the category of the referenced file in the language given by the parameter LANG of the  DOBMENT tag. The other parameters must be identical with the parameters of the DOBM tag of the referenced file. The default values for CTGLABEL, SPEC, NAME, and  ENCODING are HTML, EMPTY STRING, HTML, ISO646 ASCII. The default value for LANG is given by the parameter LANG of the DOBMENT tag. 

The element CTGSET marks the list of  DOBM elements (DOBM and DOBMENT files contain a DOBM element, but this has for each of them other meaning) which contain the declarations of the DOBM, DX, and DATA categories. 

In the DOBM element with the parameters SPEC, CTGLABEL, and NAME, the DOBM categories of all the DOBM files, which the special software must go through, are declared. The values of the parameters SPEC and CTGLABEL must be identical with the values of the same parameters of the DOBM tag in the files whose categories are declared in this way. The part SPEC marking the version of the concrete application may be greater than or equal to the version indicated in the DOBM file (see the chapter 4. The concrete application of the format). The SPEC, CTGLABEL, and NAME are EMPTY STRING, HTML, and HTML by default. The NAME defines the denomination of the category in the language given by the parameter LANG of the DOBMENT tag. 

The DOBM element contains a succession of declarations of all the significant  references, declarations of all the categories of data and those of all the categories of metadata

The significant references are declared by empty REFERENCE elements with the parameters SPEC and CTGLABEL. The values of the parameters are identical with the values of the parameters of the DOBM.REFERENCE tag for the significant reference which is declared in this way. The significant references which are not declared in advance are disregarded by software. In this manner, it is possible to limit from the bottom (in the direction from the leaves) the tree of the DOBM files to which the user will get access through software. The SPEC is equal by default to the parameter SPEC of the superior DOBM element. 

The categories of  data are declared by the empty DATA elements with the parameters CTGLABEL, NAME, and TYPE whose signification is similar to that from the preceding elements. 

The categories of metadata are declared by the DX elements with the parameters CTGLABEL, NAME, and TYPE. The elements can contain also other DX elements. The way of inclusion of these declarations into each other must strictly correspond to the nesting of categories within the DOBM files. Some DX categories can be declared even several times. This depends on the fact within which elements they can occur. However, any such declaration must be identical and it must contain the same nested elements. 

   Example of the DOBMENT file:   

 <!DOCTYPE DOBMENT PUBLIC "-//AIP//DTD DOBMENT//EN">   
 <DOBMENT>   

   <ROOTS>   
   <BEG HREF=DESCR.HTM>   
   </ROOTS>   

   <CTGSET>   
    
   <DOBM SPEC="NKP//MANUSCRIPT 2.1"   
     CTGLABEL=MANUSCRIPT  NAME="Manuscript Description">   
   <REFERENCE CTGLABEL=GALLERY NAME="Gallery">   
   <DX CTGLABEL=BIBLDESCR NAME="Bibliographic Description">   
          ...   
   <DX CTGLABEL=SHELFNO NAME="Shelf-Number"> ... </DX>   
          ...   
   </DX>   
   <DX CTGLABEL=CAPTURE NAME="Image Capturing Data"> ... </DX>   
   </DOBM>   
   

   <DOBM SPEC="NKP//MANUSCRIPT 2.1" CTGLABEL=GALLERY  
     NAME="Gallery">   
   <REFERENCE CTGLABEL=PAGE NAME="Page">   
          ...     
   </DOBM>   
   

   <DOBM CTGLABEL=PAGE NAME="Page">   
   <DATA CTGLABEL=HIGHQ NAME="High Quality Picture">   
   <DATA CTGLABEL=LOWQ NAME="Low Quality Picture">   
         ...   
   </DOBM>   

  </CTGSET>   
</DOBMENT>   
   


    This DOBMENT file is the example of the correct declaration of the following structure of DOBM files:

 


3.3. Storage of documents on CD-ROM - the MNSXDEF.INF file

The digital copies of documents are usually stored on CD-ROM media. A CD-ROM can contain even several documents or only one document can be placed on several  CD-ROM’s.  Every CD-ROM must contain in its root directory the MNSXDEF.INF file with the below described information. 

[Disc] - Mark-up of the section with information about the CD-ROM. Mandatory.
Version - The version of the MNSXDEF.INF file. This version has the no. 2.1. Mandatory.
NoOfDocuments - The number of documents on the disc. The default value is 1. 
[Document_1]  - Mark-up of the section with information about the digital copy no. 1. Mandatory.
DocID - Unambiguous identification of the digital copy. Mandatory. It consists of 2 characters defining the country as in ISO 3166, 1 - 5 numbers defining the telephone code of the town, 3 - 8 characters for identification of the producer, slash, and 8 - 32 characters for identification of the digital document. DocID can contain only  visible ASCII characters except surplus '/' character. No two documents in the universe may have the same DocID. The correct DocID is e.g. cz311aip/smriechental


Note:  The copies of the same document stored on a different number of CD-ROM's are considered to be different documents; therefore, they must differ in their DocID. 

 
 

NoOfDiscs Number of discs on which the digital copy is stored. Default value is 1. 
CurrDisc = The number of the current  CD-ROM. Default value is 1. 
EntryPoint = The relative path to the DOBMENT file. Mandatory for the CD-ROM with CurrDisc = 1. The other discs do not contain this parameter.
[Document_2]  Mark-up of the section with the information about the digital copy no. 2. 

...

     
The whole MNSXDEF.INF may contain only ASCII characters. There is no distinction between the low and upper case. 

The DOBMENT file must be on the CD-ROM with the CurrDisc = 1. 

If one digital copy is stored on several CD-ROM’s and if its contents are copied  onto a single CD-ROM (with enough capacity), then this copy must comply with all the above written rules. 

Furthermore, the data and metadata pointing to these contents through significant references must be written only on one CD-ROM. The location of the DOBM files on media must be the following: the first CD-ROM contains the files with the lowest order in the sequence (see the chapter 3.1.8), the second one the files with a higher order, the third one with an even higher order, etc. 

Concrete applications can add further information into this file. 
 

   3.4. Writing of characters

The MNSXDEF.INF file may contain only ASCII characters. 

The basic character set in the DOBM and DOBMENT files is ASCII. It is for this code page that the SGML declaration is written [9]. All the other characters from outside of this code page must be written with the help of ampersand symbols (e.g. &amp; is & or &aacute; is á). The difference face to HTML is that it is enabled to use the ampersand symbols not only for the characters as in ISO Latin 1, but also for the other characters defined in the supplement D to ISO 8879 - 1986 [4]. 

E.g.   

d&rcaron;&iacute;ve    dříve    
&quot;&amp;&quot;      "&"  

 

However, these ampersand symbols are not correctly displayed in HTML browsers; therefore, it is possible to use also other code pages in the DOBM files. These must be 8-bits pages, while the bottom part of the code table must be ASCII. The denomination of this code page must be, in fact as an addition, indicated in the parameter ENCODING of the DOBM element. 
 
   
Possible values of the ENCODING parameter in the elements DOBM and DOBM.DATA:   

   ISO646 ASCII 
   ISO6937 Latin alphabet 
   ISO8859-1 ISO Latin 1 
   ISO8859-2 ISO Latin 2 
   ISO8859-3 ISO Latin 3 
   ISO8859-4 ISO Latin 4 
   ISO8859-5 ISO Cyrillic 
   ISO8859-7 ISO Greek 
   ISO8859-9 ISO Latin 5 
   ISO8859-10 ISO Latin 6 
   cp437 PC standard 
   cp850 PC Latin 1 
   cp852 PC Latin 2 
   cp853 PC Turkey 
   cp855 PC Cyrillic 
   cp857 PC Turkey 
   cp860 PC Portugal 
   cp861 PC Iceland 
   cp863 PC Canada-French 
   cp865 PC Norway 
   cp866 PC Russian 
   cp869 PC Greek 
   cp897 PC Hungarian (WP) 
   cp1250 MS Windows Latin 2 
   cp1251 MS Windows Cyrillic 
   cp1252 MS Windows Latin 1 
   cp1253 MS Windows Greek 
   Apple 
   Apple-CE 
   Apple-Cyrillic 
 

4. Concrete application of the format

The proposed format can be used even without further concrete applications. However, the signification of the individual categories will be in this case understandable to the user only through their denominations. If we want it to be understandable also for the special software, we must establish this signification in advance. For this purpose, a concrete application of this format must be written. At present, such a concrete application is available for manuscripts and old printed books [2], modern books, periodicals, sound recordings, and collections. 

The concrete application is declared in the DOBM and DOBMENT files by the SPEC parameter of the  DOBM element. The value of this parameter consists of an identification, at least one blank or end of paragraph and of the version number (e.g. SPEC="NKP 
//MANUSCRIPT 2.1"). The identification consists of the name of the responsible institution, two slashes, and the name of the application. The permitted characters are ASCII letters, numbers, and special characters ‘ ( ) + , - . / : = ? . The case of the characters has no signification. The version has the format consisting of number dot number. If the DOBM file uses no concrete application, then the SPEC parameter has the value "". The specification with a higher version may not alter the significance of the categories defined in lower versions. Each category is identified (besides its label) only by the identification of the concrete application, and not by the version. 

It is expected from the concrete applications that they will define the categories for certain goals and that they will define they will be nested. For this purpose, it is recommended to use the DOBMENT file. Concrete applications can limit this format by various ways, but they cannot change it or enlarge. 
 
 

5. Conclusion

The aim of this document has been to inform about the structure enveloping the digital data and enabling access to it through WWW browsers or special software. There are general rules for the creation of such a structure, while the concrete applications for various types of digital documents will be published as individual proposals and they will respect fully the structure described. At present, concrete applications for digital copies of old books and manuscripts, modern books, periodicals, sound recordings, and collections are available. 
 
 

6. Literature

[1]  Tomáš Mayer, Adolf Knoll: Proposal of the Structure of Digitized Old  Books  and Manuscripts, Version 1.11. of 31st July, 1996 
[2]  Tomáš Mayer,  Adolf Knoll: The Structure of Digital Copies of Old Books and Manuscripts, Version 2.1. 
[3]  Hypertext Markup Language - 2.0, http://www.w3.org/pub/WWW /MarkUp/ html-spec/html-spec_toc.html 
[4]  ISO 8879 - 1986: Information processing - Text and office systems - Standard Generalized Markup Language (SGML) 
[5]  ISO 639 - 1988: Codes for the representation of languages. 
[6]  PUBLIC "-//AIP//DTD DOBM 2.1//EN" see the file dobm_dtd.txt
[7]  PUBLIC "-//AIP//DTD DOBMENT//EN" see the file dobmndtd.txt
[8]  SGML Declaration for DOBM a DOBMENT see the file dobm_dec.txt