Národní knihovna CR
Albertina icome Praha
images/space.gif

 

aboutg1.jpg (6576 bytes)
aboutg2.jpg (3077 bytes)

 

Contents:
1. Introduction
2. The Purpose of the GENTEMP and GENHTML Programs
3. Distribution of duties between the three work sites
4. How to get started?
5. Conclusion
6. Literature

 

1. Introduction 

This article describes the method of setting up a standardized description of a digital copy of a manuscript. This description consists of the so-called metadata (see [1]). It introduces you with the programs which make the process of creating a standardized description easier and explains about the files which form a digital copy of a manuscript on CD-ROM. The method was designed in accordance with the document Proposal of the structure of digital copies of manuscripts and old books, version 2. 1, 1997, AiP, NK Praha [1] based on a more general work Digitization of Old Books, Manuscripts, and Other Documents: The Format for Storage of Metadata, version 2. 1, 1997, AiP, NK Praha [6]

 The text [1] sets forth the rules and regulations for creating standardized descriptions of digital copies of old books and manuscripts which will be backed up on digital media in accordance with UNESCO' s Memory of the World programme. As the previous version also this proposal has been written on the basis of recommendations of the Memory of the World Sub-Committee on Technology. 

The description of digitized manuscripts in [6] and [1] is developed so that it can be used in any WWW browser, for example NETSCAPE or MSIE. Digital copies can, at any time, be made available on Internet. For this reason, the extended HTML rules are used as a basis for creating the structured description of digital copies of manuscripts. The description, in accordance with the international SGML (ISO 8879) standard, contains information which could be later used for mass processing of data. Special search software can be applied to digital copies of old manuscripts. An example of such software is the ManuFret application that is developed by the Albertina icome Praha Ltd. and which allows the user not only to look through, magnify, reduce, and print pictures of individual pages of a manuscript, but also to search HTML texts with a large possibility for entering a query. 

 According to the proposal in [1], a digital copy of a manuscript is composed from pictures of individual pages (in [1] marked as data) and a structured HTML description (in [1] it is spoken more exactly about metadata, DOBM, DOBMENT and MNSXDEF.INF files). The goal of this method, which is for example being used in the Czech National Library, is to find the easiest, quickest, and safest method of producing a digital copy of a manuscript. 

This method of forming digital copies of a manuscript: 

  1. makes the preparation of structured HTML documents as specified in [1] simpler and quicker - the scholar preparing descriptions of manuscripts does not need to be familiarized with the proposal in [1] or with the HTML language,
  2. simplifies communication between the person preparing the descriptions of manuscripts, the person making pictures of individual pages, and the person who saves the digital copy on CD-ROM,
  3. limits the possibility of making a mistake while naming the files with pictures - there is a clear correspondence between the physical page of the manuscript and the name of the file containing its picture.

 

2. The Purpose of the GENTEMP and GENHTML Programs

The process of converting a description of a manuscript into the extended HTML format is gradually realized through the use of two programs - GENTEMP and GENHTML

The GENHTML program generates HTML documents (appropriate to [1]) from the text file input which is written according to simple rules. The person who is preparing the descriptions of the manuscripts does not necessarily have to work directly with the format defined in [1] or with the HTML language. 

The GENTEMP program speeds up the preparation of the text file input for the GENHTML program. It is a very simple application which, on the basis of several entered parameters, creates a text file with the prescribed structure and with numbers of the individual sheets or pages of the manuscript. We will call the product produced from the GENTEMP program a template

With the use of both programs, the preparation of the structured HTML description is as follows: 

  1. the person preparing the structured HTML description uses the GENTEMP program for generating a template (for example MNSCR.TXT), 
  2. all data, which is meant to be part of the HTML description, is put into the created model,
  3. the same person then hands over the entire text file to the people who are in charge of preparing the CD-ROM,
  4. these people, with the help of the HTML application, create the structured HTML description which is saved together with pictures of individual pages of the manuscript, on CD-ROM.

 The method of creating digital copies of manuscripts with the help of the GENTEMP and GENHTML programs is depicted in Figure no. 1.

 

Figure no. 1: Creation of digital copies of manuscripts   

 

3. Distribution of duties between the three work sites 

The process is dependent upon the co-operation between three independent work sites: the documentation work site, the digitization work site, and the work site for CD-ROM preparation. The method of communication and the transfer of data between the work sites is dependent upon the knowledge and qualifications of persons working at the individual work sites. If the persons at the documentation work site are not interested in arranging the appearance of the final HTML documents or they do not want to become familiarized with the HTML format, the distribution of duties is as follows: 

  1. documentation work site - decides which manner it will use in writing documentation of the manuscript. The main work of this work site is, with the help of the GENTEMP program, to prepare a text file (for example MNSCR.TXT) which serves as a data source for preparation of HTML documents representing one manuscript, to hand over this document to the work site for CD-ROM preparation and to inform the digitization work site, in a timely manner, of the number of pages which the document contains and the manner in which the digitized manuscript was numbered.
  2. digitization work site - receives information from the documentation work site about the number of pages the document contains and the manner in which the digitized manuscript is numbered. They then scan the individual pages of the manuscript and prepare files which will contain the pictures. 
    The names of these files are easily derived from the information given by the documentation work site. The pictures of the pages of the manuscript are then handed over to the work site for CD-ROM preparation. 
  3. work site for CD-ROM preparation - receives the text file (MNSCR.TXT) from the documentation work site and the pictures of individual pages from the digitization work site. With the help of the GENHTML application, this work site generates, from the MNSCR.TXT file, a structured HTML description, creates the description file called (MNSXDEF.INF) (see[1]) and prepares the CD-ROM, which it will submit to the documentation work site for final control. 

Figure no. 2: Distribution of duties between the three work sites In accordance with the standard [1], the HTML description consists of general data about the manuscript and descriptions of its individual pages. From now on, we will use the word record to describe an individual page of the manuscript. Each type of record is made up of specific data sets, which we call items. Each item has (following the rules set up in [1]) its own name (denomination, as for example: Foliation, Size, Watermark, Type, English text, German text, Motif,...), type (only TEXT, DATE, and NUMBER are allowed), label (identification of the item), language and contents (for example: 7r, 255-260 x 348 mm, watermark - apparently the emblem "IHS" in a cartouche, illustrative scene, the depiction of our former school or the Seminary of St Ildefonso in Mexico, Vorstellung unserer gewesenen Pflantzschul oder Seminarium in Mexico San Ildefonso, Mexico - the city, ...).   


An automatically created description consists, among others, of general data about the manuscript coming from the AACR2 standard (BIBLDESCR section), of the book that is a list of numbers of individual pages of the manuscript, possibly also of a gallery of small pictures of individual pages of the manuscript and the descriptions of individual pages which contain references to pictures which are of a higher quality. Each page of the manuscript corresponds to one HTML file. All pages of the manuscript are specified by specific statements (for example: Foliation, Motif, Latin text, Translation etc.) 

As a necessary condition for HTML references to files to work, it is necessary to place all the HTML files, which are part of the description, into the same subdirectory with the MNSXDEF.INF file (see [1]) be it on one or more CD-ROM's. During the naming and saving of files, the user must follow specific rules. The number of the sheet or the page decides what the name of the picture will be, while the path is decided by the quality of the picture. If the digital copy is saved on more than one CD-ROM, you must be careful when separating the individual files. All files which are in any way connected with an individual page of the manuscript have to be placed on the same CD-ROM. 

The technology we have presented here is based on the idea that all of the pages of the manuscript will be treated in the same way during the digitization of manuscripts. What we mean by this is that for all pages, there are pictures of the same quality (PREVIEWQ, GALLERYQ, INTERNETQ, NORMALQ, or EXCELLENTQ - see [1]) available. In addition we expect that for each page of the document, at least one picture of a specific quality will be made. In more detail, each page of the manuscript is represented by a maximum of five pictures (one picture for each quality). 

  1. a picture of a page in GALLERYQ quality (data marked as CTGLABEL="GALLERYQ", very small preview images for quick orientation; the recommended format is GIF max. 10 kB - see [1])... GALLERYQ pictures will concern either all or none of the pages of the manuscript.
  2. a picture of a page in PREVIEWQ quality (data marked as CTGLABEL="PREVIEWQ", preview images of the original pages used in the description of individual pages; the recommended format is GIF max. 50 kB)... PREVIEWQ pictures will concern either all or none of the pages of the manuscript. 
  3. a picture of a page in INTERNETQ quality (data marked as CTGLABEL="INTERNETQ", images for on-line viewing of the digital copy on Internet; the recommended format is JPEG max. 150 kB)... INTERNETQ pictures will concern either all or none of the pages of the manuscript.
  4. a picture of a page in NORMALQ quality (data marked as CTGLABEL="NORMALQ", images for normal use of scholars; the recommended format is JPEG cca. 1MB)... NORMALQ pictures will concern either all or none of the pages of the manuscript.
  5. a picture of a page in EXCELLENTQ quality (data marked as CTGLABEL="EXCELLENTQ", images in the highest quality enabled by the scanning device; the recommended format is JPEG)... EXCELLENTQ pictures will concern either all or none of the pages of the manuscript.

 Before you start reading further, use the WWW browser to take a look at the examples of the digital copies of the manuscripts Labirynt sveta a lusthauz srdce, Knizky sestery o obecných vecech krestanskych and Codex Pictoricus Mexicanus (part). Doing this will give you a better idea what the digitized manuscripts look like.
 

4. How to get started?

We feel that it is a good idea for the people participating in the digitization of the manuscripts to go through the entire process of making a digital copy of a fictitious manuscript (all but the scanning of the manuscript). This gives them a better idea about how the process works. At the same time, they have a chance to figure out the manner they will use in transferring the information between the different work sites. 

Let us assume that we will want to create a digital copy of a manuscript which has a front cover with an outside and inside part, a back cover, with an outside and inside part and five sheets located between the two covers. Let us say that we will want to number the sheets (not the pages) of the manuscript and that we will not attach to our own pages any additional data. 

Start the GENTEMP application. After pressing this icon, the text form, which is used for deciding the appearance of the digitized manuscript, appears on the screen. In this form, you examine the information which it already contains to see if the information fits the above-mentioned demands. When opening this form, the method of numbering is set on Foliation. This is OK if you plan to number the manuscript by sheets. When the application is opened, Front Cover, Front End-Sheet, Back Cover and Back End-Sheet ( the outside and inside part of the front cover and the outside and inside part of the back cover) are all chosen. This means that the covers of the manuscript and its pictures will be scanned at the same time. Since our fictitious manuscript contains five sheets, replace the 0 in the Number of sheets in main part line with 5. Now that everything is ready, open the dialogue window with the Generate button and then choose the name of the template and directory in which you want to save the generated file. Now press the OK button. After pressing OK, the application creates the requested text file. Text that is enclosed within combined brackets is not part of the template, it is only general information about the template. Three vertical periods stand for left out text. 

{The user fills in the general information about the manuscript. This information is derived from the AACR2 standard} 
  

!Document title:
{in this spot, write the name of the manuscript, it can be shortened} 

!Shelf-number: 
{insert the shelf-number of the manuscript here} 

!Library: 
{in this spot, write the name of the library where the manuscript comes from} 

!Owner: 
{in this spot, write the name of the owner of the document} 

!Title: 
{the name of the manuscript goes here} 

!Author: 
{the name of the author goes here} 
     . 
     . {three periods indicate that some text was left out} 
     . 
!Literature: 

!Language of the Original:
      LA
{ already filled-in the two character abbreviation of the language in which the original was written, following the ISO639 standard } 

!Image Capturing Data: 
{in this spot, write information about the method in which the digitized copies of the pages of the manuscript were created} 

!!!!! end of bibliographic description !!!!!!!!!!!!!!! 

{ names, types, labels, and language of the statements with the help of which the individual pages of the manuscript will be described} 

!Foliation: TEXT, FOLIATION, EN 
{the manuscript pages will be described only with the help of one statement with the name of Foliation, type TEXT, and the label (identification - see [1]) FOLIATION; the contents of this statement is in English and it has been generated by the GENTEMP application} 

!!!!! end of definitions !!!!!!!!!!!!!!! 

{first record - description of the front cover of the manuscript} 

!Foliation: 
     FC
{the sign for front cover - the outside part of the front cover} 

!!!!! end of record no. 1 !!!!!!!!!!!!!!! 

{second record - description of the inside part of the front cover} 

!Foliation: 
     FS
{the sign for Front End-Sheet - the inside part of the front cover} 

!!!!! end of record no. 2 !!!!!!!!!!!!!!! 

{third record - description of the first page of the manuscript} 

!Foliation: 
    1r
{the sign for the front side of the first sheet} 

!!!!! end of record no. 3 !!!!!!!!!!!!!!! 

{fourth record - description of the second page of the manuscript) 

!Foliation: 
     1v
{the sign for the back side of the first sheet} 

!!!!! end of record no 4 !!!!!!!!!!!!!!! 

{fifth record - description of the third page of the manuscript} 

!Foliation: 
     2r
{the sign for the front side of the second sheet} 

!!!!! end of record no. 5 !!!!!!!!!!!!!!! 

{sixth record - description of the fourth page of the manuscript} 

!Foliation: 
     2v
{the sign for the back side of the second sheet} 

!!!!! end of record no. 6 !!!!!!!!!!!!!!! 

!Foliation: 
     3r
{the sign for the front side of the third sheet} 

!!!!! end of record no. 7 !!!!!!!!!!!!!!! 
     . 
     . {three dots indicate that some text is missing) 

!!!!! end of record no. 11 !!!!!!!!!!!!!!!  

!Foliation: 
      5v
{the sign for the back side of the fifth sheet} 

!!!!! end of record no. 12 !!!!!!!!!!!!!!! 

!Foliation: 
     BS
{the sign for Back End-Sheet - the inside part of back cover} 

!!!!! end of record no. 13 !!!!!!!!!!!!!!! 

!Foliation: 
     BC
{the sign for Back Cover - the inside part of back cover} 

!!!!! end of record no. 14 !!!!!!!!!!!!!!! 

!!END {the sign for the end of the text file} 

 

        Close the GENTEMP application with the Exit button. Start the text editor WORDPAD or NOTEPAD (eventually if a different program will be used, make sure that you preserve the text format). Open the template and fill in the first part with general information about the manuscript (Shelf-number, Library, Owner, ...). Separate a new paragraph or entry in the text with a tab or one or more spaces at the beginning of the line.
  

 !Document Title:
    Shortened Chronicle of the Republic of Lapalie 

!Shelf-number:
    XXIII B 99/1 

!Library:
    National Library of the Republic of Lapalie 

!Owner:
    The Republic of Lapalie 

!Title:
    Shortened Chronicle of the Republic of Lapalie 

!Author:
    Anonymous 

!Edition:

!Type of Document:
    manuscript 

!Publisher:

!Place of Publication:

!Printer:

!Place of Printing:

!Datation: ......... 
    16 c. 

!Physical Description:

!Material:
    paper 

!Size:

!Extent:
     .
     . 
     .

!Language of the Original:
    LA 

!Image Capturing Data:
    digitized with the KODAK DCS 460  !!!!! end of bibliographic description !!!!!!!!!!!!!!!

     .
     . {in this example, nothing else needs to be filled in} 
     .  

!Foliation: TEXT, FOLIATION, EN 

!!!!! end of bibliographic description !!!!!!!!!!  

!Foliation:  
    FC 
     .  
     .  
     .  
  

       Save the entire text file and close the edit program. Open the GENHTML program and fill out the formula. By pressing the buttons with the yellow covers, which have Input File/Name written above them, you open the dialogue window. In the dialogue window, first choose the directory and then the name of the template which has been filled out. After, set the correct code from the five codes offered in the Input File/Code Page line. If your Windows system uses a western system code, choose the Windows Latin I code page. If your Windows system uses an eastern system code, choose the Windows Latin II code page. If you worked with DOS editor, choose Latin I, Latin II or MJK (the Czech code of the Kamenický brothers). Your choice depends on the type of code that your editor uses.  

Note:  If you write the descriptions solely in English and you don not use characters not included in the English alphabet, which we recommend, you need not concern yourself with choosing the correct code because any of the codes you choose can be used. 

 

Choose the button in the lower right part of the formula and in the dialogue window set the Output Directory, into which the generated HTML documents should be saved. After doing this, press the Generate button. If a mistake is found, fix the input text file and press the Generate button again. For now, ignore any warnings about the Contents, Notation or Illuminations. After creating the HTML documents, close the GENHTML application with the Exit button. Your structured HTML description is ready. If, however, you want to view the pictures of individual manuscript pages with a WWW browser, (for example NETSCAPE) you have to copy the files with these pictures, to the appropriate place. The subdirectory GALLERY, located in the ...\EXAMPLES\FICT directory, contains files with GALLERYQ pictures of pages of the fictitious manuscript. Copy this subdirectory to the directory where your MNSXDEF.INF file is located. Do the same with the PREVIEW, INET and NORMAL directories. Start an HTML browser and begin to look through your fictitious digitized manuscript by opening the EN\DESCR.HTM file 

Make note of how the names of HTML documents and files containing pictures are derived from the number of a sheet, more exactly from the item Foliation. 
  

Page with Foliation:   The name of the HTML file describing the page with foliation: 

    FC
    FS 
    1r 
    1v 
    2r 
   
   
   
    5v 
     BS 
    BC

     EN\FC.HTM  
EN\FS.HTM  
EN\0001R.HTM  
EN\0001V.HTM  
EN\0002R.HTM  
.  
.  
.  
EN\0005V.HTM  
EN\BS.HTM  
EN\BC.HTM 


    The path names of the files containing pictures of pages of the manuscript are the same as the names of the HTML files which they represent. The extension name, only *.jpg or *.gif are allowed, is determined by the quality of the picture. The quality of the picture designates the name of the subdirectory. The method of naming files containing a picture of a page of the manuscript can be clearly seen it the following examples:  
  

Page with Foliation:  depicts GALLERYQ picture with a name:
   FC
FS
1r
1v
2r
  .
  .
  .
5v
BS
BC
   GALLERY\FC.gif  
GALLERY\FS.gif   
GALLERY\0001R.gif  
GALLERY\0001V.gif  
GALLERY\0002R.gif  
.  
.  
.  
GALLERY\0005V.gif  
GALLERY\BS.gif  
GALLERY\BC.gif

 

Page with Foliation:  depicts PREVIEW picture with a name:
   FC
FS
1r
1v
2r
  .
  .
  .
5v
BS
BC
   PREVIEW\FC.gif  
PREVIEW\FS.gif  
PREVIEW\0001R.gif  
PREVIEW\0001V.gif  
PREVIEW\0002R.gif  
.  
.  
.  
PREVIEW\0005V.gif  
PREVIEW\BS.gif  
PREVIEW\BC.gif 

 

Page with Foliation:  depicts INTERNETQ picture with a name:
   FC
FS
1r
1v
2r
  .
  .
  .
5v
BS
BC
   INET\FC.jpg  
INET\FS.jpg  
INET\0001R.jpg  
INET\0001V.jpg  
INET\0002R.jpg 
   . 
   . 
   . 
INET\0005V.jpg  
INET\BS.jpg  
INET\BC.jpg 

 

Page with Foliation:  depicts NORMALQ picture with a name: 
   FC
FS
1r
1v
2r
  .
  .
  .
5v
BS
BC
   NORMAL\FC.jpg  
NORMAL\FS.jpg  
NORMAL\0001R.jpg  
NORMAL\0001V.jpg  
NORMAL\0002R.jpg  
   . 
   . 
   . 
NORMAL\0005V.jpg  
NORMAL\BS.jpg  
NORMAL\BC.jpg 

 

5. Conclusion

The purpose of this article was to introduce you with the method of generating a description of manuscripts in the extended HTML structure and how to set up files into their final form on CD-ROM. You will find directions for using the programs in [2], [3]. The structure of the text file, which is made by the GENTEMP program, is explained in [2]. The final form of the digital copy is explained in more detail in [3]. 
 
 

6. Literature:   

[1]  Tomas Mayer: Proposal of the Structure of Digital Copies of Manuscripts and Old Books. Version 2. 1, 1997, AiP, NK Praha 
[2]  Help of GENTEMP Application, AiP, 1997 
[3]  Help of GENHTML Application, AiP, 1997
[4]  Hana Krizova: Description of a manuscript in TEXT format, AiP, 1997
[5]  Hana Krizova: Description of a manuscript in structured HTML format, AiP, 1997
[6]  Jan Vomlel: Digitization of Old Books, Manuscripts and Other Documents.The Format for Storage of Metadata. Version 2.1