Contents - Previous - Next


Section VI: Use of computers in audiovisual archives

6.1 Evaluating computer cataloguing systems: A guide for audiovisual archivists

 

6.1 Evaluating computer cataloguing systems: A guide for audiovisual archivists

Roger Smither, Keeper of Film and Television, Imperial War Museum

INTRODUCTION

Those thinking of acquiring computer systems are liable to be submerged in a flood of brochures and swamped by the attentions of salesmen, all determined to convince the potential customer that the system they are promoting is uniquely qualified to meet that customer's needs. It is all too easy, confronted by these impassioned declarations, by incomprehensible technical specifications and, often enough, by impressive demonstrations, to lose sight of what the system is actually offering and how that relates to the customer's real needs. The demonstrations can be particularly seductive, since they will be conducted using information the suppliers know their system can handle; the archive's data may be significantly different. This paper offers a selection of the kinds of considerations that are likely to arise in the context of systems for cataloguing, and about which the customer should be aware before making up his or her mind.

The paper is divided into two parts. The first offers definitions of the terminology used to describe computer systems by salesmen and in promotional literature. Although not intended to provide a full explanation of the technology of computing, it is hoped this section will give potential purchasers an understanding of what vendors are really offering. Such a prelude is necessary because it is this terminology that is also inevitably used in the second part of the paper, which suggests the sort of questions to which archivists should be seeking answers before agreeing to adopt a system. Readers who are already familiar with the language of computers probably need not bother with this first part, although they may wish to refer to it if they encounter any unfamiliar terms later.

Very deliberately, this paper does not refer to specific systems. In the first place the most important priority for archives is the quality of the data entering cataloguing systems, not the specific system used; the transfer between systems of good quality data compiled in accordance with a sound and logical structure does not these days present major problems. In the second place there is no single correct answer to the question of computer usage for film cataloguing. The needs, the budgets, the capabilities and the other circumstances of archives are too divergent for it to be sensible to expect a single system to satisfy them all.

It is perhaps also worth pointing out that it may be unreasonable to expect to find a system that scores 100% on the list of desiderata a film archive may compile (although there is no harm in hoping). Solutions, especially solutions based on packages for micro-computers, are likely to be reached on the basis of best- available compromise. This paper will have achieved its purpose if it helps an archive to an appreciation of some of the areas where it may be acceptable to make a compromise and some where it will be necessary to draw the line.

I. DEFINITIONS OF TERMINOLOGY

The acquisition of computers and computer systems is notorious for the amount and complexity of the jargon attaching to it. This part of the paper tries to explain the jargon, beginning with a few of the more fundamental concepts that the purchaser of any computing system is likely to encounter and then progressing to topics more specific to the kind of systems likely to be acquired for cataloguing. Additional definitions are also provided in appropriate contexts in the second part of the paper.

To begin at the most obvious level: computer systems consist of combinations of hardware and software. Hardware means the mechanical components of the system, software the intellectual components. The effectiveness of a cataloguing system is largely dependent on the software selected, but the performance of software in turn depends on the hardware on which it runs.

1. HARDWARE DEFINITIONS

Computers are traditionally described as main-frames, mini- computers (or minis) and micro-computers (or micros).

Main-frames are big computer installations serving several large- scale applications, typically administered by government departments, universities and research establishments, or large corporations.

Continuous technological progress has resulted in a reduction in size and production costs and an increase in capacity and speed of many key computer components. The combination of capacity and speed is what is normally meant when a computer's Power is discussed.

Mini-computers are effectively scaled down mainframes; their size, cost and comparative simplicity of operation are such as to make them available to larger archives or to the parent bodies of smaller archives. The micro-computer originates in a different design concept, characterised by the phrases Personal computer (PC) or desk-top computer. This is the idea of giving individual users personal access to systems with equivalent capabilities to those which they might have been able to use (on a shared basis) on a larger machine. Costs of micro-computers are low enough to put them in reach of almost all organisations.

The main hardware components of a computer are the CPU or central processing unit, the terminal(s) where users enter commands or data and the computer displays the results of its operations, and the disc-drives or tape-drives used for the storage of software or data. In large computer installations these components may be recognisably separate; in smaller systems they are often combined in a single machine.

Other hardware is usually described generically as Peripherals to distinguish it from parts essential to the computer's operations. This category includes devices for output (printers etc.), and for communications, input and remote storage.

1A. Principal hardware components: the CPU. The most important components of CPUs are the processors (the microprocessors or chips) and the main memory. The circuit-board on which a micro-computer's processor and memory are located is known as the mother-board. Processors and memory are areas of rapid development in the industry especially in microcomputers, and generate a great deal of technical language, which it is only possible to explain briefly.

The processor is the part of the computer that actually carries out the operations required by the program. Where microprocessors are concerned, the recent past has seen considerable progress in the quantity of information a microprocessor can handle at one time ("8-bit", to "16-" and "32-bit" chips) and in the speed at which the processor operates, measured in megahertz. These trends enhance the "power" of computers, with higher figures indicating superior performance: such figures are prominently featured in current brochures.

The computer's central or main memory holds the software which the computer is currently operating and the data immediately needed for processing by that software. This memory is available for all computer operations, and is thus also known as random access memory or RAM. Main memory capacity, like other forms of storage is measured in Kb or Mb (standing for Kilobyte and Megabyte respectively - measurements of capacity roughly equal to one thousand or one million characters of data, program etc.), so it is the size of main memory that is being described when a brochure mentions "640 Kb of RAM" or "1 Mb of main memory".

Cache-ing uses RAM to store data for easy access when processing, thus increasing speed of operation as the data would otherwise have to be recalled from disc: where disc access time is measured in microseconds (ms), RAM access time is measured in nanoseconds (ns: a nanosecond is one thousandth of a microsecond). Since these indicate durations, they are one of the few performance measurements in computing where "less is better".

Other publicity jargon relating to CPUs includes the reasonably self-explanatory phrase system architecture (as in "revolutionary 386-based system architecture") and the less obvious "wait state", "expansion slot" and "bus".

Wait state derives from machinery where processors operate faster than memory and the processor has to pause for the memory to catch up, which reduces the value of a faster processor. As with access times, the lower the number of wait states the better: the optimum value is therefore "zero wait state".

An expansion slot gives the facility to plug extra printed-circuit expansion cards or boards into the same circuitry as the mother-board: such cards may be needed to add extra storage or memory, to control the screen or to "drive" a peripheral such as a tape-streamer (all these concepts are defined later), etc.

1B. Principal hardware components: the terminal

The keyboard is an essential part of a terminal, used by the operator to give instructions to the computer or to enter data. It resembles the keyboard of a conventional typewriter with the addition of several extra keys (such as programmable function keys marked "Fl", "F2" etc.) providing for specific computer actions. Some software expects the operator to use these extra keys when working with the program, so it may be important to check that they are present and are located where operators can use them easily. Keyboards are also normally supplied to generate the appropriate character set of letters and symbols for the country where the machine is sold. The vendor's expectations of what is appropriate may not coincide with the expectations of the archive or the software, so this area also requires checking.

The other part of a terminal is the screen or monitor on which the operator sees the immediate results of the computer's operations. A monitor is commonly described as a VDU - visual display unit - or CRT - cathode ray tube. Monitors cover a range of sophistication. from monochrome text-only to colour with high-resolution graphics capability. Colour and graphics add significantly to the cost. The handling of text, colour and graphics on screen is controlled by a display protocol which is usually identified in publicity literature by a name or acronym (examples include Hercules, CGA, EGA, VGA, etc.). Generally, colour screens and sophisticated graphics capability are not essential for catalogue-type applications, but a lot of software is now written to make use of such facilities and may not function to the standard demonstrated without them.

IC. Principal hardware components: storage devices. Data and software not in use on are kept on file storage devices - normally meaning disc or tape. Computer systems talk of writing and reading data into or out of storage. The development of disc storage offers fast access. Micro-computers promoted the development of the small removable disc (known as the floppy or floppy diskette) and of the miniaturised and robust permanent disc-drive.

Micro-computers use floppy discs with a high capacity. Floppy discs can not automatically be used to carry data between any two machines: in addition to physical difference in disc size (5.25 has given way to 3,25" and the CD-ROM is gaining ground) and are formatted by different generations of machines to pack data more or less densely and thus to fit more or less on a disc. A micro-computer that formats discs to the "high density" rating of 2. Mb can frequently (but not always) read a lower density disc, but the reverse is not true, and some machines do not read discs formatted on another machine even of the same supposed standard.

ID. Peripheral hardware components: output devices, The purpose of an output device is self-evident: to produce typed or printed text output, known in the industry as hard copy, from the data held. Output devices typically connect to the computer through a port on the CPU box: the more common interface (or linkage) is called parallel interface but some devices will use the serial interface port. Usage of a particular port may require the insertion of an expansion card into the computer.

To print a large report can take a long time, during which the computer may not be available for other uses; similarly, users may find themselves queuing for a shared device and unable to get on with other work while they wait. To resolve such problems, manufacturers commonly offer facilities known as "buffering" and "spooling". Buffering provides temporary memory. which can absorb the data at the fast speed at which a computer transmits it and pass it on at the slower speed at which it is actually printed. Spooling adds to buffering a form of queue-management.

Printers may use either standard cut-sheet paper or the familiar "fan fold" continuous computer stationery: the former is typically associated with quality, the latter with bulk and speed. Printers using continuous stationery are described as tractor feed because of the mechanics of paper movement.

Printers use different methods of creating characters. The principal types are:

- dot-matrix printers, which make characters out of patterns of dots formed by different arrangements of a print-head of small pins striking a ribbon (ink jet printers are similar, except that ink is squirted onto the paper not struck from a ribbon: printing is quieter but special paper or a heat process is often needed to accelerate ink drying);
- daisy-wheel printers, which use the same technology as type-writers, i.e. cast letters striking a ribbon (the name derives from the flower-like wheels on which different styles of lettering may be loaded into the printer); and
- laser printers, which usethe movement of a laser to draw output onto plain paper and offer, for a price, an optimum combination of quality and noise level.

1E. Peripheral hardware components: communication devices and networking

Communication devices enable one computer to transmit data or instructions to others or to receive data or instructions from them. They may be simple cable linkages to connect two computers by way of one or other of their ports, or more complex devices such as a modem (which enables a computer to address another computer at a great distance, commonly over a telephone line).

A special form of communications device is that linking two or more computers in a network. Within a network, users of different computers may have access to each other's data files and disc storage and share peripheral devices, or they may share access to one or more central CPU and storage resource devices known as file servers as well as to peripherals. Users of file servers may either operate on a dumb terminal or work station (with no processing capability of their own) or on a micro-computer linked to the network, which can both operate the shared resources and carry out its own local processing. Literature commonly distinguishes between a local area network or LAN, contained within a single building or set of offices, and a wide area network or WAN involving terminals (or nodes) at more remote sites - some perhaps participating in the network via modems.

Networking increases the theoretical benefit of computerisation by extending the availability of a computer system beyond the single machine on which it is first installed, and by offering the possibility of linking users to more powerful processing or larger storage resources than could justifiably be made available to individuals. The opportunities for networking should be explored when planning any new system.

IF. Peripheral hardware components: input devices.

A "mouse" and a "scanner" are input devices that may be offered as alternatives to the keyboard (or communication devices) for the entry of data or instructions into computer systems.

A mouse is a hand-held device that controls the movement of a pointer on the terminal screen! the pointer is used to select options from a 'menu' of possible operations. Normally, however, the software must have been written with mouse-operation in mind: it will not be possible to use a mouse in a system written solely for keyboard operation.

A scanner is a machine that looks like a facsimile (fax) machine or photocopier. and can "read" the contents of a page into a computer system. The typical usage in computer applications is the capture or of transfer graphic designs, or with OCR (optical character recognition) capabilities that will read text from a clearly printed or typed page. Such devices may help a computerised cataloguing system catch up with backlogs of data from well-kept manual systems.

1G. Peripheral hardware components: remote storage devices.

In addition to the storage devices built into a computer, further storage may be provided as a peripheral. This may consist simply of extra disc-drives (floppy or hard) to supplement those built into the computer or to act as file servers for a network (see definition above) but other data storage technologies also exist. For example, tape streamers offer miniaturised tape storage and can be useful methods of making precautionary (back-up) copies of programs and data files to store in case of problems with the copies held on the computer. Various forms of laser disc technology such as CD-ROM offer high storage capacity for large data files, although it is an important restriction that this is not yet commonly in a re-usable form (a restriction indicated by the letters standing for "read-only memory"; another designation is WORM, standing for "write once, read many times")

2. SOFTWARE DEFINITIONS

Software is the generic term for computer programs - the machine- readable instructions that enable the computer to perform the intended operations. Software is divided between operating systems - the programs which control the computer's work whatever task it is given - and application Programs, which address the user's specific requirements (such as word-processing. accounts management or, in the context of this paper, cataloguing).

2A. Operating systems

Operating systems are (as previously noted) the programs used within the computer to control its operations. Application programs must be tailored to work within a particular operating system: not all applications are available under all operating systems, or even under all versions of a given operating system. Operating systems evolve, with enhanced facilities in successive released versions, and applications software often requires not just a particular operating system but a version no earlier than a particular release. Note also that operating systems may impose limitations on an application that are not apparent when that application is run on other systems. For example. some operating systems impose limits on the size of file that can be handled. If this limit is lower than one tolerated by an application program in other contexts, users may find that program running up against limitations a demonstration had not led them to expect.

Some operating systems are specific to individual computer types; others have widespread currency. The first significant "machine- independent" operating system was CP/M, but the market leaders now are PC-DOS/MS-DOS (the system developed for the IBM Personal Computer and now used by the whole range of "IBM-compatibles) and UNIX/XENIX (a system developed by Bell Laboratories/AT&T for multi-user or multi-tasking computers - concepts considered below). The new generation of IBM machines will operate under a new operating system, OS/2. which many expect to overtake MS-DOS as the effective industry standard during the next decade. The vast majority of micro-computer software currently available, however, runs under MS-DOS.

Hardware and operating system combinations will be a factor in determining whether an archive's use of its computer is single-user or multi-user (i.e. whether more than one person can use the system at the same time) and whether or not the equipment can be multi-tasking (i.e. capable of running more than one job at a time). The majority of micro-computer systems hitherto have been single-user and single-task operations, and have obliged their owners to come to terms with "the single user bottleneck" - the difficulties arising from the fact that while the machine is in use by one person it, and the data it contains, are not available to other would-be users. The development of more powerful chips, of easier networking (see above), and of multi-tasking operating systems should all make these problems easier to avoid.

It is important to remember that software must be appropriate to multi-user access as well as the hardware. For example, the software should include file-locking or record-locking procedures that will resolve conflicts between users seeking simultaneous access by preventing a second user from opening a file (or, less drastically, a single record) already opened by another.

2B. Applications software: general introduction

One of the chief advantages of the enormous increase in usage of micro-computers has been the reduction of the need for the majority of potential users to have to contemplate writing or commissioning their own software. Most common uses for computers have been identified, and a range of solutions is on offer for each. AV archivists need to be aware, however, that their particular needs do not necessarily fit into this category: film cataloguing is not the same as stock-taking in shops or even book-cataloguing in libraries. AV archivists may hope to avoid writing a new system entirely from first principles, but they should not expect to find there is no work to be done.

The kind of software on offer to archives for cataloguing will be variously described: the terms "package", "program" or "suite of programs" may be used as well as "system". The first necessary distinction is between interactive or On-line systems and batch- mode or off-line systems. The former offer the user the facility to interrogate the computer to seek immediate answers to specific enquiries from the information in its files. Off-line systems are those where a computer is used to process a structured file to generate catalogue- and index-type listings, which enquirers use in an essentially conventional way (although the physical format of these listings may be less conventional than their contents). "Batch-model" derives from the fact that the computer processes data in cumulative batches rather than interactively (or 'as it comes'). In fact many interactive systems will also carry out certain functions in batch-mode if this suits the user: batches of new information may be added to the file overnight, or long printouts generated in quiet periods, so as to avoid tying up the computer at times when it is needed for other purposes.

Most systems currently on offer to archives for in-house use (see below) are interactive, although usually with a report-generating capability which enables users to produce printed catalogue- or index-listings as well as making on-screen enquiries. Bureau-service and other remote systems may only be available off-line. "Off-line" does not automatically mean inferior: many archives have made successful starts in computerisation using generalised information-processing systems operating entirely in batch mode.

On-line or interactive systems may run in-house or be available to the archive on a time-sharing basis. In-house systems are, as their name suggests, entirely self-contained within the archive, which may therefore expect considerable freedom of choice (within organisational and budgetary limits) in their selection. The term "time-sharing" describes the relationship where a system user is allowed or is sold on-line access to a computer which is also (or primarily) dedicated to the needs of another (usually larger) user, such as a university, library, government agency etc. Time-sharing will commonly give an archive access to a computer of greater power than it would be likely to acquire itself, but there may well be limitations on the choice of software, on the times when the archive is allowed access, etc., as well as the possible factor of connect costs - the charges levied for the time the archive spends actually communicating with the system.

Another topic for potential confusion is the extent to which a purchased system is generalised or customised. Commercial systems are obviously normally written for general use: some suppliers expect the user to adapt the system to his or her own needs while others offer to make the specific adjustments required by the buyer and include this service in their price. Since the latter systems should be delivered in a "ready to run" form, they are often known as turn-key systems. Turn-key systems tend to be considerably more expensive than a generalised package. The extra investment may, however, be worthwhile: although implementation of some systems should not defeat the intelligent amateur, others are very difficult indeed for a non-programmer to work with. Even generalised systems can offer help to the user in coming to terms with the system by offering on-screen help facilities.

2C. Applications software: cataloguing systems.

There are several different types of interactive system, but most of those offered as solutions to cataloguing problems will be described either as data base or data base management systems or as retrieval, text retrieval or information retrieval systems.

In describing a computer system, suppliers will commonly talk of the handling of information in a vocabulary including the words "fields", "records" and "files".

Fields are the computer equivalent of the "boxes" on a manual catalogue card - each field contains a single item of information so that a film catalogue might have a "Running Time" field, a "Date of First Screening" field, a "Director" field, an "Archive Accession Number" field, etc. For some applications, users might welcome the possibility of using sub-fields and/or group fields. Sub-fields offer divisions of a field, so that a "Cast Credit" field might be divided into sub-fields allowing both the actor's name and the role portrayed to be entered. Group-fields, as their name suggests, hold together a coherent set of data fields and are capable of repeating as groups so that the computer will not confuse information from two different sets: for example, a group field might give details of festival screenings, with festival name, date and awards held in a separate group for each festival, or describe the physical characteristics of an archives holdings of a particular film, with details of the length, gauge, base, sound system etc. of each copy held in a separate group.

Records, to continue the analogy, are the equivalent of the complete catalogue cards - the collection of fields representing a complete set of information about a single catalogue item. For some applications, users might welcome the possibility of using sub-records within the main record, for example to describe individual stories within a single newsreel issue or to describe alternative versions of a single film.

A file is a set of records: for example, an archives catalogue. Precisely how files are stored varies between systems. Some have only a single file but most have at least two, of which the first contains the data entered by cataloguers and the second, third etc. are inverted files created by the computer system to act a indexes to give faster access to the data in the main file. There may be one inverted file covering all fields in the record, or separate inverted files for fields of different types. This type of system is sometimes designated "flat file to differentiate from systems like those described in the next paragraph.

Another approach is that described as a relational data base. In such systems, the total information requirement is analysed in several different files. An archive, for example, might create a film titles file, a credits file (production and distribution companies. film makers and actors), a file of copyright holder and donors, a file of subject classifications or keywords, and file of film copies. The equivalent function to "access to the full catalogue" is then provided by the computer searching the different data bases and making links between them as necessary.

Because of the way data is separated into specialised files in relational systems, data storage tends to be more economical and searching more rapid. Another attraction of such an approach is the possibility of maintaining in separate files details which would be impractical to enter in generalised files - for example biographical details of film makers, "scope notes" for subject headings, addresses for distribution companies. It is, however important to note that, on the whole, micro-computers are not well suited to the full complexities of relational data base usage for large collections of complex items: the hardware tends to limit the number of files that may be opened simultaneously the total size of each file, etc.

2D. Applications software: system weaknesses. System suppliers will normally fail to mention the weaknesses in a system, but it would be naive to accept any product as totally secure. Even manual systems are vulnerable to poor design (ink that fades, or glue on labels that ceases to stick), to accident (fire or flood), or to vandalism or carelessness (a disfigured stolen ledger, or a dropped card-index drawer). Computer system can suffer similar physical problems (system design faults are known as bugs), and may also be vulnerable to hazards which might seem quite trivial to those used to manual systems. Some data bases, for example, can be damaged if there is a power failure or other interruption while data is being entered or amended or an inverted file is being built or restructured. It is important to know which processes place the system most at risk of such data corruption, and what if any safeguards are available.

Two forms of vandalism specific to computers have been in the news: these are "hacking" and "viruses". Hacking is the morally dubious hobby of attempting to gain access to a computer system from which outsiders are supposedly barred: the intention is rarely malicious, but inadvertent damage may result. Viruses are deliberately-written computer programs whose purpose is to disrupt or destroy legitimate programs or files and which are often written to spread (like their medical namesakes) to other computers by replicating themselves into programs or onto discs processed by an infected machine. Archives should be safe from hacking until they become involved in schemes for information exchange involving a wide-area network with modem linkages; the virus threat is more general, and the best precaution is to be strict about forbidding staff to bring in discs from outside (of games, "pirated" software, etc.) to run on the archive computer

The other vital precaution is to have back-up copies of programs and of essential files permanently available for use in case of damage to the main copies. Systems will frequently offer recovery facilities to lessen the damage resulting from a data base corruption - these may be described as a "salvage or rebuild data base" function. Note, however, that these will seldom be 100% effective in undoing damage, and offer no protection against a serious hardware failure (or crash). Backing-up data - the regular generation of copies of the data base or at least of the "raw" data it contains onto removable disc or tape for storage separately from the computer - is an essential routine precaution against loss or corruption of the main file: if a recent back-up copy is always available, the amount of damage resulting even from a major file corruption can be containable. To be prepared for disaster is a more constructive approach than merely to hope a disaster will not happen.

Continue


Contents - Previous - Next