United Nations Education, Scientific and Cultural OrganizationUNESCO Home PageSite Map
Second  ICSU-UNESCO International Conference on Electronic Publishing in ScienceICSU Home Page
ProgrammeAbstractsList of ParticipantsProceedings
Is electronic publishing being used in the best interests of science? The scientist's view
Steve Berry


Session VI: ISSUES IN ELECTRONIC PUBLISHING
Chair: Graciela Munoz, Universidad Catolica de Valparaiso, Chile

Publication and use of lrge data sets
John Rumble Jr.,
President ICSU COmmittee on Data for Science and Technology (CODATA)
The University of Chicago National Institute of Standards and Technology,
100 Bureau Drive, Gaithersburg, MD 20899-2310, USA

Abstract

Scientific information comes in many sizes, types and levels of quality. Because of the diversity of the scientific information being published, different issues will arise in publishing different types of information electronically. In this paper, we will address issues related to electronic publication of large scientific data sets, a subset of scientific information often overlooked in discussions on electronic scientific publications. First, we establish the parameters that define large scientific data sets. Then we identify examples from a variety of scientific disciplines. Large data sets (LDS) require special technology for their creation and management, and that technology is briefly described, as well as traditional publication and use of LDS. We discuss electronic publication of large scientific data sets and their uses as they exist today. Finally, we look into the future of electronic publication of LDS, including issues such as intellectual property rights (IPR), and LDS as a source of new discovery and economics.

Introduction

Before starting, we should define the term publication, both non-electronic and electronic. The Oxford Encyclopedic English Dictionary [1] provides two definitions: (1) the preparation and issuing of a book,…, to the public and (2) the act … of making something publicly known. Both definitions apply to science. In the first instance, scientists formally publish in books and journals. In the second instance, they release information and data to the public, but without the formality of the first instance.

Formal publication includes the peer-review literature, the preprint literature, the report literature, conference presentations and proceedings and depositions in public databases that were part of a formal publication. Publicly released data are simply that, data released by a scientist, or group of scientists, without formal publication of the type just defined. In a paper environment, formal publication of large data sets was virtually impractical. The sheer volume of data made it expensive. The rigidity of printed formats made it difficult for someone else to use the data. Large sets of raw data that supported most formal publications were either archived and then lost, or lost directly. The report literature often included some large data sets, but accessibility and usability were limited.

In an electronic environment, both formal and informal publication becomes practical. In practice, the volume of an electronic data file associated with a formal publication is almost unrestricted. Publication as part of a formal report or as a publicly released data set now becomes a matter of policy and custom of specific scientific communities. In this paper, we will consider both types of publication: formal publication and public release of large scientific data sets. Definition of Large Data Sets A scientific data set can be large in many different ways. Table 1 summarizes some of the possible dimensions.

 

Table 1 – Dimensions of Large Scientific Data Sets

Dimension

Aspects

Examples

Density

Data given at many values of one or more independent variables

Data taken at every second for 24 hours of a volcanic explosion; data taken at 0.1 energy units from 0 to 106 energy units

Repetition

Repeated experiments under the same conditions

1000s runs on a particle accelerator

Number of experiments

Different measurements for same property

Fluid properties determined by several techniques; sky survey by visible, radio and x-ray astronomy

Number of substances
or systems

Measurements on a large number of different substances or systems

Every star; every chemical compound; every species

Amount of metadata

Many independent variables

x,y,z positions of every atom in biomacromolecules; 200 variables needed for composite material testing 

In an electronic environment, one can conceive of a simple definition of a large scientific data set, namely establishing a minimum necessary size, say one gigabyte. Obviously, any such definition is arbitrary and in today’s environment, with automated instruments and observation platforms, such a limit is routinely reached in many instances. What is more significant is the complexity of the information, for it is the complexity that complicates collection, management, analysis and publication. Each of the dimensions in Table 1 brings different requirements. Databases that are simply large because of many data points probably never are published per se. Instead, they are most likely analyzed, reduced, fitted to equations or summarized before publication. This might be particularly true, for example, for satellite data.

Examples of Large Data Sets

Virtually every scientific discipline generates and uses large data sets of one type or another. Table 2 provides some examples. Needless to say, these are just representative of the possible large data sets found in science. With the advent of modern database management, inexpensive mass storage devices and the Web, researchers can capture, store and disseminate virtually any set of measured and calculated data. The value of an unlimited number of large data sets is unclear. This new ability to create and disseminate large scientific data sets informally must be considered in any discussion of the topic.

 

Table 2 – Examples of Large Scientific Data Sets

Discipline

Examples

Astronomy

Sky surveys; star catalogs; opacity calculations

Satellite Data

Planetary observations; earth observations

Temporal Data

Meteorological data 

Earth Science Data

Seismic data; motion pictures of volcanoes; river flow

Biological Data

Biodiversity surveys, Visible human project; museum collections

Genomic Sequence Data

Human, animal, plant, one-cell organisms

Chemistry Data

Spectral data, chemical structure diagrams; crystal structure; molecular dynamics calculations

Large Apparatus Physics Data

High energy physics; synchrotron

Geography-Based Data

Land features; population

Ecological Data

Land use

The Technology of Large Data Sets

Anyone dealing with scientific database management quickly realizes that commercial database management systems (DBMSs) have not been designed with scientific applications in mind. Simple scientific information/data such as scientific notation, Greek letters, super- and subscripts, chemical structure diagrams and much more are not routinely accommodated in commercial DBMSs. While the situation is improving, a large body of specialized DBMSs has been developed for chemistry, biology and many other fields.

The situation for large scientific databases is equally unsatisfactory. Additional challenges that large scientific databases present to database management systems include large volumes of data, images and photographs, reuse of metadata with minor changes and unpredictable access patterns. To accommodate these needs, scientists have developed many homegrown database management systems. These systems often are optimized to specific applications, which make reuse difficult or impossible. Many times, an additional data location facility is added on top of a DBMS to facilitate data retrieval, or in some cases, to support loading of archival data from an offline storage resource.

The special needs of very large databases were recognized many years ago. To support the use of database management technology for large databases, the Association for Computing Machinery (ACM) Special Interest Group on the Management of Data (SIGMOD) initiated a series of conferences on very large databases beginning in the 1970s. This conference has provided perhaps the most important source of new database management ideas for large databases.

Another manifestation of the recognition of the special nature of large scientific data sets is the network of specialized data centers that has been established worldwide to handle such data. Some have been set up by special facilities such as CERN (European Organization for Nuclear Research) and the Hubble Space Telescope. Others are a group of related data centers such as the World Data Centers, with branches in the United States, Europe, Russia and China. The World Data Centers collect and make available a wide variety of earth observation data including weather data. Many of the international groups that support land-based telescopes provide centralized facilities for their data collections. The same is true for satellite data generated by NASA, the European Space Agency and the Japan Space Agency. Other examples include large-scale collections of chemical structure data, such as maintained by Chemical Abstracts (U.S.), JICST (Information Center for Science and Technology of the Japan Science and Technology Corporation) and various commercial vendors, as well as crystal structure and spectral databases. Of growing importance today are the special genomic repositories such as GenBank (U.S.), EBI (European Bioinformatics Institute) and DDBJ (DNA Data Bank of Japan).

One common feature of these activities is the ability to support the development and maintenance of specialized database management software for their applications, Even if these groups utilize commercial systems, they still develop a large amount of specialized software tailored to their needs.

This need for specialized database software has major implications in electronic publishing of large scientific databases. Electronic publication does not support reading or browsing as with full text databases. It does support extraction, reuse, manipulation, and analysis. Publications of large data sets that do not allow such functionality severely limit the utility of the publication. However, supporting specialized functionality imposes additional economic considerations beyond electronic publication of full text. Also, different IPR issues are raised, as data extraction and fair use. These ideas are dealt with below.

We cannot overlook the fact that significant collections of scientific data already exist in paper format only. Many of these collections have been computerized in recent years, but many specialized collections have not, and these are in jeopardy of being lost. Especially of value are long-term collections of observations of natural phenomena, from weather records to one-time events such as photographs of volcanic eruptions. The records are too valuable to discard, but too expensive to convert into electronic publications. Other important paper data sets are descriptions of museum collections and other natural history collections. Significant resources are now being spent to computerize these notes.

In summary, electronic publication of large scientific data sets requires specialized database management software. Not only does the required investment provide special capabilities, but also it affects data accessibility and usability issues and policies.

Traditional Publication of Large Data Sets

The traditional methods for paper publication of large scientific data sets had several features. First, the publication usually was simply an overview and descriptive article about the collection. Normally only subsets and samples were included in this publication, and even the most complete publications rarely included all available data. However, selection criteria were not always documented. A few journals do publish large tables of data, such as the Journal of Physical and Chemical Reference Data (U.S.), the Journal of Chemical Engineering Data (U.S.) and Atomic and Nuclear Data Tables (U.K.) In addition, some journals allowed for deposition on microfilm or paper of supplementary data tables. However, in reality, very few depositions were ever made. In some areas, especially crystallography and genomics, it has become common practice to require deposition of complete data sets into international data centers as a condition of publication acceptance.

Many different types of institutions have established a series of reports, which constitute a significant part of the so-called gray literature. Many of these government, academic, nonprofit organization and even industrial reports often contain large sets of data not available anywhere else. Locating such reports can be a very difficult task, and the possibility that these will be preserved in an electronic era is quite slim.

Traditional Use of Large Data Sets

Given the rather haphazard paper publication of large scientific data sets in the past, it is useful to examine briefly how such data sets were used. Those available through publicly accessible data centers were utilized the most. For serious researchers in developed countries, knowledge of such data sets was an integral part of the research effort. However, for researchers embarking on cross-disciplinary work, outside the mainstream of research, or in less-developed countries, knowledge of these sources was generally minimal and access even more difficult. Today, at the beginning of the year 2001, with the Web being an integral tool of every scientist, it is difficult to remember that even ten years ago, data sets of interest were hard to find, hard to gain access to, and harder yet to use.

These difficulties led to a discernable hierarchy of use: principal investigators being at the top, colleagues within the same institutions and co-investigators at other institutions next; then researchers within the discipline next; and finally the general scientific community. The drop-off in knowledge and accessibility between adjacent levels was considerable, and science suffered because of that drop-off.

In what specific ways did science suffer? First existing data were not easily available, and research was done on the basis of incomplete data sets. Knowledge discovery from the existing data sets was limited. Data could not be searched, manipulated, analyzed and exploited because they were not electronic. The existence of long-term trends or the recognition of patterns that becomes possible only with complete data sets was hampered. Perhaps most importantly, these deficits were not always readily recognized, leading to over-confidence and overestimation of the validity of new research findings based on incomplete data sets.

Some Economics Issues on the Electronic Publication of Large Data Sets

Today’s technology for electronic publication of large scientific data sets is well understood. Data are generated, collected, analyzed and stored using a set of interrelated or compatible software. Data files of virtually any size can be made available with relatively little investment via the Web, either as an informal publication, as part of a larger data collection or linked to formal peer-reviewed articles. Large-scale disks and burnable CD-ROMs allow for local storage, rapid access and easy back up. The dream of having a “personal” library not only of all books, but also of all scientific data is not too far away.

That said, why should the topic of electronic publication of large scientific data sets be of interest to a conference on electronic publishing in science? The simple answer is, of course, economics. How does the existence of large scientific data sets change the real or perceived economics of electronic publishing in science? The answer lies in the impact of such data sets on how science is done. Let us set forth some issues and ideas to answer this question.

  • There are no real technological barriers to collecting and making available large scientific data sets. Modern database technology, large scale mass storage and the Web make collection, storage and dissemination of even the largest scientific data sets possible. A legitimate question is: Will scientists challenge present technology in their inevitable quest for more of everything? Of course the answer is yes, but there are few if any discernible barriers in the near (next two decades) future.

  • The economics of generating large data sets through experimentation, measurement and observation remain essentially the same as before. The design, construction and operation of large instruments, experiments and observation programs are only marginally impacted by modern data technology. The bulk of any such investment is still in the people and the engineering. Electronic data collection is integral to any modern program, but still a small fraction of the overall cost.

  • The investment in data management, analysis, and dissemination of large data sets is generally too small to support their full use and exploitation. Further, in many instances, as time progresses from the initial generation of the data set, such investments rapidly fall, thereby putting in jeopardy the long-term availability and utility of such data sets. Very few scientific programs have built in enough support for long-term data management. Yet the value of these data sets, especially of short- and long-term observations of nature or from expensive instruments, may not be achieved quickly. Too often the maintenance of data sets falls into competition with the generation of new results, and data maintenance inevitably suffers.

  • The economics of collecting, preserving and disseminating data generated by large-scale calculational programs are poorly understood and not being studied. Modern computational science has made many advances and is poised to replace experimentation as the source of “measured” data in many areas in the near future. Virtually no effort is being expended in discussing which calculated data should be stored in large data sets, how they should be preserved, and what the economics of such activities are.

  • Both formal and informal publication of large scientific data sets will continue to be important. In spite of the relatively low cost of formally publishing large scientific data sets in comparison with the cost of generating such data, there are still significant costs associated with formal publication. In particular, specialized software is needed, and the variety of large data sets means that the software has to be flexible and extensible. Further it is very unclear if users would be willing to pay even to cover the incremental cost of delivering large data sets. Episodic users would usually be willing to pay little or nothing. Heavy users would want to acquire a data set only if the cost is reasonable. Unless the market is proven, commercial publishers will be hesitant to provide large data sets as part of their full text electronic publishing efforts. Consequently, the existing paradigm of many informal electronic data publishers will likely continue, primarily with informal electronic publications. To the extent custom and the user community allow, some cost recovery will likely be tried. For the foreseeable future, self-supporting informal electronic publication seems unlikely.

  • Most large scientific data sets have value in the industrial and commercial sectors. Scientists either do not realize or choose to ignore the fact that many large scientific data sets have significant economic value. This leads to competing economic rights and detracts from important principles related to the conduct of science and the accessibility of data. Scientists can no longer ignore this situation and must be involved in formulating long-term solutions that support the best possible science as well as societal gains from improved or increased commerce.

  • The relative roles of various players in linking and distributing large scientific data sets to the electronic versions of traditional scientific publication are unknown and need to be analyzed carefully. For the many reasons discussed above, publication of large data sets has been separate from traditional scientific publishing. However, computer technology can now link these data sets directly to original articles, thereby involving the primary publishing community to a much greater extent than before. In addition, nonprofit organizations and industrial companies can generate and disseminate large data sets as part of their normal output. Other possibilities exist. All involve new strains on the conduct of science. Scientific publishers, both commercial and non-commercial, can limit access to subscribers. Companies can put limitations on the use of data. Scientific communities and organizations need to be proactive in defining the issues and preserving the importance of full and open access to large data sets.

  • Large scientific data sets present new discovery opportunities that will compete with individual experiments, observations and theory in the future. As data sets get more complete and related data sets are linked together, new knowledge discovery and data mining techniques are going to provide important scientific insights and discoveries in the future. In addition, large data sets may facilitate the replacement of measurement with prediction. For example, with a body of 100,000+ critically evaluated mass spectra available, techniques could be developed to predict new spectra and reduce the need for new measurements. Knowledge discovery is a more exciting possibility. Much of science is based on observing something new and unknown and trying to understand what has been observed. Many complex relationships are observable only when a data set is extensive. Perhaps the most visible of these are the linkages between nutrition and human behavior with disease, as discovered through epidemiology. Every area of science – from astronomy to ecology to genomics to chemical structure – has large data sets that are ripe for exploitation. We can anticipate exciting science ahead.

  • The intellectual property rights associated with large scientific data sets are unknown, especially in view of changes such as the European Union Data Base Directive (1996) and similar proposed legislation in other countries. Copyright law, for all its ambiguity, provides fairly clear guidance on the ownership rights and use of large data sets. Clearly for data sets to be copyright protected, they have to result from creativity and intellectual effort. Equally clearly, fair use provisions allow scientists to extract and use data for research purposes. However, the EU Directive has created a new class of intellectual property rights, called sui generis, that essentially provides intellectual property protection to compilations that do not involve the creativity required by copyright legislation. Furthermore, fair use exceptions are not made a mandatory requirement in the required national legislation implementing sui generis rights. To date, this Directive has had no visible practical effect, but we can anticipate it will. For example, protecting a collection of genomic data under sui generis seems very possible and would result in much restricted access compared to present practice.

  • Aside from economic considerations, the biggest barrier to the use of large scientific data sets is the lack of data and metadata standards. During the lifetime, possibly quite long, of large scientific data sets, there will be many changes in storage and database management technology. Careful planning and good documentation will make these changes and migrations manageable. However, the lack of common data and metadata definitions and ambiguous data elements will cause great problems unless standards are developed. These standards must be developed by the communities involved both in the generation and potential use of the data. Today, many large scientific data sets are aggregations of data from many sources. They can be constructed and disseminated because copyright provisions are not violated under the provisions of fair use and with the addition of a creative element. However, the EU Database Directive appears to provide new restrictions on making aggregated databases from data covered by sui generis. While this has not been an issue so far, the possibility is clear.

  • (Speculative) As the decades pass, the volume of data will give rise to new data access mechanisms or agents that will take on the responsibility for maintaining and advancing large scientific data sets. As technology changes, volume grows and new data sets must be integrated with older sets. We can look at computing today and ask what we can learn about how we manage software and data now. First, most people do not want to expend much time maintaining their computer. They want new software to work every time and to be easy to install. They want the updates and changes to come automatically and do not want to have to keep track of changes. They want to have the tools to build data sets easily and to use standards if they are easy to use. They do not want the bother of upgrading database management systems, disk space and Web sites to interfere with research duties. Of course, there are some people who delight in this sort of work, and they will evolve into the data providers of tomorrow. The actual form of this business is unknown, but it will develop, perhaps by traditional publishers or perhaps by Web site consolidation (already starting). Scientists should not decry this development, as it will be a natural evolution of libraries, commercial publishers, handbook producers and scientific service providers (similar to instrument makers, etc.).

  • Conclusions

    Scientific data are often considered the realization of scientific understanding. If a phenomenon can be quantified, it is measurable and understandable. Of course, the understanding comes only after much time, many observations and much misunderstanding. The large data sets that capture scientific experiments, observations and calculations will become only more valuable in the future. As the entire realm of scientific information becomes available electronically, and as the business and economic aspects of the dissemination of scientific information change and evolve in the electronic era, the issues associated with large scientific data sets must be identified, addressed and ultimately solved for science to progress.In this paper, we have outlined some of the issues specific to large scientific data sets. Other issues will arise because this is the beginning of the electronic era, not its maturity.

    CODATA, the Committee on Data for Science and Technology, will play an important role in ensuring that the scientific data community has long-term access to large sets of scientific data. By virtue of its international and multi-disciplinary nature, CODATA is ideally suited to keep the focus on what is best for science. We cannot ignore the fact that information is an economic quantity, and that the laws of economics apply even to scientific data. We must work diligently, however, to make sure that the importance of full and open access to scientific data, paramount for scientific progress, continues to be recognized. For society to gain the return on its investments on scientific research, it must ensure that a balance is maintained between preserving future opportunities and reaping short-term economic benefits. In any case, scientists have the responsibility to speak out on these issues.

    For example, in October 2000, CODATA and ICSU co-sponsored a Workshop on the European Directive on the Legal Protection of Databases. This workshop featured many excellent presentations from the scientific and legal communities as well as one from the European Community itself. As a result, CODATA and ICSU are working together to prepare the scientific community for the scheduled mandatory review of the Directive. Large scientific data sets open many scientific possibilities – discovery, new knowledge, understanding, even controlling, nature for the benefit of all living things. The electronic era of scientific information will be exciting.

    References [1] The Oxford Encyclopedic English Dictionary, J.Pearsall and B. Trumble, Eds. (Oxford University Press, New York, 1995)

     

    End of presentation

    New practices for electronic publishing: how to maintain quality and guarantee integrity
    Joost G. Kircz
    KRA-Publishing Research, Prins Hendrikkade 141, 1011 AS Amsterdam
    and Van der Waals-Zeeman Instituut, Universiteit van Amsterdam
     

    1. Introduction

    At present, we are witnessing a change from paper to electronic media for the storage, dissemination and handling of scientific articles. In practice, this often means only a change from one carrier to another. Most electronic publications are simply paper products transposed onto electronic media. Neither the structure, nor the way language is used, is significantly different from earlier practice.

    Nevertheless, we witness a sometimes heated debate on the value of such "electronic documents". In my view, we have to make a difference between documents that look, smell and sound like a paper document but are stored and transmitted by electronic means, and documents that are originally created for an electronic environment, and hence are new animals in the zoo of scientific communications.

    The discussion on the value of electronic documents is often hampered by the fact that one starts from what one is accustomed to in the paper world and attempts to impose that on an electronic environment. The scientific paper as we know it is a paper-based object that obviously can be cast into various technical forms, but intrinsically remains a paper object.

    In order to grasp the impact of the current electronic revolution, as well as being able to set out a policy towards the future, we have to abstract from the presentation form and start with the aims and content of scientific communication before we zoom in on a particular presentation form.

    We have to step back and analyse what it means to write for an electronic medium and what it means to read material that is stored electronically. In a paper world, writing and reading are very close. Writing for an electronic medium, means an understanding of the full capacities the medium contains. Reading electronic articles on the other hand doesn’t mean reading from a screen. The presentation becomes flexible! In contrast to paper, the electronic media allow a distinct difference in presentation between the author’s favoured presentation and the consumer’s reading practice.

    An electronic document is not the electronic version of a traditional paper document with embellishments such as hyperlinks, colour pictures and illustrative animations. An electronic document is a document comprising a variety of different types of information presentations that are brought together by an author in order to present a comprehensive scientific argument. Or to put it in other terms: in an electronic publication, images, animations and so on cease to be illuminating illustrations to the text, but are now semi-independent knowledge representations that together with the text comprise the scientific argument communicated to peer scientists.

    In order to develop new insights in an editorial policy that maintains the essential virtues of the paper document as well as incorporates all the new exciting features, I will firstly discuss the scientific paper as we know it. Subsequently, new ways for knowledge expression are dealt with. In the concluding section, I try to set out some guidelines for the coming period.   2. What is a scientific paper? For the purpose of this conference, it is not necessary to dwell at length on the coming into being and practise of present day scientific publishing. The reader is referred to Garvey’s book Communication: the essence of science (Gar79) and the more recent book by Meadows Communicating Research (Mea98) and references therein. A good starting point for our discussion is the report of an International Working Group (IWG99) based on a Workshop organized by the AAAS, ICSU Press and UNESCO and published under the title Defining and Certifying Electronic Publication in Science. A proposal to the International Association of STM Publishers.

    The necessity of a clear understanding of what a scientific publication actually is, is well formulated as: Publication is the hard currency of science. It is the primary yardstick for establishing priority of discovery, making the status of a publication a critical factor in resolving priority disputes or intellectual property claims. Academic tenure and promotion decisions are based in large part on publication in peer-reviewed journals or scholarly books. To make these decisions fairly and with confidence, scientists and their institutions need assurance of what counts as a legitimate electronic publication.

    Thus, the challenge is to ensure that, independent of the technology used, the use and exchange value of this type of currency can be established universally for all participants in the world of science.

    The Working Group proposes a list of minimum characteristics to qualify a document as a "publication". It is worthwhile to confront this list with, on the one hand, the expansion of the concept document to all coherent knowledge presentations being textual, non-textual or a mixture, and on the other hand the list of communication needs presented by Kircz and Roosendaal (Kir96). This list of communication needs reads: 1) awareness of knowledge, 2) awareness of new research outcomes, 3) specific information, 4) scientific standards, 5) platform of communication, and 6 ) ownership protection. It is immediately clear that scientific communication needs as such encompass a much wider range of interaction between scientists than formal publications.

    The Working Group makes a useful distinction between an informal notification, a first publication and a definitive publication. They recommend four main characteristics that adhere to all publications; we will discuss them now. 1) Fixation (i.e., the document must be durably recorded on some medium). This demand is obvious as no communication or debate independent of time and location is possible without a fixedness of the object of discussion on any technical type of platform of communication. The change into an electronic environment implies that the notion of "durably recorded" is under attack. In contrast to the paper world where we can demand that the paper is printed on acid-free paper according to an official standard, in an electronic environment, we don’t have any idea yet what an equivalent form looks like. We have no idea what kind of optical, magnetic or other medium will be selected in the coming years as the accepted standard and we have no idea what the method of writing on that medium will be. The technology push is such that almost every month we are confronted with a claim for an even superior technology. On top of that, we have to expand the notion of a document to non-pure textual objects such as images, simulations or other multimedia objects that might be the final outcome of a research programme. We don’t have to think about fashionable computer-game type presentations. The ability of electronic media to treat civil engineering design drawings (not necessarily of a complicated CAD/CAM type) the same way as textual documents is a simple example and already very difficult to tackle.

    This means that the demand for fixedness must be tailored towards a demand for the inalterableness of the content of the said object. This means that we have to interpret the demand for fixedness as a demand for a well-defined descriptive standard about the content of the document. A standard that enables the storage and maintenance of the integrity of the information independent of the carrier of that information, be it a clay tablet or a future DNA chip. It goes without saying that the current developments in descriptive languages such as the Standard General Mark-up Language (SGML) and its successor the eXtended Mark-up Language (XML) are of the utmost importance. If, finally, all information in a document is properly coded according to such a language, we deal with simple ASCII, or better Unicode, strings that can be handled in all conceivable material memory structures. For integrity reasons, such a file can be endowed with an electronic watermark. For the future user of the once-stored document, only the capability to read it again from the then popular medium is of importance. For the immediate future, an interesting initiative is the NCSA Astronomical Digital Image Library (ADIL), a repository providing astronomers with research quality images strait from the telescope to their desk over the Web (Pla99). 2) Public availability (in principle not necessarily free of charge). This demand is clearly medium-independent and does not need any further consideration here. 3) Persistence (i.e., it should remain in the same form and at the same location, so that it is reliably accessible and retrievable over time). This point dovetails with the first demand and again we see a mix-up of old and new concerns. The persistence of the work has two aspects: the integrity of the appearance and the completeness of the content. Firstly, we have to deal with the problem of the integrity of appearance. This issue is also an important discussion point in the world of the archivist. In many cases, only the content of the information is important, e.g., the figures of a town or departmental budget. In other cases, the visual and textural aspects are essential for an archival object such as an official certification or a signed treaty. It goes without saying that for non-textual material, this issue is different than for text, and persistence of presentation form can be crucial. But also here, we cannot be too conservative as the pictorial presentation of a data set can be essential to spot a peculiar behaviour but in time, more sophisticated presentations might reveal more details. This argument leads us to the notion that we have to make a differentiation in such cases between the basic non-figurative data and the presentation of them by the original author. Both have to be fixed as forming together part of the author’s original publication. But the data set must be separately available from the presentation module in order to allow future authors to use and/or integrate these data with new data or with new presentation techniques to enable the publication of new work to take place.

    Secondly, we have the aspect of internal integrity and coherence. This is typically an XML issue. This persistence aspect can be covered by the introduction of a complete list or map of contents as an integral part of every document. Not only do the bitstreams of every component of the document have to maintain their integrity, but also the mutual relations between the various components. We also need a mechanism to check that all components are present. This last demand can become a serious problem in the future. More and more documents will be rendered from components residing in different databases. Think about an astronomy article that calls for data from a huge database filled with satellite measurement data. As an electronic publication is, in principle, a modular entity and not an essay (Kir98a), the persistence demand requires that a publication guarantee that all components remain available. This demand is closely linked to the problem of dead hyperlinks. All this converges to the discussion on the Digital Object Identifier initiative. The International DOI foundation was "created in 1998 and supports the need of the intellectual property community in the digital environment, by the development and promotion of the Digital Object Identifier system as a common infrastructure for content management" ( Doi00a, Doi00b). The DOI foundation is supported by almost all major (commercial) publishers and societies. The idea behind DOI is that every item that has an assigned copyright (hence also books) will get a unique identifier. In the course of the developments, this identifier will be endowed with metadata such as bibliographic information, genre, but also publishers’ information and price. In the first round as experimented with in Crossref (Cro00), DOI is limited to a one-to-one link with the URL of scientific articles in a publisher’s data base. In the full implementation, it is envisioned that also DOI allows choices, e.g., to go to a copy of the identified entity or to a metadata record about the entity, or to an identical copy of the same entity at different location (mirror site). Adding metadata to DOI’s will allow the reader to choose which type of realisation of a particular document is wanted, e.g., as a PFD file, an XML file or whatever other storage types are available. It is clear that the DOI approach is a strong attempt to ensure the integrity of information entities seen not only as intellectual property containers but also as a step towards electronic commerce and trade with intellectual property rights.

    A competitive scheme for reference linking, emerging from the scientists who are engaged in the world of pre-print servers is the Open Archives initiative. Its goal reads: "The Open Archives Initiative develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content. The Open Archives Initiative has its roots in an effort to enhance access to e-print archives as a means of increasing the availability of scholarly communication. (Oai00). The aim of this initiative is to promote those active authors for whom self-archiving solutions are a preferred option. The interoperability between such archives becomes the prime research (Som00). Though DOI and OAi are approaching the problem from two antagonistically philosophical backgrounds, both schemes at the end must ensure the integrity and quality demands that are at the basis of proper scientific discourse. 4) Version control (bibliographic record must be attached to each version. A set of minimum details is suggested in the document). As long as we talk about a document in its traditional form, version management can be straight- forward. However, in case that a document is a composition consisting of various modules, originating from different sources, new schemes have to be developed. This issue will be further dealt with in section 3. The main point is that electronic documents are not any more the smallest exchangeable entities. Many electronic documents and most professional web pages are derived from a variety of dynamic databases. A feature of an electronic document can be that it changes with time (or outside temperature, or stock market index, or rocket launch date). This electronic document can be the result of some deep science or engineering advance, and hence be a scientific publication.

    A bibliographic record (metadata) is essential for fulfilment of this recommendation. The issue of metadata that entails much more than the traditional bibliographic information is also dealt with in section 3.

    Depending on the availability and affordability of the technical means, the Working Group recommends: 5) Authenticity (i.e., versions should be certified as authentic and protected from change). Although nobody will challenge the issue of the absolute need for authentication of every published document, we run into problems if we enter the discussion of parts of a document. In referred to above, a necessary distinction is made between documents as the smallest units of communication and documents that are build-up from various components. We have to understand that there is a difference between re-use and multiple use. In the case of multiple use, the citing author integrates the full body of an "information chunk" into the new work and uses it (see further section 3.2). A good example is the case of pattern recognition. The journals in this field are full of data sets (e.g., in the form of distorted pictures) on the one hand and methods on the other hand. Would it not be much more exciting is we could swap methods and data sets between authors and allow a true comparison between different methods unleashed on the same dataset. This would be an extension of the already current practise in some fields, especially astronomy, to tap data from a common database, see for instance the French astronomical database Centre de Données Astronomiques de Strasbourg (CDA00).

    For a first publication, next to version control, the Working Group recommends,: 6) Notification (the community of one’s peers must be informed as to the version associated with priority claims). This is an obvious though essential demand for the awareness of current and new research and the free and democratic flow of information and knowledge. Notification can be enormously enhanced by electronic tools such as bulletin boards, newsgroups and current awareness services. 7) Assignment and persistence of a Web address/location for the record. In its document, the Working Group sometimes phrases this point as the need to identify the work unambiguously. This demand is an obvious call for retrievability. Above we already discussed the DOI and OAI programmes that try to tackle this point. The DOI foundation especially is working on this issue, as its main goal is the identification and subsequent handling of intellectual property. The problem is that the persistence of a unique identification code can never be linked to a unique URL. It is much better to ensure a unique code per item, allowing that item to flow from database to database, provided that those databases have a searchable index capable of understanding the grammar of the unique identification code.

    Another issue here is of a more archival nature, namely that from every serious document, at least one copy is stored safely in an archive. This is an important and strong demand in a period where paper is on the way out and a plethora of digital media, each with their own way of data handling, emerges. This point is closely related with the first point on fixation. It is also closely related with the metadata issue. It implies that a central organization such as the National Library of Congress in the USA must install a legal depot of all items with unique storage codes for ever. 8) Commitment not to withdraw (authors must agree, prior to commencing the selection process, that they will not delete the document from the electronic literature). This recommendation is a clear statement to keep lots of free-floating drafts and worse, out of the main stream of public scientific discourse. In practise this will be a very difficult issue, as in some fields, people dump almost everything on their home web sites and feel free to send all drafts to pre-print servers. A problem arises when a second version of a draft has a slightly different title and a different number or order of authors. To impose strict adherence to version control and a commitment to not withdraw the final draft, i.e., the version open for peer review, will be very difficult, as the correction of a typo, a number or the addition or deletion of a reference, can be important as has already been proven in the paper world. A solution might be that an erratum is permanently linked to the original instead of stored separately as in paper journals.

    For the definitive publication, the Working Group recommends, alongside persistence and version control, assignment and persistence of a web address. 9) Quality control (vetted to ensure quality), in order to maximise usefulness for science and to establish a high level of trust among readers. With this issue, we enter the essence of quality control and the heated current debate on peer review. It is not the purpose of this contribution to discuss the various possible peer review schemes in detail. The literature on this issue is abundant, ranging from full scales books analysing a particular journal such as "Angewandte Chemie" (applied chemistry) (Dan93) to regular contributions on the pros and cons of double-blind refereeing, nepotism and sexism in peer review, and so on. An important new aspect is the self-publishing current in science. Here, new schemes for refereeing are regularly discussed in several internet lists and discussion forums such as the September forum (Sep98), and by individual protagonists, such as the Cognitive Psychologist and the editor of the e-journal Psycoloquy, Stevan Harnad (Hrd00).

    Out of all this discussion, one thing becomes crystal clear, namely that the issue is very much domain dependent. Whilst in theoretical physics the pace of research is such that every new idea is immediately broadcast via pre-print servers, although often after internally peer reviewed by the researcher’s institute, in more experimental fields, the tempo is more relaxed. After all, it is easier to steal an idea than to redo an experiment. In medicine, the question is intrinsically more sensitive as new medical information is often rocketed to high levels of public phantasy. In this field, the discussion on ethics and misconduct is a permanent concern (Hud00). For a recent review on the domain dependency of refereeing in e-journals, see Weller (Wel00). 10) Commitment to archiving and long-term preservation. For this point, the same holds as for the persistence point. However, the issue of long-term preservation is, more than all other issues, one of current concern. Within the archivist’s world, an enormous effort is being made to design protocols for log-term storage. As mentioned above, the problem can be split into storage of the digitalised content and storage of the textural and visual appearance as well. One important ingredient in this discussion is the scheme of Jeff Rothenberg, in which he proposes to store , next to the information item itself, also the software programs used including the operating systems (Rot99). This very intriguing so-called emulation scheme is severely under attack from XML aficionados. A less fundamental but directly applicable scheme is discussed in the "Draft recommendation for space data system standards: Reference Model for an Open Archival Information System (OAIS)". This scheme allows the storage of heterogeneous information. Also here, astronomy and space research takes the lead, as in these large fields, much information is already only available electronically (CCS99)

    In the above, we critically discussed the recommendations of an International Working Group. As we have seen, the most visible tension between the very reasonable recommendations and the electronic publication, is that an electronic publication is not a paper publication stored in a different medium. In the next section, I will dwell on the unique features of electronic publications, sharpening the argument for publishing standards, which will be summarized in the last section.

    3. Towards an understanding of electronic publications

    As already indicated in the above discussion, before we arrive at new guidelines for electronic publishing, we need a full understanding of the differences between traditional paper documents and electronic documents. This means that we have to abstract from the current accepted practise of scientific communications in order to define societal and scientifically acceptable rules of conduct and then apply them within the context of a new environment.

    The abstract notions of the International Working Group are, of course, fine; however, the problem is in the implementation. This implementation demands a better grasp of what electronic documents are. For precisely that reason, we try in the present section to make an advance on this issue in order to specify recommendations in the final section.

    3.1. The most notable feature of electronic publishing is the integration of text, image, sound and simulationsThe greatest step forward in scientific communication is that we are now able to use one carrier for all possible expressions of scientific knowledge. By translating knowledge into a binary code, we create a mono-medium that allows us to integrate all kinds of representation. It thus becomes immediately clear that text will play a less prominent rôle in the future. Although language will remain the essential transfer mechanism for knowledge exchange, non-linguistic communication will regain some of the prominence lost since the written language enabled scientific communications to emerge, independent of place and time.

    In the electronic future, stills and moving pictures, sounds, simulations and soon also tactile information can be exchanged and experienced, hence analysed and interpreted by different people separated by time and place  (Kir98b) . This means that a genuine electronic document will be a composition of text, images, sounds, animations, etc. All these components of the electronic document must adhere to quality and integrity standards. Thus, within the law of proper scientific discourse, all knowledge presentations are equal. To continue this political metaphor, we can say that we certainly need a diversity policy, to replace the period of positive discrimination of text only.

    Here is not the place here to dwell at length on the differences between intuitive understanding by means of non-textual stimuli and scientific understanding through linguistic reasoning, but we must come to a realisation of the tenet that non-textual components will play a central rôle in the electronic document of the future.

    In order to create an environment in which all this can be organised in a meaningful way, the first conclusion is that, in the first approximation, we have to consider all the various components as independent but interacting objects. This will lead to a modular approach of information.

    3.2. The next most notable feature of electronic publishing is multiple useIn a traditional environment, an author refers to an earlier author and cites part of the original work by: referring to the original work, quoting a short or long part of the original work, or paraphrasing some of the text. This is a typical paper-based process as it relieves the new author of the need of copying extensively from an already existing text. Only in the case of images, and then often only in review papers, do authors sometimes incorporate a full illustration from another article. In standard publishing practise, the author first requests permission from the original author and subsequently the publisher requests permission from the original publisher.

    However, in an electronic environment, introducing-already existing information into a new work is trivial. This is exactly the reason why the concept of modules is so crucial. In order to keep the integrity of the original work, introducing a module in a new work means introducing a complete module.

    The difference between quoting and multiple use is that in multiple use, the new author can rely on the completeness and integrity of the original module. Hence, if, in a new work, a description of: a machine, the working of a medicine, or a mathematical proof is needed, reference to another work realises a new dimension. Now, we can seamlessly introduce the existing text into the new work. The old work doesn’t have to be located in a library elsewhere, but the electronic network allows us to input this information right there where it is needed.

    This means that a module must be compatible with usages in different environments, indicating not that a link points to relevant information elsewhere, but rather that a link now transports elsewhere- located information into the present work. 3.3. Modularity as model for electronic documentsThe idea of modularity as the next step in scientific communication (Kir96, Kir98a) is further developed by Harmsze (Har00).

    Harmsze proposes a new structuring of scientific articles in modular form. A module is defined as a "uniquely characterised, self-contained representation of a conceptual information unit aimed at communicating that information". This means that a module is a textual, pictorial, or other representation, of an amount of information that in itself is sufficiently comprehensive to convey meaning for a reader. Note that neither length nor size enter the definition of a module. Although Harmsze deals mainly with modules that comprise coherent texts, the model is perfectly able to integrate non-textual modules as well. In the model, a distinction has been made between elementary modules and complex modules. Depending on the purpose, elementary modules can be merged to form complex modules just as atoms bind to molecules. Two types of such "bounded" complex modules can be distinguished. a- A compound module is a complex module that is an aggregate of (elementary of complex) constituent modules

    This is the case, if the complex module itself again represents "uniquely characterised, self-contained" information of a new kind. An easy example is the complex module that describes a measuring device and consists of a series of other modules comprehensively describing, more-or- less, independent components such as the cooling, the memory, the housing, etc.

    We can compare such a compound module with a chemical molecule that is unique in itself, but can be analysed as a set of bound molecules and atoms. b- A cluster module is a complex module that focuses on a single concept which is a generalisation of the specific concepts dealt within the (elementary or complex) constituent modules.In this case, the complex modules host a multiplicity of the same kind of information. An easy example is the complex module of a set of PET-scans from a particular part of the brain recorded from various patients. Every scan is a module in itself, with its own specific metadata. The complex module disregards the specific, e.g., the patients name, and concentrates on the common aspects.

    We can compare this kind of complex module with the chemical example of a cluster, where we have many identical atoms weakly bound together.

    Modularity allows for selected reading paths so that modules can be skipped or emphasised, depending on the reader’s wish, expertise or level of understanding.

    Please note that we store information units only once! The bottom line is SGML-coded objects that will change their appearance according to the document style demanded by the presentation medium

    Unfortunately Harmsze’s approach is not the end of the analysis. If we discuss multiple use, we also have to incorporate other granularities of information as well, even down to a single number.

    At all events, full modules or single datum must be identifiable as unique entities in a database. This means that all coherent objects must carry inseparable metadata with them. 3.4. Relations as information objectsAfter having defined the electronic document as a collection of independent information units or modules, the next obvious step is to tackle the mutual relationships between these modules. As a database approach does not necessarily mean that we deal with one physical storage device but that the database objects can be distributed world-wide, it is logical to concentrate on the establishment of a system of relationships that not only connects the modules but immediately defines the type of connection as well.

    It is crucial in the following to realise that links are considered to be anchored on both sides, source and target, and can be traversed back and forth. This means that, e.g., the characterisation "section" in one direction indicates "belongs to" in the other direction. This is technically still a tedious problem, but within the XML environment good progress is being made (XML99).

    In research, part of the work is to relate previously unrelated scientific findings within a new context. In a modular environment, this process can be enhanced. The way to do this is by naming hyperlinks in such a way that the reader knows why a link is being suggested by the author. At present, we have no clue as to why hyperlinks are added; we can only find out by clicking on them. In a structured environment, we know what the reason for this link is and we can decide to follow it or not. This brings us to the tedious discussion on hyper-link taxonomies or typographies.

    Unfortunately very little has been published in the literature. Most of the initiatives are attempts towards a more-or-less complete list of possible notions (tags). In some works, a distinction is suggested between structural/ organisational relations and rhetorical or discourse relations. Our feeling is that in a distributed database environment, we have to start with a clear differentiation between at least two, and maybe three categories of relations. a) Organisational relations, describing the structural relationship of modules, e.g., hierarchical relations such: as part of, etc. b) Discourse relations describing the reasoning, such as argument for/against, an example, clarification. The discussion on this issue is ongoing and part of current research. (Har00, Kir00 and references therein); and c) context relations describing the context in which a certain relation is valid. Obviously the structure of this last category might be domain-dependent. 4. Conclusions One goal is to establish clear and transparent understandings of what we mean by a scientific contribution, how we guarantee quality, integrity and value the intellectual ownership of its originator. In this contribution, I have tried to critically evaluate the notion of a scientific document in an electronic environment. The result of my discussion is that we have to step back from the accepted practise of paper journals; however, without the societal and scientific morale vis-a-vis quality and integrity. People, much easier then in the past can cut and paste from each other works. This dynamic cooperation has to be accepted and appreciated as an advancement in communications.

    Instead of trying to curb history by conservative approaches, as some publishers try to enforce with their refusal to allow authors to post their own papers on their web site, we have to be forward-looking.

    The conclusion so far is that we face a transition in which the traditional journal article will cease to exist. This means that we have to reformulate our notions about scientific documentation. In my view, which I defend in this contribution, we have to go for a distinctly different granularity of information units than that which the traditional paper one allows.

  • If we define modules as conceptual units, we can apply strict rules about quality. At present, a scientific article is peer-reviewed without any discrimination between the various kinds of information in it. In a world of well-defined modules, the refereeing standard for a module Method will be distinctly different from the module Data-acquisition. Thus, quality control will go up.

  • If all modules are endowed with a set of metadata that clearly identifies the author and time of creation, integration of a module in another work is automatically taken care with due credit being given. The DOI approach is promising in this respect. Of course, people can always retype, steal and add fraudulent data, but misconduct is a social problem and not a scientific one.

  • Another interesting new outcome of this analysis is that relations, which express themselves in hyperlinks become information objects on their own merit. As relations in an electronic environment can be typed, they become objects with metadata. Thus, we have to add the bibliographic information of the originator and a time stamp. This way, the minimum scientific publication becomes the brilliant insight of a researcher who connects two separate information units by a typed link, without any further business.
  • For documents that are built from available and new modules, we will have two levels of authentication, one on the level of each module and the other on the level of the complete new work.

  • Modular publication will have a list or map of contents with links to all components as well as a new kind of abstract that reflects the content of all modules and serves as an orientation tool in the hypertext environment. Not only is the completeness of the information part of the integrity but also the overview and a description of the mutual relationships between the components.

  • Therefore, the lesson of this contribution is that electronic media enhance the integration of textual and non-textual knowledge representations, enabling a proper conceptual segregation between various kinds of knowledge and therefore allowing for more specific refereeing. The flip side of these new capabilities is that we have to develop a stable system of domain-dependent metadata for modules and relations that steer the logistics and storage of these modules and relations. We can think back wistfully to the stable situation of established peer-reviewed journals we built over the last century; however, the unknown is the object of science and we are entering a new and unknown phase in scientific communication. Therefore, we have to make sure that our societal and scientific demands for quality and integrity are not mixed with the latest fashion in technology. Technology is enabling us to expand scientific communication into a serious mix of textual and non-textual components. For most of the non-textual components we don’t even have a good insight what quality standards are. Like all real advancement in science, also the development of scientific communication will go through experimental phases. From the analysis of these experiments we will be able to develop new standards and rules. It is a matter of the highest importance that the scientific community takes this experimenting serious and does not bend for conservative forces that try to restrict the developments to the known and established practises of the paper world.

    End of presentation

    The Importance of Aggregators
    Simon Inger, CatchWord Ltd., UK

    Introduction

     It seemed strange to me that, when I was invited to write this paper for the second ICSU conference on electronic publishing, the organisers wanted to categorise it amongst papers on secondary publishing. Much of what I am to say seems so close to the core businesses of today’s journal publishers that one tends to consider it to be primary publishing. But not very long ago the only kind of aggregator in existence was the kind that licensed full text content from primary publishers and sold it as a collection to libraries and researchers – a truly secondary publishing function.

    In reassessing the role of aggregator one needs to look once again at the basic concept of aggregations and uncloak some of the varied business models now somewhat misleadingly known as aggregation models. The term ‘aggregation’ has become too widely deployed in the electronic publishing sector.

    Is it an aggregator?

     So what is this aggregator aggregating? To the user the term aggregator is taken to mean the aggregation of full text content. But the companies who are collectively termed aggregators today range from those who aggregate full text on a selective basis, organised by subject, to those who simply provide a non-selective hosting service for full text publishers, to those who aggregate abstracts and metadata. More sublimely, there are those who just aggregate links to full text. To merely group all of these functions within the industry into one collective term has been the cause of some considerable confusion.

    In the beginning, back in the days when compact disk aggregations of material appeared to be the answer to every collection-developer’s dreams, the new term ‘aggregator’ meant just one thing. That was, a company that licensed content from primary publishers to create a single collection of information available for purchase en masse. As these same companies have traversed into the internet age they have made their collections available online in many cases before the primary publishers’ own online offerings and as a consequence now present a significant threat to the primary publisher itself.

    A closer definition of aggregators

    There are three clear, distinct classes of company that have become "aggregators" in this new world. Firstly there are those companies whose primary focus is to provide a hosting service for publishers – the content host. Secondly there are those who index or categorise disparate content on other content host services – the gateways. And lastly the "traditional" aggregators of licensed full text content – the full-text aggregators. The role of content hosts

    The role of the content host should be primarily a service to publishers. This is the role occupied by CatchWord, Highwire Press, Allen Press, American Institute of Physics, the hosting services of Ingenta and so on.

    The fundamental business model of a full text hosting service provider dictates that its primary revenue stream will come from the services it provides to publishers. Primary publishers simply pay content hosts for the services that they need. This means that, in general, these companies cannot afford to be selective about the subject area (or indeed quality) of the content that they host.

    The exception to this is Highwire Press. They have made a success of cornering one subject area but surrounded it in mantel of exclusivity that made many publishers believe that it was the only sensible place to host their content. This fact, coupled with the apparent exclusive nature of its clients, limited to society and university presses, has further enabled the host to concentrate on the subject areas that it wished to target and not those which target it.

    The lack of selectivity of the majority of content hosts has not impinged on their success. On the contrary, their size alone warrants that gateway services and libraries alike place these organisations high on their target lists of services to which they need to link or index.

    The role of gateways

    This role is the one that almost all of the major subscription agents of the world embarked upon from late 1995 as their model for inclusion into the electronic journal market. Latterly, this is also the territory of the abstracting and indexing companies, such as Cambridge Scientific Abstracts, Silverplatter and ISI. This is also a goal of some of the dot com companies in our market, e.g. TheScientificWorld, although many of them seek to make significant document sales from this linking operation as opposed to subscription sales for the subscription agents.

    The gateway is a large collection of links to publishers’ full text content. The gateway does not host the full text. The gateway does own (or at least accumulate) information about the full text, usually an abstract and other key "header" information, such as author, article title and other standard article metadata. It uses this information to provide its users with an adequate set of information to support a browsing function and a searching function, but not inclusive (generally) of searching the full text.

    Another distinguishing feature of many gateway services is that these gateway companies know the access rights to the content that they index. They achieve this either by being a subscription agent, or by acquiring the subscription information from the publishers whose content they index and link to, or having obtained this information by a combination of inputs from publisher and library, e.g. OCLC FirstSearch ECO. This provides the end user with a certainty of being able to access content presented to him after having done a search.

    Abstracting and indexing companies, however, tend not to acquire this subscription information, but instead rely on their clients, usually libraries, to indicate instead the titles (rather than title combined with year) that their users are allowed to see. This is the route taken by ISI, for example. While this method does not provide the gateway with irrefutable information on access rights, it is extremely likely that the library will set up appropriate access for its users, even if it means that their users might be challenged for payment before being able to view the resource.

    Having said all that, many libraries are themselves becoming gateways. The simple fact that no commercial gateway normally facilitates access to all of the content that any one library requires means that the world’s larger libraries, particularly those in the USA, are creating library web pages that link to all their subscribed electronic content, or in some cases, are able to enhance their web-base catalogues to link directly to the full text resources. For example the University of Massachusetts Medical School has set up web pages detailing all of the titles that students on and off campus can access via the proxy server of their networked site. Indeed, the creation of such library web pages, or library portals as they are increasingly known, has become part of the business model of companies like Ingenta.

    Furthermore, libraries are supported in their quest to be the most appropriate gateways for their students and researchers by a very important initiative called SFX.

    In a traditional gateway environment the gateway provides the user with his first port of call only, since once a user has made a search, located an article or title and navigated his way to the full text, there is a significant likelihood that his next click of the mouse will take him to a text referenced by the article, probably on a publisher’s web site or within another aggregation. The gateway has lost control of its user and indeed, worse still, the user may now be approaching content through the web pages of a rival service.

    SFX, however, allows a library, and hence a library portal, to override the linking suggested within a document and replace it with something more appropriate for its user. In other words, the user may be re-directed not to content on a publisher web site, but perhaps to a manifestation of that content through the library’s chosen gateway or in a locally held full-text aggregation. Whatever the mechanics, the crucial element is that library gateways are in the unique position to dictate that they retain the user as he browses and navigates from one document to another. This currently puts libraries in a uniquely strong position in the provision of information portals to its clients.

    The table below describes the key differences between the major classes of e-journal gateway providers.
     

    Type of gateway Search Headers Search Full Text Know Permissions Maintain Control of User Coverage  Examples
    Subscription Agent, some Dot Coms Yes No Yes No Multi-disciplinary, partial and non-selective SwetsBlackwell, Rowecom IQ,  OCLC Firstsearch ECO
    A&I Company, other Dot Coms Yes No No No Single discipline, selective ISI WOS, CSA, TheScientificWorld
    Library Yes No Yes Yes Everything selected (by definition) All libraries

    The role of full-text aggregators The full-text aggregator, as the name suggests, is a company that creates databases of full text articles, defined by subject area and sold as a single product, rather than as individual subscriptions to components of the database.

    Whilst the companies in this sector were the first to be called aggregators, increasingly their business model is causing some difficulty for primary publishers. On the one hand these companies provide a valuable service to the internet-inexperienced publishers. They can take input from as little as scanned images of the printed text and make an online product quickly and easily. But on the other hand, their very presence has the potential to limit the growth of the primary publishers as they begin to experiment with flexible pricing to "new" markets for their material. A new market for a small publisher would quite likely include the very institutions to which the full-text aggregator had been selling the publisher’s content. In that sense, many new opportunities are stifled since the aggregator has "got there first".

    There are many companies that provide full-text aggregations. Perhaps the best known of these are Adonis, Ebsco Publishing and UMI (Bell and Howell). The business model is fairly straightforward. The full-text aggregator licenses content from its owner and pays a royalty for the sale of the content to libraries and information centres. This has the benefit of allowing the generation of some income for journal publishers from those libraries that would not normally pay a full subscription price.

    However, herein lies a problem. The very concept of a subscription price is being challenged as primary content is placed on the internet. In the print medium a journal subscription has a list price, occasionally with some small variations based on institutional type or regional considerations. But in the internet subscription age, an ever-increasing number of publishers price their content on a case by case basis. In other words they are creating a market for their subscriptions within the very institutions that would have purchased their content via a full-text aggregator. It would be quite conceivable for a publisher to charge just 5% of a full electronic subscription rate to a small community college and still make money from it. As a consequence, many of the larger publishers are reviewing their agreements with the full-text aggregators, often limiting the terms of the licence to include only older materials, leaving them free to negotiate a discounted price directly with the client for current information.

    In addition, many publishers now have successful relationships with consortia of libraries. In many cases these deals with consortia include discounts for low-use, peripheral-interest materials, or materials destined for small institutions. Academic Press was the clear leader in this way of thinking.

    As we speak, there are a number of initiatives under way to bring together groups of smaller publishers who can collectively negotiate with library consortia. This in turn will lead these publishers to the same differential pricing strategy as their larger forerunners and thus to review their relationships with full-text aggregators.

    Who benefits from the presence of aggregators?

    Libraries

    Purchasing of mass collections or organising access to a mass collection that has been purchased as a separate act, both allow libraries quickly to address the information needs of their patrons.

    Small publishers

    Small publishers gain very much through careful deal-making with aggregators. In particular, by using an appropriate content host, the small publisher achieves the same "shop-window" status of its larger counterpart. In addition it can use gateways to further improve its visibility. The very existence of value-for-money content hosts ensures the continuation of the tradition of the small publishing house in being able to secure a niche market for its niche products.

    Large publishers

    Larger publishers save money by outsourcing many of their non-core competencies, such as printing and typesetting, leaving their valuable management time to concentrate on key differentiators from publisher to publisher, namely editorial and organisational differences. Electronic journal hosting can and should be one of those functions, just like printing, typesetting and distribution. In addition its presence in the right gateways can be used to carefully enhance its brand as long as care is taken not to allow the gateway to subsume the publisher brand.

    Scholarship

    Scholarship as a whole gains from a combination of the above benefits for libraries and publishers alike. Aggregators facilitate the diversity of publication from large numbers of publishers rather than promoting the continued conglomeration of publishing houses. Lessons for small to medium commercial publishers

    The primary lesson for all publishers is to get its content online quickly, but at a price that it can afford. Publishers should use gateways to maximise content visibility, and take control of the ownership of their own content by making sure to keep tight control over licensing arrangements. In addition they should probably join a consortium of publishers with the goal of increasing their market by differential pricing to smaller institutions. Likewise, it is important to make sure the content is never sold too cheaply, especially as wealthy multi-nationals stretch the meaning of the "site-licence". Lessons for large commercial publishers

    Most large commercial publishers are already online, but should keep a close eye on cost to make sure the most cost-effective method of having content online is being deployed. Many larger publishers already take great care of their licensing terms and are at least experimenting with differential pricing. Lessons for not-for-profit publishers

    While the commercial imperative of differential pricing and sales to consortia is less apparent to the not-for-profit publisher, maximising readership remains an intrinsic must. This can be best achieved through cost-effective content hosting, the careful management of its own (often powerful) brand and maximising visibility through gateways appropriate to its readership. Conclusions

    This paper has been one largely intended to bring improved clarity to the varying roles of a group of companies collectively termed aggregators. The three classes of aggregator all bring about different benefits to publishers and libraries alike, and their business models are quite distinct.

    The future for libraries appears to be very bright indeed, and with careful planning, can be equally bright for publishing as a whole.

    References

    CCS99 Consultative Committee for Space Data Systems. CCSDS 650.0-R-1: Reference Model for an Open Archival Information System (OAIS). Red Book. Issue 1. May 1999. http://www.ccsds.org/review_books.html

    CDA00 Centre de Données astronomiques de Strasbourg. http://cdsweb.u-strasbg.fr/CDS.html

    Cro00 Crossref. The central source for reference linking. www.crossref.org

    Dan93 H.-D. Daniel. Guardians of science. Fairness and reliability of peer review. Translated by Willed E. Russe. VCK, Weinheim 1993.

    Doi00a Home page Digital Object Identifier Foundation. www.doi.org

    Doi00b The DOI handbook Version 0.5.1. 11 August 2000. http://www.doi.org/handbook_2000/index.html

    Gar79 W.D. Garvey. Communication: The Essence of Science. Pergamon Press, Oxford 1979.

    Har00 Frédérique Harmsze. A modular structure for scientific articles in an electronic environment. PhD dissertation University of Amsterdam, 2000. The full text and appendices is available via: www.science.uva.nl/projects/commphys/papers

    Hrd00 See for the publications on internet by Stevan Harnad. http://cogsci.soton.ac.uk/~harnad/intpub.html

    Hud00 Anne Hudson Jones and Faith McLellan (eds.) Ethical Issues in Biomedical Publication. Johns Hopkins UP, Baltimore 2000.

    IWG99 International Working Group. Defining and Certifying Electronic Publication in Science. A proposal to the International Association of STM Publishers. http://associnst.ox.ac.uk/~icsuinfo/aaas-stm.htm

    Kir96 Joost G. Kircz and Hans E. Roosendaal. Understanding and shaping scientific information transfer. In: Dennis Shaw and Howard Moore (eds.) Electronic publishing in science: proceedings of the joint CDSI/UNESCO Expert Conference Paris February 1996. Unesco Press 1996 pp. 106-116.

    Kir98a Joost G. Kircz. Modularity: the next form of scientific information presentation? Journal of Documentation, vol.54, no. 2, March 1998, pp. 210-235. The final draft can be found on: www.science.uva.nl/projects/commphys/papers

    Kir98b Joost Kircz. Nouvelles présentati