United Nations Education, Scientific and Cultural OrganizationUNESCO Home PageSite Map
Second  ICSU-UNESCO International Conference on Electronic Publishing in ScienceICSU Home Page
ProgrammeAbstractsList of ParticipantsProceedings
Is electronic publishing being used in the best interests of science? The scientist's view
Steve Berry


Session III. RESPONSES FROM THE SCIENTIFIC COMMUNITY
Chair: Peter Schindler Swiss Academy of Sciences

Creating a global knowledge network
Paul Ginsparg
Los Alamos National Laboratory, Los Alamos, NM, US

Contents Abstract

If we were to start from scratch today to design a quality-controlled archive and distribution system for research findings, would it be realized as a set of "electronic clones" of print journals? Could we imagine instead some form of incipient knowledge network for our research communications infrastructure? What differences should be expected in its realization for different scientific research fields? Is there an obvious alternative to the false dichotomy of "classical peer review" vs. no quality control at all? What is the proper role of governments and their funding agencies in this enterprise, and what might be the role of suitably configured professional societies? These are some of the key questions raised by the past decade of initial experience with new forms of electronic research infrastructure. In the article below, I will suggest only some partial answers to the above, with more complete answers expected on the 5-10 year timescale. 

Some arXiv background

Since my talk at the first conference in this series five years ago [Ginsparg, 1996], "Electronic Publishing in Science" has evolved in perception from intriguing possibility to inevitability. This period has also seen widespread acceptance of the internet as a communications medium, both inside and outside of academia, fostered largely by applications such as the WorldWideWeb. While progress on some fronts has been more rapid than might have been anticipated, the core structure and policies of scientific publishing remain essentially unchanged, as are the conclusions and recommendations of this follow-up meeting. In what follows, I will nonetheless suggest some looming instabilities of the current system, and reasons to anticipate much further evolution in the coming decade.

The essential question for "Electronic Publishing in Science" is how our scientific research communications infrastructure should be reconfigured to take maximal advantage of newly evolving electronic resources. Rather than "electronic publishing" which connotes a rather straightforward cloning of the paper methodology to the electronic network, many researchers would prefer to see the new technology lead to some form of global "knowledge network", and sooner rather than later.

Some of the possibilities offered by a unified global archive are suggested by the e-print arXiv (where "e-print" denotes self-archiving by the author), which since its inception in 1991 has become a major forum for dissemination of results in physics and mathematics. This resource has been entirely scientist driven, and is flexible enough either to co-exist with the pre-existing publication system, or to help it evolve to something better optimized for researcher needs. The arXiv is an example of a service created by a group of specialists for their own use: when researchers or professionals create such services, the results often differ markedly from the services provided by publishers and libraries. It is also important to note that the rapid dissemination it provides is not in the least inconsistent with concurrent or post facto peer review, and in the long run offers a possible framework for a more functional archival structuring of the literature than is provided by current peer review processes.

As argued by Odlyzko [1995, 1999], the current methodology of research dissemination and validation is premised on a paper medium that was difficult to produce, difficult to distribute, difficult to archive, and difficult to duplicate -- a medium that hence required numerous local redistribution points in the form of research libraries. The electronic medium is opposite in each of the above regards, and, hence, if we were to start from scratch today to design a quality-controlled distribution system for research findings, it would likely take a very different form both from the current system and from the electronic clone it would spawn without more constructive input from the research community.

An overview of the growth of the arXiv can be found in the monthly submission statistics, showing the number of submissions received during each month since the inception of service in August 1991. The total number of submissions received during the first 10 years of operation is roughly 170,000. The submission rate continues to increase, and roughly 35,000 new submissions are expected during calendar year 2001. Additional signal is contained in the submission data sorted by subject area. The primary observation is that submission growth during the period 1995-2000 was dominated by new users in Condensed Matter physics and Astrophysics: the sum of whose submission rates grew to exceed those in High Energy Physics by late 1997. Extrapolating current growth rates, within a very few years Condensed Matter submissions alone will likely exceed those of High Energy physics, which had begun to reach saturation (i.e., 100% participation of the community) by the mid 1990's. This suggests that the widespread preexisting practice of exchanging hard copy preprints, as was the case in High Energy physics, may not be essential for the electronic analog of this behavior to be adopted by other research communities.

Where do all the submissions come from? According to the submission statistics sorted by e-mail domain of the submitting author, roughly 30% of the submissions come from United States based submitters; 12% from Germany; 6% from each of the U.K., Italy, Japan; 5% from France; and submissions overall arrive from about 100 different countries. The distribution is similar to that for refereed physics journals, and the participation of any given country is typically proportional to its Gross Domestic Product. Reflecting the international nature of the enterprise, the arXiv maintains a 16 country mirror network to facilitate remote access, and which together with the main site typically handles in aggregate many millions of accesses per week. 

The real lesson of the electronic distribution format

The need to reexamine the current methodology of scholarly publishing is reinforced by considering the hierarchy of costs and revenues in Figure 1. The figure depicts five orders of magnitude in US dollars per scientific article. To first approximation in an all-electronic future, the editorial cost per article should be roughly independent of length (as are already the bulk of the current costs, with the exception of the time and energy spent by referees, a significant hidden cost omitted from the numbers below).

At the top of the scale is the minimum $50,000 on average to produce the underlying research for the article, money typically in the form of salary and overhead, and also for experimental equipment. This sets a scale for the overall order of magnitude of the funding involved, and is roughly independent of whether the research is conducted at a university, government lab, or industrial lab.

The next figure on the scale is a rough estimate of the revenues for "high end" commercial journals. (In this case "high end" refers to the pricing, rather than to any additional services provided.) The $10,000--$20,000/article published range is obtained by multiplying the subscription cost per year for some representative "high end" journals by an estimated number of institutional library subscribers and dividing by the number of articles published per year.

Odlyzko's estimate [Odlyzko, 1999] for average aggregate publisher revenues in his survey of Mathematics and Computer Science journals is of the order of $4000/article. (Odlyzko has also pointed out that that since acquisition costs are typically 1/3 of library budgets, the current system expends an additional $8000/article in other library costs, another cost omitted from these considerations.)

If that average holds more generally, then there must be publishers operating at well below that value. At least one professional society publisher in Physics brings in about $2000/article in revenue. (Note as an aside that insofar as revenues=costs for non-profit operations, and if the level of services provided is the same as by the "high end" commercial publishers, then it is possible to estimate the potential profit margin of the latter.) By eliminating the print product, and by restructuring the workflow to take greater advantage of electronically facilitated efficiencies, it is likely that the costs for a relatively large existing publisher could be brought down closer to $1000/article. This number includes long-term infrastructural needs, such as an editor-in-chief for long-range planning, a small research and development (R&D) staff, and maintenance of an archival database.

We can also ask whether an idealistic electronic start-up venture, without the legacy problems of an existing publisher, might be even more efficient. At least one such in Physics, currently publishing about 700 articles per year, operates in the $500/article range, including support of computer operations and overhead for use of space and network connections. But private communications suggest that this number is likely to creep upward rather than downward, as some of the labor volunteered from initial enthusiasm is replaced by paid labor and salaries for existing labor are adjusted to competitive levels for retention, so might also move closer to the $1000/article published range.

The point of these observations is not by any means to argue that any of the above operations are hopelessly inefficient. To the contrary, the object is to assess in an operational (rather than theoretical) context what are the likely editorial costs if the current system is taken all-electronic. The order of magnitude conclusion is that costs on the order of some irreducible $1000 per peer-reviewed published article should be expected, using current methodology. The number is not too surprising after taking into account that the human labor (plus overhead) that dominates the costs is effectively quantized in order $100,000 chunks per person (including overhead: but space, utilities, network connections all cost money). A functional operation, to peer review and publish many hundreds of articles per year, will ordinarily require at least parts of a few people for high level editor, low level secretarial work, system administration, and some amount of R&D on an ongoing basis (still of course assuming volunteered referee time). The costs are therefore immediately in the many hundreds of thousands dollar range, confirming the rough $1000/article order of magnitude, up to a factor of 2 one way or another. Moreover, there do not appear to be dramatic economies of scale from taking such an already idealized skeletal editorial operation to larger sizes of thousands or tens of thousands of published articles per year: the labor costs just scale proportionally. (It is even possible that there are certain diseconomies of scale, i.e., that organizing peer review for many thousands of articles per year leads to additional overhead for centralized offices, managerial staff, and more complicated communications infrastructure; but with the potential benefit of a more coherent set of policies and long-range planning for a larger fraction of the literature.)

Another data point in Figure 1 is the current revenue for a representative "web printer", i.e., an operation that takes the data feed from an existing print publisher and converts it to HTML and/or PDF for rendering by a suitable browser. At least one such operation currently functions in the $100/article range. It is eminently reasonable that the costs should be somewhat lower than any of the above peer review figures, since these services are conducted after the peer review and other editorial functions have taken place. The revenue may seem high, but that is because the operation currently involves reverse engineering part of a legacy process intended for print, and can require a slightly different re-engineering for each participation publisher. With better standardized formats, and better authoring tools to produce them, the associated costs may diminish. There are also other "transitional" R&D costs to working with a still evolving technology, and additional costs associated with experimentation on formats, search engines, alert services, and other forms of reader personalization.

Finally, at the bottom of the scale in Figure 1 is an estimate of the cost per current arXiv submission: in the $1-$5/submission range, based on the direct labor costs per year involved only in processing incoming submissions and operating an e-mail "help desk". (Hardware and labor costs for maintaining the static archival database add on only a small percentage.) The estimate is given as a range because the labor per submission is a skewed distribution. There are subsets, such as the original hep-th (High Energy Physics - Theory), which operate according to the original "fully automated" design, with users requiring no assistance at all. Indeed the vast majority of submissions require zero labor time and only a very small number of new users or problematic submissions are responsible for all labor time spent. This has to be the case since there are upwards of 200 new submissions and replacements per weekday -- if each took even just 15 minutes of human labor at the arXiv end, that would mean over 50 hours of work per day, i.e., at least 7 full-time employees. The current tiny percentage of problematic submissions, and smattering of other user questions, in reality requires less than a single full-time equivalent, placing the cost in the middle of the above cited range. This is also assuming a relatively matured and static system, without the need for constant R&D -- it is not clear whether this is a realistic long-term assumption, but including more R&D would only push the costs closer to the upper end of the $1-$5 range.

A key point of the electronic communication medium is that the cost to archive an article and make it freely available to the entire world in perpetuity is a tiny fraction of the amount to produce the research in the first place. This is, moreover, consistent with public policy goals [Bachrach et al., 1999] for what is in large part publicly funded research.

In the future there is likely to be a more ideal case, in which the steady state labor is not dominated by the current ever-expanding profile of new users, but would rely instead on an experienced userbase in possession of better local authoring tools. Such tools would make it possible for the user to prepare a more sophisticated and fully portable document format, with accurate and automatically parsable metadata, auto-linked references, better treatment of figures and other attachments, and more. Then the rest of the research community could interact with the automated system as autonomously as the original hep-th community, with the result that the system as a whole could operate in the $1/submission range (or below).

The conclusion of the above is that the per article costs for a pure dissemination system are likely to be at least a factor of 100 to 1000 lower than for a conventionally peer reviewed system. This is the real lesson of the move to electronic formats and distribution, i.e., not that everything should somehow be free, but that with many of the production tasks automatable or off-loadable to the authors, the editorial costs will then dominate the costs of an unreviewed distribution system by many orders of magnitude. This crucial point is the subtle difference from the paper system, in which the expenses directly associated to print production and distribution were roughly the same order of magnitude as the editorial costs (estimates for the cost of the print component are typically 30% of the total). It wasn't as essential to ask whether the production and dissemination system should be decoupled from the intellectual authentication system when the two were comparable in cost. Now that the former may be feasible at less than 1% of the latter, the unavoidable question is whether the utility provided by the latter, in its naive extrapolation to electronic form, continues to justify the associated time and expense. Since many communities rely in an essential way on the structuring of the literature provided by the editorial process, the related question is whether there might be some hybrid methodology that can provide all of the benefits of the current system but for a cost somewhere in between the order $1000/article cost of current editorial methodology and the order $1/article cost of a pure distribution system.

The above questions cannot yet be answered, but some closing observations regarding the above revenue per article estimates may be relevant. The key for any automated system to getting the per article cost down (presuming for the moment that is the objective) will always be to handle far greater volume than can a conventionally edited journal. As mentioned above, any significant fraction of an employee immediately puts the costs per year in the $100,000 range, including overhead, so that requires of order 100,000 articles per year in order to get down towards the bottom of Figure 1. It is also worthwhile to clarify that the above comparisons involve cost per arXiv submission on the one hand, and editorial cost per article published on the other. It is not so much an issue that adding in the numbers for rejected articles would reduce the nominal cost/article (e.g., by a factor of 2 for a journal with a 50% acceptance rate), but that the bulk of the editorial time, hence cost, is evidently spent on the articles rejected for publication. (Lest it seem a hopelessly paradoxical and inefficient effort to devote the majority of time to the material that won't be seen, recall that we don't ordinarily regard sculptors as involved in a futile effort just because the vast majority of their time is also spent removing extraneous material.) This does suggest however that there might be some modification of the existing editorial methodology to somehow take advantage of the open distribution sector as a pre-filter in order to maximize the time and effort spent on the articles that will be selected for publication, while still maintaining high standards (presuming that remains the objective).

It should also be noted that for the most part the current peer review system has itself escaped a systematic assessment. Despite its widespread use, and the widespread dependence on it both for publication and for grant allocation, much of the evidence for its efficacy remains largely anecdotal. In the Health Sciences, recent studies suggest that conventional editorial peer review can be "expensive, slow, subjective and biased, open to abuse, patchy at detecting important methodological defects, and almost useless at detecting fraud or misconduct" [Peer Review..., 1999]. While it can improve the quality of those articles that do eventually get published, studies suggest that a competent lone editor can perform as well or better. Peer review is by no means a monolithic practice, however, and the Health Sciences differ from, say, Physics in a number of potentially crucial respects. The journals in the former discipline frequently have much lower acceptance rates, as low as 10%, and provide a conduit for a small number of researchers to speak to a much larger number of clinicians. In Physics and closely related disciplines, by contrast, the acceptance rates are typically higher, and the author and reader communities essentially coincide (and hence the referee community as well is composed of the same set of researchers). It will consequently be very valuable to assess peer review more systematically in other disciplines to determine whether it is as well for those "a process with so many flaws that it is only the lack of an obvious alternative that keeps the process going" [Peer Review..., 1999].

Another corollary of the above observation concerning volume is that a physically distributed set of repositories, even if seamlessly aggregated via some interoperability protocol, is not likely to be as cost-efficient as a centralized one. The argument again is that any manual labor involved will bring in some fraction of an employee at a cost of a few tens of thousands of dollars per year, and then a few tens of thousands of articles per year would be required to get the cost down to the few dollars per article range. But volume in that range is closer to the world output for a given discipline, not to the output of a single department. Distributed archiving, even if not as cost-efficient, could of course have other advantages, including redundancy and non-centralized control. In addition, such "in-sourcing" of research communication infrastructure could also make more effective use of existing support labor resources than does the current system. 

The near future?

Currently, the research literature continues to owe its structure to the editorial work funded by publisher revenues to organize peer review. The latter of course depends on the donated time and energy of the research community, and is subsidized by the same grant funds and institutions that sponsor the research in the first place. The question crystallized by the new communications medium is whether this arrangement remains the most efficient way to organize the review and certification functions, or if the dissemination and authentication systems can be naturally disentangled to create a more forward-looking research communications infrastructure.

Figure 2 is meant to illustrate one such possible hierarchical structuring of our research communications infrastructure. At left it depicts three electronic service layers, and at right the eyeball of the interested reader/researcher is given the choice of most auspicious access method for navigating the electronic literature. The three layers, depicted in blue, green, and red, are respectively the data, information, and "knowledge" networks (where "information" is usually taken to mean data + metadata (i.e. descriptive data), and "knowledge" here signifies information + synthesis (i.e. additional synthesizing information). Figure 2 also represents graphically the key possibility of disentangling and decoupling the production and dissemination on the one hand from the quality control and validation on the other (as was not possible in the paper realm).

At the data level, Figure 2 suggests a small number of potentially representative providers, including the e-print arXiv (and implicitly its international mirror network), a university library system (CDL = California Digital Library eScholarship project), and a typical foreign funding agency (the French CNRS = Centre National de Recherche Scientifique CCSD project). These are intended to convey the likely importance of library and international components. Note that there already exist cooperative agreements with each of these to coordinate via the "open archives" protocols to facilitate aggregate distributed collections.

Representing the information level, (ISI = Institute for Scientific Information), and a generic government resource (the PubScience initiative at the DOE), suggesting a mixture of free, commercial, and publicly funded resources at this level. For the biomedical audience at hand, I might have included services like Chemical Abstracts and PubMed at this level. A service such as GenBank is a hybrid in this setting, with components at both the data and information layers. The proposed role of PubMedCentral would be to fill the electronic gaps in the data layer highlighted by the more complete PubMed metadata.

At the "knowledge" layer, Figure 2 shows a tiny set of existing Physics publishers (APS = American Physical Society, JHEP = Journal of High Energy Physics, and ATMP = Applied and Theoretical Mathematical Physics; the second is based in Italy and the third already uses the arXiv entirely for its electronic dissemination); and BMC (= BioMedCentral) should also have been included at this level. These are the third parties that can overlay additional synthesizing information on top of the information and data levels, and partition the information into sectors according to subject area, overall importance, quality of research, degree of pedagogy, interdisciplinarity, or other useful criteria; and can maintain other useful retrospective resources (such as suggesting a minimal path through the literature to understand a given article, and suggesting pointers to outstanding lines of research later spawned by it). The synthesizing information in the knowledge layer is the glue that assembles the building blocks from the lower layers into a knowledge structure more accessible to both experts and non-experts.

The three layers depicted are multiply interconnected. The green arrows indicate that the information layer can harvest and index metadata from the data layer to generate an aggregation which can in turn span more than one particular archive or discipline. The red arrows suggest that the knowledge layer points to useful resources in the information layer. As mentioned above, the knowledge layer in principle provides much more information than that contained in just the author-provided "data": e.g. retrospective commentaries, etc. The blue arrows -- critical here -- represent how journals of the future can exist in an "overlay" form, i.e. as a set of pointers to selected entries at the data level. Abstracted, that is the current primary role of journals: to select and certify specific subsets of the literature for the benefit of the reader. A heterodox point that arises in this model is that a given article at the data level can be pointed to by multiple such virtual journals, insofar as they're trying to provide a useful guide to the reader. (Such multiple appearance would no longer waste space on library shelves, nor be viewed as dishonest.) This could tend to reduce the overall article flux and any tendency on the part of authors towards "least publishable units". The future author could thereby be promoted on the basis of quality rather than quantity: instead of 25 articles on a given subject, the author can point to a single critical article that "appears" in 25 different journals.

Finally, the black arrows suggest how the reader might best proceed for any given application: either trolling for gems directly from the data level (as many graduate students are occasionally wont, hoping to find a key insight missed by the mainstream), or instead beginning the quest at the information or knowledge levels, in order to benefit from some form of pre-filtering or other pre-organization. The reader most in need of a structured guide would turn directly to the highest level of "value-added" provided by the "knowledge" network. This is where capitalism should return to the fore: researchers can and should be willing to pay a fair market value for services provided at the information or knowledge levels that facilitate and enhance the research experience. For reasons detailed above, however, we expect that access at the raw data level can be provided without charge to readers. In the future this raw access can be further assisted not only by full text search engines but also by automatically generated reference and citation linking. The experience from the Physics e-print archives is that this raw access is extremely useful to research, and the small admixture of noise from an unrefereed sector has not constituted a major problem. (Research in science has certain well-defined checks and balances, and is ordinarily pursued by certain well-defined communities.)

Ultimately, issues regarding the correct configuration of electronic research infrastructure will be decided experimentally, and it will be edifying to watch the evolving roles of the current participants. Some remain very attached to the status quo, as evidenced by responses to successive forms of the PubMedCentral proposal from professional societies and other agencies, ostensibly acting on behalf of researchers but sometimes disappointingly unable to recognize or consider potential benefits to them. (Media accounts have been equally telling and disappointing in giving more attention to the "controversy" between opposing viewpoints than to a substantive accounting of the proposed benefits to researchers, and to taxpayers.) It is also useful to bear in mind that much of the entrenched current methodology is largely a post World War II construct, including both the largescale entry of commercial publishers and the widespread use of peer review for mass production quality control (neither necessary to, nor a guarantee of, good science). Ironically, the new technology may allow the traditional players from a century ago, namely the professional societies and institutional libraries, to return to their dominant role in support of the research enterprise.

The original objective of the e-print arXiv was to provide functionality that was not otherwise available, and to provide a level playing field for researchers at different academic levels and different geographic locations -- the dramatic reduction in cost of dissemination came as an unexpected bonus. (The typical researcher is entirely unaware and sometimes quite upset to learn that the average article generates many thousands of dollars in publisher revenues.) As Andy Grove of Intel has pointed out [Grove, 1996], when a critical business element is changed by a factor of 10, it is necessary to rethink the entire enterprise. The e-print arXiv suggests that dissemination costs can be lowered by more than two orders of magnitude, not just one.

But regardless of how different research areas move into the future (perhaps by some parallel and ultimately convergent evolutionary paths), and independent of whether they also employ "pre-refereed" sectors in their data space, on the one- to two-decade time scale it is likely that other research communities will also have moved to some form of global unified archive system without the current partitioning and access restrictions familiar from the paper medium, for the simple reason that it is the best way to communicate knowledge and hence to create new knowledge. 

References

P. Ginsparg, Winners and Losers in the Global Research, Electronic Publishing in Science, at UNESCO HQ, Paris, 1996 (eds. Dennis Shaw and Howard Moore).

A. Odlyzko, Tragic loss or good riddance? The impending demise of traditional scholarly journals, Intern. J. Human-Computer Studies (formerly Intern. J. Man-Machine Studies) 42 (1995), pp. 71-122, and in the electronic J. Univ. Comp. Sci., pilot issue, 1994.

A. Odlyzko, Competition and cooperation: Libraries and publishers in the transition to electronic scholarly journals, Journal of Electronic Publishing 4(4) (June 1999), and in J. Scholarly Publishing 30(4) (July 1999), pp. 163-185.

S. Bachrach et al., Who Should 'Own' Scientific Papers?, Science, Volume 281, Number 5382, Issue of 4 Sep 1998, pp. 1459-1460. See also Bits of Power: Issues in Global Access to Scientific Data, by the Committee on Issues in the Transborder Flow of Scientific Data; U.S. National Committee for CODATA; Commission on Physical Sciences, Mathematics, and Applications; and the National Research Council; National Academy Press (1997).

Peer Review in Health Sciences, Ed. by Fiona Godlee and Tom Jefferson, BMJ Books, 1999.

Andy Grove, Only the Paranoid Survive: How to Exploit the Crisis Points That Challenge Every Company and Career, Bantam Doubleday Dell, 1996
(as cited in A. Odlyzko, "The economics of electronic journals" First Monday 2(8) (August 1997), and Journal of Electronic Publishing 4(1) (September 1998). Definitive version on pp. 380-393 in Technology and Scholarly Communication, R. Ekman and R. E. Quandt, eds., Univ. Calif. Press, 1999.) 

End of presentation

E-BioSci: an Europe-based platform for e-publishing
and information integration in the life sciences
Les Grivell
Center for Social Informatics
European Molecular Biology Organisation,  Heidelberg, Germany

In the summer of 1989, the Genetics Society of America organized one of its bi-annual meetings on Yeast Genetics. As a participant at that meeting, two sessions still stand out for me. The first, a part of the official programme, was a session entitled ‘Who’s working on my/your gene?’. The other was a brief, slightly conspiratorial meeting of a handful of European scientists gathered around the conference hotel’s Steinway and was focused on preparations for an ambitious, EU-funded project aimed at the determination of the complete DNA sequence of the genome of the yeast Saccharomyces cerevisiae. The outcome of the first session was a long list of genes each remarkable for being involved in so many apparently different cellular processes that they had emerged time after time in different guises and under different names in different laboratories. The session clearly illustrated the benefits of sharing of information. It also highlighted the need for well-structured databases capable of allowing storage and retrieval of different types of information derived from many different experimental techniques in a way that would allow researchers to construct as complete picture of all the facets of a given gene and its functional relationships to others involved in the same or related cellular processes

Genomics and the information explosion

The final outcome of the second session was the publication on 24th April 1996 of the complete sequence of the 13.3 M base pairs that make up the 16 chromosomes of Saccharomyces cerevisiae, the simple, single-cell organism that is so often used as model for more complex and experimentally less accessible eukaryotic cells. The relatively brief period since this historic date has been one of unprecedented rapid progress, culminating on June 26th of last year with the announcement of the first draft of the 3000 M base pairs of the human genome (see Fig 1: Increasingly rapid progress of genomic sequencing projects ). As of February 2001, the number of completely sequenced genomes stands at 800. The total number of base pairs of DNA sequence stored in the joint EMBL/DDBJ/GenBank databanks has risen to a mind boggling 11,526,750,544 base pairs in 10,711,124 records and continues to increase at an exponential rate (see Fig. 2: Exponential growth of the EMBL DNA sequence database ).

The fields of genomics and bio-informatics have firmly established themselves in research programmes and teaching curricula, together with related areas of functional and structural genomics and their derived specializations of transcriptomics, proteomics and metabolomics, that deal with respectively all the RNAs, proteins and metabolites present in a cell. Common to all these areas is the production of vast amounts of raw data. Common too, is the increasing dependence on the internet as a means of disseminating or acquiring data and of providing access to specialized software for analysis. Paper is turning out to be an inadequate medium for the flood of new data that often demands both further manipulation and new methods of visualization as an aid to interpretation. More often than not, paper publications contain little more than summary pointers to data tables that are too large to print, or to videos and multi-dimensional images that cannot be printed. It is against this background of changing practices and expectations that bio-medical researchers have also come to question established editorial, reviewing and publishing practices and even to reconsider the nature of the publication itself.

From data to knowledge

For some, this veritable embarrass de richesse of data is seen as the death knell for hypothesis-driven research and the dawn of an era in which data-mining will generate novel leads and concepts for innovative research. For others, it signals just the opposite - a means of enabling biologists to construct for the first time precise, detailed and experimentally verifiable models of cellular function. Either way, success of data analysis depends on the ready availability of as complete a set of data as possible. Several recent developments are likely to contribute to the achievement of this ideal:

Encouraging though these developments are, a number of hurdles still have to be taken. They include:

Fig 3a. A novel approach to the visualisation of transcriptomic data: the changing patterns of gene activity as detected by DNA micro-array analysis

Fig. 3b Visualisation of data present in the yeast protein-protein interaction database facilitates the discovery of new relationships between cellular processes

With respect to this last point, PubMed in the USA has provided a first set of tools for searching and retrieval of information from the MEDLINE collection of abstracts that is linked to DNA and protein sequence databases. The system is interlinked at the level of keywords and identifiers. There are, however, clear needs for innovation and refinement: needs to increase the sophistication of search algorithms, to develop methods for searching of full text publications, to develop better discriminative criteria for interlinking and establishing relationships between published documents and to link publications with data in a variety other formats that includes, structures, images and animations. E-BioMed (subsequently PubMed Central), the first attempt to establish a single site for the storage and retrieval of electronic text and data, was an important initiative of the NIH in the USA. Unfortunately, however, the controversial aspects of this proposal with respect to the absence or possibly only optional presence of peer review and to lack of realism in terms of the aim of distribution without charge of content owned by others, prevented implementation as originally conceived. Even now, the more important issues mentioned above are being pushed into the background by a controversial call for open access to the published literature aimed at individual scientists by a number of the original proponents of PubMed Central. The call, in the form of an open letter to publishing organisations, encourages scientists to show their support for open access by pledging to publish in, review or edit for only those journals that grant unrestricted distribution rights to PubMed Central and similar entities within 6 months of publication. Inexplicably, the call focuses only on primary journal publications, ignoring a potentially much more serious problem concerning a growing tendency towards limited access to database information. This problem has been highlighted recently by acceptance by the publisher of Science journal of Celera’s terms for the release of their human genome sequence data. The data will not be submitted to public databanks and access at Celera’s own site will be restricted to those agreeing not to ‘redistribute’ the information. The implications of the latter restriction are crippling, since, depending on the exact interpretation of what is meant by redistribution, they may well extend to severe limitation of freedom to publish studies based on the data, to carry out large-scale bio-informatic analysis and to incorporate derived data into other databases.

  The E-BioSci platform for information access and retrieval

It is against this confused background that EMBO, the European Molecular Biology Organisation, has decided to take the lead in a collaborative effort to establish E-BioSci as a European-based information resource network with a global role. A series of discussions with interested parties (including research organisations, learned societies, publishers, individual research scientists and representatives of a large number of EU member states) identified the shortcomings of earlier proposals and led to the formulation of the current initiative. This defines E-BioSci as a networked platform that will extensively combine the skills and content already present, or being developed in various centres in Europe. It will work in harmony with other global initiatives such as PubMed Central, publishers and other information providers. Although superficially more complex, this setup more accurately reflects the European dimension of the project. Additionally, it offers potential advantages in terms of speed of access, provision of backup or secure storage facilities and it will allow queries to be performed in different language formats. By providing an extensive set of linkages through the biological information chain E-BioSci will:

The E-BioSci network will:

As indicated in these last two points, E-BioSci will, besides acting as an information portal, provide hosting services for electronic publications. The aim here will be to provide a platform for the dissemination of material that has previously undergone peer review and authentication by an independent body. E-BioSci need not be the sole repository of such material and authors may choose to submit their reviewed and authenticated manuscripts to as many sites as they wish. This emphasis on a reliable form of quality assessment and control distinguishes E-BioSci from a number of other e-publishing initiatives, including those modeled on the Los Alamos Physics Archive (e.g. the eprint based Cogprints server), or commercially based services such as those offered by BioMed-Central. One of the main issues here is that authors rely on the perceived quality of their publications as support for funding applications and career advancement and are thus likely to be reticent to abandon a tried and trusted model of assessment in the absence of reliable and widely accepted alternatives. Additionally, from the reader’s point of view, some degree of editorial control is, at least in part, a guarantee that technical standards have been met, that the conclusions are adequately supported by the experimental data and that the presentation meets acceptable standards of clarity. In cases in which a submission is accompanied by significant amounts of supplementary data, the peer review process also provides an appropriate opportunity for watermarking of both manuscript and data to protect against tampering at a later stage.

  Conclusions and prospects

Just as the emerging field of genomics is changing the way in which molecular biologists plan, execute and interpret their research, so is the transition from traditional to electronic publishing technologies changing the ways in which the results of this research is disseminated to and used by other scientists. In this brief overview, I have presented a perspective largely based on that of the individual scientist, who wishes to have free, or at least unhindered access to as wide a range of electronic information sources as possible, to be able to navigate effortlessly between them and to search, select, integrate and manipulate information without leaving his or her desk. I have outlined a number of recent developments that will contribute to the achievement of this goal. E-BioSci is one of these. Much still remains to be done, however, and a brief wish list of a typical user might include:

References
1. Goffeau A. (2001)  Four years of post-genomic life with 6,000 yeast genes. FEBS Lett. 2000 Aug. 25;480(1):37-41. Review.
2.  http://www.ebi.ac.uk/~sterk/genome-MOT/MOTgraph.html
3.  http://www3.ebi.ac.uk/Services/DBStats/
4.  Etzold T, Ulyanov A, Argos P. (1996) SRS: information retrieval system for molecular biology data banks. Methods Enzymol.; 266: 114-28.
5.  http://www.w3.org/XML/
6.  http://www.corba.org/
7.  http://www.doi.org/
8.  http://www.crossref.org/
9.  http://www.sfxit.com/
10.  Carazo JM, Stelzer EH.(1999) The BioImage Database Project: organizing multidimensional biological images in an object-relational database. J Struct Biol; 125: 97-102.
11.  Gonzalez-Couto E, Hayes B, Danckaert A. (2001) The life sciences global image database.  Nucleic Acids Res. 29: 336-9.
12.  Discala C, Benigni X, Barillot E, Vaysseix G. (2000)  DBcat: a catalog of 500 biological databases. Nucleic Acids Res; 28: 8-9.
13.  Eisen MB, Spellman PT, Brown PO, Botstein D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 95, 14863-8.
14.  Schwikowski B, Uetz P, Fields S. (2001) A network of protein-protein interactions in yeast.  Nat Biotechnol; 18: 1257-61.
15.  http://www3.ncbi.nlm.nih.gov/Omim/
16.  http://www.ncbi.nlm.nih.gov:80/entrez/query/static/overview.html
17.  http://www.nlm.nih.gov/databases/freemedl.html
18.  http://www.pubmedcentral.nih.gov/
19.  http://www.publiclibraryofscience.org/
20.  http://www.celera.com/
21.  See http://www.embo.org/E_Pub_pages.html
22.  http://www.eprints.org/
23.  http://www.biomedcentral.com/
End of presentation

The SciELO Model for electronic publishing and measuring of usage
and impact of Latin American and Caribbean scientific journals
Abel L. Packer,
SciELO, FAPESP/BIREME Project Operational Coordinator, BIREME/PAHO/WHO, Director

1. Introduction

The Scientific Electronic Library Online (SciELO) is emerging as a regional model for the electronic publishing of scientific journals, intended to cover primarily publications from Latin America, Caribbean and Spain. In the long term, the library is devised as an instance for the measurement of usage and impact of the scientific journals.

It operates currently through the SciELO Internet Web portal, that references the decentralized national collections of selected scientific journals organized by country of publication and regional collections by organized by subject areas.

SciELO was launched in 1997 as cooperative project among the State of São Paulo Science Foundation (FAPESP), the Latin American and Caribbean Center on Health Sciences Information (BIREME), a center of the Pan American Health Organization (PAHO/WHO) and a selected group of Brazilian scientific editors. FAPESP – is a very dynamic agency that promotes scientific research in São Paulo and has a program that supports scientific journals. Its fundamental motivation toward SciELO is the production of indicators of usage and impact of Brazilian scientific journals. BIREME – is a specialized center on the promotion of technical cooperation in health sciences information through out Latin America and the Caribbean (LA&C). BIREME’s main objective is to strengthen the flow of health sciences information and its current operational strategy is the construction of the Virtual Health Library (VHL) as a network of information sources in the Internet. A specific product of the VHL is the LILACS (L&C health sciences literature) bibliographic database that indexes the regional literature in health sciences. Its fundamental motivation on SciELO is to move the best journals indexed by LILACS to electronic format. The partnership between FAPESP and BIREME involves all subject areas.

By the end of 2000, SciELO operates regularly collections of the best journal titles published in Chile and Brazil in all scientific areas, and it is being implemented in Cuba and Costa Rica as well but covering health sciences related journals. A separated collection of the best public health journals from Latin America covers titles from Mexico, Brazil, Spain and the Bulletin of the Pan American Health Organization. In total, in the beginning of 2001, SciELO operates online 85 titles and about 12.000 articles. The number of online access to all these collections is increasing constantly.

The strength of the SciELO model resides in the fact that in addition to promote the transition of paper based journals towards the Internet electronic media, it intends to address and contribute to overcome the traditional problems inherent the publications running out of the mainstream scientific communication expressed by the titles indexed in the major international bibliographic databases, such as those produced and commercialized by the Institute of Scientific Information Inc. data bases, the MEDLINE produced by the National Library of Medicine (NLM) of the United States National Institute of Health (NIH), the United States IEEE data bases, the American Phycological Association database, etc.

2. The traditional vicious circle that affects developing countries scientific journals The problems that traditionally affect the non-mainstream journals, specially the publications from developing countries, are concerned with the quality and more specifically with the perception of the quality of the scientific communication they accomplish.

There is no local systematic quality control of scientific communication carried out by developing countries based on time series of bibliometric and informetric indicators. In consequence, the measurement of developing countries scientific production relies most of the times on indicators produced by developed countries information products and services, specially the ISI Journal Citation Reports (ISI JCR). This situation represents a tremendous limitation due to the fact that most of the scientific journals published by developing counties are not indexed and not measured systematically in terms of usage and impact by the traditional developed countries indexes. For example, ISI JCR 1999 Social Science Edition indexes 8 titles from LA&C and ISI JCR 1999 Science Edition indexes 43 titles. NLM MEDLINE indexes 43 titles from a total of more then 600 titles indexed by LILACS.

The limited number of journals indexed in the major developed countries indexes has historically limited the visibility and accessibility of a great volume of developing countries publications. This phenomena has been a constant on the analysis and discussions of the problems faced by LA&C scientific journals and was identified by W. Wayt Gibbs as "Lost science in the Third World" in his article in Science 1995,2(273):76-83.

Without a systematic quality control, developing countries journals have been considered in many circles as second class in terms of quality. Worst than that, even the titles from developing countries indexed as main stream journals are frequently not perceived as such due to the fact that they come from developing countries. This perception permeates the academic community and the agencies that support science research in both developed and developing countries. In consequence, the stigma or the myth of second class often is assigned to developing countries journals, which, in the view of many, can only be minimized when the title is indexed in the main stream indexes. The practical consequence is to reinforce authors and institutions preference to publish on journals from developed countries because they will better rewarded. So, under this stigma, local journals are relegated or viewed as simply recipients of manuscripts not accepted outside or with a high probability to be rejected due a variety of reasons, including the authors perceived value of the manuscript, the subject focussed on local problems or interests, language, etc.

This is the context that feeds the circle vicious that has affected the evolution of the developing countries journals. And there is no perspective to break this vicious circle if developing countries journals continue to be judged by the unique criteria of being or not being indexed by developed countries bibliographic databases, in particular, by the indicators supplied by the ISI JCR.

The SciELO Model was conceived and it is being developed to break this vicious circle by promoting a context where the publishing of non-mainstream journals can have a positive feedback with their potential or target community of readers and authors and be progressively judged by its real dimension and value. In this line, the SciELO Model intends to provide their journal collections with advanced mechanisms of indexing and measurement that complements ISI JCR and other developed countries information products and services.

3. The conception and evolution of the SciELO Model The SciELO Model is based on the assumption that the universe of scientific journals published by developing countries varies largely in terms of quality, including the fact that there are high quality titles not indexed by developed countries bibliographic databases due to a variety of reasons. In addition, the contents communicated by developing countries journals are in a significant extent related to local problems and therefore the promotion of their credibility and wide dissemination is crucial for the usage of scientific information in the social and economic development process. Therefore, a strategic objective of SciELO is to conform and strengthen the regional or peripheral scientific communication envisaging and promoting its integration in the global information flow Internet is creating. An expected result of this strategy is to foster the development of science in and for developing countries as a consequence of the wide dissemination of the local scientific research results.

In practical terms, the SciELO Model promotes the operation of a network of decentralized collections of selected titles of journals that publishes original articles that communicates results of original scientific research and other original works, case reports, technical reports, reviews, and communications related to scientific research. SciELO sites are organized to increase the visibility, accessibility and usage of individual titles. In addition, SciELO encompass a series of measurements of usage and impact of journal titles envisaging to value and monitor the quality of the journals titles and the contents they communicate.

For the purpose of its development and dissemination, the SciELO Model evolved according three components: SciELO Methodology, SciELO Site and SciELO Network.

The SciELO Methodology comprises standards, guidelines and software for the electronic publishing of integrated collections of scientific journals. It makes intensive usage of Internet related technologies to provide users with the capability to browse and search texts in the scope of a collection of titles as well as of individual titles, look up at table of contents of individual issues, display and print abstracts and full texts, which are presented in HTML format and optionally PDF format. The methodology comprises also a set of selection criteria and guidelines for including and maintaining titles in a collection as well as an integrated set of tools to measure usage and impact of the journal titles texts.

The texts are marked up using SGML based international standards so each text is accompanied by its metadata, which can be used to exchange records with different bibliographic data bases as well as to establish hyperlinks inside a SciELO collection or with external information sources. A practical example, is the interchange of records and the automatic linking between LILACS and MEDLINE bibliographic records and SciELO articles.

The methodology requires constant improvements in order to increase its efficiency, answer new needs and follow the advancements in electronic publishing. A key characteristic of SciELO Methodology is combining the compliance with international standards and practices and the adaptability to the prevalent conditions of developing countries in terms of infrastructure of information technology, economic and human resources. The first version of the methodology was developed during the pilot phase of the project, from March 1997 to May 1998, by operating a selected group of 10 Brazilian titles. A second version was launched in 1999 and the third version fully based in XML is expected to be released by the end of 2002.

The second component of the model is the SciELO Site, which comprises all issues related to the actual production and operation of a collection of electronic journals in Internet, according to the SciELO Methodology. A SciELO Site requires an established organization to deal with its daily management and operation, including text conversion, editing, markup, storage, publishing, exchanging metadata and linking with external databases, producing bibliographic indicators and reports of usage and impact.

SciELO Sites are planned to be operational at national and regional levels. The first SciELO Site was SciELO Brazil, which started its regular operation in June 1998 with 10 titles, reaching 54 titles by the end of 2000 covering all scientific areas. SciELO Brazil is expected to reach around 80 titles by the end of 2002. The SciELO Brazil, which was created and developed by the partnership between BIREME, FAPESP and a group of Brazilian scientific editors was the origin of the SciELO Model. SciELO Brazil is financed by FAPESP and operated by BIREME, which also coordinates the dissemination of the model.

The success of SciELO Brazil was followed by SciELO Chile, which operates regularly since the end of 1999 under the coordination of the Chilean National Council of Science and Technology ( CONICyT). It started with 5 journal titles and reached 20 titles by the end of 2000. SciELO Chile is expected to reach around 35 titles by end of 2002 covering all scientific areas.

SciELO Costa Rica is operating in a pilot way under the coordination of the Library of the Social Security Institute of Costa Rica (BINASS). It is expected to start its regular operation by April 2001with 5 titles related to health sciences.

SciELO Cuba is also expected to start its regular operation by April 2001 with 5 titles related to health sciences. SciELO Cuba is coordinated by Infomed unit of the National Center of Medical Sciences Information of the Ministry of Health of Cuba (CNCM).

In next two years, SciELO Model is expected to be adopted by at least three news countries from LA&C.

BIREME and the Instituto de Salud Carlos III from Spain are working on a cooperative project to implement SciELO Spain covering health sciences journals published in Spain. The participation of Spain in SciELO will increase the availability of scientific information in Spanish language for LA&C countries.

At regional level, SciELO Public Health - includes 5 titles from Brazil, Mexico, Spain and the Bulletin of the Pan American Health Organization. The Bulletin of WHO is expected to be added into the collection by April 2001.

SciELO Sites, by putting together selected collection of journals titles, they maximize the development of individual titles in several dimensions. First, it makes feasible the electronic publishing of journals in an advanced and compatible way, which would be impossible or would take long time and would cost much more if published individually. Most of the editors and publishers of LA&C countries journal do not have economic and technological conditions to move their journal to electronic format in a sustainable way. Second, it provides the collection and each individual title higher visibility and accessibility when compared to paper based distribution as well as to electronic publication in scattered sites in Internet. The collection per se stimulates the browsing, navigation and cross-links among the different title texts. SciELO also increases incoming links from external sources. The library adds value to the time of the users by maximizing the relationship between recall and precision in the searching process when compared to searching over Internet non-compatible sites. Third, SciELO succeeds in quality control it will be progressively recognized as a context that privileges quality and therefore become a reference for authors, readers, editors, publishers, research-related agencies, etc. which will increase the credibility of SciELO journals.

In order to guarantee the continued search for excellence, the operation of a SciELO Site is recommended to be assisted by an advisory committee responsible for the application of the selection criteria to include news titles into the collection and to monitor the performance of individual titles regarding the minimum requirements to remain in the collection.

The SciELO Site produces automatically a set of unique numeric indicators of usage and impact, which provide the advisory committee, editors, publishers and the agencies that support scientific communication empirical data to monitor the performance and to identify the weakness and flaws that affects the collection as a whole and each individual title. Examples of indicators are: Web pages visited, articles visited by journal, issues visited, citing and cited journals, etc. As these indicators acquires critical mass, countries using SciELO will have systematic and updated series of bibliometric and informetrics indicators to evaluate and improve their scientific communication. In other words, the SciELO Site is expected to create an environment that induces the constant monitoring and enhancement of the quality of journals.

Finally, the systematic evaluation of SciELO site helps developing countries to build and improve local capacity on scientific writing, editing and publishing.

The third component of the SciELO Model is the development and operation of the network of SciELO sites. Currently the network is at its initial stage. Its current public expression is the Web Portal that links to the different decentralized SciELO Sites described before.

The dissemination and development of the SciELO network is promoted primarily by BIREME as part of its major program to strength the flow of health scientific and technical information among the countries of LA&C through the as Virtual Health Library (VHL) technical cooperation framework . The VHL is conceived as a virtual space in Internet to be co-operating by information producers, intermediaries and user envisaging the equity access to health information. Inside the VHL, SciELO texts are linked with other information sources such as the LILACS (Latin American and Caribbean Health Sciences) and MEDLINE databases, directories of authors, institutions, etc. These links increases the number of access points and therefore the public exposition of the texts. Inside the VHL, users can search for literature and have access to the original documents be they on paper or electronic media. For example, searching LILACS database for "chagas disease Cochabamba", the VHL returns references which points to electronic texts when available and to the inter-library loan services for location of the document in paper. Selecting the link to SciELO in the first reference, the full text article is displayed with links to authors resumes and from the bibliographic references at the bottom of the article.

The dissemination of the model is carried out through the promotion of partnerships with national institutions supporting scientific research and communication. SciELO Model is designed to cover all scientific subject areas and, whenever possible, the implementation of SciELO at national level is coordinated by the national agency or agencies responsible for supporting scientific research and communication. When this is not possible in the short term, priority is then moved to the publishing of health science journals.

Once a national institution assumes the responsibility to operate the SciELO Site, BIREME transfers free of charge all the SciELO Methodology to operate a collection of journals. In addition, the national institutions involved in the operation of SciELO are also expected to cooperate on the development and further dissemination of the Model at national and regional level. It takes about 1 year to have a SciELO site operating in a regular basis.

The SciELO Network founds its operation on decentralized SciELO sites that are operated at national level. The decentralization offers long term advantages because it promotes the development of national capacity to run local information flow as part of the regional and global information flow. This approach requires that local needs and idiosyncrasies be taken into account, including language priorities, national agendas for scientific research and communication, financing policies, etc. The decentralization may take more time than a centralized approach but in the long term it assures more sustainability.

The decentralization also poses challenges on the harmonization of policies and criteria as well as on the inter-operability of the collections.

The quality control for both inclusion and permanence of titles in SciELO collections will demand constant discussion and evaluation in order to balance the search for excellence and the promotion of local scientific communication. It will be necessary the continued development of a common set of indicators that take into consideration local and regional conditions.

The inter-operability of the collections will demand constant technological improvement, including the provision of adequate connectivity, searching and mining databases of texts in different languages, set up of dynamic exchange of records and hyperlinks with external databases, etc. In addition to facilitate user access to information, an efficient inter-operability among SciELO Sites will stimulate the cross citing among local and regional authors.

Finally, as SciELO become fully operational in several countries in the coming years, it will be possible to work on the elaboration of regional and sub-regional bibliometric indicators, which will complement the statistics currently available for the mainstream journals.