The Project Gutenberg Etext of LOC WORKSHOP ON ELECTRONIC TEXTS | |
WORKSHOP ON ELECTRONIC TEXTS | |
PROCEEDINGS | |
Edited by James Daly | |
9-10 June 1992 | |
Library of Congress | |
Washington, D.C. | |
Supported by a Grant from the David and Lucile Packard Foundation | |
*** *** *** ****** *** *** *** | |
TABLE OF CONTENTS | |
Acknowledgements | |
Introduction | |
Proceedings | |
Welcome | |
Prosser Gifford and Carl Fleischhauer | |
Session I. Content in a New Form: Who Will Use It and What Will They Do? | |
James Daly (Moderator) | |
Avra Michelson, Overview | |
Susan H. Veccia, User Evaluation | |
Joanne Freeman, Beyond the Scholar | |
Discussion | |
Session II. Show and Tell | |
Jacqueline Hess (Moderator) | |
Elli Mylonas, Perseus Project | |
Discussion | |
Eric M. Calaluca, Patrologia Latina Database | |
Carl Fleischhauer and Ricky Erway, American Memory | |
Discussion | |
Dorothy Twohig, The Papers of George Washington | |
Discussion | |
Maria L. Lebron, The Online Journal of Current Clinical Trials | |
Discussion | |
Lynne K. Personius, Cornell mathematics books | |
Discussion | |
Session III. Distribution, Networks, and Networking: | |
Options for Dissemination | |
Robert G. Zich (Moderator) | |
Clifford A. Lynch | |
Discussion | |
Howard Besser | |
Discussion | |
Ronald L. Larsen | |
Edwin B. Brownrigg | |
Discussion | |
Session IV. Image Capture, Text Capture, Overview of Text and | |
Image Storage Formats | |
William L. Hooton (Moderator) | |
A) Principal Methods for Image Capture of Text: | |
direct scanning, use of microform | |
Anne R. Kenney | |
Pamela Q.J. Andre | |
Judith A. Zidar | |
Donald J. Waters | |
Discussion | |
B) Special Problems: bound volumes, conservation, | |
reproducing printed halftones | |
George Thoma | |
Carl Fleischhauer | |
Discussion | |
C) Image Standards and Implications for Preservation | |
Jean Baronas | |
Patricia Battin | |
Discussion | |
D) Text Conversion: OCR vs. rekeying, standards of accuracy | |
and use of imperfect texts, service bureaus | |
Michael Lesk | |
Ricky Erway | |
Judith A. Zidar | |
Discussion | |
Session V. Approaches to Preparing Electronic Texts | |
Susan Hockey (Moderator) | |
Stuart Weibel | |
Discussion | |
C.M. Sperberg-McQueen | |
Discussion | |
Eric M. Calaluca | |
Discussion | |
Session VI. Copyright Issues | |
Marybeth Peters | |
Session VII. Conclusion | |
Prosser Gifford (Moderator) | |
General discussion | |
Appendix I: Program | |
Appendix II: Abstracts | |
Appendix III: Directory of Participants | |
*** *** *** ****** *** *** *** | |
Acknowledgements | |
I would like to thank Carl Fleischhauer and Prosser Gifford for the | |
opportunity to learn about areas of human activity unknown to me a scant | |
ten months ago, and the David and Lucile Packard Foundation for | |
supporting that opportunity. The help given by others is acknowledged on | |
a separate page. | |
19 October 1992 | |
*** *** *** ****** *** *** *** | |
INTRODUCTION | |
The Workshop on Electronic Texts (1) drew together representatives of | |
various projects and interest groups to compare ideas, beliefs, | |
experiences, and, in particular, methods of placing and presenting | |
historical textual materials in computerized form. Most attendees gained | |
much in insight and outlook from the event. But the assembly did not | |
form a new nation, or, to put it another way, the diversity of projects | |
and interests was too great to draw the representatives into a cohesive, | |
action-oriented body.(2) | |
Everyone attending the Workshop shared an interest in preserving and | |
providing access to historical texts. But within this broad field the | |
attendees represented a variety of formal, informal, figurative, and | |
literal groups, with many individuals belonging to more than one. These | |
groups may be defined roughly according to the following topics or | |
activities: | |
* Imaging | |
* Searchable coded texts | |
* National and international computer networks | |
* CD-ROM production and dissemination | |
* Methods and technology for converting older paper materials into | |
electronic form | |
* Study of the use of digital materials by scholars and others | |
This summary is arranged thematically and does not follow the actual | |
sequence of presentations. | |
NOTES: | |
(1) In this document, the phrase electronic text is used to mean | |
any computerized reproduction or version of a document, book, | |
article, or manuscript (including images), and not merely a machine- | |
readable or machine-searchable text. | |
(2) The Workshop was held at the Library of Congress on 9-10 June | |
1992, with funding from the David and Lucile Packard Foundation. | |
The document that follows represents a summary of the presentations | |
made at the Workshop and was compiled by James DALY. This | |
introduction was written by DALY and Carl FLEISCHHAUER. | |
PRESERVATION AND IMAGING | |
Preservation, as that term is used by archivists,(3) was most explicitly | |
discussed in the context of imaging. Anne KENNEY and Lynne PERSONIUS | |
explained how the concept of a faithful copy and the user-friendliness of | |
the traditional book have guided their project at Cornell University.(4) | |
Although interested in computerized dissemination, participants in the | |
Cornell project are creating digital image sets of older books in the | |
public domain as a source for a fresh paper facsimile or, in a future | |
phase, microfilm. The books returned to the library shelves are | |
high-quality and useful replacements on acid-free paper that should last | |
a long time. To date, the Cornell project has placed little or no | |
emphasis on creating searchable texts; one would not be surprised to find | |
that the project participants view such texts as new editions, and thus | |
not as faithful reproductions. | |
In her talk on preservation, Patricia BATTIN struck an ecumenical and | |
flexible note as she endorsed the creation and dissemination of a variety | |
of types of digital copies. Do not be too narrow in defining what counts | |
as a preservation element, BATTIN counseled; for the present, at least, | |
digital copies made with preservation in mind cannot be as narrowly | |
standardized as, say, microfilm copies with the same objective. Setting | |
standards precipitously can inhibit creativity, but delay can result in | |
chaos, she advised. | |
In part, BATTIN's position reflected the unsettled nature of image-format | |
standards, and attendees could hear echoes of this unsettledness in the | |
comments of various speakers. For example, Jean BARONAS reviewed the | |
status of several formal standards moving through committees of experts; | |
and Clifford LYNCH encouraged the use of a new guideline for transmitting | |
document images on Internet. Testimony from participants in the National | |
Agricultural Library's (NAL) Text Digitization Program and LC's American | |
Memory project highlighted some of the challenges to the actual creation | |
or interchange of images, including difficulties in converting | |
preservation microfilm to digital form. Donald WATERS reported on the | |
progress of a master plan for a project at Yale University to convert | |
books on microfilm to digital image sets, Project Open Book (POB). | |
The Workshop offered rather less of an imaging practicum than planned, | |
but "how-to" hints emerge at various points, for example, throughout | |
KENNEY's presentation and in the discussion of arcana such as | |
thresholding and dithering offered by George THOMA and FLEISCHHAUER. | |
NOTES: | |
(3) Although there is a sense in which any reproductions of | |
historical materials preserve the human record, specialists in the | |
field have developed particular guidelines for the creation of | |
acceptable preservation copies. | |
(4) Titles and affiliations of presenters are given at the | |
beginning of their respective talks and in the Directory of | |
Participants (Appendix III). | |
THE MACHINE-READABLE TEXT: MARKUP AND USE | |
The sections of the Workshop that dealt with machine-readable text tended | |
to be more concerned with access and use than with preservation, at least | |
in the narrow technical sense. Michael SPERBERG-McQUEEN made a forceful | |
presentation on the Text Encoding Initiative's (TEI) implementation of | |
the Standard Generalized Markup Language (SGML). His ideas were echoed | |
by Susan HOCKEY, Elli MYLONAS, and Stuart WEIBEL. While the | |
presentations made by the TEI advocates contained no practicum, their | |
discussion focused on the value of the finished product, what the | |
European Community calls reusability, but what may also be termed | |
durability. They argued that marking up--that is, coding--a text in a | |
well-conceived way will permit it to be moved from one computer | |
environment to another, as well as to be used by various users. Two | |
kinds of markup were distinguished: 1) procedural markup, which | |
describes the features of a text (e.g., dots on a page), and 2) | |
descriptive markup, which describes the structure or elements of a | |
document (e.g., chapters, paragraphs, and front matter). | |
The TEI proponents emphasized the importance of texts to scholarship. | |
They explained how heavily coded (and thus analyzed and annotated) texts | |
can underlie research, play a role in scholarly communication, and | |
facilitate classroom teaching. SPERBERG-McQUEEN reminded listeners that | |
a written or printed item (e.g., a particular edition of a book) is | |
merely a representation of the abstraction we call a text. To concern | |
ourselves with faithfully reproducing a printed instance of the text, | |
SPERBERG-McQUEEN argued, is to concern ourselves with the representation | |
of a representation ("images as simulacra for the text"). The TEI proponents' | |
interest in images tends to focus on corollary materials for use in teaching, | |
for example, photographs of the Acropolis to accompany a Greek text. | |
By the end of the Workshop, SPERBERG-McQUEEN confessed to having been | |
converted to a limited extent to the view that electronic images | |
constitute a promising alternative to microfilming; indeed, an | |
alternative probably superior to microfilming. But he was not convinced | |
that electronic images constitute a serious attempt to represent text in | |
electronic form. HOCKEY and MYLONAS also conceded that their experience | |
at the Pierce Symposium the previous week at Georgetown University and | |
the present conference at the Library of Congress had compelled them to | |
reevaluate their perspective on the usefulness of text as images. | |
Attendees could see that the text and image advocates were in | |
constructive tension, so to say. | |
Three nonTEI presentations described approaches to preparing | |
machine-readable text that are less rigorous and thus less expensive. In | |
the case of the Papers of George Washington, Dorothy TWOHIG explained | |
that the digital version will provide a not-quite-perfect rendering of | |
the transcribed text--some 135,000 documents, available for research | |
during the decades while the perfect or print version is completed. | |
Members of the American Memory team and the staff of NAL's Text | |
Digitization Program (see below) also outlined a middle ground concerning | |
searchable texts. In the case of American Memory, contractors produce | |
texts with about 99-percent accuracy that serve as "browse" or | |
"reference" versions of written or printed originals. End users who need | |
faithful copies or perfect renditions must refer to accompanying sets of | |
digital facsimile images or consult copies of the originals in a nearby | |
library or archive. American Memory staff argued that the high cost of | |
producing 100-percent accurate copies would prevent LC from offering | |
access to large parts of its collections. | |
THE MACHINE-READABLE TEXT: METHODS OF CONVERSION | |
Although the Workshop did not include a systematic examination of the | |
methods for converting texts from paper (or from facsimile images) into | |
machine-readable form, nevertheless, various speakers touched upon this | |
matter. For example, WEIBEL reported that OCLC has experimented with a | |
merging of multiple optical character recognition systems that will | |
reduce errors from an unacceptable rate of 5 characters out of every | |
l,000 to an unacceptable rate of 2 characters out of every l,000. | |
Pamela ANDRE presented an overview of NAL's Text Digitization Program and | |
Judith ZIDAR discussed the technical details. ZIDAR explained how NAL | |
purchased hardware and software capable of performing optical character | |
recognition (OCR) and text conversion and used its own staff to convert | |
texts. The process, ZIDAR said, required extensive editing and project | |
staff found themselves considering alternatives, including rekeying | |
and/or creating abstracts or summaries of texts. NAL reckoned costs at | |
$7 per page. By way of contrast, Ricky ERWAY explained that American | |
Memory had decided from the start to contract out conversion to external | |
service bureaus. The criteria used to select these contractors were cost | |
and quality of results, as opposed to methods of conversion. ERWAY noted | |
that historical documents or books often do not lend themselves to OCR. | |
Bound materials represent a special problem. In her experience, quality | |
control--inspecting incoming materials, counting errors in samples--posed | |
the most time-consuming aspect of contracting out conversion. ERWAY | |
reckoned American Memory's costs at $4 per page, but cautioned that fewer | |
cost-elements had been included than in NAL's figure. | |
OPTIONS FOR DISSEMINATION | |
The topic of dissemination proper emerged at various points during the | |
Workshop. At the session devoted to national and international computer | |
networks, LYNCH, Howard BESSER, Ronald LARSEN, and Edwin BROWNRIGG | |
highlighted the virtues of Internet today and of the network that will | |
evolve from Internet. Listeners could discern in these narratives a | |
vision of an information democracy in which millions of citizens freely | |
find and use what they need. LYNCH noted that a lack of standards | |
inhibits disseminating multimedia on the network, a topic also discussed | |
by BESSER. LARSEN addressed the issues of network scalability and | |
modularity and commented upon the difficulty of anticipating the effects | |
of growth in orders of magnitude. BROWNRIGG talked about the ability of | |
packet radio to provide certain links in a network without the need for | |
wiring. However, the presenters also called attention to the | |
shortcomings and incongruities of present-day computer networks. For | |
example: 1) Network use is growing dramatically, but much network | |
traffic consists of personal communication (E-mail). 2) Large bodies of | |
information are available, but a user's ability to search across their | |
entirety is limited. 3) There are significant resources for science and | |
technology, but few network sources provide content in the humanities. | |
4) Machine-readable texts are commonplace, but the capability of the | |
system to deal with images (let alone other media formats) lags behind. | |
A glimpse of a multimedia future for networks, however, was provided by | |
Maria LEBRON in her overview of the Online Journal of Current Clinical | |
Trials (OJCCT), and the process of scholarly publishing on-line. | |
The contrasting form of the CD-ROM disk was never systematically | |
analyzed, but attendees could glean an impression from several of the | |
show-and-tell presentations. The Perseus and American Memory examples | |
demonstrated recently published disks, while the descriptions of the | |
IBYCUS version of the Papers of George Washington and Chadwyck-Healey's | |
Patrologia Latina Database (PLD) told of disks to come. According to | |
Eric CALALUCA, PLD's principal focus has been on converting Jacques-Paul | |
Migne's definitive collection of Latin texts to machine-readable form. | |
Although everyone could share the network advocates' enthusiasm for an | |
on-line future, the possibility of rolling up one's sleeves for a session | |
with a CD-ROM containing both textual materials and a powerful retrieval | |
engine made the disk seem an appealing vessel indeed. The overall | |
discussion suggested that the transition from CD-ROM to on-line networked | |
access may prove far slower and more difficult than has been anticipated. | |
WHO ARE THE USERS AND WHAT DO THEY DO? | |
Although concerned with the technicalities of production, the Workshop | |
never lost sight of the purposes and uses of electronic versions of | |
textual materials. As noted above, those interested in imaging discussed | |
the problematical matter of digital preservation, while the TEI proponents | |
described how machine-readable texts can be used in research. This latter | |
topic received thorough treatment in the paper read by Avra MICHELSON. | |
She placed the phenomenon of electronic texts within the context of | |
broader trends in information technology and scholarly communication. | |
Among other things, MICHELSON described on-line conferences that | |
represent a vigorous and important intellectual forum for certain | |
disciplines. Internet now carries more than 700 conferences, with about | |
80 percent of these devoted to topics in the social sciences and the | |
humanities. Other scholars use on-line networks for "distance learning." | |
Meanwhile, there has been a tremendous growth in end-user computing; | |
professors today are less likely than their predecessors to ask the | |
campus computer center to process their data. Electronic texts are one | |
key to these sophisticated applications, MICHELSON reported, and more and | |
more scholars in the humanities now work in an on-line environment. | |
Toward the end of the Workshop, Michael LESK presented a corollary to | |
MICHELSON's talk, reporting the results of an experiment that compared | |
the work of one group of chemistry students using traditional printed | |
texts and two groups using electronic sources. The experiment | |
demonstrated that in the event one does not know what to read, one needs | |
the electronic systems; the electronic systems hold no advantage at the | |
moment if one knows what to read, but neither do they impose a penalty. | |
DALY provided an anecdotal account of the revolutionizing impact of the | |
new technology on his previous methods of research in the field of classics. | |
His account, by extrapolation, served to illustrate in part the arguments | |
made by MICHELSON concerning the positive effects of the sudden and radical | |
transformation being wrought in the ways scholars work. | |
Susan VECCIA and Joanne FREEMAN delineated the use of electronic | |
materials outside the university. The most interesting aspect of their | |
use, FREEMAN said, could be seen as a paradox: teachers in elementary | |
and secondary schools requested access to primary source materials but, | |
at the same time, found that "primariness" itself made these materials | |
difficult for their students to use. | |
OTHER TOPICS | |
Marybeth PETERS reviewed copyright law in the United States and offered | |
advice during a lively discussion of this subject. But uncertainty | |
remains concerning the price of copyright in a digital medium, because a | |
solution remains to be worked out concerning management and synthesis of | |
copyrighted and out-of-copyright pieces of a database. | |
As moderator of the final session of the Workshop, Prosser GIFFORD directed | |
discussion to future courses of action and the potential role of LC in | |
advancing them. Among the recommendations that emerged were the following: | |
* Workshop participants should 1) begin to think about working | |
with image material, but structure and digitize it in such a | |
way that at a later stage it can be interpreted into text, and | |
2) find a common way to build text and images together so that | |
they can be used jointly at some stage in the future, with | |
appropriate network support, because that is how users will want | |
to access these materials. The Library might encourage attempts | |
to bring together people who are working on texts and images. | |
* A network version of American Memory should be developed or | |
consideration should be given to making the data in it | |
available to people interested in doing network multimedia. | |
Given the current dearth of digital data that is appealing and | |
unencumbered by extremely complex rights problems, developing a | |
network version of American Memory could do much to help make | |
network multimedia a reality. | |
* Concerning the thorny issue of electronic deposit, LC should | |
initiate a catalytic process in terms of distributed | |
responsibility, that is, bring together the distributed | |
organizations and set up a study group to look at all the | |
issues related to electronic deposit and see where we as a | |
nation should move. For example, LC might attempt to persuade | |
one major library in each state to deal with its state | |
equivalent publisher, which might produce a cooperative project | |
that would be equitably distributed around the country, and one | |
in which LC would be dealing with a minimal number of publishers | |
and minimal copyright problems. LC must also deal with the | |
concept of on-line publishing, determining, among other things, | |
how serials such as OJCCT might be deposited for copyright. | |
* Since a number of projects are planning to carry out | |
preservation by creating digital images that will end up in | |
on-line or near-line storage at some institution, LC might play | |
a helpful role, at least in the near term, by accelerating how | |
to catalog that information into the Research Library Information | |
Network (RLIN) and then into OCLC, so that it would be accessible. | |
This would reduce the possibility of multiple institutions digitizing | |
the same work. | |
CONCLUSION | |
The Workshop was valuable because it brought together partisans from | |
various groups and provided an occasion to compare goals and methods. | |
The more committed partisans frequently communicate with others in their | |
groups, but less often across group boundaries. The Workshop was also | |
valuable to attendees--including those involved with American Memory--who | |
came less committed to particular approaches or concepts. These | |
attendees learned a great deal, and plan to select and employ elements of | |
imaging, text-coding, and networked distribution that suit their | |
respective projects and purposes. | |
Still, reality rears its ugly head: no breakthrough has been achieved. | |
On the imaging side, one confronts a proliferation of competing | |
data-interchange standards and a lack of consensus on the role of digital | |
facsimiles in preservation. In the realm of machine-readable texts, one | |
encounters a reasonably mature standard but methodological difficulties | |
and high costs. These latter problems, of course, represent a special | |
impediment to the desire, as it is sometimes expressed in the popular | |
press, "to put the [contents of the] Library of Congress on line." In | |
the words of one participant, there was "no solution to the economic | |
problems--the projects that are out there are surviving, but it is going | |
to be a lot of work to transform the information industry, and so far the | |
investment to do that is not forthcoming" (LESK, per litteras). | |
*** *** *** ****** *** *** *** | |
PROCEEDINGS | |
WELCOME | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
GIFFORD * Origin of Workshop in current Librarian's desire to make LC's | |
collections more widely available * Desiderata arising from the prospect | |
of greater interconnectedness * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
After welcoming participants on behalf of the Library of Congress, | |
American Memory (AM), and the National Demonstration Lab, Prosser | |
GIFFORD, director for scholarly programs, Library of Congress, located | |
the origin of the Workshop on Electronic Texts in a conversation he had | |
had considerably more than a year ago with Carl FLEISCHHAUER concerning | |
some of the issues faced by AM. On the assumption that numerous other | |
people were asking the same questions, the decision was made to bring | |
together as many of these people as possible to ask the same questions | |
together. In a deeper sense, GIFFORD said, the origin of the Workshop | |
lay in the desire of the current Librarian of Congress, James H. | |
Billington, to make the collections of the Library, especially those | |
offering unique or unusual testimony on aspects of the American | |
experience, available to a much wider circle of users than those few | |
people who can come to Washington to use them. This meant that the | |
emphasis of AM, from the outset, has been on archival collections of the | |
basic material, and on making these collections themselves available, | |
rather than selected or heavily edited products. | |
From AM's emphasis followed the questions with which the Workshop began: | |
who will use these materials, and in what form will they wish to use | |
them. But an even larger issue deserving mention, in GIFFORD's view, was | |
the phenomenal growth in Internet connectivity. He expressed the hope | |
that the prospect of greater interconnectedness than ever before would | |
lead to: 1) much more cooperative and mutually supportive endeavors; 2) | |
development of systems of shared and distributed responsibilities to | |
avoid duplication and to ensure accuracy and preservation of unique | |
materials; and 3) agreement on the necessary standards and development of | |
the appropriate directories and indices to make navigation | |
straightforward among the varied resources that are, and increasingly | |
will be, available. In this connection, GIFFORD requested that | |
participants reflect from the outset upon the sorts of outcomes they | |
thought the Workshop might have. Did those present constitute a group | |
with sufficient common interests to propose a next step or next steps, | |
and if so, what might those be? They would return to these questions the | |
following afternoon. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
FLEISCHHAUER * Core of Workshop concerns preparation and production of | |
materials * Special challenge in conversion of textual materials * | |
Quality versus quantity * Do the several groups represented share common | |
interests? * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Carl FLEISCHHAUER, coordinator, American Memory, Library of Congress, | |
emphasized that he would attempt to represent the people who perform some | |
of the work of converting or preparing materials and that the core of | |
the Workshop had to do with preparation and production. FLEISCHHAUER | |
then drew a distinction between the long term, when many things would be | |
available and connected in the ways that GIFFORD described, and the short | |
term, in which AM not only has wrestled with the issue of what is the | |
best course to pursue but also has faced a variety of technical | |
challenges. | |
FLEISCHHAUER remarked AM's endeavors to deal with a wide range of library | |
formats, such as motion picture collections, sound-recording collections, | |
and pictorial collections of various sorts, especially collections of | |
photographs. In the course of these efforts, AM kept coming back to | |
textual materials--manuscripts or rare printed matter, bound materials, | |
etc. Text posed the greatest conversion challenge of all. Thus, the | |
genesis of the Workshop, which reflects the problems faced by AM. These | |
problems include physical problems. For example, those in the library | |
and archive business deal with collections made up of fragile and rare | |
manuscript items, bound materials, especially the notoriously brittle | |
bound materials of the late nineteenth century. These are precious | |
cultural artifacts, however, as well as interesting sources of | |
information, and LC desires to retain and conserve them. AM needs to | |
handle things without damaging them. Guillotining a book to run it | |
through a sheet feeder must be avoided at all costs. | |
Beyond physical problems, issues pertaining to quality arose. For | |
example, the desire to provide users with a searchable text is affected | |
by the question of acceptable level of accuracy. One hundred percent | |
accuracy is tremendously expensive. On the other hand, the output of | |
optical character recognition (OCR) can be tremendously inaccurate. | |
Although AM has attempted to find a middle ground, uncertainty persists | |
as to whether or not it has discovered the right solution. | |
Questions of quality arose concerning images as well. FLEISCHHAUER | |
contrasted the extremely high level of quality of the digital images in | |
the Cornell Xerox Project with AM's efforts to provide a browse-quality | |
or access-quality image, as opposed to an archival or preservation image. | |
FLEISCHHAUER therefore welcomed the opportunity to compare notes. | |
FLEISCHHAUER observed in passing that conversations he had had about | |
networks have begun to signal that for various forms of media a | |
determination may be made that there is a browse-quality item, or a | |
distribution-and-access-quality item that may coexist in some systems | |
with a higher quality archival item that would be inconvenient to send | |
through the network because of its size. FLEISCHHAUER referred, of | |
course, to images more than to searchable text. | |
As AM considered those questions, several conceptual issues arose: ought | |
AM occasionally to reproduce materials entirely through an image set, at | |
other times, entirely through a text set, and in some cases, a mix? | |
There probably would be times when the historical authenticity of an | |
artifact would require that its image be used. An image might be | |
desirable as a recourse for users if one could not provide 100-percent | |
accurate text. Again, AM wondered, as a practical matter, if a | |
distinction could be drawn between rare printed matter that might exist | |
in multiple collections--that is, in ten or fifteen libraries. In such | |
cases, the need for perfect reproduction would be less than for unique | |
items. Implicit in his remarks, FLEISCHHAUER conceded, was the admission | |
that AM has been tilting strongly towards quantity and drawing back a | |
little from perfect quality. That is, it seemed to AM that society would | |
be better served if more things were distributed by LC--even if they were | |
not quite perfect--than if fewer things, perfectly represented, were | |
distributed. This was stated as a proposition to be tested, with | |
responses to be gathered from users. | |
In thinking about issues related to reproduction of materials and seeing | |
other people engaged in parallel activities, AM deemed it useful to | |
convene a conference. Hence, the Workshop. FLEISCHHAUER thereupon | |
surveyed the several groups represented: 1) the world of images (image | |
users and image makers); 2) the world of text and scholarship and, within | |
this group, those concerned with language--FLEISCHHAUER confessed to finding | |
delightful irony in the fact that some of the most advanced thinkers on | |
computerized texts are those dealing with ancient Greek and Roman materials; | |
3) the network world; and 4) the general world of library science, which | |
includes people interested in preservation and cataloging. | |
FLEISCHHAUER concluded his remarks with special thanks to the David and | |
Lucile Packard Foundation for its support of the meeting, the American | |
Memory group, the Office for Scholarly Programs, the National | |
Demonstration Lab, and the Office of Special Events. He expressed the | |
hope that David Woodley Packard might be able to attend, noting that | |
Packard's work and the work of the foundation had sponsored a number of | |
projects in the text area. | |
****** | |
SESSION I. CONTENT IN A NEW FORM: WHO WILL USE IT AND WHAT WILL THEY DO? | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
DALY * Acknowledgements * A new Latin authors disk * Effects of the new | |
technology on previous methods of research * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Serving as moderator, James DALY acknowledged the generosity of all the | |
presenters for giving of their time, counsel, and patience in planning | |
the Workshop, as well as of members of the American Memory project and | |
other Library of Congress staff, and the David and Lucile Packard | |
Foundation and its executive director, Colburn S. Wilbur. | |
DALY then recounted his visit in March to the Center for Electronic Texts | |
in the Humanities (CETH) and the Department of Classics at Rutgers | |
University, where an old friend, Lowell Edmunds, introduced him to the | |
department's IBYCUS scholarly personal computer, and, in particular, the | |
new Latin CD-ROM, containing, among other things, almost all classical | |
Latin literary texts through A.D. 200. Packard Humanities Institute | |
(PHI), Los Altos, California, released this disk late in 1991, with a | |
nominal triennial licensing fee. | |
Playing with the disk for an hour or so at Rutgers brought home to DALY | |
at once the revolutionizing impact of the new technology on his previous | |
methods of research. Had this disk been available two or three years | |
earlier, DALY contended, when he was engaged in preparing a commentary on | |
Book 10 of Virgil's Aeneid for Cambridge University Press, he would not | |
have required a forty-eight-square-foot table on which to spread the | |
numerous, most frequently consulted items, including some ten or twelve | |
concordances to key Latin authors, an almost equal number of lexica to | |
authors who lacked concordances, and where either lexica or concordances | |
were lacking, numerous editions of authors antedating and postdating Virgil. | |
Nor, when checking each of the average six to seven words contained in | |
the Virgilian hexameter for its usage elsewhere in Virgil's works or | |
other Latin authors, would DALY have had to maintain the laborious | |
mechanical process of flipping through these concordances, lexica, and | |
editions each time. Nor would he have had to frequent as often the | |
Milton S. Eisenhower Library at the Johns Hopkins University to consult | |
the Thesaurus Linguae Latinae. Instead of devoting countless hours, or | |
the bulk of his research time, to gathering data concerning Virgil's use | |
of words, DALY--now freed by PHI's Latin authors disk from the | |
tyrannical, yet in some ways paradoxically happy scholarly drudgery-- | |
would have been able to devote that same bulk of time to analyzing and | |
interpreting Virgilian verbal usage. | |
Citing Theodore Brunner, Gregory Crane, Elli MYLONAS, and Avra MICHELSON, | |
DALY argued that this reversal in his style of work, made possible by the | |
new technology, would perhaps have resulted in better, more productive | |
research. Indeed, even in the course of his browsing the Latin authors | |
disk at Rutgers, its powerful search, retrieval, and highlighting | |
capabilities suggested to him several new avenues of research into | |
Virgil's use of sound effects. This anecdotal account, DALY maintained, | |
may serve to illustrate in part the sudden and radical transformation | |
being wrought in the ways scholars work. | |
****** | |
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
MICHELSON * Elements related to scholarship and technology * Electronic | |
texts within the context of broader trends within information technology | |
and scholarly communication * Evaluation of the prospects for the use of | |
electronic texts * Relationship of electronic texts to processes of | |
scholarly communication in humanities research * New exchange formats | |
created by scholars * Projects initiated to increase scholarly access to | |
converted text * Trend toward making electronic resources available | |
through research and education networks * Changes taking place in | |
scholarly communication among humanities scholars * Network-mediated | |
scholarship transforming traditional scholarly practices * Key | |
information technology trends affecting the conduct of scholarly | |
communication over the next decade * The trend toward end-user computing | |
* The trend toward greater connectivity * Effects of these trends * Key | |
transformations taking place * Summary of principal arguments * | |
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Avra MICHELSON, Archival Research and Evaluation Staff, National Archives | |
and Records Administration (NARA), argued that establishing who will use | |
electronic texts and what they will use them for involves a consideration | |
of both information technology and scholarship trends. This | |
consideration includes several elements related to scholarship and | |
technology: 1) the key trends in information technology that are most | |
relevant to scholarship; 2) the key trends in the use of currently | |
available technology by scholars in the nonscientific community; and 3) | |
the relationship between these two very distinct but interrelated trends. | |
The investment in understanding this relationship being made by | |
information providers, technologists, and public policy developers, as | |
well as by scholars themselves, seems to be pervasive and growing, | |
MICHELSON contended. She drew on collaborative work with Jeff Rothenberg | |
on the scholarly use of technology. | |
MICHELSON sought to place the phenomenon of electronic texts within the | |
context of broader trends within information technology and scholarly | |
communication. She argued that electronic texts are of most use to | |
researchers to the extent that the researchers' working context (i.e., | |
their relevant bibliographic sources, collegial feedback, analytic tools, | |
notes, drafts, etc.), along with their field's primary and secondary | |
sources, also is accessible in electronic form and can be integrated in | |
ways that are unique to the on-line environment. | |
Evaluation of the prospects for the use of electronic texts includes two | |
elements: 1) an examination of the ways in which researchers currently | |
are using electronic texts along with other electronic resources, and 2) | |
an analysis of key information technology trends that are affecting the | |
long-term conduct of scholarly communication. MICHELSON limited her | |
discussion of the use of electronic texts to the practices of humanists | |
and noted that the scientific community was outside the panel's overview. | |
MICHELSON examined the nature of the current relationship of electronic | |
texts in particular, and electronic resources in general, to what she | |
maintained were, essentially, five processes of scholarly communication | |
in humanities research. Researchers 1) identify sources, 2) communicate | |
with their colleagues, 3) interpret and analyze data, 4) disseminate | |
their research findings, and 5) prepare curricula to instruct the next | |
generation of scholars and students. This examination would produce a | |
clearer understanding of the synergy among these five processes that | |
fuels the tendency of the use of electronic resources for one process to | |
stimulate its use for other processes of scholarly communication. | |
For the first process of scholarly communication, the identification of | |
sources, MICHELSON remarked the opportunity scholars now enjoy to | |
supplement traditional word-of-mouth searches for sources among their | |
colleagues with new forms of electronic searching. So, for example, | |
instead of having to visit the library, researchers are able to explore | |
descriptions of holdings in their offices. Furthermore, if their own | |
institutions' holdings prove insufficient, scholars can access more than | |
200 major American library catalogues over Internet, including the | |
universities of California, Michigan, Pennsylvania, and Wisconsin. | |
Direct access to the bibliographic databases offers intellectual | |
empowerment to scholars by presenting a comprehensive means of browsing | |
through libraries from their homes and offices at their convenience. | |
The second process of communication involves communication among | |
scholars. Beyond the most common methods of communication, scholars are | |
using E-mail and a variety of new electronic communications formats | |
derived from it for further academic interchange. E-mail exchanges are | |
growing at an astonishing rate, reportedly 15 percent a month. They | |
currently constitute approximately half the traffic on research and | |
education networks. Moreover, the global spread of E-mail has been so | |
rapid that it is now possible for American scholars to use it to | |
communicate with colleagues in close to 140 other countries. | |
Other new exchange formats created by scholars and operating on Internet | |
include more than 700 conferences, with about 80 percent of these devoted | |
to topics in the social sciences and humanities. The rate of growth of | |
these scholarly electronic conferences also is astonishing. From l990 to | |
l991, 200 new conferences were identified on Internet. From October 1991 | |
to June 1992, an additional 150 conferences in the social sciences and | |
humanities were added to this directory of listings. Scholars have | |
established conferences in virtually every field, within every different | |
discipline. For example, there are currently close to 600 active social | |
science and humanities conferences on topics such as art and | |
architecture, ethnomusicology, folklore, Japanese culture, medical | |
education, and gifted and talented education. The appeal to scholars of | |
communicating through these conferences is that, unlike any other medium, | |
electronic conferences today provide a forum for global communication | |
with peers at the front end of the research process. | |
Interpretation and analysis of sources constitutes the third process of | |
scholarly communication that MICHELSON discussed in terms of texts and | |
textual resources. The methods used to analyze sources fall somewhere on | |
a continuum from quantitative analysis to qualitative analysis. | |
Typically, evidence is culled and evaluated using methods drawn from both | |
ends of this continuum. At one end, quantitative analysis involves the | |
use of mathematical processes such as a count of frequencies and | |
distributions of occurrences or, on a higher level, regression analysis. | |
At the other end of the continuum, qualitative analysis typically | |
involves nonmathematical processes oriented toward language | |
interpretation or the building of theory. Aspects of this work involve | |
the processing--either manual or computational--of large and sometimes | |
massive amounts of textual sources, although the use of nontextual | |
sources as evidence, such as photographs, sound recordings, film footage, | |
and artifacts, is significant as well. | |
Scholars have discovered that many of the methods of interpretation and | |
analysis that are related to both quantitative and qualitative methods | |
are processes that can be performed by computers. For example, computers | |
can count. They can count brush strokes used in a Rembrandt painting or | |
perform regression analysis for understanding cause and effect. By means | |
of advanced technologies, computers can recognize patterns, analyze text, | |
and model concepts. Furthermore, computers can complete these processes | |
faster with more sources and with greater precision than scholars who | |
must rely on manual interpretation of data. But if scholars are to use | |
computers for these processes, source materials must be in a form | |
amenable to computer-assisted analysis. For this reason many scholars, | |
once they have identified the sources that are key to their research, are | |
converting them to machine-readable form. Thus, a representative example | |
of the numerous textual conversion projects organized by scholars around | |
the world in recent years to support computational text analysis is the | |
TLG, the Thesaurus Linguae Graecae. This project is devoted to | |
converting the extant ancient texts of classical Greece. (Editor's note: | |
according to the TLG Newsletter of May l992, TLG was in use in thirty-two | |
different countries. This figure updates MICHELSON's previous count by one.) | |
The scholars performing these conversions have been asked to recognize | |
that the electronic sources they are converting for one use possess value | |
for other research purposes as well. As a result, during the past few | |
years, humanities scholars have initiated a number of projects to | |
increase scholarly access to converted text. So, for example, the Text | |
Encoding Initiative (TEI), about which more is said later in the program, | |
was established as an effort by scholars to determine standard elements | |
and methods for encoding machine-readable text for electronic exchange. | |
In a second effort to facilitate the sharing of converted text, scholars | |
have created a new institution, the Center for Electronic Texts in the | |
Humanities (CETH). The center estimates that there are 8,000 series of | |
source texts in the humanities that have been converted to | |
machine-readable form worldwide. CETH is undertaking an international | |
search for converted text in the humanities, compiling it into an | |
electronic library, and preparing bibliographic descriptions of the | |
sources for the Research Libraries Information Network's (RLIN) | |
machine-readable data file. The library profession has begun to initiate | |
large conversion projects as well, such as American Memory. | |
While scholars have been making converted text available to one another, | |
typically on disk or on CD-ROM, the clear trend is toward making these | |
resources available through research and education networks. Thus, the | |
American and French Research on the Treasury of the French Language | |
(ARTFL) and the Dante Project are already available on Internet. | |
MICHELSON summarized this section on interpretation and analysis by | |
noting that: 1) increasing numbers of humanities scholars in the library | |
community are recognizing the importance to the advancement of | |
scholarship of retrospective conversion of source materials in the arts | |
and humanities; and 2) there is a growing realization that making the | |
sources available on research and education networks maximizes their | |
usefulness for the analysis performed by humanities scholars. | |
The fourth process of scholarly communication is dissemination of | |
research findings, that is, publication. Scholars are using existing | |
research and education networks to engineer a new type of publication: | |
scholarly-controlled journals that are electronically produced and | |
disseminated. Although such journals are still emerging as a | |
communication format, their number has grown, from approximately twelve | |
to thirty-six during the past year (July 1991 to June 1992). Most of | |
these electronic scholarly journals are devoted to topics in the | |
humanities. As with network conferences, scholarly enthusiasm for these | |
electronic journals stems from the medium's unique ability to advance | |
scholarship in a way that no other medium can do by supporting global | |
feedback and interchange, practically in real time, early in the research | |
process. Beyond scholarly journals, MICHELSON remarked the delivery of | |
commercial full-text products, such as articles in professional journals, | |
newsletters, magazines, wire services, and reference sources. These are | |
being delivered via on-line local library catalogues, especially through | |
CD-ROMs. Furthermore, according to MICHELSON, there is general optimism | |
that the copyright and fees issues impeding the delivery of full text on | |
existing research and education networks soon will be resolved. | |
The final process of scholarly communication is curriculum development | |
and instruction, and this involves the use of computer information | |
technologies in two areas. The first is the development of | |
computer-oriented instructional tools, which includes simulations, | |
multimedia applications, and computer tools that are used to assist in | |
the analysis of sources in the classroom, etc. The Perseus Project, a | |
database that provides a multimedia curriculum on classical Greek | |
civilization, is a good example of the way in which entire curricula are | |
being recast using information technologies. It is anticipated that the | |
current difficulty in exchanging electronically computer-based | |
instructional software, which in turn makes it difficult for one scholar | |
to build upon the work of others, will be resolved before too long. | |
Stand-alone curricular applications that involve electronic text will be | |
sharable through networks, reinforcing their significance as intellectual | |
products as well as instructional tools. | |
The second aspect of electronic learning involves the use of research and | |
education networks for distance education programs. Such programs | |
interactively link teachers with students in geographically scattered | |
locations and rely on the availability of electronic instructional | |
resources. Distance education programs are gaining wide appeal among | |
state departments of education because of their demonstrated capacity to | |
bring advanced specialized course work and an array of experts to many | |
classrooms. A recent report found that at least 32 states operated at | |
least one statewide network for education in 1991, with networks under | |
development in many of the remaining states. | |
MICHELSON summarized this section by noting two striking changes taking | |
place in scholarly communication among humanities scholars. First is the | |
extent to which electronic text in particular, and electronic resources | |
in general, are being infused into each of the five processes described | |
above. As mentioned earlier, there is a certain synergy at work here. | |
The use of electronic resources for one process tends to stimulate its | |
use for other processes, because the chief course of movement is toward a | |
comprehensive on-line working context for humanities scholars that | |
includes on-line availability of key bibliographies, scholarly feedback, | |
sources, analytical tools, and publications. MICHELSON noted further | |
that the movement toward a comprehensive on-line working context for | |
humanities scholars is not new. In fact, it has been underway for more | |
than forty years in the humanities, since Father Roberto Busa began | |
developing an electronic concordance of the works of Saint Thomas Aquinas | |
in 1949. What we are witnessing today, MICHELSON contended, is not the | |
beginning of this on-line transition but, for at least some humanities | |
scholars, the turning point in the transition from a print to an | |
electronic working context. Coinciding with the on-line transition, the | |
second striking change is the extent to which research and education | |
networks are becoming the new medium of scholarly communication. The | |
existing Internet and the pending National Education and Research Network | |
(NREN) represent the new meeting ground where scholars are going for | |
bibliographic information, scholarly dialogue and feedback, the most | |
current publications in their field, and high-level educational | |
offerings. Traditional scholarly practices are undergoing tremendous | |
transformations as a result of the emergence and growing prominence of | |
what is called network-mediated scholarship. | |
MICHELSON next turned to the second element of the framework she proposed | |
at the outset of her talk for evaluating the prospects for electronic | |
text, namely the key information technology trends affecting the conduct | |
of scholarly communication over the next decade: 1) end-user computing | |
and 2) connectivity. | |
End-user computing means that the person touching the keyboard, or | |
performing computations, is the same as the person who initiates or | |
consumes the computation. The emergence of personal computers, along | |
with a host of other forces, such as ubiquitous computing, advances in | |
interface design, and the on-line transition, is prompting the consumers | |
of computation to do their own computing, and is thus rendering obsolete | |
the traditional distinction between end users and ultimate users. | |
The trend toward end-user computing is significant to consideration of | |
the prospects for electronic texts because it means that researchers are | |
becoming more adept at doing their own computations and, thus, more | |
competent in the use of electronic media. By avoiding programmer | |
intermediaries, computation is becoming central to the researcher's | |
thought process. This direct involvement in computing is changing the | |
researcher's perspective on the nature of research itself, that is, the | |
kinds of questions that can be posed, the analytical methodologies that | |
can be used, the types and amount of sources that are appropriate for | |
analyses, and the form in which findings are presented. The trend toward | |
end-user computing means that, increasingly, electronic media and | |
computation are being infused into all processes of humanities | |
scholarship, inspiring remarkable transformations in scholarly | |
communication. | |
The trend toward greater connectivity suggests that researchers are using | |
computation increasingly in network environments. Connectivity is | |
important to scholarship because it erases the distance that separates | |
students from teachers and scholars from their colleagues, while allowing | |
users to access remote databases, share information in many different | |
media, connect to their working context wherever they are, and | |
collaborate in all phases of research. | |
The combination of the trend toward end-user computing and the trend | |
toward connectivity suggests that the scholarly use of electronic | |
resources, already evident among some researchers, will soon become an | |
established feature of scholarship. The effects of these trends, along | |
with ongoing changes in scholarly practices, point to a future in which | |
humanities researchers will use computation and electronic communication | |
to help them formulate ideas, access sources, perform research, | |
collaborate with colleagues, seek peer review, publish and disseminate | |
results, and engage in many other professional and educational activities. | |
In summary, MICHELSON emphasized four points: 1) A portion of humanities | |
scholars already consider electronic texts the preferred format for | |
analysis and dissemination. 2) Scholars are using these electronic | |
texts, in conjunction with other electronic resources, in all the | |
processes of scholarly communication. 3) The humanities scholars' | |
working context is in the process of changing from print technology to | |
electronic technology, in many ways mirroring transformations that have | |
occurred or are occurring within the scientific community. 4) These | |
changes are occurring in conjunction with the development of a new | |
communication medium: research and education networks that are | |
characterized by their capacity to advance scholarship in a wholly unique | |
way. | |
MICHELSON also reiterated her three principal arguments: l) Electronic | |
texts are best understood in terms of the relationship to other | |
electronic resources and the growing prominence of network-mediated | |
scholarship. 2) The prospects for electronic texts lie in their capacity | |
to be integrated into the on-line network of electronic resources that | |
comprise the new working context for scholars. 3) Retrospective conversion | |
of portions of the scholarly record should be a key strategy as information | |
providers respond to changes in scholarly communication practices. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
VECCIA * AM's evaluation project and public users of electronic resources | |
* AM and its design * Site selection and evaluating the Macintosh | |
implementation of AM * Characteristics of the six public libraries | |
selected * Characteristics of AM's users in these libraries * Principal | |
ways AM is being used * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Susan VECCIA, team leader, and Joanne FREEMAN, associate coordinator, | |
American Memory, Library of Congress, gave a joint presentation. First, | |
by way of introduction, VECCIA explained her and FREEMAN's roles in | |
American Memory (AM). Serving principally as an observer, VECCIA has | |
assisted with the evaluation project of AM, placing AM collections in a | |
variety of different sites around the country and helping to organize and | |
implement that project. FREEMAN has been an associate coordinator of AM | |
and has been involved principally with the interpretative materials, | |
preparing some of the electronic exhibits and printed historical | |
information that accompanies AM and that is requested by users. VECCIA | |
and FREEMAN shared anecdotal observations concerning AM with public users | |
of electronic resources. Notwithstanding a fairly structured evaluation | |
in progress, both VECCIA and FREEMAN chose not to report on specifics in | |
terms of numbers, etc., because they felt it was too early in the | |
evaluation project to do so. | |
AM is an electronic archive of primary source materials from the Library | |
of Congress, selected collections representing a variety of formats-- | |
photographs, graphic arts, recorded sound, motion pictures, broadsides, | |
and soon, pamphlets and books. In terms of the design of this system, | |
the interpretative exhibits have been kept separate from the primary | |
resources, with good reason. Accompanying this collection are printed | |
documentation and user guides, as well as guides that FREEMAN prepared for | |
teachers so that they may begin using the content of the system at once. | |
VECCIA described the evaluation project before talking about the public | |
users of AM, limiting her remarks to public libraries, because FREEMAN | |
would talk more specifically about schools from kindergarten to twelfth | |
grade (K-12). Having started in spring 1991, the evaluation currently | |
involves testing of the Macintosh implementation of AM. Since the | |
primary goal of this evaluation is to determine the most appropriate | |
audience or audiences for AM, very different sites were selected. This | |
makes evaluation difficult because of the varying degrees of technology | |
literacy among the sites. AM is situated in forty-four locations, of | |
which six are public libraries and sixteen are schools. Represented | |
among the schools are elementary, junior high, and high schools. | |
District offices also are involved in the evaluation, which will | |
conclude in summer 1993. | |
VECCIA focused the remainder of her talk on the six public libraries, one | |
of which doubles as a state library. They represent a range of | |
geographic areas and a range of demographic characteristics. For | |
example, three are located in urban settings, two in rural settings, and | |
one in a suburban setting. A range of technical expertise is to be found | |
among these facilities as well. For example, one is an "Apple library of | |
the future," while two others are rural one-room libraries--in one, AM | |
sits at the front desk next to a tractor manual. | |
All public libraries have been extremely enthusiastic, supportive, and | |
appreciative of the work that AM has been doing. VECCIA characterized | |
various users: Most users in public libraries describe themselves as | |
general readers; of the students who use AM in the public libraries, | |
those in fourth grade and above seem most interested. Public libraries | |
in rural sites tend to attract retired people, who have been highly | |
receptive to AM. Users tend to fall into two additional categories: | |
people interested in the content and historical connotations of these | |
primary resources, and those fascinated by the technology. The format | |
receiving the most comments has been motion pictures. The adult users in | |
public libraries are more comfortable with IBM computers, whereas young | |
people seem comfortable with either IBM or Macintosh, although most of | |
them seem to come from a Macintosh background. This same tendency is | |
found in the schools. | |
What kinds of things do users do with AM? In a public library there are | |
two main goals or ways that AM is being used: as an individual learning | |
tool, and as a leisure activity. Adult learning was one area that VECCIA | |
would highlight as a possible application for a tool such as AM. She | |
described a patron of a rural public library who comes in every day on | |
his lunch hour and literally reads AM, methodically going through the | |
collection image by image. At the end of his hour he makes an electronic | |
bookmark, puts it in his pocket, and returns to work. The next day he | |
comes in and resumes where he left off. Interestingly, this man had | |
never been in the library before he used AM. In another small, rural | |
library, the coordinator reports that AM is a popular activity for some | |
of the older, retired people in the community, who ordinarily would not | |
use "those things,"--computers. Another example of adult learning in | |
public libraries is book groups, one of which, in particular, is using AM | |
as part of its reading on industrialization, integration, and urbanization | |
in the early 1900s. | |
One library reports that a family is using AM to help educate their | |
children. In another instance, individuals from a local museum came in | |
to use AM to prepare an exhibit on toys of the past. These two examples | |
emphasize the mission of the public library as a cultural institution, | |
reaching out to people who do not have the same resources available to | |
those who live in a metropolitan area or have access to a major library. | |
One rural library reports that junior high school students in large | |
numbers came in one afternoon to use AM for entertainment. A number of | |
public libraries reported great interest among postcard collectors in the | |
Detroit collection, which was essentially a collection of images used on | |
postcards around the turn of the century. Train buffs are similarly | |
interested because that was a time of great interest in railroading. | |
People, it was found, relate to things that they know of firsthand. For | |
example, in both rural public libraries where AM was made available, | |
observers reported that the older people with personal remembrances of | |
the turn of the century were gravitating to the Detroit collection. | |
These examples served to underscore MICHELSON's observation re the | |
integration of electronic tools and ideas--that people learn best when | |
the material relates to something they know. | |
VECCIA made the final point that in many cases AM serves as a | |
public-relations tool for the public libraries that are testing it. In | |
one case, AM is being used as a vehicle to secure additional funding for | |
the library. In another case, AM has served as an inspiration to the | |
staff of a major local public library in the South to think about ways to | |
make its own collection of photographs more accessible to the public. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
FREEMAN * AM and archival electronic resources in a school environment * | |
Questions concerning context * Questions concerning the electronic format | |
itself * Computer anxiety * Access and availability of the system * | |
Hardware * Strengths gained through the use of archival resources in | |
schools * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Reiterating an observation made by VECCIA, that AM is an archival | |
resource made up of primary materials with very little interpretation, | |
FREEMAN stated that the project has attempted to bridge the gap between | |
these bare primary materials and a school environment, and in that cause | |
has created guided introductions to AM collections. Loud demand from the | |
educational community, chiefly from teachers working with the upper | |
grades of elementary school through high school, greeted the announcement | |
that AM would be tested around the country. | |
FREEMAN reported not only on what was learned about AM in a school | |
environment, but also on several universal questions that were raised | |
concerning archival electronic resources in schools. She discussed | |
several strengths of this type of material in a school environment as | |
opposed to a highly structured resource that offers a limited number of | |
paths to follow. | |
FREEMAN first raised several questions about using AM in a school | |
environment. There is often some difficulty in developing a sense of | |
what the system contains. Many students sit down at a computer resource | |
and assume that, because AM comes from the Library of Congress, all of | |
American history is now at their fingertips. As a result of that sort of | |
mistaken judgment, some students are known to conclude that AM contains | |
nothing of use to them when they look for one or two things and do not | |
find them. It is difficult to discover that middle ground where one has | |
a sense of what the system contains. Some students grope toward the idea | |
of an archive, a new idea to them, since they have not previously | |
experienced what it means to have access to a vast body of somewhat | |
random information. | |
Other questions raised by FREEMAN concerned the electronic format itself. | |
For instance, in a school environment it is often difficult both for | |
teachers and students to gain a sense of what it is they are viewing. | |
They understand that it is a visual image, but they do not necessarily | |
know that it is a postcard from the turn of the century, a panoramic | |
photograph, or even machine-readable text of an eighteenth-century | |
broadside, a twentieth-century printed book, or a nineteenth-century | |
diary. That distinction is often difficult for people in a school | |
environment to grasp. Because of that, it occasionally becomes difficult | |
to draw conclusions from what one is viewing. | |
FREEMAN also noted the obvious fear of the computer, which constitutes a | |
difficulty in using an electronic resource. Though students in general | |
did not suffer from this anxiety, several older students feared that they | |
were computer-illiterate, an assumption that became self-fulfilling when | |
they searched for something but failed to find it. FREEMAN said she | |
believed that some teachers also fear computer resources, because they | |
believe they lack complete control. FREEMAN related the example of | |
teachers shooing away students because it was not their time to use the | |
system. This was a case in which the situation had to be extremely | |
structured so that the teachers would not feel that they had lost their | |
grasp on what the system contained. | |
A final question raised by FREEMAN concerned access and availability of | |
the system. She noted the occasional existence of a gap in communication | |
between school librarians and teachers. Often AM sits in a school | |
library and the librarian is the person responsible for monitoring the | |
system. Teachers do not always take into their world new library | |
resources about which the librarian is excited. Indeed, at the sites | |
where AM had been used most effectively within a library, the librarian | |
was required to go to specific teachers and instruct them in its use. As | |
a result, several AM sites will have in-service sessions over a summer, | |
in the hope that perhaps, with a more individualized link, teachers will | |
be more likely to use the resource. | |
A related issue in the school context concerned the number of | |
workstations available at any one location. Centralization of equipment | |
at the district level, with teachers invited to download things and walk | |
away with them, proved unsuccessful because the hours these offices were | |
open were also school hours. | |
Another issue was hardware. As VECCIA observed, a range of sites exists, | |
some technologically advanced and others essentially acquiring their | |
first computer for the primary purpose of using it in conjunction with | |
AM's testing. Users at technologically sophisticated sites want even | |
more sophisticated hardware, so that they can perform even more | |
sophisticated tasks with the materials in AM. But once they acquire a | |
newer piece of hardware, they must learn how to use that also; at an | |
unsophisticated site it takes an extremely long time simply to become | |
accustomed to the computer, not to mention the program offered with the | |
computer. All of these small issues raise one large question, namely, | |
are systems like AM truly rewarding in a school environment, or do they | |
simply act as innovative toys that do little more than spark interest? | |
FREEMAN contended that the evaluation project has revealed several strengths | |
that were gained through the use of archival resources in schools, including: | |
* Psychic rewards from using AM as a vast, rich database, with | |
teachers assigning various projects to students--oral presentations, | |
written reports, a documentary, a turn-of-the-century newspaper-- | |
projects that start with the materials in AM but are completed using | |
other resources; AM thus is used as a research tool in conjunction | |
with other electronic resources, as well as with books and items in | |
the library where the system is set up. | |
* Students are acquiring computer literacy in a humanities context. | |
* This sort of system is overcoming the isolation between disciplines | |
that often exists in schools. For example, many English teachers are | |
requiring their students to write papers on historical topics | |
represented in AM. Numerous teachers have reported that their | |
students are learning critical thinking skills using the system. | |
* On a broader level, AM is introducing primary materials, not only | |
to students but also to teachers, in an environment where often | |
simply none exist--an exciting thing for the students because it | |
helps them learn to conduct research, to interpret, and to draw | |
their own conclusions. In learning to conduct research and what it | |
means, students are motivated to seek knowledge. That relates to | |
another positive outcome--a high level of personal involvement of | |
students with the materials in this system and greater motivation to | |
conduct their own research and draw their own conclusions. | |
* Perhaps the most ironic strength of these kinds of archival | |
electronic resources is that many of the teachers AM interviewed | |
were desperate, it is no exaggeration to say, not only for primary | |
materials but for unstructured primary materials. These would, they | |
thought, foster personally motivated research, exploration, and | |
excitement in their students. Indeed, these materials have done | |
just that. Ironically, however, this lack of structure produces | |
some of the confusion to which the newness of these kinds of | |
resources may also contribute. The key to effective use of archival | |
products in a school environment is a clear, effective introduction | |
to the system and to what it contains. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
DISCUSSION * Nothing known, quantitatively, about the number of | |
humanities scholars who must see the original versus those who would | |
settle for an edited transcript, or about the ways in which humanities | |
scholars are using information technology * Firm conclusions concerning | |
the manner and extent of the use of supporting materials in print | |
provided by AM to await completion of evaluative study * A listener's | |
reflections on additional applications of electronic texts * Role of | |
electronic resources in teaching elementary research skills to students * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
During the discussion that followed the presentations by MICHELSON, | |
VECCIA, and FREEMAN, additional points emerged. | |
LESK asked if MICHELSON could give any quantitative estimate of the | |
number of humanities scholars who must see or want to see the original, | |
or the best possible version of the material, versus those who typically | |
would settle for an edited transcript. While unable to provide a figure, | |
she offered her impressions as an archivist who has done some reference | |
work and has discussed this issue with other archivists who perform | |
reference, that those who use archives and those who use primary sources | |
for what would be considered very high-level scholarly research, as | |
opposed to, say, undergraduate papers, were few in number, especially | |
given the public interest in using primary sources to conduct | |
genealogical or avocational research and the kind of professional | |
research done by people in private industry or the federal government. | |
More important in MICHELSON's view was that, quantitatively, nothing is | |
known about the ways in which, for example, humanities scholars are using | |
information technology. No studies exist to offer guidance in creating | |
strategies. The most recent study was conducted in 1985 by the American | |
Council of Learned Societies (ACLS), and what it showed was that 50 | |
percent of humanities scholars at that time were using computers. That | |
constitutes the extent of our knowledge. | |
Concerning AM's strategy for orienting people toward the scope of | |
electronic resources, FREEMAN could offer no hard conclusions at this | |
point, because she and her colleagues were still waiting to see, | |
particularly in the schools, what has been made of their efforts. Within | |
the system, however, AM has provided what are called electronic exhibits- | |
-such as introductions to time periods and materials--and these are | |
intended to offer a student user a sense of what a broadside is and what | |
it might tell her or him. But FREEMAN conceded that the project staff | |
would have to talk with students next year, after teachers have had a | |
summer to use the materials, and attempt to discover what the students | |
were learning from the materials. In addition, FREEMAN described | |
supporting materials in print provided by AM at the request of local | |
teachers during a meeting held at LC. These included time lines, | |
bibliographies, and other materials that could be reproduced on a | |
photocopier in a classroom. Teachers could walk away with and use these, | |
and in this way gain a better understanding of the contents. But again, | |
reaching firm conclusions concerning the manner and extent of their use | |
would have to wait until next year. | |
As to the changes she saw occurring at the National Archives and Records | |
Administration (NARA) as a result of the increasing emphasis on | |
technology in scholarly research, MICHELSON stated that NARA at this | |
point was absorbing the report by her and Jeff Rothenberg addressing | |
strategies for the archival profession in general, although not for the | |
National Archives specifically. NARA is just beginning to establish its | |
role and what it can do. In terms of changes and initiatives that NARA | |
can take, no clear response could be given at this time. | |
GREENFIELD remarked two trends mentioned in the session. Reflecting on | |
DALY's opening comments on how he could have used a Latin collection of | |
text in an electronic form, he said that at first he thought most scholars | |
would be unwilling to do that. But as he thought of that in terms of the | |
original meaning of research--that is, having already mastered these texts, | |
researching them for critical and comparative purposes--for the first time, | |
the electronic format made a lot of sense. GREENFIELD could envision | |
growing numbers of scholars learning the new technologies for that very | |
aspect of their scholarship and for convenience's sake. | |
Listening to VECCIA and FREEMAN, GREENFIELD thought of an additional | |
application of electronic texts. He realized that AM could be used as a | |
guide to lead someone to original sources. Students cannot be expected | |
to have mastered these sources, things they have never known about | |
before. Thus, AM is leading them, in theory, to a vast body of | |
information and giving them a superficial overview of it, enabling them | |
to select parts of it. GREENFIELD asked if any evidence exists that this | |
resource will indeed teach the new user, the K-12 students, how to do | |
research. Scholars already know how to do research and are applying | |
these new tools. But he wondered why students would go beyond picking | |
out things that were most exciting to them. | |
FREEMAN conceded the correctness of GREENFIELD's observation as applied | |
to a school environment. The risk is that a student would sit down at a | |
system, play with it, find some things of interest, and then walk away. | |
But in the relatively controlled situation of a school library, much will | |
depend on the instructions a teacher or a librarian gives a student. She | |
viewed the situation not as one of fine-tuning research skills but of | |
involving students at a personal level in understanding and researching | |
things. Given the guidance one can receive at school, it then becomes | |
possible to teach elementary research skills to students, which in fact | |
one particular librarian said she was teaching her fifth graders. | |
FREEMAN concluded that introducing the idea of following one's own path | |
of inquiry, which is essentially what research entails, involves more | |
than teaching specific skills. To these comments VECCIA added the | |
observation that the individual teacher and the use of a creative | |
resource, rather than AM itself, seemed to make the key difference. | |
Some schools and some teachers are making excellent use of the nature | |
of critical thinking and teaching skills, she said. | |
Concurring with these remarks, DALY closed the session with the thought that | |
the more that producers produced for teachers and for scholars to use with | |
their students, the more successful their electronic products would prove. | |
****** | |
SESSION II. SHOW AND TELL | |
Jacqueline HESS, director, National Demonstration Laboratory, served as | |
moderator of the "show-and-tell" session. She noted that a | |
question-and-answer period would follow each presentation. | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
MYLONAS * Overview and content of Perseus * Perseus' primary materials | |
exist in a system-independent, archival form * A concession * Textual | |
aspects of Perseus * Tools to use with the Greek text * Prepared indices | |
and full-text searches in Perseus * English-Greek word search leads to | |
close study of words and concepts * Navigating Perseus by tracing down | |
indices * Using the iconography to perform research * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Elli MYLONAS, managing editor, Perseus Project, Harvard University, first | |
gave an overview of Perseus, a large, collaborative effort based at | |
Harvard University but with contributors and collaborators located at | |
numerous universities and colleges in the United States (e.g., Bowdoin, | |
Maryland, Pomona, Chicago, Virginia). Funded primarily by the | |
Annenberg/CPB Project, with additional funding from Apple, Harvard, and | |
the Packard Humanities Institute, among others, Perseus is a multimedia, | |
hypertextual database for teaching and research on classical Greek | |
civilization, which was released in February 1992 in version 1.0 and | |
distributed by Yale University Press. | |
Consisting entirely of primary materials, Perseus includes ancient Greek | |
texts and translations of those texts; catalog entries--that is, museum | |
catalog entries, not library catalog entries--on vases, sites, coins, | |
sculpture, and archaeological objects; maps; and a dictionary, among | |
other sources. The number of objects and the objects for which catalog | |
entries exist are accompanied by thousands of color images, which | |
constitute a major feature of the database. Perseus contains | |
approximately 30 megabytes of text, an amount that will double in | |
subsequent versions. In addition to these primary materials, the Perseus | |
Project has been building tools for using them, making access and | |
navigation easier, the goal being to build part of the electronic | |
environment discussed earlier in the morning in which students or | |
scholars can work with their sources. | |
The demonstration of Perseus will show only a fraction of the real work | |
that has gone into it, because the project had to face the dilemma of | |
what to enter when putting something into machine-readable form: should | |
one aim for very high quality or make concessions in order to get the | |
material in? Since Perseus decided to opt for very high quality, all of | |
its primary materials exist in a system-independent--insofar as it is | |
possible to be system-independent--archival form. Deciding what that | |
archival form would be and attaining it required much work and thought. | |
For example, all the texts are marked up in SGML, which will be made | |
compatible with the guidelines of the Text Encoding Initiative (TEI) when | |
they are issued. | |
Drawings are postscript files, not meeting international standards, but | |
at least designed to go across platforms. Images, or rather the real | |
archival forms, consist of the best available slides, which are being | |
digitized. Much of the catalog material exists in database form--a form | |
that the average user could use, manipulate, and display on a personal | |
computer, but only at great cost. Thus, this is where the concession | |
comes in: All of this rich, well-marked-up information is stripped of | |
much of its content; the images are converted into bit-maps and the text | |
into small formatted chunks. All this information can then be imported | |
into HyperCard and run on a mid-range Macintosh, which is what Perseus | |
users have. This fact has made it possible for Perseus to attain wide | |
use fairly rapidly. Without those archival forms the HyperCard version | |
being demonstrated could not be made easily, and the project could not | |
have the potential to move to other forms and machines and software as | |
they appear, none of which information is in Perseus on the CD. | |
Of the numerous multimedia aspects of Perseus, MYLONAS focused on the | |
textual. Part of what makes Perseus such a pleasure to use, MYLONAS | |
said, is this effort at seamless integration and the ability to move | |
around both visual and textual material. Perseus also made the decision | |
not to attempt to interpret its material any more than one interprets by | |
selecting. But, MYLONAS emphasized, Perseus is not courseware: No | |
syllabus exists. There is no effort to define how one teaches a topic | |
using Perseus, although the project may eventually collect papers by | |
people who have used it to teach. Rather, Perseus aims to provide | |
primary material in a kind of electronic library, an electronic sandbox, | |
so to say, in which students and scholars who are working on this | |
material can explore by themselves. With that, MYLONAS demonstrated | |
Perseus, beginning with the Perseus gateway, the first thing one sees | |
upon opening Perseus--an effort in part to solve the contextualizing | |
problem--which tells the user what the system contains. | |
MYLONAS demonstrated only a very small portion, beginning with primary | |
texts and running off the CD-ROM. Having selected Aeschylus' Prometheus | |
Bound, which was viewable in Greek and English pretty much in the same | |
segments together, MYLONAS demonstrated tools to use with the Greek text, | |
something not possible with a book: looking up the dictionary entry form | |
of an unfamiliar word in Greek after subjecting it to Perseus' | |
morphological analysis for all the texts. After finding out about a | |
word, a user may then decide to see if it is used anywhere else in Greek. | |
Because vast amounts of indexing support all of the primary material, one | |
can find out where else all forms of a particular Greek word appear-- | |
often not a trivial matter because Greek is highly inflected. Further, | |
since the story of Prometheus has to do with the origins of sacrifice, a | |
user may wish to study and explore sacrifice in Greek literature; by | |
typing sacrifice into a small window, a user goes to the English-Greek | |
word list--something one cannot do without the computer (Perseus has | |
indexed the definitions of its dictionary)--the string sacrifice appears | |
in the definitions of these sixty-five words. One may then find out | |
where any of those words is used in the work(s) of a particular author. | |
The English definitions are not lemmatized. | |
All of the indices driving this kind of usage were originally devised for | |
speed, MYLONAS observed; in other words, all that kind of information-- | |
all forms of all words, where they exist, the dictionary form they belong | |
to--were collected into databases, which will expedite searching. Then | |
it was discovered that one can do things searching in these databases | |
that could not be done searching in the full texts. Thus, although there | |
are full-text searches in Perseus, much of the work is done behind the | |
scenes, using prepared indices. Re the indexing that is done behind the | |
scenes, MYLONAS pointed out that without the SGML forms of the text, it | |
could not be done effectively. Much of this indexing is based on the | |
structures that are made explicit by the SGML tagging. | |
It was found that one of the things many of Perseus' non-Greek-reading | |
users do is start from the dictionary and then move into the close study | |
of words and concepts via this kind of English-Greek word search, by which | |
means they might select a concept. This exercise has been assigned to | |
students in core courses at Harvard--to study a concept by looking for the | |
English word in the dictionary, finding the Greek words, and then finding | |
the words in the Greek but, of course, reading across in the English. | |
That tells them a great deal about what a translation means as well. | |
Should one also wish to see images that have to do with sacrifice, that | |
person would go to the object key word search, which allows one to | |
perform a similar kind of index retrieval on the database of | |
archaeological objects. Without words, pictures are useless; Perseus has | |
not reached the point where it can do much with images that are not | |
cataloged. Thus, although it is possible in Perseus with text and images | |
to navigate by knowing where one wants to end up--for example, a | |
red-figure vase from the Boston Museum of Fine Arts--one can perform this | |
kind of navigation very easily by tracing down indices. MYLONAS | |
illustrated several generic scenes of sacrifice on vases. The features | |
demonstrated derived from Perseus 1.0; version 2.0 will implement even | |
better means of retrieval. | |
MYLONAS closed by looking at one of the pictures and noting again that | |
one can do a great deal of research using the iconography as well as the | |
texts. For instance, students in a core course at Harvard this year were | |
highly interested in Greek concepts of foreigners and representations of | |
non-Greeks. So they performed a great deal of research, both with texts | |
(e.g., Herodotus) and with iconography on vases and coins, on how the | |
Greeks portrayed non-Greeks. At the same time, art historians who study | |
iconography were also interested, and were able to use this material. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
DISCUSSION * Indexing and searchability of all English words in Perseus * | |
Several features of Perseus 1.0 * Several levels of customization | |
possible * Perseus used for general education * Perseus' effects on | |
education * Contextual information in Perseus * Main challenge and | |
emphasis of Perseus * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Several points emerged in the discussion that followed MYLONAS's presentation. | |
Although MYLONAS had not demonstrated Perseus' ability to cross-search | |
documents, she confirmed that all English words in Perseus are indexed | |
and can be searched. So, for example, sacrifice could have been searched | |
in all texts, the historical essay, and all the catalogue entries with | |
their descriptions--in short, in all of Perseus. | |
Boolean logic is not in Perseus 1.0 but will be added to the next | |
version, although an effort is being made not to restrict Perseus to a | |
database in which one just performs searching, Boolean or otherwise. It | |
is possible to move laterally through the documents by selecting a word | |
one is interested in and selecting an area of information one is | |
interested in and trying to look that word up in that area. | |
Since Perseus was developed in HyperCard, several levels of customization | |
are possible. Simple authoring tools exist that allow one to create | |
annotated paths through the information, which are useful for note-taking | |
and for guided tours for teaching purposes and for expository writing. | |
With a little more ingenuity it is possible to begin to add or substitute | |
material in Perseus. | |
Perseus has not been used so much for classics education as for general | |
education, where it seemed to have an impact on the students in the core | |
course at Harvard (a general required course that students must take in | |
certain areas). Students were able to use primary material much more. | |
The Perseus Project has an evaluation team at the University of Maryland | |
that has been documenting Perseus' effects on education. Perseus is very | |
popular, and anecdotal evidence indicates that it is having an effect at | |
places other than Harvard, for example, test sites at Ball State | |
University, Drury College, and numerous small places where opportunities | |
to use vast amounts of primary data may not exist. One documented effect | |
is that archaeological, anthropological, and philological research is | |
being done by the same person instead of by three different people. | |
The contextual information in Perseus includes an overview essay, a | |
fairly linear historical essay on the fifth century B.C. that provides | |
links into the primary material (e.g., Herodotus, Thucydides, and | |
Plutarch), via small gray underscoring (on the screen) of linked | |
passages. These are handmade links into other material. | |
To different extents, most of the production work was done at Harvard, | |
where the people and the equipment are located. Much of the | |
collaborative activity involved data collection and structuring, because | |
the main challenge and the emphasis of Perseus is the gathering of | |
primary material, that is, building a useful environment for studying | |
classical Greece, collecting data, and making it useful. | |
Systems-building is definitely not the main concern. Thus, much of the | |
work has involved writing essays, collecting information, rewriting it, | |
and tagging it. That can be done off site. The creative link for the | |
overview essay as well as for both systems and data was collaborative, | |
and was forged via E-mail and paper mail with professors at Pomona and | |
Bowdoin. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
CALALUCA * PLD's principal focus and contribution to scholarship * | |
Various questions preparatory to beginning the project * Basis for | |
project * Basic rule in converting PLD * Concerning the images in PLD * | |
Running PLD under a variety of retrieval softwares * Encoding the | |
database a hard-fought issue * Various features demonstrated * Importance | |
of user documentation * Limitations of the CD-ROM version * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Eric CALALUCA, vice president, Chadwyck-Healey, Inc., demonstrated a | |
software interpretation of the Patrologia Latina Database (PLD). PLD's | |
principal focus from the beginning of the project about three-and-a-half | |
years ago was on converting Migne's Latin series, and in the end, | |
CALALUCA suggested, conversion of the text will be the major contribution | |
to scholarship. CALALUCA stressed that, as possibly the only private | |
publishing organization at the Workshop, Chadwyck-Healey had sought no | |
federal funds or national foundation support before embarking upon the | |
project, but instead had relied upon a great deal of homework and | |
marketing to accomplish the task of conversion. | |
Ever since the possibilities of computer-searching have emerged, scholars | |
in the field of late ancient and early medieval studies (philosophers, | |
theologians, classicists, and those studying the history of natural law | |
and the history of the legal development of Western civilization) have | |
been longing for a fully searchable version of Western literature, for | |
example, all the texts of Augustine and Bernard of Clairvaux and | |
Boethius, not to mention all the secondary and tertiary authors. | |
Various questions arose, CALALUCA said. Should one convert Migne? | |
Should the database be encoded? Is it necessary to do that? How should | |
it be delivered? What about CD-ROM? Since this is a transitional | |
medium, why even bother to create software to run on a CD-ROM? Since | |
everybody knows people will be networking information, why go to the | |
trouble--which is far greater with CD-ROM than with the production of | |
magnetic data? Finally, how does one make the data available? Can many | |
of the hurdles to using electronic information that some publishers have | |
imposed upon databases be eliminated? | |
The PLD project was based on the principle that computer-searching of | |
texts is most effective when it is done with a large database. Because | |
PLD represented a collection that serves so many disciplines across so | |
many periods, it was irresistible. | |
The basic rule in converting PLD was to do no harm, to avoid the sins of | |
intrusion in such a database: no introduction of newer editions, no | |
on-the-spot changes, no eradicating of all possible falsehoods from an | |
edition. Thus, PLD is not the final act in electronic publishing for | |
this discipline, but simply the beginning. The conversion of PLD has | |
evoked numerous unanticipated questions: How will information be used? | |
What about networking? Can the rights of a database be protected? | |
Should one protect the rights of a database? How can it be made | |
available? | |
Those converting PLD also tried to avoid the sins of omission, that is, | |
excluding portions of the collections or whole sections. What about the | |
images? PLD is full of images, some are extremely pious | |
nineteenth-century representations of the Fathers, while others contain | |
highly interesting elements. The goal was to cover all the text of Migne | |
(including notes, in Greek and in Hebrew, the latter of which, in | |
particular, causes problems in creating a search structure), all the | |
indices, and even the images, which are being scanned in separately | |
searchable files. | |
Several North American institutions that have placed acquisition requests | |
for the PLD database have requested it in magnetic form without software, | |
which means they are already running it without software, without | |
anything demonstrated at the Workshop. | |
What cannot practically be done is go back and reconvert and re-encode | |
data, a time-consuming and extremely costly enterprise. CALALUCA sees | |
PLD as a database that can, and should, be run under a variety of | |
retrieval softwares. This will permit the widest possible searches. | |
Consequently, the need to produce a CD-ROM of PLD, as well as to develop | |
software that could handle some 1.3 gigabyte of heavily encoded text, | |
developed out of conversations with collection development and reference | |
librarians who wanted software both compassionate enough for the | |
pedestrian but also capable of incorporating the most detailed | |
lexicographical studies that a user desires to conduct. In the end, the | |
encoding and conversion of the data will prove the most enduring | |
testament to the value of the project. | |
The encoding of the database was also a hard-fought issue: Did the | |
database need to be encoded? Were there normative structures for encoding | |
humanist texts? Should it be SGML? What about the TEI--will it last, | |
will it prove useful? CALALUCA expressed some minor doubts as to whether | |
a data bank can be fully TEI-conformant. Every effort can be made, but | |
in the end to be TEI-conformant means to accept the need to make some | |
firm encoding decisions that can, indeed, be disputed. The TEI points | |
the publisher in a proper direction but does not presume to make all the | |
decisions for him or her. Essentially, the goal of encoding was to | |
eliminate, as much as possible, the hindrances to information-networking, | |
so that if an institution acquires a database, everybody associated with | |
the institution can have access to it. | |
CALALUCA demonstrated a portion of Volume 160, because it had the most | |
anomalies in it. The software was created by Electronic Book | |
Technologies of Providence, RI, and is called Dynatext. The software | |
works only with SGML-coded data. | |
Viewing a table of contents on the screen, the audience saw how Dynatext | |
treats each element as a book and attempts to simplify movement through a | |
volume. Familiarity with the Patrologia in print (i.e., the text, its | |
source, and the editions) will make the machine-readable versions highly | |
useful. (Software with a Windows application was sought for PLD, | |
CALALUCA said, because this was the main trend for scholarly use.) | |
CALALUCA also demonstrated how a user can perform a variety of searches | |
and quickly move to any part of a volume; the look-up screen provides | |
some basic, simple word-searching. | |
CALALUCA argued that one of the major difficulties is not the software. | |
Rather, in creating a product that will be used by scholars representing | |
a broad spectrum of computer sophistication, user documentation proves | |
to be the most important service one can provide. | |
CALALUCA next illustrated a truncated search under mysterium within ten | |
words of virtus and how one would be able to find its contents throughout | |
the entire database. He said that the exciting thing about PLD is that | |
many of the applications in the retrieval software being written for it | |
will exceed the capabilities of the software employed now for the CD-ROM | |
version. The CD-ROM faces genuine limitations, in terms of speed and | |
comprehensiveness, in the creation of a retrieval software to run it. | |
CALALUCA said he hoped that individual scholars will download the data, | |
if they wish, to their personal computers, and have ready access to | |
important texts on a constant basis, which they will be able to use in | |
their research and from which they might even be able to publish. | |
(CALALUCA explained that the blue numbers represented Migne's column numbers, | |
which are the standard scholarly references. Pulling up a note, he stated | |
that these texts were heavily edited and the image files would appear simply | |
as a note as well, so that one could quickly access an image.) | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
FLEISCHHAUER/ERWAY * Several problems with which AM is still wrestling * | |
Various search and retrieval capabilities * Illustration of automatic | |
stemming and a truncated search * AM's attempt to find ways to connect | |
cataloging to the texts * AM's gravitation towards SGML * Striking a | |
balance between quantity and quality * How AM furnishes users recourse to | |
images * Conducting a search in a full-text environment * Macintosh and | |
IBM prototypes of AM * Multimedia aspects of AM * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
A demonstration of American Memory by its coordinator, Carl FLEISCHHAUER, | |
and Ricky ERWAY, associate coordinator, Library of Congress, concluded | |
the morning session. Beginning with a collection of broadsides from the | |
Continental Congress and the Constitutional Convention, the only text | |
collection in a presentable form at the time of the Workshop, FLEISCHHAUER | |
highlighted several of the problems with which AM is still wrestling. | |
(In its final form, the disk will contain two collections, not only the | |
broadsides but also the full text with illustrations of a set of | |
approximately 300 African-American pamphlets from the period 1870 to 1910.) | |
As FREEMAN had explained earlier, AM has attempted to use a small amount | |
of interpretation to introduce collections. In the present case, the | |
contractor, a company named Quick Source, in Silver Spring, MD., used | |
software called Toolbook and put together a modestly interactive | |
introduction to the collection. Like the two preceding speakers, | |
FLEISCHHAUER argued that the real asset was the underlying collection. | |
FLEISCHHAUER proceeded to describe various search and retrieval | |
capabilities while ERWAY worked the computer. In this particular package | |
the "go to" pull-down allowed the user in effect to jump out of Toolbook, | |
where the interactive program was located, and enter the third-party | |
software used by AM for this text collection, which is called Personal | |
Librarian. This was the Windows version of Personal Librarian, a | |
software application put together by a company in Rockville, Md. | |
Since the broadsides came from the Revolutionary War period, a search was | |
conducted using the words British or war, with the default operator reset | |
as or. FLEISCHHAUER demonstrated both automatic stemming (which finds | |
other forms of the same root) and a truncated search. One of Personal | |
Librarian's strongest features, the relevance ranking, was represented by | |
a chart that indicated how often words being sought appeared in | |
documents, with the one receiving the most "hits" obtaining the highest | |
score. The "hit list" that is supplied takes the relevance ranking into | |
account, making the first hit, in effect, the one the software has | |
selected as the most relevant example. | |
While in the text of one of the broadside documents, FLEISCHHAUER | |
remarked AM's attempt to find ways to connect cataloging to the texts, | |
which it does in different ways in different manifestations. In the case | |
shown, the cataloging was pasted on: AM took MARC records that were | |
written as on-line records right into one of the Library's mainframe | |
retrieval programs, pulled them out, and handed them off to the contractor, | |
who massaged them somewhat to display them in the manner shown. One of | |
AM's questions is, Does the cataloguing normally performed in the mainframe | |
work in this context, or had AM ought to think through adjustments? | |
FLEISCHHAUER made the additional point that, as far as the text goes, AM | |
has gravitated towards SGML (he pointed to the boldface in the upper part | |
of the screen). Although extremely limited in its ability to translate | |
or interpret SGML, Personal Librarian will furnish both bold and italics | |
on screen; a fairly easy thing to do, but it is one of the ways in which | |
SGML is useful. | |
Striking a balance between quantity and quality has been a major concern | |
of AM, with accuracy being one of the places where project staff have | |
felt that less than 100-percent accuracy was not unacceptable. | |
FLEISCHHAUER cited the example of the standard of the rekeying industry, | |
namely 99.95 percent; as one service bureau informed him, to go from | |
99.95 to 100 percent would double the cost. | |
FLEISCHHAUER next demonstrated how AM furnishes users recourse to images, | |
and at the same time recalled LESK's pointed question concerning the | |
number of people who would look at those images and the number who would | |
work only with the text. If the implication of LESK's question was | |
sound, FLEISCHHAUER said, it raised the stakes for text accuracy and | |
reduced the value of the strategy for images. | |
Contending that preservation is always a bugaboo, FLEISCHHAUER | |
demonstrated several images derived from a scan of a preservation | |
microfilm that AM had made. He awarded a grade of C at best, perhaps a | |
C minus or a C plus, for how well it worked out. Indeed, the matter of | |
learning if other people had better ideas about scanning in general, and, | |
in particular, scanning from microfilm, was one of the factors that drove | |
AM to attempt to think through the agenda for the Workshop. Skew, for | |
example, was one of the issues that AM in its ignorance had not reckoned | |
would prove so difficult. | |
Further, the handling of images of the sort shown, in a desktop computer | |
environment, involved a considerable amount of zooming and scrolling. | |
Ultimately, AM staff feel that perhaps the paper copy that is printed out | |
might be the most useful one, but they remain uncertain as to how much | |
on-screen reading users will do. | |
Returning to the text, FLEISCHHAUER asked viewers to imagine a person who | |
might be conducting a search in a full-text environment. With this | |
scenario, he proceeded to illustrate other features of Personal Librarian | |
that he considered helpful; for example, it provides the ability to | |
notice words as one reads. Clicking the "include" button on the bottom | |
of the search window pops the words that have been highlighted into the | |
search. Thus, a user can refine the search as he or she reads, | |
re-executing the search and continuing to find things in the quest for | |
materials. This software not only contains relevance ranking, Boolean | |
operators, and truncation, it also permits one to perform word algebra, | |
so to say, where one puts two or three words in parentheses and links | |
them with one Boolean operator and then a couple of words in another set | |
of parentheses and asks for things within so many words of others. | |
Until they became acquainted recently with some of the work being done in | |
classics, the AM staff had not realized that a large number of the | |
projects that involve electronic texts were being done by people with a | |
profound interest in language and linguistics. Their search strategies | |
and thinking are oriented to those fields, as is shown in particular by | |
the Perseus example. As amateur historians, the AM staff were thinking | |
more of searching for concepts and ideas than for particular words. | |
Obviously, FLEISCHHAUER conceded, searching for concepts and ideas and | |
searching for words may be two rather closely related things. | |
While displaying several images, FLEISCHHAUER observed that the Macintosh | |
prototype built by AM contains a greater diversity of formats. Echoing a | |
previous speaker, he said that it was easier to stitch things together in | |
the Macintosh, though it tended to be a little more anemic in search and | |
retrieval. AM, therefore, increasingly has been investigating | |
sophisticated retrieval engines in the IBM format. | |
FLEISCHHAUER demonstrated several additional examples of the prototype | |
interfaces: One was AM's metaphor for the network future, in which a | |
kind of reading-room graphic suggests how one would be able to go around | |
to different materials. AM contains a large number of photographs in | |
analog video form worked up from a videodisc, which enable users to make | |
copies to print or incorporate in digital documents. A frame-grabber is | |
built into the system, making it possible to bring an image into a window | |
and digitize or print it out. | |
FLEISCHHAUER next demonstrated sound recording, which included texts. | |
Recycled from a previous project, the collection included sixty 78-rpm | |
phonograph records of political speeches that were made during and | |
immediately after World War I. These constituted approximately three | |
hours of audio, as AM has digitized it, which occupy 150 megabytes on a | |
CD. Thus, they are considerably compressed. From the catalogue card, | |
FLEISCHHAUER proceeded to a transcript of a speech with the audio | |
available and with highlighted text following it as it played. | |
A photograph has been added and a transcription made. | |
Considerable value has been added beyond what the Library of Congress | |
normally would do in cataloguing a sound recording, which raises several | |
questions for AM concerning where to draw lines about how much value it can | |
afford to add and at what point, perhaps, this becomes more than AM could | |
reasonably do or reasonably wish to do. FLEISCHHAUER also demonstrated | |
a motion picture. As FREEMAN had reported earlier, the motion picture | |
materials have proved the most popular, not surprisingly. This says more | |
about the medium, he thought, than about AM's presentation of it. | |
Because AM's goal was to bring together things that could be used by | |
historians or by people who were curious about history, | |
turn-of-the-century footage seemed to represent the most appropriate | |
collections from the Library of Congress in motion pictures. These were | |
the very first films made by Thomas Edison's company and some others at | |
that time. The particular example illustrated was a Biograph film, | |
brought in with a frame-grabber into a window. A single videodisc | |
contains about fifty titles and pieces of film from that period, all of | |
New York City. Taken together, AM believes, they provide an interesting | |
documentary resource. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
DISCUSSION * Using the frame-grabber in AM * Volume of material processed | |
and to be processed * Purpose of AM within LC * Cataloguing and the | |
nature of AM's material * SGML coding and the question of quality versus | |
quantity * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
During the question-and-answer period that followed FLEISCHHAUER's | |
presentation, several clarifications were made. | |
AM is bringing in motion pictures from a videodisc. The frame-grabber | |
devices create a window on a computer screen, which permits users to | |
digitize a single frame of the movie or one of the photographs. It | |
produces a crude, rough-and-ready image that high school students can | |
incorporate into papers, and that has worked very nicely in this way. | |
Commenting on FLEISCHHAUER's assertion that AM was looking more at | |
searching ideas than words, MYLONAS argued that without words an idea | |
does not exist. FLEISCHHAUER conceded that he ought to have articulated | |
his point more clearly. MYLONAS stated that they were in fact both | |
talking about the same thing. By searching for words and by forcing | |
people to focus on the word, the Perseus Project felt that they would get | |
them to the idea. The way one reviews results is tailored more to one | |
kind of user than another. | |
Concerning the total volume of material that has been processed in this | |
way, AM at this point has in retrievable form seven or eight collections, | |
all of them photographic. In the Macintosh environment, for example, | |
there probably are 35,000-40,000 photographs. The sound recordings | |
number sixty items. The broadsides number about 300 items. There are | |
500 political cartoons in the form of drawings. The motion pictures, as | |
individual items, number sixty to seventy. | |
AM also has a manuscript collection, the life history portion of one of | |
the federal project series, which will contain 2,900 individual | |
documents, all first-person narratives. AM has in process about 350 | |
African-American pamphlets, or about 12,000 printed pages for the period | |
1870-1910. Also in the works are some 4,000 panoramic photographs. AM | |
has recycled a fair amount of the work done by LC's Prints and | |
Photographs Division during the Library's optical disk pilot project in | |
the 1980s. For example, a special division of LC has tooled up and | |
thought through all the ramifications of electronic presentation of | |
photographs. Indeed, they are wheeling them out in great barrel loads. | |
The purpose of AM within the Library, it is hoped, is to catalyze several | |
of the other special collection divisions which have no particular | |
experience with, in some cases, mixed feelings about, an activity such as | |
AM. Moreover, in many cases the divisions may be characterized as not | |
only lacking experience in "electronifying" things but also in automated | |
cataloguing. MARC cataloguing as practiced in the United States is | |
heavily weighted toward the description of monograph and serial | |
materials, but is much thinner when one enters the world of manuscripts | |
and things that are held in the Library's music collection and other | |
units. In response to a comment by LESK, that AM's material is very | |
heavily photographic, and is so primarily because individual records have | |
been made for each photograph, FLEISCHHAUER observed that an item-level | |
catalog record exists, for example, for each photograph in the Detroit | |
Publishing collection of 25,000 pictures. In the case of the Federal | |
Writers Project, for which nearly 3,000 documents exist, representing | |
information from twenty-six different states, AM with the assistance of | |
Karen STUART of the Manuscript Division will attempt to find some way not | |
only to have a collection-level record but perhaps a MARC record for each | |
state, which will then serve as an umbrella for the 100-200 documents | |
that come under it. But that drama remains to be enacted. The AM staff | |
is conservative and clings to cataloguing, though of course visitors tout | |
artificial intelligence and neural networks in a manner that suggests that | |
perhaps one need not have cataloguing or that much of it could be put aside. | |
The matter of SGML coding, FLEISCHHAUER conceded, returned the discussion | |
to the earlier treated question of quality versus quantity in the Library | |
of Congress. Of course, text conversion can be done with 100-percent | |
accuracy, but it means that when one's holdings are as vast as LC's only | |
a tiny amount will be exposed, whereas permitting lower levels of | |
accuracy can lead to exposing or sharing larger amounts, but with the | |
quality correspondingly impaired. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
TWOHIG * A contrary experience concerning electronic options * Volume of | |
material in the Washington papers and a suggestion of David Packard * | |
Implications of Packard's suggestion * Transcribing the documents for the | |
CD-ROM * Accuracy of transcriptions * The CD-ROM edition of the Founding | |
Fathers documents * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Finding encouragement in a comment of MICHELSON's from the morning | |
session--that numerous people in the humanities were choosing electronic | |
options to do their work--Dorothy TWOHIG, editor, The Papers of George | |
Washington, opened her illustrated talk by noting that her experience | |
with literary scholars and numerous people in editing was contrary to | |
MICHELSON's. TWOHIG emphasized literary scholars' complete ignorance of | |
the technological options available to them or their reluctance or, in | |
some cases, their downright hostility toward these options. | |
After providing an overview of the five Founding Fathers projects | |
(Jefferson at Princeton, Franklin at Yale, John Adams at the | |
Massachusetts Historical Society, and Madison down the hall from her at | |
the University of Virginia), TWOHIG observed that the Washington papers, | |
like all of the projects, include both sides of the Washington | |
correspondence and deal with some 135,000 documents to be published with | |
extensive annotation in eighty to eighty-five volumes, a project that | |
will not be completed until well into the next century. Thus, it was | |
with considerable enthusiasm several years ago that the Washington Papers | |
Project (WPP) greeted David Packard's suggestion that the papers of the | |
Founding Fathers could be published easily and inexpensively, and to the | |
great benefit of American scholarship, via CD-ROM. | |
In pragmatic terms, funding from the Packard Foundation would expedite | |
the transcription of thousands of documents waiting to be put on disk in | |
the WPP offices. Further, since the costs of collecting, editing, and | |
converting the Founding Fathers documents into letterpress editions were | |
running into the millions of dollars, and the considerable staffs | |
involved in all of these projects were devoting their careers to | |
producing the work, the Packard Foundation's suggestion had a | |
revolutionary aspect: Transcriptions of the entire corpus of the | |
Founding Fathers papers would be available on CD-ROM to public and | |
college libraries, even high schools, at a fraction of the cost-- | |
$100-$150 for the annual license fee--to produce a limited university | |
press run of 1,000 of each volume of the published papers at $45-$150 per | |
printed volume. Given the current budget crunch in educational systems | |
and the corresponding constraints on librarians in smaller institutions | |
who wish to add these volumes to their collections, producing the | |
documents on CD-ROM would likely open a greatly expanded audience for the | |
papers. TWOHIG stressed, however, that development of the Founding | |
Fathers CD-ROM is still in its infancy. Serious software problems remain | |
to be resolved before the material can be put into readable form. | |
Funding from the Packard Foundation resulted in a major push to | |
transcribe the 75,000 or so documents of the Washington papers remaining | |
to be transcribed onto computer disks. Slides illustrated several of the | |
problems encountered, for example, the present inability of CD-ROM to | |
indicate the cross-outs (deleted material) in eighteenth century | |
documents. TWOHIG next described documents from various periods in the | |
eighteenth century that have been transcribed in chronological order and | |
delivered to the Packard offices in California, where they are converted | |
to the CD-ROM, a process that is expected to consume five years to | |
complete (that is, reckoning from David Packard's suggestion made several | |
years ago, until about July 1994). TWOHIG found an encouraging | |
indication of the project's benefits in the ongoing use made by scholars | |
of the search functions of the CD-ROM, particularly in reducing the time | |
spent in manually turning the pages of the Washington papers. | |
TWOHIG next furnished details concerning the accuracy of transcriptions. | |
For instance, the insertion of thousands of documents on the CD-ROM | |
currently does not permit each document to be verified against the | |
original manuscript several times as in the case of documents that appear | |
in the published edition. However, the transcriptions receive a cursory | |
check for obvious typos, the misspellings of proper names, and other | |
errors from the WPP CD-ROM editor. Eventually, all documents that appear | |
in the electronic version will be checked by project editors. Although | |
this process has met with opposition from some of the editors on the | |
grounds that imperfect work may leave their offices, the advantages in | |
making this material available as a research tool outweigh fears about the | |
misspelling of proper names and other relatively minor editorial matters. | |
Completion of all five Founding Fathers projects (i.e., retrievability | |
and searchability of all of the documents by proper names, alternate | |
spellings, or varieties of subjects) will provide one of the richest | |
sources of this size for the history of the United States in the latter | |
part of the eighteenth century. Further, publication on CD-ROM will | |
allow editors to include even minutiae, such as laundry lists, not | |
included in the printed volumes. | |
It seems possible that the extensive annotation provided in the printed | |
volumes eventually will be added to the CD-ROM edition, pending | |
negotiations with the publishers of the papers. At the moment, the | |
Founding Fathers CD-ROM is accessible only on the IBYCUS, a computer | |
developed out of the Thesaurus Linguae Graecae project and designed for | |
the use of classical scholars. There are perhaps 400 IBYCUS computers in | |
the country, most of which are in university classics departments. | |
Ultimately, it is anticipated that the CD-ROM edition of the Founding | |
Fathers documents will run on any IBM-compatible or Macintosh computer | |
with a CD-ROM drive. Numerous changes in the software will also occur | |
before the project is completed. (Editor's note: an IBYCUS was | |
unavailable to demonstrate the CD-ROM.) | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
DISCUSSION * Several additional features of WPP clarified * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Discussion following TWOHIG's presentation served to clarify several | |
additional features, including (1) that the project's primary | |
intellectual product consists in the electronic transcription of the | |
material; (2) that the text transmitted to the CD-ROM people is not | |
marked up; (3) that cataloging and subject-indexing of the material | |
remain to be worked out (though at this point material can be retrieved | |
by name); and (4) that because all the searching is done in the hardware, | |
the IBYCUS is designed to read a CD-ROM which contains only sequential | |
text files. Technically, it then becomes very easy to read the material | |
off and put it on another device. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
LEBRON * Overview of the history of the joint project between AAAS and | |
OCLC * Several practices the on-line environment shares with traditional | |
publishing on hard copy * Several technical and behavioral barriers to | |
electronic publishing * How AAAS and OCLC arrived at the subject of | |
clinical trials * Advantages of the electronic format and other features | |
of OJCCT * An illustrated tour of the journal * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Maria LEBRON, managing editor, The Online Journal of Current Clinical | |
Trials (OJCCT), presented an illustrated overview of the history of the | |
joint project between the American Association for the Advancement of | |
Science (AAAS) and the Online Computer Library Center, Inc. (OCLC). The | |
joint venture between AAAS and OCLC owes its beginning to a | |
reorganization launched by the new chief executive officer at OCLC about | |
three years ago and combines the strengths of these two disparate | |
organizations. In short, OJCCT represents the process of scholarly | |
publishing on line. | |
LEBRON next discussed several practices the on-line environment shares | |
with traditional publishing on hard copy--for example, peer review of | |
manuscripts--that are highly important in the academic world. LEBRON | |
noted in particular the implications of citation counts for tenure | |
committees and grants committees. In the traditional hard-copy | |
environment, citation counts are readily demonstrable, whereas the | |
on-line environment represents an ethereal medium to most academics. | |
LEBRON remarked several technical and behavioral barriers to electronic | |
publishing, for instance, the problems in transmission created by special | |
characters or by complex graphics and halftones. In addition, she noted | |
economic limitations such as the storage costs of maintaining back issues | |
and market or audience education. | |
Manuscripts cannot be uploaded to OJCCT, LEBRON explained, because it is | |
not a bulletin board or E-mail, forms of electronic transmission of | |
information that have created an ambience clouding people's understanding | |
of what the journal is attempting to do. OJCCT, which publishes | |
peer-reviewed medical articles dealing with the subject of clinical | |
trials, includes text, tabular material, and graphics, although at this | |
time it can transmit only line illustrations. | |
Next, LEBRON described how AAAS and OCLC arrived at the subject of | |
clinical trials: It is 1) a highly statistical discipline that 2) does | |
not require halftones but can satisfy the needs of its audience with line | |
illustrations and graphic material, and 3) there is a need for the speedy | |
dissemination of high-quality research results. Clinical trials are | |
research activities that involve the administration of a test treatment | |
to some experimental unit in order to test its usefulness before it is | |
made available to the general population. LEBRON proceeded to give | |
additional information on OJCCT concerning its editor-in-chief, editorial | |
board, editorial content, and the types of articles it publishes | |
(including peer-reviewed research reports and reviews), as well as | |
features shared by other traditional hard-copy journals. | |
Among the advantages of the electronic format are faster dissemination of | |
information, including raw data, and the absence of space constraints | |
because pages do not exist. (This latter fact creates an interesting | |
situation when it comes to citations.) Nor are there any issues. AAAS's | |
capacity to download materials directly from the journal to a | |
subscriber's printer, hard drive, or floppy disk helps ensure highly | |
accurate transcription. Other features of OJCCT include on-screen alerts | |
that allow linkage of subsequently published documents to the original | |
documents; on-line searching by subject, author, title, etc.; indexing of | |
every single word that appears in an article; viewing access to an | |
article by component (abstract, full text, or graphs); numbered | |
paragraphs to replace page counts; publication in Science every thirty | |
days of indexing of all articles published in the journal; | |
typeset-quality screens; and Hypertext links that enable subscribers to | |
bring up Medline abstracts directly without leaving the journal. | |
After detailing the two primary ways to gain access to the journal, | |
through the OCLC network and Compuserv if one desires graphics or through | |
the Internet if just an ASCII file is desired, LEBRON illustrated the | |
speedy editorial process and the coding of the document using SGML tags | |
after it has been accepted for publication. She also gave an illustrated | |
tour of the journal, its search-and-retrieval capabilities in particular, | |
but also including problems associated with scanning in illustrations, | |
and the importance of on-screen alerts to the medical profession re | |
retractions or corrections, or more frequently, editorials, letters to | |
the editors, or follow-up reports. She closed by inviting the audience | |
to join AAAS on 1 July, when OJCCT was scheduled to go on-line. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
DISCUSSION * Additional features of OJCCT * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
In the lengthy discussion that followed LEBRON's presentation, these | |
points emerged: | |
* The SGML text can be tailored as users wish. | |
* All these articles have a fairly simple document definition. | |
* Document-type definitions (DTDs) were developed and given to OJCCT | |
for coding. | |
* No articles will be removed from the journal. (Because there are | |
no back issues, there are no lost issues either. Once a subscriber | |
logs onto the journal he or she has access not only to the currently | |
published materials, but retrospectively to everything that has been | |
published in it. Thus the table of contents grows bigger. The date | |
of publication serves to distinguish between currently published | |
materials and older materials.) | |
* The pricing system for the journal resembles that for most medical | |
journals: for 1992, $95 for a year, plus telecommunications charges | |
(there are no connect time charges); for 1993, $110 for the | |
entire year for single users, though the journal can be put on a | |
local area network (LAN). However, only one person can access the | |
journal at a time. Site licenses may come in the future. | |
* AAAS is working closely with colleagues at OCLC to display | |
mathematical equations on screen. | |
* Without compromising any steps in the editorial process, the | |
technology has reduced the time lag between when a manuscript is | |
originally submitted and the time it is accepted; the review process | |
does not differ greatly from the standard six-to-eight weeks | |
employed by many of the hard-copy journals. The process still | |
depends on people. | |
* As far as a preservation copy is concerned, articles will be | |
maintained on the computer permanently and subscribers, as part of | |
their subscription, will receive a microfiche-quality archival copy | |
of everything published during that year; in addition, reprints can | |
be purchased in much the same way as in a hard-copy environment. | |
Hard copies are prepared but are not the primary medium for the | |
dissemination of the information. | |
* Because OJCCT is not yet on line, it is difficult to know how many | |
people would simply browse through the journal on the screen as | |
opposed to downloading the whole thing and printing it out; a mix of | |
both types of users likely will result. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
PERSONIUS * Developments in technology over the past decade * The CLASS | |
Project * Advantages for technology and for the CLASS Project * | |
Developing a network application an underlying assumption of the project | |
* Details of the scanning process * Print-on-demand copies of books * | |
Future plans include development of a browsing tool * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Lynne PERSONIUS, assistant director, Cornell Information Technologies for | |
Scholarly Information Services, Cornell University, first commented on | |
the tremendous impact that developments in technology over the past ten | |
years--networking, in particular--have had on the way information is | |
handled, and how, in her own case, these developments have counterbalanced | |
Cornell's relative geographical isolation. Other significant technologies | |
include scanners, which are much more sophisticated than they were ten years | |
ago; mass storage and the dramatic savings that result from it in terms of | |
both space and money relative to twenty or thirty years ago; new and | |
improved printing technologies, which have greatly affected the distribution | |
of information; and, of course, digital technologies, whose applicability to | |
library preservation remains at issue. | |
Given that context, PERSONIUS described the College Library Access and | |
Storage System (CLASS) Project, a library preservation project, | |
primarily, and what has been accomplished. Directly funded by the | |
Commission on Preservation and Access and by the Xerox Corporation, which | |
has provided a significant amount of hardware, the CLASS Project has been | |
working with a development team at Xerox to develop a software | |
application tailored to library preservation requirements. Within | |
Cornell, participants in the project have been working jointly with both | |
library and information technologies. The focus of the project has been | |
on reformatting and saving books that are in brittle condition. | |
PERSONIUS showed Workshop participants a brittle book, and described how | |
such books were the result of developments in papermaking around the | |
beginning of the Industrial Revolution. The papermaking process was | |
changed so that a significant amount of acid was introduced into the | |
actual paper itself, which deteriorates as it sits on library shelves. | |
One of the advantages for technology and for the CLASS Project is that | |
the information in brittle books is mostly out of copyright and thus | |
offers an opportunity to work with material that requires library | |
preservation, and to create and work on an infrastructure to save the | |
material. Acknowledging the familiarity of those working in preservation | |
with this information, PERSONIUS noted that several things are being | |
done: the primary preservation technology used today is photocopying of | |
brittle material. Saving the intellectual content of the material is the | |
main goal. With microfilm copy, the intellectual content is preserved on | |
the assumption that in the future the image can be reformatted in any | |
other way that then exists. | |
An underlying assumption of the CLASS Project from the beginning was | |
that it would develop a network application. Project staff scan books | |
at a workstation located in the library, near the brittle material. | |
An image-server filing system is located at a distance from that | |
workstation, and a printer is located in another building. All of the | |
materials digitized and stored on the image-filing system are cataloged | |
in the on-line catalogue. In fact, a record for each of these electronic | |
books is stored in the RLIN database so that a record exists of what is | |
in the digital library throughout standard catalogue procedures. In the | |
future, researchers working from their own workstations in their offices, | |
or their networks, will have access--wherever they might be--through a | |
request server being built into the new digital library. A second | |
assumption is that the preferred means of finding the material will be by | |
looking through a catalogue. PERSONIUS described the scanning process, | |
which uses a prototype scanner being developed by Xerox and which scans a | |
very high resolution image at great speed. Another significant feature, | |
because this is a preservation application, is the placing of the pages | |
that fall apart one for one on the platen. Ordinarily, a scanner could | |
be used with some sort of a document feeder, but because of this | |
application that is not feasible. Further, because CLASS is a | |
preservation application, after the paper replacement is made there, a | |
very careful quality control check is performed. An original book is | |
compared to the printed copy and verification is made, before proceeding, | |
that all of the image, all of the information, has been captured. Then, | |
a new library book is produced: The printed images are rebound by a | |
commercial binder and a new book is returned to the shelf. | |
Significantly, the books returned to the library shelves are beautiful | |
and useful replacements on acid-free paper that should last a long time, | |
in effect, the equivalent of preservation photocopies. Thus, the project | |
has a library of digital books. In essence, CLASS is scanning and | |
storing books as 600 dot-per-inch bit-mapped images, compressed using | |
Group 4 CCITT (i.e., the French acronym for International Consultative | |
Committee for Telegraph and Telephone) compression. They are stored as | |
TIFF files on an optical filing system that is composed of a database | |
used for searching and locating the books and an optical jukebox that | |
stores 64 twelve-inch platters. A very-high-resolution printed copy of | |
these books at 600 dots per inch is created, using a Xerox DocuTech | |
printer to make the paper replacements on acid-free paper. | |
PERSONIUS maintained that the CLASS Project presents an opportunity to | |
introduce people to books as digital images by using a paper medium. | |
Books are returned to the shelves while people are also given the ability | |
to print on demand--to make their own copies of books. (PERSONIUS | |
distributed copies of an engineering journal published by engineering | |
students at Cornell around 1900 as an example of what a print-on-demand | |
copy of material might be like. This very cheap copy would be available | |
to people to use for their own research purposes and would bridge the gap | |
between an electronic work and the paper that readers like to have.) | |
PERSONIUS then attempted to illustrate a very early prototype of | |
networked access to this digital library. Xerox Corporation has | |
developed a prototype of a view station that can send images across the | |
network to be viewed. | |
The particular library brought down for demonstration contained two | |
mathematics books. CLASS is developing and will spend the next year | |
developing an application that allows people at workstations to browse | |
the books. Thus, CLASS is developing a browsing tool, on the assumption | |
that users do not want to read an entire book from a workstation, but | |
would prefer to be able to look through and decide if they would like to | |
have a printed copy of it. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
DISCUSSION * Re retrieval software * "Digital file copyright" * Scanning | |
rate during production * Autosegmentation * Criteria employed in | |
selecting books for scanning * Compression and decompression of images * | |
OCR not precluded * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
During the question-and-answer period that followed her presentation, | |
PERSONIUS made these additional points: | |
* Re retrieval software, Cornell is developing a Unix-based server | |
as well as clients for the server that support multiple platforms | |
(Macintosh, IBM and Sun workstations), in the hope that people from | |
any of those platforms will retrieve books; a further operating | |
assumption is that standard interfaces will be used as much as | |
possible, where standards can be put in place, because CLASS | |
considers this retrieval software a library application and would | |
like to be able to look at material not only at Cornell but at other | |
institutions. | |
* The phrase "digital file copyright by Cornell University" was | |
added at the advice of Cornell's legal staff with the caveat that it | |
probably would not hold up in court. Cornell does not want people | |
to copy its books and sell them but would like to keep them | |
available for use in a library environment for library purposes. | |
* In production the scanner can scan about 300 pages per hour, | |
capturing 600 dots per inch. | |
* The Xerox software has filters to scan halftone material and avoid | |
the moire patterns that occur when halftone material is scanned. | |
Xerox has been working on hardware and software that would enable | |
the scanner itself to recognize this situation and deal with it | |
appropriately--a kind of autosegmentation that would enable the | |
scanner to handle halftone material as well as text on a single page. | |
* The books subjected to the elaborate process described above were | |
selected because CLASS is a preservation project, with the first 500 | |
books selected coming from Cornell's mathematics collection, because | |
they were still being heavily used and because, although they were | |
in need of preservation, the mathematics library and the mathematics | |
faculty were uncomfortable having them microfilmed. (They wanted a | |
printed copy.) Thus, these books became a logical choice for this | |
project. Other books were chosen by the project's selection committees | |
for experiments with the technology, as well as to meet a demand or need. | |
* Images will be decompressed before they are sent over the line; at | |
this time they are compressed and sent to the image filing system | |
and then sent to the printer as compressed images; they are returned | |
to the workstation as compressed 600-dpi images and the workstation | |
decompresses and scales them for display--an inefficient way to | |
access the material though it works quite well for printing and | |
other purposes. | |
* CLASS is also decompressing on Macintosh and IBM, a slow process | |
right now. Eventually, compression and decompression will take | |
place on an image conversion server. Trade-offs will be made, based | |
on future performance testing, concerning where the file is | |
compressed and what resolution image is sent. | |
* OCR has not been precluded; images are being stored that have been | |
scanned at a high resolution, which presumably would suit them well | |
to an OCR process. Because the material being scanned is about 100 | |
years old and was printed with less-than-ideal technologies, very | |
early and preliminary tests have not produced good results. But the | |
project is capturing an image that is of sufficient resolution to be | |
subjected to OCR in the future. Moreover, the system architecture | |
and the system plan have a logical place to store an OCR image if it | |
has been captured. But that is not being done now. | |
****** | |
SESSION III. DISTRIBUTION, NETWORKS, AND NETWORKING: OPTIONS FOR | |
DISSEMINATION | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
ZICH * Issues pertaining to CD-ROMs * Options for publishing in CD-ROM * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Robert ZICH, special assistant to the associate librarian for special | |
projects, Library of Congress, and moderator of this session, first noted | |
the blessed but somewhat awkward circumstance of having four very | |
distinguished people representing networks and networking or at least | |
leaning in that direction, while lacking anyone to speak from the | |
strongest possible background in CD-ROMs. ZICH expressed the hope that | |
members of the audience would join the discussion. He stressed the | |
subtitle of this particular session, "Options for Dissemination," and, | |
concerning CD-ROMs, the importance of determining when it would be wise | |
to consider dissemination in CD-ROM versus networks. A shopping list of | |
issues pertaining to CD-ROMs included: the grounds for selecting | |
commercial publishers, and in-house publication where possible versus | |
nonprofit or government publication. A similar list for networks | |
included: determining when one should consider dissemination through a | |
network, identifying the mechanisms or entities that exist to place items | |
on networks, identifying the pool of existing networks, determining how a | |
producer would choose between networks, and identifying the elements of | |
a business arrangement in a network. | |
Options for publishing in CD-ROM: an outside publisher versus | |
self-publication. If an outside publisher is used, it can be nonprofit, | |
such as the Government Printing Office (GPO) or the National Technical | |
Information Service (NTIS), in the case of government. The pros and cons | |
associated with employing an outside publisher are obvious. Among the | |
pros, there is no trouble getting accepted. One pays the bill and, in | |
effect, goes one's way. Among the cons, when one pays an outside | |
publisher to perform the work, that publisher will perform the work it is | |
obliged to do, but perhaps without the production expertise and skill in | |
marketing and dissemination that some would seek. There is the body of | |
commercial publishers that do possess that kind of expertise in | |
distribution and marketing but that obviously are selective. In | |
self-publication, one exercises full control, but then one must handle | |
matters such as distribution and marketing. Such are some of the options | |
for publishing in the case of CD-ROM. | |
In the case of technical and design issues, which are also important, | |
there are many matters which many at the Workshop already knew a good | |
deal about: retrieval system requirements and costs, what to do about | |
images, the various capabilities and platforms, the trade-offs between | |
cost and performance, concerns about local-area networkability, | |
interoperability, etc. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
LYNCH * Creating networked information is different from using networks | |
as an access or dissemination vehicle * Networked multimedia on a large | |
scale does not yet work * Typical CD-ROM publication model a two-edged | |
sword * Publishing information on a CD-ROM in the present world of | |
immature standards * Contrast between CD-ROM and network pricing * | |
Examples demonstrated earlier in the day as a set of insular information | |
gems * Paramount need to link databases * Layering to become increasingly | |
necessary * Project NEEDS and the issues of information reuse and active | |
versus passive use * X-Windows as a way of differentiating between | |
network access and networked information * Barriers to the distribution | |
of networked multimedia information * Need for good, real-time delivery | |
protocols * The question of presentation integrity in client-server | |
computing in the academic world * Recommendations for producing multimedia | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Clifford LYNCH, director, Library Automation, University of California, | |
opened his talk with the general observation that networked information | |
constituted a difficult and elusive topic because it is something just | |
starting to develop and not yet fully understood. LYNCH contended that | |
creating genuinely networked information was different from using | |
networks as an access or dissemination vehicle and was more sophisticated | |
and more subtle. He invited the members of the audience to extrapolate, | |
from what they heard about the preceding demonstration projects, to what | |
sort of a world of electronics information--scholarly, archival, | |
cultural, etc.--they wished to end up with ten or fifteen years from now. | |
LYNCH suggested that to extrapolate directly from these projects would | |
produce unpleasant results. | |
Putting the issue of CD-ROM in perspective before getting into | |
generalities on networked information, LYNCH observed that those engaged | |
in multimedia today who wish to ship a product, so to say, probably do | |
not have much choice except to use CD-ROM: networked multimedia on a | |
large scale basically does not yet work because the technology does not | |
exist. For example, anybody who has tried moving images around over the | |
Internet knows that this is an exciting touch-and-go process, a | |
fascinating and fertile area for experimentation, research, and | |
development, but not something that one can become deeply enthusiastic | |
about committing to production systems at this time. | |
This situation will change, LYNCH said. He differentiated CD-ROM from | |
the practices that have been followed up to now in distributing data on | |
CD-ROM. For LYNCH the problem with CD-ROM is not its portability or its | |
slowness but the two-edged sword of having the retrieval application and | |
the user interface inextricably bound up with the data, which is the | |
typical CD-ROM publication model. It is not a case of publishing data | |
but of distributing a typically stand-alone, typically closed system, | |
all--software, user interface, and data--on a little disk. Hence, all | |
the between-disk navigational issues as well as the impossibility in most | |
cases of integrating data on one disk with that on another. Most CD-ROM | |
retrieval software does not network very gracefully at present. However, | |
in the present world of immature standards and lack of understanding of | |
what network information is or what the ground rules are for creating or | |
using it, publishing information on a CD-ROM does add value in a very | |
real sense. | |
LYNCH drew a contrast between CD-ROM and network pricing and in doing so | |
highlighted something bizarre in information pricing. A large | |
institution such as the University of California has vendors who will | |
offer to sell information on CD-ROM for a price per year in four digits, | |
but for the same data (e.g., an abstracting and indexing database) on | |
magnetic tape, regardless of how many people may use it concurrently, | |
will quote a price in six digits. | |
What is packaged with the CD-ROM in one sense adds value--a complete | |
access system, not just raw, unrefined information--although it is not | |
generally perceived that way. This is because the access software, | |
although it adds value, is viewed by some people, particularly in the | |
university environment where there is a very heavy commitment to | |
networking, as being developed in the wrong direction. | |
Given that context, LYNCH described the examples demonstrated as a set of | |
insular information gems--Perseus, for example, offers nicely linked | |
information, but would be very difficult to integrate with other | |
databases, that is, to link together seamlessly with other source files | |
from other sources. It resembles an island, and in this respect is | |
similar to numerous stand-alone projects that are based on videodiscs, | |
that is, on the single-workstation concept. | |
As scholarship evolves in a network environment, the paramount need will | |
be to link databases. We must link personal databases to public | |
databases, to group databases, in fairly seamless ways--which is | |
extremely difficult in the environments under discussion with copies of | |
databases proliferating all over the place. | |
The notion of layering also struck LYNCH as lurking in several of the | |
projects demonstrated. Several databases in a sense constitute | |
information archives without a significant amount of navigation built in. | |
Educators, critics, and others will want a layered structure--one that | |
defines or links paths through the layers to allow users to reach | |
specific points. In LYNCH's view, layering will become increasingly | |
necessary, and not just within a single resource but across resources | |
(e.g., tracing mythology and cultural themes across several classics | |
databases as well as a database of Renaissance culture). This ability to | |
organize resources, to build things out of multiple other things on the | |
network or select pieces of it, represented for LYNCH one of the key | |
aspects of network information. | |
Contending that information reuse constituted another significant issue, | |
LYNCH commended to the audience's attention Project NEEDS (i.e., National | |
Engineering Education Delivery System). This project's objective is to | |
produce a database of engineering courseware as well as the components | |
that can be used to develop new courseware. In a number of the existing | |
applications, LYNCH said, the issue of reuse (how much one can take apart | |
and reuse in other applications) was not being well considered. He also | |
raised the issue of active versus passive use, one aspect of which is | |
how much information will be manipulated locally by users. Most people, | |
he argued, may do a little browsing and then will wish to print. LYNCH | |
was uncertain how these resources would be used by the vast majority of | |
users in the network environment. | |
LYNCH next said a few words about X-Windows as a way of differentiating | |
between network access and networked information. A number of the | |
applications demonstrated at the Workshop could be rewritten to use X | |
across the network, so that one could run them from any X-capable device- | |
-a workstation, an X terminal--and transact with a database across the | |
network. Although this opens up access a little, assuming one has enough | |
network to handle it, it does not provide an interface to develop a | |
program that conveniently integrates information from multiple databases. | |
X is a viewing technology that has limits. In a real sense, it is just a | |
graphical version of remote log-in across the network. X-type applications | |
represent only one step in the progression towards real access. | |
LYNCH next discussed barriers to the distribution of networked multimedia | |
information. The heart of the problem is a lack of standards to provide | |
the ability for computers to talk to each other, retrieve information, | |
and shuffle it around fairly casually. At the moment, little progress is | |
being made on standards for networked information; for example, present | |
standards do not cover images, digital voice, and digital video. A | |
useful tool kit of exchange formats for basic texts is only now being | |
assembled. The synchronization of content streams (i.e., synchronizing a | |
voice track to a video track, establishing temporal relations between | |
different components in a multimedia object) constitutes another issue | |
for networked multimedia that is just beginning to receive attention. | |
Underlying network protocols also need some work; good, real-time | |
delivery protocols on the Internet do not yet exist. In LYNCH's view, | |
highly important in this context is the notion of networked digital | |
object IDs, the ability of one object on the network to point to another | |
object (or component thereof) on the network. Serious bandwidth issues | |
also exist. LYNCH was uncertain if billion-bit-per-second networks would | |
prove sufficient if numerous people ran video in parallel. | |
LYNCH concluded by offering an issue for database creators to consider, | |
as well as several comments about what might constitute good trial | |
multimedia experiments. In a networked information world the database | |
builder or service builder (publisher) does not exercise the same | |
extensive control over the integrity of the presentation; strange | |
programs "munge" with one's data before the user sees it. Serious | |
thought must be given to what guarantees integrity of presentation. Part | |
of that is related to where one draws the boundaries around a networked | |
information service. This question of presentation integrity in | |
client-server computing has not been stressed enough in the academic | |
world, LYNCH argued, though commercial service providers deal with it | |
regularly. | |
Concerning multimedia, LYNCH observed that good multimedia at the moment | |
is hideously expensive to produce. He recommended producing multimedia | |
with either very high sale value, or multimedia with a very long life | |
span, or multimedia that will have a very broad usage base and whose | |
costs therefore can be amortized among large numbers of users. In this | |
connection, historical and humanistically oriented material may be a good | |
place to start, because it tends to have a longer life span than much of | |
the scientific material, as well as a wider user base. LYNCH noted, for | |
example, that American Memory fits many of the criteria outlined. He | |
remarked the extensive discussion about bringing the Internet or the | |
National Research and Education Network (NREN) into the K-12 environment | |
as a way of helping the American educational system. | |
LYNCH closed by noting that the kinds of applications demonstrated struck | |
him as excellent justifications of broad-scale networking for K-12, but | |
that at this time no "killer" application exists to mobilize the K-12 | |
community to obtain connectivity. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
DISCUSSION * Dearth of genuinely interesting applications on the network | |
a slow-changing situation * The issue of the integrity of presentation in | |
a networked environment * Several reasons why CD-ROM software does not | |
network * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
During the discussion period that followed LYNCH's presentation, several | |
additional points were made. | |
LYNCH reiterated even more strongly his contention that, historically, | |
once one goes outside high-end science and the group of those who need | |
access to supercomputers, there is a great dearth of genuinely | |
interesting applications on the network. He saw this situation changing | |
slowly, with some of the scientific databases and scholarly discussion | |
groups and electronic journals coming on as well as with the availability | |
of Wide Area Information Servers (WAIS) and some of the databases that | |
are being mounted there. However, many of those things do not seem to | |
have piqued great popular interest. For instance, most high school | |
students of LYNCH's acquaintance would not qualify as devotees of serious | |
molecular biology. | |
Concerning the issue of the integrity of presentation, LYNCH believed | |
that a couple of information providers have laid down the law at least on | |
certain things. For example, his recollection was that the National | |
Library of Medicine feels strongly that one needs to employ the | |
identifier field if he or she is to mount a database commercially. The | |
problem with a real networked environment is that one does not know who | |
is reformatting and reprocessing one's data when one enters a client | |
server mode. It becomes anybody's guess, for example, if the network | |
uses a Z39.50 server, or what clients are doing with one's data. A data | |
provider can say that his contract will only permit clients to have | |
access to his data after he vets them and their presentation and makes | |
certain it suits him. But LYNCH held out little expectation that the | |
network marketplace would evolve in that way, because it required too | |
much prior negotiation. | |
CD-ROM software does not network for a variety of reasons, LYNCH said. | |
He speculated that CD-ROM publishers are not eager to have their products | |
really hook into wide area networks, because they fear it will make their | |
data suppliers nervous. Moreover, until relatively recently, one had to | |
be rather adroit to run a full TCP/IP stack plus applications on a | |
PC-size machine, whereas nowadays it is becoming easier as PCs grow | |
bigger and faster. LYNCH also speculated that software providers had not | |
heard from their customers until the last year or so, or had not heard | |
from enough of their customers. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
BESSER * Implications of disseminating images on the network; planning | |
the distribution of multimedia documents poses two critical | |
implementation problems * Layered approach represents the way to deal | |
with users' capabilities * Problems in platform design; file size and its | |
implications for networking * Transmission of megabyte size images | |
impractical * Compression and decompression at the user's end * Promising | |
trends for compression * A disadvantage of using X-Windows * A project at | |
the Smithsonian that mounts images on several networks * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Howard BESSER, School of Library and Information Science, University of | |
Pittsburgh, spoke primarily about multimedia, focusing on images and the | |
broad implications of disseminating them on the network. He argued that | |
planning the distribution of multimedia documents posed two critical | |
implementation problems, which he framed in the form of two questions: | |
1) What platform will one use and what hardware and software will users | |
have for viewing of the material? and 2) How can one deliver a | |
sufficiently robust set of information in an accessible format in a | |
reasonable amount of time? Depending on whether network or CD-ROM is the | |
medium used, this question raises different issues of storage, | |
compression, and transmission. | |
Concerning the design of platforms (e.g., sound, gray scale, simple | |
color, etc.) and the various capabilities users may have, BESSER | |
maintained that a layered approach was the way to deal with users' | |
capabilities. A result would be that users with less powerful | |
workstations would simply have less functionality. He urged members of | |
the audience to advocate standards and accompanying software that handle | |
layered functionality across a wide variety of platforms. | |
BESSER also addressed problems in platform design, namely, deciding how | |
large a machine to design for situations when the largest number of users | |
have the lowest level of the machine, and one desires higher | |
functionality. BESSER then proceeded to the question of file size and | |
its implications for networking. He discussed still images in the main. | |
For example, a digital color image that fills the screen of a standard | |
mega-pel workstation (Sun or Next) will require one megabyte of storage | |
for an eight-bit image or three megabytes of storage for a true color or | |
twenty-four-bit image. Lossless compression algorithms (that is, | |
computational procedures in which no data is lost in the process of | |
compressing [and decompressing] an image--the exact bit-representation is | |
maintained) might bring storage down to a third of a megabyte per image, | |
but not much further than that. The question of size makes it difficult | |
to fit an appropriately sized set of these images on a single disk or to | |
transmit them quickly enough on a network. | |
With these full screen mega-pel images that constitute a third of a | |
megabyte, one gets 1,000-3,000 full-screen images on a one-gigabyte disk; | |
a standard CD-ROM represents approximately 60 percent of that. Storing | |
images the size of a PC screen (just 8 bit color) increases storage | |
capacity to 4,000-12,000 images per gigabyte; 60 percent of that gives | |
one the size of a CD-ROM, which in turn creates a major problem. One | |
cannot have full-screen, full-color images with lossless compression; one | |
must compress them or use a lower resolution. For megabyte-size images, | |
anything slower than a T-1 speed is impractical. For example, on a | |
fifty-six-kilobaud line, it takes three minutes to transfer a | |
one-megabyte file, if it is not compressed; and this speed assumes ideal | |
circumstances (no other user contending for network bandwidth). Thus, | |
questions of disk access, remote display, and current telephone | |
connection speed make transmission of megabyte-size images impractical. | |
BESSER then discussed ways to deal with these large images, for example, | |
compression and decompression at the user's end. In this connection, the | |
issues of how much one is willing to lose in the compression process and | |
what image quality one needs in the first place are unknown. But what is | |
known is that compression entails some loss of data. BESSER urged that | |
more studies be conducted on image quality in different situations, for | |
example, what kind of images are needed for what kind of disciplines, and | |
what kind of image quality is needed for a browsing tool, an intermediate | |
viewing tool, and archiving. | |
BESSER remarked two promising trends for compression: from a technical | |
perspective, algorithms that use what is called subjective redundancy | |
employ principles from visual psycho-physics to identify and remove | |
information from the image that the human eye cannot perceive; from an | |
interchange and interoperability perspective, the JPEG (i.e., Joint | |
Photographic Experts Group, an ISO standard) compression algorithms also | |
offer promise. These issues of compression and decompression, BESSER | |
argued, resembled those raised earlier concerning the design of different | |
platforms. Gauging the capabilities of potential users constitutes a | |
primary goal. BESSER advocated layering or separating the images from | |
the applications that retrieve and display them, to avoid tying them to | |
particular software. | |
BESSER detailed several lessons learned from his work at Berkeley with | |
Imagequery, especially the advantages and disadvantages of using | |
X-Windows. In the latter category, for example, retrieval is tied | |
directly to one's data, an intolerable situation in the long run on a | |
networked system. Finally, BESSER described a project of Jim Wallace at | |
the Smithsonian Institution, who is mounting images in a extremely | |
rudimentary way on the Compuserv and Genie networks and is preparing to | |
mount them on America On Line. Although the average user takes over | |
thirty minutes to download these images (assuming a fairly fast modem), | |
nevertheless, images have been downloaded 25,000 times. | |
BESSER concluded his talk with several comments on the business | |
arrangement between the Smithsonian and Compuserv. He contended that not | |
enough is known concerning the value of images. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
DISCUSSION * Creating digitized photographic collections nearly | |
impossible except with large organizations like museums * Need for study | |
to determine quality of images users will tolerate * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
During the brief exchange between LESK and BESSER that followed, several | |
clarifications emerged. | |
LESK argued that the photographers were far ahead of BESSER: It is | |
almost impossible to create such digitized photographic collections | |
except with large organizations like museums, because all the | |
photographic agencies have been going crazy about this and will not sign | |
licensing agreements on any sort of reasonable terms. LESK had heard | |
that National Geographic, for example, had tried to buy the right to use | |
some image in some kind of educational production for $100 per image, but | |
the photographers will not touch it. They want accounting and payment | |
for each use, which cannot be accomplished within the system. BESSER | |
responded that a consortium of photographers, headed by a former National | |
Geographic photographer, had started assembling its own collection of | |
electronic reproductions of images, with the money going back to the | |
cooperative. | |
LESK contended that BESSER was unnecessarily pessimistic about multimedia | |
images, because people are accustomed to low-quality images, particularly | |
from video. BESSER urged the launching of a study to determine what | |
users would tolerate, what they would feel comfortable with, and what | |
absolutely is the highest quality they would ever need. Conceding that | |
he had adopted a dire tone in order to arouse people about the issue, | |
BESSER closed on a sanguine note by saying that he would not be in this | |
business if he did not think that things could be accomplished. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
LARSEN * Issues of scalability and modularity * Geometric growth of the | |
Internet and the role played by layering * Basic functions sustaining | |
this growth * A library's roles and functions in a network environment * | |
Effects of implementation of the Z39.50 protocol for information | |
retrieval on the library system * The trade-off between volumes of data | |
and its potential usage * A snapshot of current trends * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Ronald LARSEN, associate director for information technology, University | |
of Maryland at College Park, first addressed the issues of scalability | |
and modularity. He noted the difficulty of anticipating the effects of | |
orders-of-magnitude growth, reflecting on the twenty years of experience | |
with the Arpanet and Internet. Recalling the day's demonstrations of | |
CD-ROM and optical disk material, he went on to ask if the field has yet | |
learned how to scale new systems to enable delivery and dissemination | |
across large-scale networks. | |
LARSEN focused on the geometric growth of the Internet from its inception | |
circa 1969 to the present, and the adjustments required to respond to | |
that rapid growth. To illustrate the issue of scalability, LARSEN | |
considered computer networks as including three generic components: | |
computers, network communication nodes, and communication media. Each | |
component scales (e.g., computers range from PCs to supercomputers; | |
network nodes scale from interface cards in a PC through sophisticated | |
routers and gateways; and communication media range from 2,400-baud | |
dial-up facilities through 4.5-Mbps backbone links, and eventually to | |
multigigabit-per-second communication lines), and architecturally, the | |
components are organized to scale hierarchically from local area networks | |
to international-scale networks. Such growth is made possible by | |
building layers of communication protocols, as BESSER pointed out. | |
By layering both physically and logically, a sense of scalability is | |
maintained from local area networks in offices, across campuses, through | |
bridges, routers, campus backbones, fiber-optic links, etc., up into | |
regional networks and ultimately into national and international | |
networks. | |
LARSEN then illustrated the geometric growth over a two-year period-- | |
through September 1991--of the number of networks that comprise the | |
Internet. This growth has been sustained largely by the availability of | |
three basic functions: electronic mail, file transfer (ftp), and remote | |
log-on (telnet). LARSEN also reviewed the growth in the kind of traffic | |
that occurs on the network. Network traffic reflects the joint contributions | |
of a larger population of users and increasing use per user. Today one sees | |
serious applications involving moving images across the network--a rarity | |
ten years ago. LARSEN recalled and concurred with BESSER's main point | |
that the interesting problems occur at the application level. | |
LARSEN then illustrated a model of a library's roles and functions in a | |
network environment. He noted, in particular, the placement of on-line | |
catalogues onto the network and patrons obtaining access to the library | |
increasingly through local networks, campus networks, and the Internet. | |
LARSEN supported LYNCH's earlier suggestion that we need to address | |
fundamental questions of networked information in order to build | |
environments that scale in the information sense as well as in the | |
physical sense. | |
LARSEN supported the role of the library system as the access point into | |
the nation's electronic collections. Implementation of the Z39.50 | |
protocol for information retrieval would make such access practical and | |
feasible. For example, this would enable patrons in Maryland to search | |
California libraries, or other libraries around the world that are | |
conformant with Z39.50 in a manner that is familiar to University of | |
Maryland patrons. This client-server model also supports moving beyond | |
secondary content into primary content. (The notion of how one links | |
from secondary content to primary content, LARSEN said, represents a | |
fundamental problem that requires rigorous thought.) After noting | |
numerous network experiments in accessing full-text materials, including | |
projects supporting the ordering of materials across the network, LARSEN | |
revisited the issue of transmitting high-density, high-resolution color | |
images across the network and the large amounts of bandwidth they | |
require. He went on to address the bandwidth and synchronization | |
problems inherent in sending full-motion video across the network. | |
LARSEN illustrated the trade-off between volumes of data in bytes or | |
orders of magnitude and the potential usage of that data. He discussed | |
transmission rates (particularly, the time it takes to move various forms | |
of information), and what one could do with a network supporting | |
multigigabit-per-second transmission. At the moment, the network | |
environment includes a composite of data-transmission requirements, | |
volumes and forms, going from steady to bursty (high-volume) and from | |
very slow to very fast. This aggregate must be considered in the design, | |
construction, and operation of multigigabyte networks. | |
LARSEN's objective is to use the networks and library systems now being | |
constructed to increase access to resources wherever they exist, and | |
thus, to evolve toward an on-line electronic virtual library. | |
LARSEN concluded by offering a snapshot of current trends: continuing | |
geometric growth in network capacity and number of users; slower | |
development of applications; and glacial development and adoption of | |
standards. The challenge is to design and develop each new application | |
system with network access and scalability in mind. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
BROWNRIGG * Access to the Internet cannot be taken for granted * Packet | |
radio and the development of MELVYL in 1980-81 in the Division of Library | |
Automation at the University of California * Design criteria for packet | |
radio * A demonstration project in San Diego and future plans * Spread | |
spectrum * Frequencies at which the radios will run and plans to | |
reimplement the WAIS server software in the public domain * Need for an | |
infrastructure of radios that do not move around * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Edwin BROWNRIGG, executive director, Memex Research Institute, first | |
polled the audience in order to seek out regular users of the Internet as | |
well as those planning to use it some time in the future. With nearly | |
everybody in the room falling into one category or the other, BROWNRIGG | |
made a point re access, namely that numerous individuals, especially those | |
who use the Internet every day, take for granted their access to it, the | |
speeds with which they are connected, and how well it all works. | |
However, as BROWNRIGG discovered between 1987 and 1989 in Australia, | |
if one wants access to the Internet but cannot afford it or has some | |
physical boundary that prevents her or him from gaining access, it can | |
be extremely frustrating. He suggested that because of economics and | |
physical barriers we were beginning to create a world of haves and have-nots | |
in the process of scholarly communication, even in the United States. | |
BROWNRIGG detailed the development of MELVYL in academic year 1980-81 in | |
the Division of Library Automation at the University of California, in | |
order to underscore the issue of access to the system, which at the | |
outset was extremely limited. In short, the project needed to build a | |
network, which at that time entailed use of satellite technology, that is, | |
putting earth stations on campus and also acquiring some terrestrial links | |
from the State of California's microwave system. The installation of | |
satellite links, however, did not solve the problem (which actually | |
formed part of a larger problem involving politics and financial resources). | |
For while the project team could get a signal onto a campus, it had no means | |
of distributing the signal throughout the campus. The solution involved | |
adopting a recent development in wireless communication called packet radio, | |
which combined the basic notion of packet-switching with radio. The project | |
used this technology to get the signal from a point on campus where it | |
came down, an earth station for example, into the libraries, because it | |
found that wiring the libraries, especially the older marble buildings, | |
would cost $2,000-$5,000 per terminal. | |
BROWNRIGG noted that, ten years ago, the project had neither the public | |
policy nor the technology that would have allowed it to use packet radio | |
in any meaningful way. Since then much had changed. He proceeded to | |
detail research and development of the technology, how it is being | |
deployed in California, and what direction he thought it would take. | |
The design criteria are to produce a high-speed, one-time, low-cost, | |
high-quality, secure, license-free device (packet radio) that one can | |
plug in and play today, forget about it, and have access to the Internet. | |
By high speed, BROWNRIGG meant 1 megabyte and 1.5 megabytes. Those units | |
have been built, he continued, and are in the process of being | |
type-certified by an independent underwriting laboratory so that they can | |
be type-licensed by the Federal Communications Commission. As is the | |
case with citizens band, one will be able to purchase a unit and not have | |
to worry about applying for a license. | |
The basic idea, BROWNRIGG elaborated, is to take high-speed radio data | |
transmission and create a backbone network that at certain strategic | |
points in the network will "gateway" into a medium-speed packet radio | |
(i.e., one that runs at 38.4 kilobytes), so that perhaps by 1994-1995 | |
people, like those in the audience for the price of a VCR could purchase | |
a medium-speed radio for the office or home, have full network connectivity | |
to the Internet, and partake of all its services, with no need for an FCC | |
license and no regular bill from the local common carrier. BROWNRIGG | |
presented several details of a demonstration project currently taking | |
place in San Diego and described plans, pending funding, to install a | |
full-bore network in the San Francisco area. This network will have 600 | |
nodes running at backbone speeds, and 100 of these nodes will be libraries, | |
which in turn will be the gateway ports to the 38.4 kilobyte radios that | |
will give coverage for the neighborhoods surrounding the libraries. | |
BROWNRIGG next explained Part 15.247, a new rule within Title 47 of the | |
Code of Federal Regulations enacted by the FCC in 1985. This rule | |
challenged the industry, which has only now risen to the occasion, to | |
build a radio that would run at no more than one watt of output power and | |
use a fairly exotic method of modulating the radio wave called spread | |
spectrum. Spread spectrum in fact permits the building of networks so | |
that numerous data communications can occur simultaneously, without | |
interfering with each other, within the same wide radio channel. | |
BROWNRIGG explained that the frequencies at which the radios would run | |
are very short wave signals. They are well above standard microwave and | |
radar. With a radio wave that small, one watt becomes a tremendous punch | |
per bit and thus makes transmission at reasonable speed possible. In | |
order to minimize the potential for congestion, the project is | |
undertaking to reimplement software which has been available in the | |
networking business and is taken for granted now, for example, TCP/IP, | |
routing algorithms, bridges, and gateways. In addition, the project | |
plans to take the WAIS server software in the public domain and | |
reimplement it so that one can have a WAIS server on a Mac instead of a | |
Unix machine. The Memex Research Institute believes that libraries, in | |
particular, will want to use the WAIS servers with packet radio. This | |
project, which has a team of about twelve people, will run through 1993 | |
and will include the 100 libraries already mentioned as well as other | |
professionals such as those in the medical profession, engineering, and | |
law. Thus, the need is to create an infrastructure of radios that do not | |
move around, which, BROWNRIGG hopes, will solve a problem not only for | |
libraries but for individuals who, by and large today, do not have access | |
to the Internet from their homes and offices. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
DISCUSSION * Project operating frequencies * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
During a brief discussion period, which also concluded the day's | |
proceedings, BROWNRIGG stated that the project was operating in four | |
frequencies. The slow speed is operating at 435 megahertz, and it would | |
later go up to 920 megahertz. With the high-speed frequency, the | |
one-megabyte radios will run at 2.4 gigabits, and 1.5 will run at 5.7. | |
At 5.7, rain can be a factor, but it would have to be tropical rain, | |
unlike what falls in most parts of the United States. | |
****** | |
SESSION IV. IMAGE CAPTURE, TEXT CAPTURE, OVERVIEW OF TEXT AND | |
IMAGE STORAGE FORMATS | |
William HOOTON, vice president of operations, I-NET, moderated this session. | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
KENNEY * Factors influencing development of CXP * Advantages of using | |
digital technology versus photocopy and microfilm * A primary goal of | |
CXP; publishing challenges * Characteristics of copies printed * Quality | |
of samples achieved in image capture * Several factors to be considered | |
in choosing scanning * Emphasis of CXP on timely and cost-effective | |
production of black-and-white printed facsimiles * Results of producing | |
microfilm from digital files * Advantages of creating microfilm * Details | |
concerning production * Costs * Role of digital technology in library | |
preservation * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Anne KENNEY, associate director, Department of Preservation and | |
Conservation, Cornell University, opened her talk by observing that the | |
Cornell Xerox Project (CXP) has been guided by the assumption that the | |
ability to produce printed facsimiles or to replace paper with paper | |
would be important, at least for the present generation of users and | |
equipment. She described three factors that influenced development of | |
the project: 1) Because the project has emphasized the preservation of | |
deteriorating brittle books, the quality of what was produced had to be | |
sufficiently high to return a paper replacement to the shelf. CXP was | |
only interested in using: 2) a system that was cost-effective, which | |
meant that it had to be cost-competitive with the processes currently | |
available, principally photocopy and microfilm, and 3) new or currently | |
available product hardware and software. | |
KENNEY described the advantages that using digital technology offers over | |
both photocopy and microfilm: 1) The potential exists to create a higher | |
quality reproduction of a deteriorating original than conventional | |
light-lens technology. 2) Because a digital image is an encoded | |
representation, it can be reproduced again and again with no resulting | |
loss of quality, as opposed to the situation with light-lens processes, | |
in which there is discernible difference between a second and a | |
subsequent generation of an image. 3) A digital image can be manipulated | |
in a number of ways to improve image capture; for example, Xerox has | |
developed a windowing application that enables one to capture a page | |
containing both text and illustrations in a manner that optimizes the | |
reproduction of both. (With light-lens technology, one must choose which | |
to optimize, text or the illustration; in preservation microfilming, the | |
current practice is to shoot an illustrated page twice, once to highlight | |
the text and the second time to provide the best capture for the | |
illustration.) 4) A digital image can also be edited, density levels | |
adjusted to remove underlining and stains, and to increase legibility for | |
faint documents. 5) On-screen inspection can take place at the time of | |
initial setup and adjustments made prior to scanning, factors that | |
substantially reduce the number of retakes required in quality control. | |
A primary goal of CXP has been to evaluate the paper output printed on | |
the Xerox DocuTech, a high-speed printer that produces 600-dpi pages from | |
scanned images at a rate of 135 pages a minute. KENNEY recounted several | |
publishing challenges to represent faithful and legible reproductions of | |
the originals that the 600-dpi copy for the most part successfully | |
captured. For example, many of the deteriorating volumes in the project | |
were heavily illustrated with fine line drawings or halftones or came in | |
languages such as Japanese, in which the buildup of characters comprised | |
of varying strokes is difficult to reproduce at lower resolutions; a | |
surprising number of them came with annotations and mathematical | |
formulas, which it was critical to be able to duplicate exactly. | |
KENNEY noted that 1) the copies are being printed on paper that meets the | |
ANSI standards for performance, 2) the DocuTech printer meets the machine | |
and toner requirements for proper adhesion of print to page, as described | |
by the National Archives, and thus 3) paper product is considered to be | |
the archival equivalent of preservation photocopy. | |
KENNEY then discussed several samples of the quality achieved in the | |
project that had been distributed in a handout, for example, a copy of a | |
print-on-demand version of the 1911 Reed lecture on the steam turbine, | |
which contains halftones, line drawings, and illustrations embedded in | |
text; the first four loose pages in the volume compared the capture | |
capabilities of scanning to photocopy for a standard test target, the | |
IEEE standard 167A 1987 test chart. In all instances scanning proved | |
superior to photocopy, though only slightly more so in one. | |
Conceding the simplistic nature of her review of the quality of scanning | |
to photocopy, KENNEY described it as one representation of the kinds of | |
settings that could be used with scanning capabilities on the equipment | |
CXP uses. KENNEY also pointed out that CXP investigated the quality | |
achieved with binary scanning only, and noted the great promise in gray | |
scale and color scanning, whose advantages and disadvantages need to be | |
examined. She argued further that scanning resolutions and file formats | |
can represent a complex trade-off between the time it takes to capture | |
material, file size, fidelity to the original, and on-screen display; and | |
printing and equipment availability. All these factors must be taken | |
into consideration. | |
CXP placed primary emphasis on the production in a timely and | |
cost-effective manner of printed facsimiles that consisted largely of | |
black-and-white text. With binary scanning, large files may be | |
compressed efficiently and in a lossless manner (i.e., no data is lost in | |
the process of compressing [and decompressing] an image--the exact | |
bit-representation is maintained) using Group 4 CCITT (i.e., the French | |
acronym for International Consultative Committee for Telegraph and | |
Telephone) compression. CXP was getting compression ratios of about | |
forty to one. Gray-scale compression, which primarily uses JPEG, is much | |
less economical and can represent a lossy compression (i.e., not | |
lossless), so that as one compresses and decompresses, the illustration | |
is subtly changed. While binary files produce a high-quality printed | |
version, it appears 1) that other combinations of spatial resolution with | |
gray and/or color hold great promise as well, and 2) that gray scale can | |
represent a tremendous advantage for on-screen viewing. The quality | |
associated with binary and gray scale also depends on the equipment used. | |
For instance, binary scanning produces a much better copy on a binary | |
printer. | |
Among CXP's findings concerning the production of microfilm from digital | |
files, KENNEY reported that the digital files for the same Reed lecture | |
were used to produce sample film using an electron beam recorder. The | |
resulting film was faithful to the image capture of the digital files, | |
and while CXP felt that the text and image pages represented in the Reed | |
lecture were superior to that of the light-lens film, the resolution | |
readings for the 600 dpi were not as high as standard microfilming. | |
KENNEY argued that the standards defined for light-lens technology are | |
not totally transferable to a digital environment. Moreover, they are | |
based on definition of quality for a preservation copy. Although making | |
this case will prove to be a long, uphill struggle, CXP plans to continue | |
to investigate the issue over the course of the next year. | |
KENNEY concluded this portion of her talk with a discussion of the | |
advantages of creating film: it can serve as a primary backup and as a | |
preservation master to the digital file; it could then become the print | |
or production master and service copies could be paper, film, optical | |
disks, magnetic media, or on-screen display. | |
Finally, KENNEY presented details re production: | |
* Development and testing of a moderately-high resolution production | |
scanning workstation represented a third goal of CXP; to date, 1,000 | |
volumes have been scanned, or about 300,000 images. | |
* The resulting digital files are stored and used to produce | |
hard-copy replacements for the originals and additional prints on | |
demand; although the initial costs are high, scanning technology | |
offers an affordable means for reformatting brittle material. | |
* A technician in production mode can scan 300 pages per hour when | |
performing single-sheet scanning, which is a necessity when working | |
with truly brittle paper; this figure is expected to increase | |
significantly with subsequent iterations of the software from Xerox; | |
a three-month time-and-cost study of scanning found that the average | |
300-page book would take about an hour and forty minutes to scan | |
(this figure included the time for setup, which involves keying in | |
primary bibliographic data, going into quality control mode to | |
define page size, establishing front-to-back registration, and | |
scanning sample pages to identify a default range of settings for | |
the entire book--functions not dissimilar to those performed by | |
filmers or those preparing a book for photocopy). | |
* The final step in the scanning process involved rescans, which | |
happily were few and far between, representing well under 1 percent | |
of the total pages scanned. | |
In addition to technician time, CXP costed out equipment, amortized over | |
four years, the cost of storing and refreshing the digital files every | |
four years, and the cost of printing and binding, book-cloth binding, a | |
paper reproduction. The total amounted to a little under $65 per single | |
300-page volume, with 30 percent overhead included--a figure competitive | |
with the prices currently charged by photocopy vendors. | |
Of course, with scanning, in addition to the paper facsimile, one is left | |
with a digital file from which subsequent copies of the book can be | |
produced for a fraction of the cost of photocopy, with readers afforded | |
choices in the form of these copies. | |
KENNEY concluded that digital technology offers an electronic means for a | |
library preservation effort to pay for itself. If a brittle-book program | |
included the means of disseminating reprints of books that are in demand | |
by libraries and researchers alike, the initial investment in capture | |
could be recovered and used to preserve additional but less popular | |
books. She disclosed that an economic model for a self-sustaining | |
program could be developed for CXP's report to the Commission on | |
Preservation and Access (CPA). | |
KENNEY stressed that the focus of CXP has been on obtaining high quality | |
in a production environment. The use of digital technology is viewed as | |
an affordable alternative to other reformatting options. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
ANDRE * Overview and history of NATDP * Various agricultural CD-ROM | |
products created inhouse and by service bureaus * Pilot project on | |
Internet transmission * Additional products in progress * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Pamela ANDRE, associate director for automation, National Agricultural | |
Text Digitizing Program (NATDP), National Agricultural Library (NAL), | |
presented an overview of NATDP, which has been underway at NAL the last | |
four years, before Judith ZIDAR discussed the technical details. ANDRE | |
defined agricultural information as a broad range of material going from | |
basic and applied research in the hard sciences to the one-page pamphlets | |
that are distributed by the cooperative state extension services on such | |
things as how to grow blueberries. | |
NATDP began in late 1986 with a meeting of representatives from the | |
land-grant library community to deal with the issue of electronic | |
information. NAL and forty-five of these libraries banded together to | |
establish this project--to evaluate the technology for converting what | |
were then source documents in paper form into electronic form, to provide | |
access to that digital information, and then to distribute it. | |
Distributing that material to the community--the university community as | |
well as the extension service community, potentially down to the county | |
level--constituted the group's chief concern. | |
Since January 1988 (when the microcomputer-based scanning system was | |
installed at NAL), NATDP has done a variety of things, concerning which | |
ZIDAR would provide further details. For example, the first technology | |
considered in the project's discussion phase was digital videodisc, which | |
indicates how long ago it was conceived. | |
Over the four years of this project, four separate CD-ROM products on | |
four different agricultural topics were created, two at a | |
scanning-and-OCR station installed at NAL, and two by service bureaus. | |
Thus, NATDP has gained comparative information in terms of those relative | |
costs. Each of these products contained the full ASCII text as well as | |
page images of the material, or between 4,000 and 6,000 pages of material | |
on these disks. Topics included aquaculture, food, agriculture and | |
science (i.e., international agriculture and research), acid rain, and | |
Agent Orange, which was the final product distributed (approximately | |
eighteen months before the Workshop). | |
The third phase of NATDP focused on delivery mechanisms other than | |
CD-ROM. At the suggestion of Clifford LYNCH, who was a technical | |
consultant to the project at this point, NATDP became involved with the | |
Internet and initiated a project with the help of North Carolina State | |
University, in which fourteen of the land-grant university libraries are | |
transmitting digital images over the Internet in response to interlibrary | |
loan requests--a topic for another meeting. At this point, the pilot | |
project had been completed for about a year and the final report would be | |
available shortly after the Workshop. In the meantime, the project's | |
success had led to its extension. (ANDRE noted that one of the first | |
things done under the program title was to select a retrieval package to | |
use with subsequent products; Windows Personal Librarian was the package | |
of choice after a lengthy evaluation.) | |
Three additional products had been planned and were in progress: | |
1) An arrangement with the American Society of Agronomy--a | |
professional society that has published the Agronomy Journal since | |
about 1908--to scan and create bit-mapped images of its journal. | |
ASA granted permission first to put and then to distribute this | |
material in electronic form, to hold it at NAL, and to use these | |
electronic images as a mechanism to deliver documents or print out | |
material for patrons, among other uses. Effectively, NAL has the | |
right to use this material in support of its program. | |
(Significantly, this arrangement offers a potential cooperative | |
model for working with other professional societies in agriculture | |
to try to do the same thing--put the journals of particular interest | |
to agriculture research into electronic form.) | |
2) An extension of the earlier product on aquaculture. | |
3) The George Washington Carver Papers--a joint project with | |
Tuskegee University to scan and convert from microfilm some 3,500 | |
images of Carver's papers, letters, and drawings. | |
It was anticipated that all of these products would appear no more than | |
six months after the Workshop. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
ZIDAR * (A separate arena for scanning) * Steps in creating a database * | |
Image capture, with and without performing OCR * Keying in tracking data | |
* Scanning, with electronic and manual tracking * Adjustments during | |
scanning process * Scanning resolutions * Compression * De-skewing and | |
filtering * Image capture from microform: the papers and letters of | |
George Washington Carver * Equipment used for a scanning system * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Judith ZIDAR, coordinator, National Agricultural Text Digitizing Program | |
(NATDP), National Agricultural Library (NAL), illustrated the technical | |
details of NATDP, including her primary responsibility, scanning and | |
creating databases on a topic and putting them on CD-ROM. | |
(ZIDAR remarked a separate arena from the CD-ROM projects, although the | |
processing of the material is nearly identical, in which NATDP is also | |
scanning material and loading it on a Next microcomputer, which in turn | |
is linked to NAL's integrated library system. Thus, searches in NAL's | |
bibliographic database will enable people to pull up actual page images | |
and text for any documents that have been entered.) | |
In accordance with the session's topic, ZIDAR focused her illustrated | |
talk on image capture, offering a primer on the three main steps in the | |
process: 1) assemble the printed publications; 2) design the database | |
(database design occurs in the process of preparing the material for | |
scanning; this step entails reviewing and organizing the material, | |
defining the contents--what will constitute a record, what kinds of | |
fields will be captured in terms of author, title, etc.); 3) perform a | |
certain amount of markup on the paper publications. NAL performs this | |
task record by record, preparing work sheets or some other sort of | |
tracking material and designing descriptors and other enhancements to be | |
added to the data that will not be captured from the printed publication. | |
Part of this process also involves determining NATDP's file and directory | |
structure: NATDP attempts to avoid putting more than approximately 100 | |
images in a directory, because placing more than that on a CD-ROM would | |
reduce the access speed. | |
This up-front process takes approximately two weeks for a | |
6,000-7,000-page database. The next step is to capture the page images. | |
How long this process takes is determined by the decision whether or not | |
to perform OCR. Not performing OCR speeds the process, whereas text | |
capture requires greater care because of the quality of the image: it | |
has to be straighter and allowance must be made for text on a page, not | |
just for the capture of photographs. | |
NATDP keys in tracking data, that is, a standard bibliographic record | |
including the title of the book and the title of the chapter, which will | |
later either become the access information or will be attached to the | |
front of a full-text record so that it is searchable. | |
Images are scanned from a bound or unbound publication, chiefly from | |
bound publications in the case of NATDP, however, because often they are | |
the only copies and the publications are returned to the shelves. NATDP | |
usually scans one record at a time, because its database tracking system | |
tracks the document in that way and does not require further logical | |
separating of the images. After performing optical character | |
recognition, NATDP moves the images off the hard disk and maintains a | |
volume sheet. Though the system tracks electronically, all the | |
processing steps are also tracked manually with a log sheet. | |
ZIDAR next illustrated the kinds of adjustments that one can make when | |
scanning from paper and microfilm, for example, redoing images that need | |
special handling, setting for dithering or gray scale, and adjusting for | |
brightness or for the whole book at one time. | |
NATDP is scanning at 300 dots per inch, a standard scanning resolution. | |
Though adequate for capturing text that is all of a standard size, 300 | |
dpi is unsuitable for any kind of photographic material or for very small | |
text. Many scanners allow for different image formats, TIFF, of course, | |
being a de facto standard. But if one intends to exchange images with | |
other people, the ability to scan other image formats, even if they are | |
less common, becomes highly desirable. | |
CCITT Group 4 is the standard compression for normal black-and-white | |
images, JPEG for gray scale or color. ZIDAR recommended 1) using the | |
standard compressions, particularly if one attempts to make material | |
available and to allow users to download images and reuse them from | |
CD-ROMs; and 2) maintaining the ability to output an uncompressed image, | |
because in image exchange uncompressed images are more likely to be able | |
to cross platforms. | |
ZIDAR emphasized the importance of de-skewing and filtering as | |
requirements on NATDP's upgraded system. For instance, scanning bound | |
books, particularly books published by the federal government whose pages | |
are skewed, and trying to scan them straight if OCR is to be performed, | |
is extremely time-consuming. The same holds for filtering of | |
poor-quality or older materials. | |
ZIDAR described image capture from microform, using as an example three | |
reels from a sixty-seven-reel set of the papers and letters of George | |
Washington Carver that had been produced by Tuskegee University. These | |
resulted in approximately 3,500 images, which NATDP had had scanned by | |
its service contractor, Science Applications International Corporation | |
(SAIC). NATDP also created bibliographic records for access. (NATDP did | |
not have such specialized equipment as a microfilm scanner. | |
Unfortunately, the process of scanning from microfilm was not an | |
unqualified success, ZIDAR reported: because microfilm frame sizes vary, | |
occasionally some frames were missed, which without spending much time | |
and money could not be recaptured. | |
OCR could not be performed from the scanned images of the frames. The | |
bleeding in the text simply output text, when OCR was run, that could not | |
even be edited. NATDP tested for negative versus positive images, | |
landscape versus portrait orientation, and single- versus dual-page | |
microfilm, none of which seemed to affect the quality of the image; but | |
also on none of them could OCR be performed. | |
In selecting the microfilm they would use, therefore, NATDP had other | |
factors in mind. ZIDAR noted two factors that influenced the quality of | |
the images: 1) the inherent quality of the original and 2) the amount of | |
size reduction on the pages. | |
The Carver papers were selected because they are informative and visually | |
interesting, treat a single subject, and are valuable in their own right. | |
The images were scanned and divided into logical records by SAIC, then | |
delivered, and loaded onto NATDP's system, where bibliographic | |
information taken directly from the images was added. Scanning was | |
completed in summer 1991 and by the end of summer 1992 the disk was | |
scheduled to be published. | |
Problems encountered during processing included the following: Because | |
the microfilm scanning had to be done in a batch, adjustment for | |
individual page variations was not possible. The frame size varied on | |
account of the nature of the material, and therefore some of the frames | |
were missed while others were just partial frames. The only way to go | |
back and capture this material was to print out the page with the | |
microfilm reader from the missing frame and then scan it in from the | |
page, which was extremely time-consuming. The quality of the images | |
scanned from the printout of the microfilm compared unfavorably with that | |
of the original images captured directly from the microfilm. The | |
inability to perform OCR also was a major disappointment. At the time, | |
computer output microfilm was unavailable to test. | |
The equipment used for a scanning system was the last topic addressed by | |
ZIDAR. The type of equipment that one would purchase for a scanning | |
system included: a microcomputer, at least a 386, but preferably a 486; | |
a large hard disk, 380 megabyte at minimum; a multi-tasking operating | |
system that allows one to run some things in batch in the background | |
while scanning or doing text editing, for example, Unix or OS/2 and, | |
theoretically, Windows; a high-speed scanner and scanning software that | |
allows one to make the various adjustments mentioned earlier; a | |
high-resolution monitor (150 dpi ); OCR software and hardware to perform | |
text recognition; an optical disk subsystem on which to archive all the | |
images as the processing is done; file management and tracking software. | |
ZIDAR opined that the software one purchases was more important than the | |
hardware and might also cost more than the hardware, but it was likely to | |
prove critical to the success or failure of one's system. In addition to | |
a stand-alone scanning workstation for image capture, then, text capture | |
requires one or two editing stations networked to this scanning station | |
to perform editing. Editing the text takes two or three times as long as | |
capturing the images. | |
Finally, ZIDAR stressed the importance of buying an open system that allows | |
for more than one vendor, complies with standards, and can be upgraded. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
WATERS *Yale University Library's master plan to convert microfilm to | |
digital imagery (POB) * The place of electronic tools in the library of | |
the future * The uses of images and an image library * Primary input from | |
preservation microfilm * Features distinguishing POB from CXP and key | |
hypotheses guiding POB * Use of vendor selection process to facilitate | |
organizational work * Criteria for selecting vendor * Finalists and | |
results of process for Yale * Key factor distinguishing vendors * | |
Components, design principles, and some estimated costs of POB * Role of | |
preservation materials in developing imaging market * Factors affecting | |
quality and cost * Factors affecting the usability of complex documents | |
in image form * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Donald WATERS, head of the Systems Office, Yale University Library, | |
reported on the progress of a master plan for a project at Yale to | |
convert microfilm to digital imagery, Project Open Book (POB). Stating | |
that POB was in an advanced stage of planning, WATERS detailed, in | |
particular, the process of selecting a vendor partner and several key | |
issues under discussion as Yale prepares to move into the project itself. | |
He commented first on the vision that serves as the context of POB and | |
then described its purpose and scope. | |
WATERS sees the library of the future not necessarily as an electronic | |
library but as a place that generates, preserves, and improves for its | |
clients ready access to both intellectual and physical recorded | |
knowledge. Electronic tools must find a place in the library in the | |
context of this vision. Several roles for electronic tools include | |
serving as: indirect sources of electronic knowledge or as "finding" | |
aids (the on-line catalogues, the article-level indices, registers for | |
documents and archives); direct sources of recorded knowledge; full-text | |
images; and various kinds of compound sources of recorded knowledge (the | |
so-called compound documents of Hypertext, mixed text and image, | |
mixed-text image format, and multimedia). | |
POB is looking particularly at images and an image library, the uses to | |
which images will be put (e.g., storage, printing, browsing, and then use | |
as input for other processes), OCR as a subsequent process to image | |
capture, or creating an image library, and also possibly generating | |
microfilm. | |
While input will come from a variety of sources, POB is considering | |
especially input from preservation microfilm. A possible outcome is that | |
the film and paper which provide the input for the image library | |
eventually may go off into remote storage, and that the image library may | |
be the primary access tool. | |
The purpose and scope of POB focus on imaging. Though related to CXP, | |
POB has two features which distinguish it: 1) scale--conversion of | |
10,000 volumes into digital image form; and 2) source--conversion from | |
microfilm. Given these features, several key working hypotheses guide | |
POB, including: 1) Since POB is using microfilm, it is not concerned with | |
the image library as a preservation medium. 2) Digital imagery can improve | |
access to recorded knowledge through printing and network distribution at | |
a modest incremental cost of microfilm. 3) Capturing and storing documents | |
in a digital image form is necessary to further improvements in access. | |
(POB distinguishes between the imaging, digitizing process and OCR, | |
which at this stage it does not plan to perform.) | |
Currently in its first or organizational phase, POB found that it could | |
use a vendor selection process to facilitate a good deal of the | |
organizational work (e.g., creating a project team and advisory board, | |
confirming the validity of the plan, establishing the cost of the project | |
and a budget, selecting the materials to convert, and then raising the | |
necessary funds). | |
POB developed numerous selection criteria, including: a firm committed | |
to image-document management, the ability to serve as systems integrator | |
in a large-scale project over several years, interest in developing the | |
requisite software as a standard rather than a custom product, and a | |
willingness to invest substantial resources in the project itself. | |
Two vendors, DEC and Xerox, were selected as finalists in October 1991, | |
and with the support of the Commission on Preservation and Access, each | |
was commissioned to generate a detailed requirements analysis for the | |
project and then to submit a formal proposal for the completion of the | |
project, which included a budget and costs. The terms were that POB would | |
pay the loser. The results for Yale of involving a vendor included: | |
broad involvement of Yale staff across the board at a relatively low | |
cost, which may have long-term significance in carrying out the project | |
(twenty-five to thirty university people are engaged in POB); better | |
understanding of the factors that affect corporate response to markets | |
for imaging products; a competitive proposal; and a more sophisticated | |
view of the imaging markets. | |
The most important factor that distinguished the vendors under | |
consideration was their identification with the customer. The size and | |
internal complexity of the company also was an important factor. POB was | |
looking at large companies that had substantial resources. In the end, | |
the process generated for Yale two competitive proposals, with Xerox's | |
the clear winner. WATERS then described the components of the proposal, | |
the design principles, and some of the costs estimated for the process. | |
Components are essentially four: a conversion subsystem, a | |
network-accessible storage subsystem for 10,000 books (and POB expects | |
200 to 600 dpi storage), browsing stations distributed on the campus | |
network, and network access to the image printers. | |
Among the design principles, POB wanted conversion at the highest | |
possible resolution. Assuming TIFF files, TIFF files with Group 4 | |
compression, TCP/IP, and ethernet network on campus, POB wanted a | |
client-server approach with image documents distributed to the | |
workstations and made accessible through native workstation interfaces | |
such as Windows. POB also insisted on a phased approach to | |
implementation: 1) a stand-alone, single-user, low-cost entry into the | |
business with a workstation focused on conversion and allowing POB to | |
explore user access; 2) movement into a higher-volume conversion with | |
network-accessible storage and multiple access stations; and 3) a | |
high-volume conversion, full-capacity storage, and multiple browsing | |
stations distributed throughout the campus. | |
The costs proposed for start-up assumed the existence of the Yale network | |
and its two DocuTech image printers. Other start-up costs are estimated | |
at $1 million over the three phases. At the end of the project, the annual | |
operating costs estimated primarily for the software and hardware proposed | |
come to about $60,000, but these exclude costs for labor needed in the | |
conversion process, network and printer usage, and facilities management. | |
Finally, the selection process produced for Yale a more sophisticated | |
view of the imaging markets: the management of complex documents in | |
image form is not a preservation problem, not a library problem, but a | |
general problem in a broad, general industry. Preservation materials are | |
useful for developing that market because of the qualities of the | |
material. For example, much of it is out of copyright. The resolution | |
of key issues such as the quality of scanning and image browsing also | |
will affect development of that market. | |
The technology is readily available but changing rapidly. In this | |
context of rapid change, several factors affect quality and cost, to | |
which POB intends to pay particular attention, for example, the various | |
levels of resolution that can be achieved. POB believes it can bring | |
resolution up to 600 dpi, but an interpolation process from 400 to 600 is | |
more likely. The variation quality in microfilm will prove to be a | |
highly important factor. POB may reexamine the standards used to film in | |
the first place by looking at this process as a follow-on to microfilming. | |
Other important factors include: the techniques available to the | |
operator for handling material, the ways of integrating quality control | |
into the digitizing work flow, and a work flow that includes indexing and | |
storage. POB's requirement was to be able to deal with quality control | |
at the point of scanning. Thus, thanks to Xerox, POB anticipates having | |
a mechanism which will allow it not only to scan in batch form, but to | |
review the material as it goes through the scanner and control quality | |
from the outset. | |
The standards for measuring quality and costs depend greatly on the uses | |
of the material, including subsequent OCR, storage, printing, and | |
browsing. But especially at issue for POB is the facility for browsing. | |
This facility, WATERS said, is perhaps the weakest aspect of imaging | |
technology and the most in need of development. | |
A variety of factors affect the usability of complex documents in image | |
form, among them: 1) the ability of the system to handle the full range | |
of document types, not just monographs but serials, multi-part | |
monographs, and manuscripts; 2) the location of the database of record | |
for bibliographic information about the image document, which POB wants | |
to enter once and in the most useful place, the on-line catalog; 3) a | |
document identifier for referencing the bibliographic information in one | |
place and the images in another; 4) the technique for making the basic | |
internal structure of the document accessible to the reader; and finally, | |
5) the physical presentation on the CRT of those documents. POB is ready | |
to complete this phase now. One last decision involves deciding which | |
material to scan. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
DISCUSSION * TIFF files constitute de facto standard * NARA's experience | |
with image conversion software and text conversion * RFC 1314 * | |
Considerable flux concerning available hardware and software solutions * | |
NAL through-put rate during scanning * Window management questions * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
In the question-and-answer period that followed WATERS's presentation, | |
the following points emerged: | |
* ZIDAR's statement about using TIFF files as a standard meant de | |
facto standard. This is what most people use and typically exchange | |
with other groups, across platforms, or even occasionally across | |
display software. | |
* HOLMES commented on the unsuccessful experience of NARA in | |
attempting to run image-conversion software or to exchange between | |
applications: What are supposedly TIFF files go into other software | |
that is supposed to be able to accept TIFF but cannot recognize the | |
format and cannot deal with it, and thus renders the exchange | |
useless. Re text conversion, he noted the different recognition | |
rates obtained by substituting the make and model of scanners in | |
NARA's recent test of an "intelligent" character-recognition product | |
for a new company. In the selection of hardware and software, | |
HOLMES argued, software no longer constitutes the overriding factor | |
it did until about a year ago; rather it is perhaps important to | |
look at both now. | |
* Danny Cohen and Alan Katz of the University of Southern California | |
Information Sciences Institute began circulating as an Internet RFC | |
(RFC 1314) about a month ago a standard for a TIFF interchange | |
format for Internet distribution of monochrome bit-mapped images, | |
which LYNCH said he believed would be used as a de facto standard. | |
* FLEISCHHAUER's impression from hearing these reports and thinking | |
about AM's experience was that there is considerable flux concerning | |
available hardware and software solutions. HOOTON agreed and | |
commented at the same time on ZIDAR's statement that the equipment | |
employed affects the results produced. One cannot draw a complete | |
conclusion by saying it is difficult or impossible to perform OCR | |
from scanning microfilm, for example, with that device, that set of | |
parameters, and system requirements, because numerous other people | |
are accomplishing just that, using other components, perhaps. | |
HOOTON opined that both the hardware and the software were highly | |
important. Most of the problems discussed today have been solved in | |
numerous different ways by other people. Though it is good to be | |
cognizant of various experiences, this is not to say that it will | |
always be thus. | |
* At NAL, the through-put rate of the scanning process for paper, | |
page by page, performing OCR, ranges from 300 to 600 pages per day; | |
not performing OCR is considerably faster, although how much faster | |
is not known. This is for scanning from bound books, which is much | |
slower. | |
* WATERS commented on window management questions: DEC proposed an | |
X-Windows solution which was problematical for two reasons. One was | |
POB's requirement to be able to manipulate images on the workstation | |
and bring them down to the workstation itself and the other was | |
network usage. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
THOMA * Illustration of deficiencies in scanning and storage process * | |
Image quality in this process * Different costs entailed by better image | |
quality * Techniques for overcoming various de-ficiencies: fixed | |
thresholding, dynamic thresholding, dithering, image merge * Page edge | |
effects * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
George THOMA, chief, Communications Engineering Branch, National Library | |
of Medicine (NLM), illustrated several of the deficiencies discussed by | |
the previous speakers. He introduced the topic of special problems by | |
noting the advantages of electronic imaging. For example, it is regenerable | |
because it is a coded file, and real-time quality control is possible with | |
electronic capture, whereas in photographic capture it is not. | |
One of the difficulties discussed in the scanning and storage process was | |
image quality which, without belaboring the obvious, means different | |
things for maps, medical X-rays, or broadcast television. In the case of | |
documents, THOMA said, image quality boils down to legibility of the | |
textual parts, and fidelity in the case of gray or color photo print-type | |
material. Legibility boils down to scan density, the standard in most | |
cases being 300 dpi. Increasing the resolution with scanners that | |
perform 600 or 1200 dpi, however, comes at a cost. | |
Better image quality entails at least four different kinds of costs: 1) | |
equipment costs, because the CCD (i.e., charge-couple device) with | |
greater number of elements costs more; 2) time costs that translate to | |
the actual capture costs, because manual labor is involved (the time is | |
also dependent on the fact that more data has to be moved around in the | |
machine in the scanning or network devices that perform the scanning as | |
well as the storage); 3) media costs, because at high resolutions larger | |
files have to be stored; and 4) transmission costs, because there is just | |
more data to be transmitted. | |
But while resolution takes care of the issue of legibility in image | |
quality, other deficiencies have to do with contrast and elements on the | |
page scanned or the image that needed to be removed or clarified. Thus, | |
THOMA proceeded to illustrate various deficiencies, how they are | |
manifested, and several techniques to overcome them. | |
Fixed thresholding was the first technique described, suitable for | |
black-and-white text, when the contrast does not vary over the page. One | |
can have many different threshold levels in scanning devices. Thus, | |
THOMA offered an example of extremely poor contrast, which resulted from | |
the fact that the stock was a heavy red. This is the sort of image that | |
when microfilmed fails to provide any legibility whatsoever. Fixed | |
thresholding is the way to change the black-to-red contrast to the | |
desired black-to-white contrast. | |
Other examples included material that had been browned or yellowed by | |
age. This was also a case of contrast deficiency, and correction was | |
done by fixed thresholding. A final example boils down to the same | |
thing, slight variability, but it is not significant. Fixed thresholding | |
solves this problem as well. The microfilm equivalent is certainly legible, | |
but it comes with dark areas. Though THOMA did not have a slide of the | |
microfilm in this case, he did show the reproduced electronic image. | |
When one has variable contrast over a page or the lighting over the page | |
area varies, especially in the case where a bound volume has light | |
shining on it, the image must be processed by a dynamic thresholding | |
scheme. One scheme, dynamic averaging, allows the threshold level not to | |
be fixed but to be recomputed for every pixel from the neighboring | |
characteristics. The neighbors of a pixel determine where the threshold | |
should be set for that pixel. | |
THOMA showed an example of a page that had been made deficient by a | |
variety of techniques, including a burn mark, coffee stains, and a yellow | |
marker. Application of a fixed-thresholding scheme, THOMA argued, might | |
take care of several deficiencies on the page but not all of them. | |
Performing the calculation for a dynamic threshold setting, however, | |
removes most of the deficiencies so that at least the text is legible. | |
Another problem is representing a gray level with black-and-white pixels | |
by a process known as dithering or electronic screening. But dithering | |
does not provide good image quality for pure black-and-white textual | |
material. THOMA illustrated this point with examples. Although its | |
suitability for photoprint is the reason for electronic screening or | |
dithering, it cannot be used for every compound image. In the document | |
that was distributed by CXP, THOMA noticed that the dithered image of the | |
IEEE test chart evinced some deterioration in the text. He presented an | |
extreme example of deterioration in the text in which compounded | |
documents had to be set right by other techniques. The technique | |
illustrated by the present example was an image merge in which the page | |
is scanned twice and the settings go from fixed threshold to the | |
dithering matrix; the resulting images are merged to give the best | |
results with each technique. | |
THOMA illustrated how dithering is also used in nonphotographic or | |
nonprint materials with an example of a grayish page from a medical text, | |
which was reproduced to show all of the gray that appeared in the | |
original. Dithering provided a reproduction of all the gray in the | |
original of another example from the same text. | |
THOMA finally illustrated the problem of bordering, or page-edge, | |
effects. Books and bound volumes that are placed on a photocopy machine | |
or a scanner produce page-edge effects that are undesirable for two | |
reasons: 1) the aesthetics of the image; after all, if the image is to | |
be preserved, one does not necessarily want to keep all of its | |
deficiencies; 2) compression (with the bordering problem THOMA | |
illustrated, the compression ratio deteriorated tremendously). One way | |
to eliminate this more serious problem is to have the operator at the | |
point of scanning window the part of the image that is desirable and | |
automatically turn all of the pixels out of that picture to white. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
FLEISCHHAUER * AM's experience with scanning bound materials * Dithering | |
* | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Carl FLEISCHHAUER, coordinator, American Memory, Library of Congress, | |
reported AM's experience with scanning bound materials, which he likened | |
to the problems involved in using photocopying machines. Very few | |
devices in the industry offer book-edge scanning, let alone book cradles. | |
The problem may be unsolvable, FLEISCHHAUER said, because a large enough | |
market does not exist for a preservation-quality scanner. AM is using a | |
Kurzweil scanner, which is a book-edge scanner now sold by Xerox. | |
Devoting the remainder of his brief presentation to dithering, | |
FLEISCHHAUER related AM's experience with a contractor who was using | |
unsophisticated equipment and software to reduce moire patterns from | |
printed halftones. AM took the same image and used the dithering | |
algorithm that forms part of the same Kurzweil Xerox scanner; it | |
disguised moire patterns much more effectively. | |
FLEISCHHAUER also observed that dithering produces a binary file which is | |
useful for numerous purposes, for example, printing it on a laser printer | |
without having to "re-halftone" it. But it tends to defeat efficient | |
compression, because the very thing that dithers to reduce moire patterns | |
also tends to work against compression schemes. AM thought the | |
difference in image quality was worth it. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
DISCUSSION * Relative use as a criterion for POB's selection of books to | |
be converted into digital form * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
During the discussion period, WATERS noted that one of the criteria for | |
selecting books among the 10,000 to be converted into digital image form | |
would be how much relative use they would receive--a subject still | |
requiring evaluation. The challenge will be to understand whether | |
coherent bodies of material will increase usage or whether POB should | |
seek material that is being used, scan that, and make it more accessible. | |
POB might decide to digitize materials that are already heavily used, in | |
order to make them more accessible and decrease wear on them. Another | |
approach would be to provide a large body of intellectually coherent | |
material that may be used more in digital form than it is currently used | |
in microfilm. POB would seek material that was out of copyright. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
BARONAS * Origin and scope of AIIM * Types of documents produced in | |
AIIM's standards program * Domain of AIIM's standardization work * AIIM's | |
structure * TC 171 and MS23 * Electronic image management standards * | |
Categories of EIM standardization where AIIM standards are being | |
developed * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Jean BARONAS, senior manager, Department of Standards and Technology, | |
Association for Information and Image Management (AIIM), described the | |
not-for-profit association and the national and international programs | |
for standardization in which AIIM is active. | |
Accredited for twenty-five years as the nation's standards development | |
organization for document image management, AIIM began life in a library | |
community developing microfilm standards. Today the association | |
maintains both its library and business-image management standardization | |
activities--and has moved into electronic image-management | |
standardization (EIM). | |
BARONAS defined the program's scope. AIIM deals with: 1) the | |
terminology of standards and of the technology it uses; 2) methods of | |
measurement for the systems, as well as quality; 3) methodologies for | |
users to evaluate and measure quality; 4) the features of apparatus used | |
to manage and edit images; and 5) the procedures used to manage images. | |
BARONAS noted that three types of documents are produced in the AIIM | |
standards program: the first two, accredited by the American National | |
Standards Institute (ANSI), are standards and standard recommended | |
practices. Recommended practices differ from standards in that they | |
contain more tutorial information. A technical report is not an ANSI | |
standard. Because AIIM's policies and procedures for developing | |
standards are approved by ANSI, its standards are labeled ANSI/AIIM, | |
followed by the number and title of the standard. | |
BARONAS then illustrated the domain of AIIM's standardization work. For | |
example, AIIM is the administrator of the U.S. Technical Advisory Group | |
(TAG) to the International Standards Organization's (ISO) technical | |
committee, TC l7l Micrographics and Optical Memories for Document and | |
Image Recording, Storage, and Use. AIIM officially works through ANSI in | |
the international standardization process. | |
BARONAS described AIIM's structure, including its board of directors, its | |
standards board of twelve individuals active in the image-management | |
industry, its strategic planning and legal admissibility task forces, and | |
its National Standards Council, which is comprised of the members of a | |
number of organizations who vote on every AIIM standard before it is | |
published. BARONAS pointed out that AIIM's liaisons deal with numerous | |
other standards developers, including the optical disk community, office | |
and publishing systems, image-codes-and-character set committees, and the | |
National Information Standards Organization (NISO). | |
BARONAS illustrated the procedures of TC l7l, which covers all aspects of | |
image management. When AIIM's national program has conceptualized a new | |
project, it is usually submitted to the international level, so that the | |
member countries of TC l7l can simultaneously work on the development of | |
the standard or the technical report. BARONAS also illustrated a classic | |
microfilm standard, MS23, which deals with numerous imaging concepts that | |
apply to electronic imaging. Originally developed in the l970s, revised | |
in the l980s, and revised again in l991, this standard is scheduled for | |
another revision. MS23 is an active standard whereby users may propose | |
new density ranges and new methods of evaluating film images in the | |
standard's revision. | |
BARONAS detailed several electronic image-management standards, for | |
instance, ANSI/AIIM MS44, a quality-control guideline for scanning 8.5" | |
by 11" black-and-white office documents. This standard is used with the | |
IEEE fax image--a continuous tone photographic image with gray scales, | |
text, and several continuous tone pictures--and AIIM test target number | |
2, a representative document used in office document management. | |
BARONAS next outlined the four categories of EIM standardization in which | |
AIIM standards are being developed: transfer and retrieval, evaluation, | |
optical disc and document scanning applications, and design and | |
conversion of documents. She detailed several of the main projects of | |
each: 1) in the category of image transfer and retrieval, a bi-level | |
image transfer format, ANSI/AIIM MS53, which is a proposed standard that | |
describes a file header for image transfer between unlike systems when | |
the images are compressed using G3 and G4 compression; 2) the category of | |
image evaluation, which includes the AIIM-proposed TR26 tutorial on image | |
resolution (this technical report will treat the differences and | |
similarities between classical or photographic and electronic imaging); | |
3) design and conversion, which includes a proposed technical report | |
called "Forms Design Optimization for EIM" (this report considers how | |
general-purpose business forms can be best designed so that scanning is | |
optimized; reprographic characteristics such as type, rules, background, | |
tint, and color will likewise be treated in the technical report); 4) | |
disk and document scanning applications includes a project a) on planning | |
platters and disk management, b) on generating an application profile for | |
EIM when images are stored and distributed on CD-ROM, and c) on | |
evaluating SCSI2, and how a common command set can be generated for SCSI2 | |
so that document scanners are more easily integrated. (ANSI/AIIM MS53 | |
will also apply to compressed images.) | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
BATTIN * The implications of standards for preservation * A major | |
obstacle to successful cooperation * A hindrance to access in the digital | |
environment * Standards a double-edged sword for those concerned with the | |
preservation of the human record * Near-term prognosis for reliable | |
archival standards * Preservation concerns for electronic media * Need | |
for reconceptualizing our preservation principles * Standards in the real | |
world and the politics of reproduction * Need to redefine the concept of | |
archival and to begin to think in terms of life cycles * Cooperation and | |
the La Guardia Eight * Concerns generated by discussions on the problems | |
of preserving text and image * General principles to be adopted in a | |
world without standards * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Patricia BATTIN, president, the Commission on Preservation and Access | |
(CPA), addressed the implications of standards for preservation. She | |
listed several areas where the library profession and the analog world of | |
the printed book had made enormous contributions over the past hundred | |
years--for example, in bibliographic formats, binding standards, and, most | |
important, in determining what constitutes longevity or archival quality. | |
Although standards have lightened the preservation burden through the | |
development of national and international collaborative programs, | |
nevertheless, a pervasive mistrust of other people's standards remains a | |
major obstacle to successful cooperation, BATTIN said. | |
The zeal to achieve perfection, regardless of the cost, has hindered | |
rather than facilitated access in some instances, and in the digital | |
environment, where no real standards exist, has brought an ironically | |
just reward. | |
BATTIN argued that standards are a double-edged sword for those concerned | |
with the preservation of the human record, that is, the provision of | |
access to recorded knowledge in a multitude of media as far into the | |
future as possible. Standards are essential to facilitate | |
interconnectivity and access, but, BATTIN said, as LYNCH pointed out | |
yesterday, if set too soon they can hinder creativity, expansion of | |
capability, and the broadening of access. The characteristics of | |
standards for digital imagery differ radically from those for analog | |
imagery. And the nature of digital technology implies continuing | |
volatility and change. To reiterate, precipitous standard-setting can | |
inhibit creativity, but delayed standard-setting results in chaos. | |
Since in BATTIN'S opinion the near-term prognosis for reliable archival | |
standards, as defined by librarians in the analog world, is poor, two | |
alternatives remain: standing pat with the old technology, or | |
reconceptualizing. | |
Preservation concerns for electronic media fall into two general domains. | |
One is the continuing assurance of access to knowledge originally | |
generated, stored, disseminated, and used in electronic form. This | |
domain contains several subdivisions, including 1) the closed, | |
proprietary systems discussed the previous day, bundled information such | |
as electronic journals and government agency records, and electronically | |
produced or captured raw data; and 2) the application of digital | |
technologies to the reformatting of materials originally published on a | |
deteriorating analog medium such as acid paper or videotape. | |
The preservation of electronic media requires a reconceptualizing of our | |
preservation principles during a volatile, standardless transition which | |
may last far longer than any of us envision today. BATTIN urged the | |
necessity of shifting focus from assessing, measuring, and setting | |
standards for the permanence of the medium to the concept of managing | |
continuing access to information stored on a variety of media and | |
requiring a variety of ever-changing hardware and software for access--a | |
fundamental shift for the library profession. | |
BATTIN offered a primer on how to move forward with reasonable confidence | |
in a world without standards. Her comments fell roughly into two sections: | |
1) standards in the real world and 2) the politics of reproduction. | |
In regard to real-world standards, BATTIN argued the need to redefine the | |
concept of archive and to begin to think in terms of life cycles. In | |
the past, the naive assumption that paper would last forever produced a | |
cavalier attitude toward life cycles. The transient nature of the | |
electronic media has compelled people to recognize and accept upfront the | |
concept of life cycles in place of permanency. | |
Digital standards have to be developed and set in a cooperative context | |
to ensure efficient exchange of information. Moreover, during this | |
transition period, greater flexibility concerning how concepts such as | |
backup copies and archival copies in the CXP are defined is necessary, | |
or the opportunity to move forward will be lost. | |
In terms of cooperation, particularly in the university setting, BATTIN | |
also argued the need to avoid going off in a hundred different | |
directions. The CPA has catalyzed a small group of universities called | |
the La Guardia Eight--because La Guardia Airport is where meetings take | |
place--Harvard, Yale, Cornell, Princeton, Penn State, Tennessee, | |
Stanford, and USC, to develop a digital preservation consortium to look | |
at all these issues and develop de facto standards as we move along, | |
instead of waiting for something that is officially blessed. Continuing | |
to apply analog values and definitions of standards to the digital | |
environment, BATTIN said, will effectively lead to forfeiture of the | |
benefits of digital technology to research and scholarship. | |
Under the second rubric, the politics of reproduction, BATTIN reiterated | |
an oft-made argument concerning the electronic library, namely, that it | |
is more difficult to transform than to create, and nowhere is that belief | |
expressed more dramatically than in the conversion of brittle books to | |
new media. Preserving information published in electronic media involves | |
making sure the information remains accessible and that digital | |
information is not lost through reproduction. In the analog world of | |
photocopies and microfilm, the issue of fidelity to the original becomes | |
paramount, as do issues of "Whose fidelity?" and "Whose original?" | |
BATTIN elaborated these arguments with a few examples from a recent study | |
conducted by the CPA on the problems of preserving text and image. | |
Discussions with scholars, librarians, and curators in a variety of | |
disciplines dependent on text and image generated a variety of concerns, | |
for example: 1) Copy what is, not what the technology is capable of. | |
This is very important for the history of ideas. Scholars wish to know | |
what the author saw and worked from. And make available at the | |
workstation the opportunity to erase all the defects and enhance the | |
presentation. 2) The fidelity of reproduction--what is good enough, what | |
can we afford, and the difference it makes--issues of subjective versus | |
objective resolution. 3) The differences between primary and secondary | |
users. Restricting the definition of primary user to the one in whose | |
discipline the material has been published runs one headlong into the | |
reality that these printed books have had a host of other users from a | |
host of other disciplines, who not only were looking for very different | |
things, but who also shared values very different from those of the | |
primary user. 4) The relationship of the standard of reproduction to new | |
capabilities of scholarship--the browsing standard versus an archival | |
standard. How good must the archival standard be? Can a distinction be | |
drawn between potential users in setting standards for reproduction? | |
Archival storage, use copies, browsing copies--ought an attempt to set | |
standards even be made? 5) Finally, costs. How much are we prepared to | |
pay to capture absolute fidelity? What are the trade-offs between vastly | |
enhanced access, degrees of fidelity, and costs? | |
These standards, BATTIN concluded, serve to complicate further the | |
reproduction process, and add to the long list of technical standards | |
that are necessary to ensure widespread access. Ways to articulate and | |
analyze the costs that are attached to the different levels of standards | |
must be found. | |
Given the chaos concerning standards, which promises to linger for the | |
foreseeable future, BATTIN urged adoption of the following general | |
principles: | |
* Strive to understand the changing information requirements of | |
scholarly disciplines as more and more technology is integrated into | |
the process of research and scholarly communication in order to meet | |
future scholarly needs, not to build for the past. Capture | |
deteriorating information at the highest affordable resolution, even | |
though the dissemination and display technologies will lag. | |
* Develop cooperative mechanisms to foster agreement on protocols | |
for document structure and other interchange mechanisms necessary | |
for widespread dissemination and use before official standards are | |
set. | |
* Accept that, in a transition period, de facto standards will have | |
to be developed. | |
* Capture information in a way that keeps all options open and | |
provides for total convertibility: OCR, scanning of microfilm, | |
producing microfilm from scanned documents, etc. | |
* Work closely with the generators of information and the builders | |
of networks and databases to ensure that continuing accessibility is | |
a primary concern from the beginning. | |
* Piggyback on standards under development for the broad market, and | |
avoid library-specific standards; work with the vendors, in order to | |
take advantage of that which is being standardized for the rest of | |
the world. | |
* Concentrate efforts on managing permanence in the digital world, | |
rather than perfecting the longevity of a particular medium. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
DISCUSSION * Additional comments on TIFF * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
During the brief discussion period that followed BATTIN's presentation, | |
BARONAS explained that TIFF was not developed in collaboration with or | |
under the auspices of AIIM. TIFF is a company product, not a standard, | |
is owned by two corporations, and is always changing. BARONAS also | |
observed that ANSI/AIIM MS53, a bi-level image file transfer format that | |
allows unlike systems to exchange images, is compatible with TIFF as well | |
as with DEC's architecture and IBM's MODCA/IOCA. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
HOOTON * Several questions to be considered in discussing text conversion | |
* | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
HOOTON introduced the final topic, text conversion, by noting that it is | |
becoming an increasingly important part of the imaging business. Many | |
people now realize that it enhances their system to be able to have more | |
and more character data as part of their imaging system. Re the issue of | |
OCR versus rekeying, HOOTON posed several questions: How does one get | |
text into computer-readable form? Does one use automated processes? | |
Does one attempt to eliminate the use of operators where possible? | |
Standards for accuracy, he said, are extremely important: it makes a | |
major difference in cost and time whether one sets as a standard 98.5 | |
percent acceptance or 99.5 percent. He mentioned outsourcing as a | |
possibility for converting text. Finally, what one does with the image | |
to prepare it for the recognition process is also important, he said, | |
because such preparation changes how recognition is viewed, as well as | |
facilitates recognition itself. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
LESK * Roles of participants in CORE * Data flow * The scanning process * | |
The image interface * Results of experiments involving the use of | |
electronic resources and traditional paper copies * Testing the issue of | |
serendipity * Conclusions * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Michael LESK, executive director, Computer Science Research, Bell | |
Communications Research, Inc. (Bellcore), discussed the Chemical Online | |
Retrieval Experiment (CORE), a cooperative project involving Cornell | |
University, OCLC, Bellcore, and the American Chemical Society (ACS). | |
LESK spoke on 1) how the scanning was performed, including the unusual | |
feature of page segmentation, and 2) the use made of the text and the | |
image in experiments. | |
Working with the chemistry journals (because ACS has been saving its | |
typesetting tapes since the mid-1970s and thus has a significant back-run | |
of the most important chemistry journals in the United States), CORE is | |
attempting to create an automated chemical library. Approximately a | |
quarter of the pages by square inch are made up of images of | |
quasi-pictorial material; dealing with the graphic components of the | |
pages is extremely important. LESK described the roles of participants | |
in CORE: 1) ACS provides copyright permission, journals on paper, | |
journals on microfilm, and some of the definitions of the files; 2) at | |
Bellcore, LESK chiefly performs the data preparation, while Dennis Egan | |
performs experiments on the users of chemical abstracts, and supplies the | |
indexing and numerous magnetic tapes; 3) Cornell provides the site of the | |
experiment; 4) OCLC develops retrieval software and other user interfaces. | |
Various manufacturers and publishers have furnished other help. | |
Concerning data flow, Bellcore receives microfilm and paper from ACS; the | |
microfilm is scanned by outside vendors, while the paper is scanned | |
inhouse on an Improvision scanner, twenty pages per minute at 300 dpi, | |
which provides sufficient quality for all practical uses. LESK would | |
prefer to have more gray level, because one of the ACS journals prints on | |
some colored pages, which creates a problem. | |
Bellcore performs all this scanning, creates a page-image file, and also | |
selects from the pages the graphics, to mix with the text file (which is | |
discussed later in the Workshop). The user is always searching the ASCII | |
file, but she or he may see a display based on the ASCII or a display | |
based on the images. | |
LESK illustrated how the program performs page analysis, and the image | |
interface. (The user types several words, is presented with a list-- | |
usually of the titles of articles contained in an issue--that derives | |
from the ASCII, clicks on an icon and receives an image that mirrors an | |
ACS page.) LESK also illustrated an alternative interface, based on text | |
on the ASCII, the so-called SuperBook interface from Bellcore. | |
LESK next presented the results of an experiment conducted by Dennis Egan | |
and involving thirty-six students at Cornell, one third of them | |
undergraduate chemistry majors, one third senior undergraduate chemistry | |
majors, and one third graduate chemistry students. A third of them | |
received the paper journals, the traditional paper copies and chemical | |
abstracts on paper. A third received image displays of the pictures of | |
the pages, and a third received the text display with pop-up graphics. | |
The students were given several questions made up by some chemistry | |
professors. The questions fell into five classes, ranging from very easy | |
to very difficult, and included questions designed to simulate browsing | |
as well as a traditional information retrieval-type task. | |
LESK furnished the following results. In the straightforward question | |
search--the question being, what is the phosphorus oxygen bond distance | |
and hydroxy phosphate?--the students were told that they could take | |
fifteen minutes and, then, if they wished, give up. The students with | |
paper took more than fifteen minutes on average, and yet most of them | |
gave up. The students with either electronic format, text or image, | |
received good scores in reasonable time, hardly ever had to give up, and | |
usually found the right answer. | |
In the browsing study, the students were given a list of eight topics, | |
told to imagine that an issue of the Journal of the American Chemical | |
Society had just appeared on their desks, and were also told to flip | |
through it and to find topics mentioned in the issue. The average scores | |
were about the same. (The students were told to answer yes or no about | |
whether or not particular topics appeared.) The errors, however, were | |
quite different. The students with paper rarely said that something | |
appeared when it had not. But they often failed to find something | |
actually mentioned in the issue. The computer people found numerous | |
things, but they also frequently said that a topic was mentioned when it | |
was not. (The reason, of course, was that they were performing word | |
searches. They were finding that words were mentioned and they were | |
concluding that they had accomplished their task.) | |
This question also contained a trick to test the issue of serendipity. | |
The students were given another list of eight topics and instructed, | |
without taking a second look at the journal, to recall how many of this | |
new list of eight topics were in this particular issue. This was an | |
attempt to see if they performed better at remembering what they were not | |
looking for. They all performed about the same, paper or electronics, | |
about 62 percent accurate. In short, LESK said, people were not very | |
good when it came to serendipity, but they were no worse at it with | |
computers than they were with paper. | |
(LESK gave a parenthetical illustration of the learning curve of students | |
who used SuperBook.) | |
The students using the electronic systems started off worse than the ones | |
using print, but by the third of the three sessions in the series had | |
caught up to print. As one might expect, electronics provide a much | |
better means of finding what one wants to read; reading speeds, once the | |
object of the search has been found, are about the same. | |
Almost none of the students could perform the hard task--the analogous | |
transformation. (It would require the expertise of organic chemists to | |
complete.) But an interesting result was that the students using the text | |
search performed terribly, while those using the image system did best. | |
That the text search system is driven by text offers the explanation. | |
Everything is focused on the text; to see the pictures, one must press | |
on an icon. Many students found the right article containing the answer | |
to the question, but they did not click on the icon to bring up the right | |
figure and see it. They did not know that they had found the right place, | |
and thus got it wrong. | |
The short answer demonstrated by this experiment was that in the event | |
one does not know what to read, one needs the electronic systems; the | |
electronic systems hold no advantage at the moment if one knows what to | |
read, but neither do they impose a penalty. | |
LESK concluded by commenting that, on one hand, the image system was easy | |
to use. On the other hand, the text display system, which represented | |
twenty man-years of work in programming and polishing, was not winning, | |
because the text was not being read, just searched. The much easier | |
system is highly competitive as well as remarkably effective for the | |
actual chemists. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
ERWAY * Most challenging aspect of working on AM * Assumptions guiding | |
AM's approach * Testing different types of service bureaus * AM's | |
requirement for 99.95 percent accuracy * Requirements for text-coding * | |
Additional factors influencing AM's approach to coding * Results of AM's | |
experience with rekeying * Other problems in dealing with service bureaus | |
* Quality control the most time-consuming aspect of contracting out | |
conversion * Long-term outlook uncertain * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
To Ricky ERWAY, associate coordinator, American Memory, Library of | |
Congress, the constant variety of conversion projects taking place | |
simultaneously represented perhaps the most challenging aspect of working | |
on AM. Thus, the challenge was not to find a solution for text | |
conversion but a tool kit of solutions to apply to LC's varied | |
collections that need to be converted. ERWAY limited her remarks to the | |
process of converting text to machine-readable form, and the variety of | |
LC's text collections, for example, bound volumes, microfilm, and | |
handwritten manuscripts. | |
Two assumptions have guided AM's approach, ERWAY said: 1) A desire not | |
to perform the conversion inhouse. Because of the variety of formats and | |
types of texts, to capitalize the equipment and have the talents and | |
skills to operate them at LC would be extremely expensive. Further, the | |
natural inclination to upgrade to newer and better equipment each year | |
made it reasonable for AM to focus on what it did best and seek external | |
conversion services. Using service bureaus also allowed AM to have | |
several types of operations take place at the same time. 2) AM was not a | |
technology project, but an effort to improve access to library | |
collections. Hence, whether text was converted using OCR or rekeying | |
mattered little to AM. What mattered were cost and accuracy of results. | |
AM considered different types of service bureaus and selected three to | |
perform several small tests in order to acquire a sense of the field. | |
The sample collections with which they worked included handwritten | |
correspondence, typewritten manuscripts from the 1940s, and | |
eighteenth-century printed broadsides on microfilm. On none of these | |
samples was OCR performed; they were all rekeyed. AM had several special | |
requirements for the three service bureaus it had engaged. For instance, | |
any errors in the original text were to be retained. Working from bound | |
volumes or anything that could not be sheet-fed also constituted a factor | |
eliminating companies that would have performed OCR. | |
AM requires 99.95 percent accuracy, which, though it sounds high, often | |
means one or two errors per page. The initial batch of test samples | |
contained several handwritten materials for which AM did not require | |
text-coding. The results, ERWAY reported, were in all cases fairly | |
comparable: for the most part, all three service bureaus achieved 99.95 | |
percent accuracy. AM was satisfied with the work but surprised at the cost. | |
As AM began converting whole collections, it retained the requirement for | |
99.95 percent accuracy and added requirements for text-coding. AM needed | |
to begin performing work more than three years ago before LC requirements | |
for SGML applications had been established. Since AM's goal was simply | |
to retain any of the intellectual content represented by the formatting | |
of the document (which would be lost if one performed a straight ASCII | |
conversion), AM used "SGML-like" codes. These codes resembled SGML tags | |
but were used without the benefit of document-type definitions. AM found | |
that many service bureaus were not yet SGML-proficient. | |
Additional factors influencing the approach AM took with respect to | |
coding included: 1) the inability of any known microcomputer-based | |
user-retrieval software to take advantage of SGML coding; and 2) the | |
multiple inconsistencies in format of the older documents, which | |
confirmed AM in its desire not to attempt to force the different formats | |
to conform to a single document-type definition (DTD) and thus create the | |
need for a separate DTD for each document. | |
The five text collections that AM has converted or is in the process of | |
converting include a collection of eighteenth-century broadsides, a | |
collection of pamphlets, two typescript document collections, and a | |
collection of 150 books. | |
ERWAY next reviewed the results of AM's experience with rekeying, noting | |
again that because the bulk of AM's materials are historical, the quality | |
of the text often does not lend itself to OCR. While non-English | |
speakers are less likely to guess or elaborate or correct typos in the | |
original text, they are also less able to infer what we would; they also | |
are nearly incapable of converting handwritten text. Another | |
disadvantage of working with overseas keyers is that they are much less | |
likely to telephone with questions, especially on the coding, with the | |
result that they develop their own rules as they encounter new | |
situations. | |
Government contracting procedures and time frames posed a major challenge | |
to performing the conversion. Many service bureaus are not accustomed to | |
retaining the image, even if they perform OCR. Thus, questions of image | |
format and storage media were somewhat novel to many of them. ERWAY also | |
remarked other problems in dealing with service bureaus, for example, | |
their inability to perform text conversion from the kind of microfilm | |
that LC uses for preservation purposes. | |
But quality control, in ERWAY's experience, was the most time-consuming | |
aspect of contracting out conversion. AM has been attempting to perform | |
a 10-percent quality review, looking at either every tenth document or | |
every tenth page to make certain that the service bureaus are maintaining | |
99.95 percent accuracy. But even if they are complying with the | |
requirement for accuracy, finding errors produces a desire to correct | |
them and, in turn, to clean up the whole collection, which defeats the | |
purpose to some extent. Even a double entry requires a | |
character-by-character comparison to the original to meet the accuracy | |
requirement. LC is not accustomed to publish imperfect texts, which | |
makes attempting to deal with the industry standard an emotionally | |
fraught issue for AM. As was mentioned in the previous day's discussion, | |
going from 99.95 to 99.99 percent accuracy usually doubles costs and | |
means a third keying or another complete run-through of the text. | |
Although AM has learned much from its experiences with various collections | |
and various service bureaus, ERWAY concluded pessimistically that no | |
breakthrough has been achieved. Incremental improvements have occurred | |
in some of the OCR technology, some of the processes, and some of the | |
standards acceptances, which, though they may lead to somewhat lower costs, | |
do not offer much encouragement to many people who are anxiously awaiting | |
the day that the entire contents of LC are available on-line. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
ZIDAR * Several answers to why one attempts to perform full-text | |
conversion * Per page cost of performing OCR * Typical problems | |
encountered during editing * Editing poor copy OCR vs. rekeying * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Judith ZIDAR, coordinator, National Agricultural Text Digitizing Program | |
(NATDP), National Agricultural Library (NAL), offered several answers to | |
the question of why one attempts to perform full-text conversion: 1) | |
Text in an image can be read by a human but not by a computer, so of | |
course it is not searchable and there is not much one can do with it. 2) | |
Some material simply requires word-level access. For instance, the legal | |
profession insists on full-text access to its material; with taxonomic or | |
geographic material, which entails numerous names, one virtually requires | |
word-level access. 3) Full text permits rapid browsing and searching, | |
something that cannot be achieved in an image with today's technology. | |
4) Text stored as ASCII and delivered in ASCII is standardized and highly | |
portable. 5) People just want full-text searching, even those who do not | |
know how to do it. NAL, for the most part, is performing OCR at an | |
actual cost per average-size page of approximately $7. NAL scans the | |
page to create the electronic image and passes it through the OCR device. | |
ZIDAR next rehearsed several typical problems encountered during editing. | |
Praising the celerity of her student workers, ZIDAR observed that editing | |
requires approximately five to ten minutes per page, assuming that there | |
are no large tables to audit. Confusion among the three characters I, 1, | |
and l, constitutes perhaps the most common problem encountered. Zeroes | |
and O's also are frequently confused. Double M's create a particular | |
problem, even on clean pages. They are so wide in most fonts that they | |
touch, and the system simply cannot tell where one letter ends and the | |
other begins. Complex page formats occasionally fail to columnate | |
properly, which entails rescanning as though one were working with a | |
single column, entering the ASCII, and decolumnating for better | |
searching. With proportionally spaced text, OCR can have difficulty | |
discerning what is a space and what are merely spaces between letters, as | |
opposed to spaces between words, and therefore will merge text or break | |
up words where it should not. | |
ZIDAR said that it can often take longer to edit a poor-copy OCR than to | |
key it from scratch. NAL has also experimented with partial editing of | |
text, whereby project workers go into and clean up the format, removing | |
stray characters but not running a spell-check. NAL corrects typos in | |
the title and authors' names, which provides a foothold for searching and | |
browsing. Even extremely poor-quality OCR (e.g., 60-percent accuracy) | |
can still be searched, because numerous words are correct, while the | |
important words are probably repeated often enough that they are likely | |
to be found correct somewhere. Librarians, however, cannot tolerate this | |
situation, though end users seem more willing to use this text for | |
searching, provided that NAL indicates that it is unedited. ZIDAR | |
concluded that rekeying of text may be the best route to take, in spite | |
of numerous problems with quality control and cost. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
DISCUSSION * Modifying an image before performing OCR * NAL's costs per | |
page *AM's costs per page and experience with Federal Prison Industries * | |
Elements comprising NATDP's costs per page * OCR and structured markup * | |
Distinction between the structure of a document and its representation | |
when put on the screen or printed * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
HOOTON prefaced the lengthy discussion that followed with several | |
comments about modifying an image before one reaches the point of | |
performing OCR. For example, in regard to an application containing a | |
significant amount of redundant data, such as form-type data, numerous | |
companies today are working on various kinds of form renewal, prior to | |
going through a recognition process, by using dropout colors. Thus, | |
acquiring access to form design or using electronic means are worth | |
considering. HOOTON also noted that conversion usually makes or breaks | |
one's imaging system. It is extremely important, extremely costly in | |
terms of either capital investment or service, and determines the quality | |
of the remainder of one's system, because it determines the character of | |
the raw material used by the system. | |
Concerning the four projects undertaken by NAL, two inside and two | |
performed by outside contractors, ZIDAR revealed that an in-house service | |
bureau executed the first at a cost between $8 and $10 per page for | |
everything, including building of the database. The project undertaken | |
by the Consultative Group on International Agricultural Research (CGIAR) | |
cost approximately $10 per page for the conversion, plus some expenses | |
for the software and building of the database. The Acid Rain Project--a | |
two-disk set produced by the University of Vermont, consisting of | |
Canadian publications on acid rain--cost $6.70 per page for everything, | |
including keying of the text, which was double keyed, scanning of the | |
images, and building of the database. The in-house project offered | |
considerable ease of convenience and greater control of the process. On | |
the other hand, the service bureaus know their job and perform it | |
expeditiously, because they have more people. | |
As a useful comparison, ERWAY revealed AM's costs as follows: $0.75 | |
cents to $0.85 cents per thousand characters, with an average page | |
containing 2,700 characters. Requirements for coding and imaging | |
increase the costs. Thus, conversion of the text, including the coding, | |
costs approximately $3 per page. (This figure does not include the | |
imaging and database-building included in the NAL costs.) AM also | |
enjoyed a happy experience with Federal Prison Industries, which | |
precluded the necessity of going through the request-for-proposal process | |
to award a contract, because it is another government agency. The | |
prisoners performed AM's rekeying just as well as other service bureaus | |
and proved handy as well. AM shipped them the books, which they would | |
photocopy on a book-edge scanner. They would perform the markup on | |
photocopies, return the books as soon as they were done with them, | |
perform the keying, and return the material to AM on WORM disks. | |
ZIDAR detailed the elements that constitute the previously noted cost of | |
approximately $7 per page. Most significant is the editing, correction | |
of errors, and spell-checkings, which though they may sound easy to | |
perform require, in fact, a great deal of time. Reformatting text also | |
takes a while, but a significant amount of NAL's expenses are for equipment, | |
which was extremely expensive when purchased because it was one of the few | |
systems on the market. The costs of equipment are being amortized over | |
five years but are still quite high, nearly $2,000 per month. | |
HOCKEY raised a general question concerning OCR and the amount of editing | |
required (substantial in her experience) to generate the kind of | |
structured markup necessary for manipulating the text on the computer or | |
loading it into any retrieval system. She wondered if the speakers could | |
extend the previous question about the cost-benefit of adding or exerting | |
structured markup. ERWAY noted that several OCR systems retain italics, | |
bolding, and other spatial formatting. While the material may not be in | |
the format desired, these systems possess the ability to remove the | |
original materials quickly from the hands of the people performing the | |
conversion, as well as to retain that information so that users can work | |
with it. HOCKEY rejoined that the current thinking on markup is that one | |
should not say that something is italic or bold so much as why it is that | |
way. To be sure, one needs to know that something was italicized, but | |
how can one get from one to the other? One can map from the structure to | |
the typographic representation. | |
FLEISCHHAUER suggested that, given the 100 million items the Library | |
holds, it may not be possible for LC to do more than report that a thing | |
was in italics as opposed to why it was italics, although that may be | |
desirable in some contexts. Promising to talk a bit during the afternoon | |
session about several experiments OCLC performed on automatic recognition | |
of document elements, and which they hoped to extend, WEIBEL said that in | |
fact one can recognize the major elements of a document with a fairly | |
high degree of reliability, at least as good as OCR. STEVENS drew a | |
useful distinction between standard, generalized markup (i.e., defining | |
for a document-type definition the structure of the document), and what | |
he termed a style sheet, which had to do with italics, bolding, and other | |
forms of emphasis. Thus, two different components are at work, one being | |
the structure of the document itself (its logic), and the other being its | |
representation when it is put on the screen or printed. | |
****** | |
SESSION V. APPROACHES TO PREPARING ELECTRONIC TEXTS | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
HOCKEY * Text in ASCII and the representation of electronic text versus | |
an image * The need to look at ways of using markup to assist retrieval * | |
The need for an encoding format that will be reusable and multifunctional | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Susan HOCKEY, director, Center for Electronic Texts in the Humanities | |
(CETH), Rutgers and Princeton Universities, announced that one talk | |
(WEIBEL's) was moved into this session from the morning and that David | |
Packard was unable to attend. The session would attempt to focus more on | |
what one can do with a text in ASCII and the representation of electronic | |
text rather than just an image, what one can do with a computer that | |
cannot be done with a book or an image. It would be argued that one can | |
do much more than just read a text, and from that starting point one can | |
use markup and methods of preparing the text to take full advantage of | |
the capability of the computer. That would lead to a discussion of what | |
the European Community calls REUSABILITY, what may better be termed | |
DURABILITY, that is, how to prepare or make a text that will last a long | |
time and that can be used for as many applications as possible, which | |
would lead to issues of improving intellectual access. | |
HOCKEY urged the need to look at ways of using markup to facilitate retrieval, | |
not just for referencing or to help locate an item that is retrieved, but also to put markup tags in | |
a text to help retrieve the thing sought either with linguistic tagging or | |
interpretation. HOCKEY also argued that little advancement had occurred in | |
the software tools currently available for retrieving and searching text. | |
She pressed the desideratum of going beyond Boolean searches and performing | |
more sophisticated searching, which the insertion of more markup in the text | |
would facilitate. Thinking about electronic texts as opposed to images means | |
considering material that will never appear in print form, or print will not | |
be its primary form, that is, material which only appears in electronic form. | |
HOCKEY alluded to the history and the need for markup and tagging and | |
electronic text, which was developed through the use of computers in the | |
humanities; as MICHELSON had observed, Father Busa had started in 1949 | |
to prepare the first-ever text on the computer. | |
HOCKEY remarked several large projects, particularly in Europe, for the | |
compilation of dictionaries, language studies, and language analysis, in | |
which people have built up archives of text and have begun to recognize | |
the need for an encoding format that will be reusable and multifunctional, | |
that can be used not just to print the text, which may be assumed to be a | |
byproduct of what one wants to do, but to structure it inside the computer | |
so that it can be searched, built into a Hypertext system, etc. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
WEIBEL * OCLC's approach to preparing electronic text: retroconversion, | |
keying of texts, more automated ways of developing data * Project ADAPT | |
and the CORE Project * Intelligent character recognition does not exist * | |
Advantages of SGML * Data should be free of procedural markup; | |
descriptive markup strongly advocated * OCLC's interface illustrated * | |
Storage requirements and costs for putting a lot of information on line * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Stuart WEIBEL, senior research scientist, Online Computer Library Center, | |
Inc. (OCLC), described OCLC's approach to preparing electronic text. He | |
argued that the electronic world into which we are moving must | |
accommodate not only the future but the past as well, and to some degree | |
even the present. Thus, starting out at one end with retroconversion and | |
keying of texts, one would like to move toward much more automated ways | |
of developing data. | |
For example, Project ADAPT had to do with automatically converting | |
document images into a structured document database with OCR text as | |
indexing and also a little bit of automatic formatting and tagging of | |
that text. The CORE project hosted by Cornell University, Bellcore, | |
OCLC, the American Chemical Society, and Chemical Abstracts, constitutes | |
WEIBEL's principal concern at the moment. This project is an example of | |
converting text for which one already has a machine-readable version into | |
a format more suitable for electronic delivery and database searching. | |
(Since Michael LESK had previously described CORE, WEIBEL would say | |
little concerning it.) Borrowing a chemical phrase, de novo synthesis, | |
WEIBEL cited the Online Journal of Current Clinical Trials as an example | |
of de novo electronic publishing, that is, a form in which the primary | |
form of the information is electronic. | |
Project ADAPT, then, which OCLC completed a couple of years ago and in | |
fact is about to resume, is a model in which one takes page images either | |
in paper or microfilm and converts them automatically to a searchable | |
electronic database, either on-line or local. The operating assumption | |
is that accepting some blemishes in the data, especially for | |
retroconversion of materials, will make it possible to accomplish more. | |
Not enough money is available to support perfect conversion. | |
WEIBEL related several steps taken to perform image preprocessing | |
(processing on the image before performing optical character | |
recognition), as well as image postprocessing. He denied the existence | |
of intelligent character recognition and asserted that what is wanted is | |
page recognition, which is a long way off. OCLC has experimented with | |
merging of multiple optical character recognition systems that will | |
reduce errors from an unacceptable rate of 5 characters out of every | |
l,000 to an unacceptable rate of 2 characters out of every l,000, but it | |
is not good enough. It will never be perfect. | |
Concerning the CORE Project, WEIBEL observed that Bellcore is taking the | |
topography files, extracting the page images, and converting those | |
topography files to SGML markup. LESK hands that data off to OCLC, which | |
builds that data into a Newton database, the same system that underlies | |
the on-line system in virtually all of the reference products at OCLC. | |
The long-term goal is to make the systems interoperable so that not just | |
Bellcore's system and OCLC's system can access this data, but other | |
systems can as well, and the key to that is the Z39.50 common command | |
language and the full-text extension. Z39.50 is fine for MARC records, | |
but is not enough to do it for full text (that is, make full texts | |
interoperable). | |
WEIBEL next outlined the critical role of SGML for a variety of purposes, | |
for example, as noted by HOCKEY, in the world of extremely large | |
databases, using highly structured data to perform field searches. | |
WEIBEL argued that by building the structure of the data in (i.e., the | |
structure of the data originally on a printed page), it becomes easy to | |
look at a journal article even if one cannot read the characters and know | |
where the title or author is, or what the sections of that document would be. | |
OCLC wants to make that structure explicit in the database, because it will | |
be important for retrieval purposes. | |
The second big advantage of SGML is that it gives one the ability to | |
build structure into the database that can be used for display purposes | |
without contaminating the data with instructions about how to format | |
things. The distinction lies between procedural markup, which tells one | |
where to put dots on the page, and descriptive markup, which describes | |
the elements of a document. | |
WEIBEL believes that there should be no procedural markup in the data at | |
all, that the data should be completely unsullied by information about | |
italics or boldness. That should be left up to the display device, | |
whether that display device is a page printer or a screen display device. | |
By keeping one's database free of that kind of contamination, one can | |
make decisions down the road, for example, reorganize the data in ways | |
that are not cramped by built-in notions of what should be italic and | |
what should be bold. WEIBEL strongly advocated descriptive markup. As | |
an example, he illustrated the index structure in the CORE data. With | |
subsequent illustrated examples of markup, WEIBEL acknowledged the common | |
complaint that SGML is hard to read in its native form, although markup | |
decreases considerably once one gets into the body. Without the markup, | |
however, one would not have the structure in the data. One can pass | |
markup through a LaTeX processor and convert it relatively easily to a | |
printed version of the document. | |
WEIBEL next illustrated an extremely cluttered screen dump of OCLC's | |
system, in order to show as much as possible the inherent capability on | |
the screen. (He noted parenthetically that he had become a supporter of | |
X-Windows as a result of the progress of the CORE Project.) WEIBEL also | |
illustrated the two major parts of the interface: l) a control box that | |
allows one to generate lists of items, which resembles a small table of | |
contents based on key words one wishes to search, and 2) a document | |
viewer, which is a separate process in and of itself. He demonstrated | |
how to follow links through the electronic database simply by selecting | |
the appropriate button and bringing them up. He also noted problems that | |
remain to be accommodated in the interface (e.g., as pointed out by LESK, | |
what happens when users do not click on the icon for the figure). | |
Given the constraints of time, WEIBEL omitted a large number of ancillary | |
items in order to say a few words concerning storage requirements and | |
what will be required to put a lot of things on line. Since it is | |
extremely expensive to reconvert all of this data, especially if it is | |
just in paper form (and even if it is in electronic form in typesetting | |
tapes), he advocated building journals electronically from the start. In | |
that case, if one only has text graphics and indexing (which is all that | |
one needs with de novo electronic publishing, because there is no need to | |
go back and look at bit-maps of pages), one can get 10,000 journals of | |
full text, or almost 6 million pages per year. These pages can be put in | |
approximately 135 gigabytes of storage, which is not all that much, | |
WEIBEL said. For twenty years, something less than three terabytes would | |
be required. WEIBEL calculated the costs of storing this information as | |
follows: If a gigabyte costs approximately $1,000, then a terabyte costs | |
approximately $1 million to buy in terms of hardware. One also needs a | |
building to put it in and a staff like OCLC to handle that information. | |
So, to support a terabyte, multiply by five, which gives $5 million per | |
year for a supported terabyte of data. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
DISCUSSION * Tapes saved by ACS are the typography files originally | |
supporting publication of the journal * Cost of building tagged text into | |
the database * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
During the question-and-answer period that followed WEIBEL's | |
presentation, these clarifications emerged. The tapes saved by the | |
American Chemical Society are the typography files that originally | |
supported the publication of the journal. Although they are not tagged | |
in SGML, they are tagged in very fine detail. Every single sentence is | |
marked, all the registry numbers, all the publications issues, dates, and | |
volumes. No cost figures on tagging material on a per-megabyte basis | |
were available. Because ACS's typesetting system runs from tagged text, | |
there is no extra cost per article. It was unknown what it costs ACS to | |
keyboard the tagged text rather than just keyboard the text in the | |
cheapest process. In other words, since one intends to publish things | |
and will need to build tagged text into a typography system in any case, | |
if one does that in such a way that it can drive not only typography but | |
an electronic system (which is what ACS intends to do--move to SGML | |
publishing), the marginal cost is zero. The marginal cost represents the | |
cost of building tagged text into the database, which is small. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
SPERBERG-McQUEEN * Distinction between texts and computers * Implications | |
of recognizing that all representation is encoding * Dealing with | |
complicated representations of text entails the need for a grammar of | |
documents * Variety of forms of formal grammars * Text as a bit-mapped | |
image does not represent a serious attempt to represent text in | |
electronic form * SGML, the TEI, document-type declarations, and the | |
reusability and longevity of data * TEI conformance explicitly allows | |
extension or modification of the TEI tag set * Administrative background | |
of the TEI * Several design goals for the TEI tag set * An absolutely | |
fixed requirement of the TEI Guidelines * Challenges the TEI has | |
attempted to face * Good texts not beyond economic feasibility * The | |
issue of reproducibility or processability * The issue of mages as | |
simulacra for the text redux * One's model of text determines what one's | |
software can do with a text and has economic consequences * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Prior to speaking about SGML and markup, Michael SPERBERG-McQUEEN, editor, | |
Text Encoding Initiative (TEI), University of Illinois-Chicago, first drew | |
a distinction between texts and computers: Texts are abstract cultural | |
and linguistic objects while computers are complicated physical devices, | |
he said. Abstract objects cannot be placed inside physical devices; with | |
computers one can only represent text and act upon those representations. | |
The recognition that all representation is encoding, SPERBERG-McQUEEN | |
argued, leads to the recognition of two things: 1) The topic description | |
for this session is slightly misleading, because there can be no discussion | |
of pros and cons of text-coding unless what one means is pros and cons of | |
working with text with computers. 2) No text can be represented in a | |
computer without some sort of encoding; images are one way of encoding text, | |
ASCII is another, SGML yet another. There is no encoding without some | |
information loss, that is, there is no perfect reproduction of a text that | |
allows one to do away with the original. Thus, the question becomes, | |
What is the most useful representation of text for a serious work? | |
This depends on what kind of serious work one is talking about. | |
The projects demonstrated the previous day all involved highly complex | |
information and fairly complex manipulation of the textual material. | |
In order to use that complicated information, one has to calculate it | |
slowly or manually and store the result. It needs to be stored, therefore, | |
as part of one's representation of the text. Thus, one needs to store the | |
structure in the text. To deal with complicated representations of text, | |
one needs somehow to control the complexity of the representation of a text; | |
that means one needs a way of finding out whether a document and an | |
electronic representation of a document is legal or not; and that | |
means one needs a grammar of documents. | |
SPERBERG-McQUEEN discussed the variety of forms of formal grammars, | |
implicit and explicit, as applied to text, and their capabilities. He | |
argued that these grammars correspond to different models of text that | |
different developers have. For example, one implicit model of the text | |
is that there is no internal structure, but just one thing after another, | |
a few characters and then perhaps a start-title command, and then a few | |
more characters and an end-title command. SPERBERG-McQUEEN also | |
distinguished several kinds of text that have a sort of hierarchical | |
structure that is not very well defined, which, typically, corresponds | |
to grammars that are not very well defined, as well as hierarchies that | |
are very well defined (e.g., the Thesaurus Linguae Graecae) and extremely | |
complicated things such as SGML, which handle strictly hierarchical data | |
very nicely. | |
SPERBERG-McQUEEN conceded that one other model not illustrated on his two | |
displays was the model of text as a bit-mapped image, an image of a page, | |
and confessed to having been converted to a limited extent by the | |
Workshop to the view that electronic images constitute a promising, | |
probably superior alternative to microfilming. But he was not convinced | |
that electronic images represent a serious attempt to represent text in | |
electronic form. Many of their problems stem from the fact that they are | |
not direct attempts to represent the text but attempts to represent the | |
page, thus making them representations of representations. | |
In this situation of increasingly complicated textual information and the | |
need to control that complexity in a useful way (which begs the question | |
of the need for good textual grammars), one has the introduction of SGML. | |
With SGML, one can develop specific document-type declarations | |
for specific text types or, as with the TEI, attempts to generate | |
general document-type declarations that can handle all sorts of text. | |
The TEI is an attempt to develop formats for text representation that | |
will ensure the kind of reusability and longevity of data discussed earlier. | |
It offers a way to stay alive in the state of permanent technological | |
revolution. | |
It has been a continuing challenge in the TEI to create document grammars | |
that do some work in controlling the complexity of the textual object but | |
also allowing one to represent the real text that one will find. | |
Fundamental to the notion of the TEI is that TEI conformance allows one | |
the ability to extend or modify the TEI tag set so that it fits the text | |
that one is attempting to represent. | |
SPERBERG-McQUEEN next outlined the administrative background of the TEI. | |
The TEI is an international project to develop and disseminate guidelines | |
for the encoding and interchange of machine-readable text. It is | |
sponsored by the Association for Computers in the Humanities, the | |
Association for Computational Linguistics, and the Association for | |
Literary and Linguistic Computing. Representatives of numerous other | |
professional societies sit on its advisory board. The TEI has a number | |
of affiliated projects that have provided assistance by testing drafts of | |
the guidelines. | |
Among the design goals for the TEI tag set, the scheme first of all must | |
meet the needs of research, because the TEI came out of the research | |
community, which did not feel adequately served by existing tag sets. | |
The tag set must be extensive as well as compatible with existing and | |
emerging standards. In 1990, version 1.0 of the Guidelines was released | |
(SPERBERG-McQUEEN illustrated their contents). | |
SPERBERG-McQUEEN noted that one problem besetting electronic text has | |
been the lack of adequate internal or external documentation for many | |
existing electronic texts. The TEI guidelines as currently formulated | |
contain few fixed requirements, but one of them is this: There must | |
always be a document header, an in-file SGML tag that provides | |
1) a bibliographic description of the electronic object one is talking | |
about (that is, who included it, when, what for, and under which title); | |
and 2) the copy text from which it was derived, if any. If there was | |
no copy text or if the copy text is unknown, then one states as much. | |
Version 2.0 of the Guidelines was scheduled to be completed in fall 1992 | |
and a revised third version is to be presented to the TEI advisory board | |
for its endorsement this coming winter. The TEI itself exists to provide | |
a markup language, not a marked-up text. | |
Among the challenges the TEI has attempted to face is the need for a | |
markup language that will work for existing projects, that is, handle the | |
level of markup that people are using now to tag only chapter, section, | |
and paragraph divisions and not much else. At the same time, such a | |
language also will be able to scale up gracefully to handle the highly | |
detailed markup which many people foresee as the future destination of | |
much electronic text, and which is not the future destination but the | |
present home of numerous electronic texts in specialized areas. | |
SPERBERG-McQUEEN dismissed the lowest-common-denominator approach as | |
unable to support the kind of applications that draw people who have | |
never been in the public library regularly before, and make them come | |
back. He advocated more interesting text and more intelligent text. | |
Asserting that it is not beyond economic feasibility to have good texts, | |
SPERBERG-McQUEEN noted that the TEI Guidelines listing 200-odd tags | |
contains tags that one is expected to enter every time the relevant | |
textual feature occurs. It contains all the tags that people need now, | |
and it is not expected that everyone will tag things in the same way. | |
The question of how people will tag the text is in large part a function | |
of their reaction to what SPERBERG-McQUEEN termed the issue of | |
reproducibility. What one needs to be able to reproduce are the things | |
one wants to work with. Perhaps a more useful concept than that of | |
reproducibility or recoverability is that of processability, that is, | |
what can one get from an electronic text without reading it again | |
in the original. He illustrated this contention with a page from | |
Jan Comenius's bilingual Introduction to Latin. | |
SPERBERG-McQUEEN returned at length to the issue of images as simulacra | |
for the text, in order to reiterate his belief that in the long run more | |
than images of pages of particular editions of the text are needed, | |
because just as second-generation photocopies and second-generation | |
microfilm degenerate, so second-generation representations tend to | |
degenerate, and one tends to overstress some relatively trivial aspects | |
of the text such as its layout on the page, which is not always | |
significant, despite what the text critics might say, and slight other | |
pieces of information such as the very important lexical ties between the | |
English and Latin versions of Comenius's bilingual text, for example. | |
Moreover, in many crucial respects it is easy to fool oneself concerning | |
what a scanned image of the text will accomplish. For example, in order | |
to study the transmission of texts, information concerning the text | |
carrier is necessary, which scanned images simply do not always handle. | |
Further, even the high-quality materials being produced at Cornell use | |
much of the information that one would need if studying those books as | |
physical objects. It is a choice that has been made. It is an arguably | |
justifiable choice, but one does not know what color those pen strokes in | |
the margin are or whether there was a stain on the page, because it has | |
been filtered out. One does not know whether there were rips in the page | |
because they do not show up, and on a couple of the marginal marks one | |
loses half of the mark because the pen is very light and the scanner | |
failed to pick it up, and so what is clearly a checkmark in the margin of | |
the original becomes a little scoop in the margin of the facsimile. | |
Standard problems for facsimile editions, not new to electronics, but | |
also true of light-lens photography, and are remarked here because it is | |
important that we not fool ourselves that even if we produce a very nice | |
image of this page with good contrast, we are not replacing the | |
manuscript any more than microfilm has replaced the manuscript. | |
The TEI comes from the research community, where its first allegiance | |
lies, but it is not just an academic exercise. It has relevance far | |
beyond those who spend all of their time studying text, because one's | |
model of text determines what one's software can do with a text. Good | |
models lead to good software. Bad models lead to bad software. That has | |
economic consequences, and it is these economic consequences that have | |
led the European Community to help support the TEI, and that will lead, | |
SPERBERG-McQUEEN hoped, some software vendors to realize that if they | |
provide software with a better model of the text they can make a killing. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
DISCUSSION * Implications of different DTDs and tag sets * ODA versus SGML * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
During the discussion that followed, several additional points were made. | |
Neither AAP (i.e., Association of American Publishers) nor CALS (i.e., | |
Computer-aided Acquisition and Logistics Support) has a document-type | |
definition for ancient Greek drama, although the TEI will be able to | |
handle that. Given this state of affairs and assuming that the | |
technical-journal producers and the commercial vendors decide to use the | |
other two types, then an institution like the Library of Congress, which | |
might receive all of their publications, would have to be able to handle | |
three different types of document definitions and tag sets and be able to | |
distinguish among them. | |
Office Document Architecture (ODA) has some advantages that flow from its | |
tight focus on office documents and clear directions for implementation. | |
Much of the ODA standard is easier to read and clearer at first reading | |
than the SGML standard, which is extremely general. What that means is | |
that if one wants to use graphics in TIFF and ODA, one is stuck, because | |
ODA defines graphics formats while TIFF does not, whereas SGML says the | |
world is not waiting for this work group to create another graphics format. | |
What is needed is an ability to use whatever graphics format one wants. | |
The TEI provides a socket that allows one to connect the SGML document to | |
the graphics. The notation that the graphics are in is clearly a choice | |
that one needs to make based on her or his environment, and that is one | |
advantage. SGML is less megalomaniacal in attempting to define formats | |
for all kinds of information, though more megalomaniacal in attempting to | |
cover all sorts of documents. The other advantage is that the model of | |
text represented by SGML is simply an order of magnitude richer and more | |
flexible than the model of text offered by ODA. Both offer hierarchical | |
structures, but SGML recognizes that the hierarchical model of the text | |
that one is looking at may not have been in the minds of the designers, | |
whereas ODA does not. | |
ODA is not really aiming for the kind of document that the TEI wants to | |
encompass. The TEI can handle the kind of material ODA has, as well as a | |
significantly broader range of material. ODA seems to be very much | |
focused on office documents, which is what it started out being called-- | |
office document architecture. | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
CALALUCA * Text-encoding from a publisher's perspective * | |
Responsibilities of a publisher * Reproduction of Migne's Latin series | |
whole and complete with SGML tags based on perceived need and expected | |
use * Particular decisions arising from the general decision to produce | |
and publish PLD * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
The final speaker in this session, Eric CALALUCA, vice president, | |
Chadwyck-Healey, Inc., spoke from the perspective of a publisher re | |
text-encoding, rather than as one qualified to discuss methods of | |
encoding data, and observed that the presenters sitting in the room, | |
whether they had chosen to or not, were acting as publishers: making | |
choices, gathering data, gathering information, and making assessments. | |
CALALUCA offered the hard-won conviction that in publishing very large | |
text files (such as PLD), one cannot avoid making personal judgments of | |
appropriateness and structure. | |
In CALALUCA's view, encoding decisions stem from prior judgments. Two | |
notions have become axioms for him in the consideration of future sources | |
for electronic publication: 1) electronic text publishing is as personal | |
as any other kind of publishing, and questions of if and how to encode | |
the data are simply a consequence of that prior decision; 2) all | |
personal decisions are open to criticism, which is unavoidable. | |
CALALUCA rehearsed his role as a publisher or, better, as an intermediary | |
between what is viewed as a sound idea and the people who would make use | |
of it. Finding the specialist to advise in this process is the core of | |
that function. The publisher must monitor and hug the fine line between | |
giving users what they want and suggesting what they might need. One | |
responsibility of a publisher is to represent the desires of scholars and | |
research librarians as opposed to bullheadedly forcing them into areas | |
they would not choose to enter. | |
CALALUCA likened the questions being raised today about data structure | |
and standards to the decisions faced by the Abbe Migne himself during | |
production of the Patrologia series in the mid-nineteenth century. | |
Chadwyck-Healey's decision to reproduce Migne's Latin series whole and | |
complete with SGML tags was also based upon a perceived need and an | |
expected use. In the same way that Migne's work came to be far more than | |
a simple handbook for clerics, PLD is already far more than a database | |
for theologians. It is a bedrock source for the study of Western | |
civilization, CALALUCA asserted. | |
In regard to the decision to produce and publish PLD, the editorial board | |
offered direct judgments on the question of appropriateness of these | |
texts for conversion, their encoding and their distribution, and | |
concluded that the best possible project was one that avoided overt | |
intrusions or exclusions in so important a resource. Thus, the general | |
decision to transmit the original collection as clearly as possible with | |
the widest possible avenues for use led to other decisions: 1) To encode | |
the data or not, SGML or not, TEI or not. Again, the expected user | |
community asserted the need for normative tagging structures of important | |
humanities texts, and the TEI seemed the most appropriate structure for | |
that purpose. Research librarians, who are trained to view the larger | |
impact of electronic text sources on 80 or 90 or 100 doctoral | |
disciplines, loudly approved the decision to include tagging. They see | |
what is coming better than the specialist who is completely focused on | |
one edition of Ambrose's De Anima, and they also understand that the | |
potential uses exceed present expectations. 2) What will be tagged and | |
what will not. Once again, the board realized that one must tag the | |
obvious. But in no way should one attempt to identify through encoding | |
schemes every single discrete area of a text that might someday be | |
searched. That was another decision. Searching by a column number, an | |
author, a word, a volume, permitting combination searches, and tagging | |
notations seemed logical choices as core elements. 3) How does one make | |
the data available? Tieing it to a CD-ROM edition creates limitations, | |
but a magnetic tape file that is very large, is accompanied by the | |
encoding specifications, and that allows one to make local modifications | |
also allows one to incorporate any changes one may desire within the | |
bounds of private research, though exporting tag files from a CD-ROM | |
could serve just as well. Since no one on the board could possibly | |
anticipate each and every way in which a scholar might choose to mine | |
this data bank, it was decided to satisfy the basics and make some | |
provisions for what might come. 4) Not to encode the database would rob | |
it of the interchangeability and portability these important texts should | |
accommodate. For CALALUCA, the extensive options presented by full-text | |
searching require care in text selection and strongly support encoding of | |
data to facilitate the widest possible search strategies. Better | |
software can always be created, but summoning the resources, the people, | |
and the energy to reconvert the text is another matter. | |
PLD is being encoded, captured, and distributed, because to | |
Chadwyck-Healey and the board it offers the widest possible array of | |
future research applications that can be seen today. CALALUCA concluded | |
by urging the encoding of all important text sources in whatever way | |
seems most appropriate and durable at the time, without blanching at the | |
thought that one's work may require emendation in the future. (Thus, | |
Chadwyck-Healey produced a very large humanities text database before the | |
final release of the TEI Guidelines.) | |
****** | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
DISCUSSION * Creating texts with markup advocated * Trends in encoding * | |
The TEI and the issue of interchangeability of standards * A | |
misconception concerning the TEI * Implications for an institution like | |
LC in the event that a multiplicity of DTDs develops * Producing images | |
as a first step towards possible conversion to full text through | |
character recognition * The AAP tag sets as a common starting point and | |
the need for caution * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
HOCKEY prefaced the discussion that followed with several comments in | |
favor of creating texts with markup and on trends in encoding. In the | |
future, when many more texts are available for on-line searching, real | |
problems in finding what is wanted will develop, if one is faced with | |
millions of words of data. It therefore becomes important to consider | |
putting markup in texts to help searchers home in on the actual things | |
they wish to retrieve. Various approaches to refining retrieval methods | |
toward this end include building on a computer version of a dictionary | |
and letting the computer look up words in it to obtain more information | |
about the semantic structure or semantic field of a word, its grammatical | |
structure, and syntactic structure. | |
HOCKEY commented on the present keen interest in the encoding world | |
in creating: 1) machine-readable versions of dictionaries that can be | |
initially tagged in SGML, which gives a structure to the dictionary entry; | |
these entries can then be converted into a more rigid or otherwise | |
different database structure inside the computer, which can be treated as | |
a dynamic tool for searching mechanisms; 2) large bodies of text to study | |
the language. In order to incorporate more sophisticated mechanisms, | |
more about how words behave needs to be known, which can be learned in | |
part from information in dictionaries. However, the last ten years have | |
seen much interest in studying the structure of printed dictionaries | |
converted into computer-readable form. The information one derives about | |
many words from those is only partial, one or two definitions of the | |
common or the usual meaning of a word, and then numerous definitions of | |
unusual usages. If the computer is using a dictionary to help retrieve | |
words in a text, it needs much more information about the common usages, | |
because those are the ones that occur over and over again. Hence the | |
current interest in developing large bodies of text in computer-readable | |
form in order to study the language. Several projects are engaged in | |
compiling, for example, 100 million words. HOCKEY described one with | |
which she was associated briefly at Oxford University involving | |
compilation of 100 million words of British English: about 10 percent of | |
that will contain detailed linguistic tagging encoded in SGML; it will | |
have word class taggings, with words identified as nouns, verbs, | |
adjectives, or other parts of speech. This tagging can then be used by | |
programs which will begin to learn a bit more about the structure of the | |
language, and then, can go to tag more text. | |
HOCKEY said that the more that is tagged accurately, the more one can | |
refine the tagging process and thus the bigger body of text one can build | |
up with linguistic tagging incorporated into it. Hence, the more tagging | |
or annotation there is in the text, the more one may begin to learn about | |
language and the more it will help accomplish more intelligent OCR. She | |
recommended the development of software tools that will help one begin to | |
understand more about a text, which can then be applied to scanning | |
images of that text in that format and to using more intelligence to help | |
one interpret or understand the text. | |
HOCKEY posited the need to think about common methods of text-encoding | |
for a long time to come, because building these large bodies of text is | |
extremely expensive and will only be done once. | |
In the more general discussion on approaches to encoding that followed, | |
these points were made: | |
BESSER identified the underlying problem with standards that all have to | |
struggle with in adopting a standard, namely, the tension between a very | |
highly defined standard that is very interchangeable but does not work | |
for everyone because something is lacking, and a standard that is less | |
defined, more open, more adaptable, but less interchangeable. Contending | |
that the way in which people use SGML is not sufficiently defined, BESSER | |
wondered 1) if people resist the TEI because they think it is too defined | |
in certain things they do not fit into, and 2) how progress with | |
interchangeability can be made without frightening people away. | |
SPERBERG-McQUEEN replied that the published drafts of the TEI had met | |
with surprisingly little objection on the grounds that they do not allow | |
one to handle X or Y or Z. Particular concerns of the affiliated | |
projects have led, in practice, to discussions of how extensions are to | |
be made; the primary concern of any project has to be how it can be | |
represented locally, thus making interchange secondary. The TEI has | |
received much criticism based on the notion that everything in it is | |
required or even recommended, which, as it happens, is a misconception | |
from the beginning, because none of it is required and very little is | |
actually actively recommended for all cases, except that one document | |
one's source. | |
SPERBERG-McQUEEN agreed with BESSER about this trade-off: all the | |
projects in a set of twenty TEI-conformant projects will not necessarily | |
tag the material in the same way. One result of the TEI will be that the | |
easiest problems will be solved--those dealing with the external form of | |
the information; but the problem that is hardest in interchange is that | |
one is not encoding what another wants, and vice versa. Thus, after | |
the adoption of a common notation, the differences in the underlying | |
conceptions of what is interesting about texts become more visible. | |
The success of a standard like the TEI will lie in the ability of | |
the recipient of interchanged texts to use some of what it contains | |
and to add the information that was not encoded that one wants, in a | |
layered way, so that texts can be gradually enriched and one does not | |
have to put in everything all at once. Hence, having a well-behaved | |
markup scheme is important. | |
STEVENS followed up on the paradoxical analogy that BESSER alluded to in | |
the example of the MARC records, namely, the formats that are the same | |
except that they are different. STEVENS drew a parallel between | |
document-type definitions and MARC records for books and serials and maps, | |
where one has a tagging structure and there is a text-interchange. | |
STEVENS opined that the producers of the information will set the terms | |
for the standard (i.e., develop document-type definitions for the users | |
of their products), creating a situation that will be problematical for | |
an institution like the Library of Congress, which will have to deal with | |
the DTDs in the event that a multiplicity of them develops. Thus, | |
numerous people are seeking a standard but cannot find the tag set that | |
will be acceptable to them and their clients. SPERBERG-McQUEEN agreed | |
with this view, and said that the situation was in a way worse: attempting | |
to unify arbitrary DTDs resembled attempting to unify a MARC record with a | |
bibliographic record done according to the Prussian instructions. | |
According to STEVENS, this situation occurred very early in the process. | |
WATERS recalled from early discussions on Project Open Book the concern | |
of many people that merely by producing images, POB was not really | |
enhancing intellectual access to the material. Nevertheless, not wishing | |
to overemphasize the opposition between imaging and full text, WATERS | |
stated that POB views getting the images as a first step toward possibly | |
converting to full text through character recognition, if the technology | |
is appropriate. WATERS also emphasized that encoding is involved even | |
with a set of images. | |
SPERBERG-McQUEEN agreed with WATERS that one can create an SGML document | |
consisting wholly of images. At first sight, organizing graphic images | |
with an SGML document may not seem to offer great advantages, but the | |
advantages of the scheme WATERS described would be precisely that | |
ability to move into something that is more of a multimedia document: | |
a combination of transcribed text and page images. WEIBEL concurred in | |
this judgment, offering evidence from Project ADAPT, where a page is | |
divided into text elements and graphic elements, and in fact the text | |
elements are organized by columns and lines. These lines may be used as | |
the basis for distributing documents in a network environment. As one | |
develops software intelligent enough to recognize what those elements | |
are, it makes sense to apply SGML to an image initially, that may, in | |
fact, ultimately become more and more text, either through OCR or edited | |
OCR or even just through keying. For WATERS, the labor of composing the | |
document and saying this set of documents or this set of images belongs | |
to this document constitutes a significant investment. | |
WEIBEL also made the point that the AAP tag sets, while not excessively | |
prescriptive, offer a common starting point; they do not define the | |
structure of the documents, though. They have some recommendations about | |
DTDs one could use as examples, but they do just suggest tag sets. For | |
example, the CORE project attempts to use the AAP markup as much as | |
possible, but there are clearly areas where structure must be added. | |
That in no way contradicts the use of AAP tag sets. | |
SPERBERG-McQUEEN noted that the TEI prepared a long working paper early | |
on about the AAP tag set and what it lacked that the TEI thought it | |
needed, and a fairly long critique of the naming conventions, which has | |
led to a very different style of naming in the TEI. He stressed the | |
importance of the opposition between prescriptive markup, the kind that a | |
publisher or anybody can do when producing documents de novo, and | |
descriptive markup, in which one has to take what the text carrier | |
provides. In these particular tag sets it is easy to overemphasize this | |
opposition, because the AAP tag set is extremely flexible. Even if one | |
just used the DTDs, they allow almost anything to appear almost anywhere. | |
****** | |
SESSION VI. COPYRIGHT ISSUES | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
PETERS * Several cautions concerning copyright in an electronic | |
environment * Review of copyright law in the United States * The notion | |
of the public good and the desirability of incentives to promote it * | |
What copyright protects * Works not protected by copyright * The rights | |
of copyright holders * Publishers' concerns in today's electronic | |
environment * Compulsory licenses * The price of copyright in a digital | |
medium and the need for cooperation * Additional clarifications * Rough | |
justice oftentimes the outcome in numerous copyright matters * Copyright | |
in an electronic society * Copyright law always only sets up the | |
boundaries; anything can be changed by contract * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
Marybeth PETERS, policy planning adviser to the Register of Copyrights, | |
Library of Congress, made several general comments and then opened the | |
floor to discussion of subjects of interest to the audience. | |
Having attended several sessions in an effort to gain a sense of what | |
people did and where copyright would affect their lives, PETERS expressed | |
the following cautions: | |
* If one takes and converts materials and puts them in new forms, | |
then, from a copyright point of view, one is creating something and | |
will receive some rights. | |
* However, if what one is converting already exists, a question | |
immediately arises about the status of the materials in question. | |
* Putting something in the public domain in the United States offers | |
some freedom from anxiety, but distributing it throughout the world | |
on a network is another matter, even if one has put it in the public | |
domain in the United States. Re foreign laws, very frequently a | |
work can be in the public domain in the United States but protected | |
in other countries. Thus, one must consider all of the places a | |
work may reach, lest one unwittingly become liable to being faced | |
with a suit for copyright infringement, or at least a letter | |
demanding discussion of what one is doing. | |
PETERS reviewed copyright law in the United States. The U.S. | |
Constitution effectively states that Congress has the power to enact | |
copyright laws for two purposes: 1) to encourage the creation and | |
dissemination of intellectual works for the good of society as a whole; | |
and, significantly, 2) to give creators and those who package and | |
disseminate materials the economic rewards that are due them. | |
Congress strives to strike a balance, which at times can become an | |
emotional issue. The United States has never accepted the notion of the | |
natural right of an author so much as it has accepted the notion of the | |
public good and the desirability of incentives to promote it. This state | |
of affairs, however, has created strains on the international level and | |
is the reason for several of the differences in the laws that we have. | |
Today the United States protects almost every kind of work that can be | |
called an expression of an author. The standard for gaining copyright | |
protection is simply originality. This is a low standard and means that | |
a work is not copied from something else, as well as shows a certain | |
minimal amount of authorship. One can also acquire copyright protection | |
for making a new version of preexisting material, provided it manifests | |
some spark of creativity. | |
However, copyright does not protect ideas, methods, systems--only the way | |
that one expresses those things. Nor does copyright protect anything | |
that is mechanical, anything that does not involve choice, or criteria | |
concerning whether or not one should do a thing. For example, the | |
results of a process called declicking, in which one mechanically removes | |
impure sounds from old recordings, are not copyrightable. On the other | |
hand, the choice to record a song digitally and to increase the sound of | |
violins or to bring up the tympani constitutes the results of conversion | |
that are copyrightable. Moreover, if a work is protected by copyright in | |
the United States, one generally needs the permission of the copyright | |
owner to convert it. Normally, who will own the new--that is, converted- | |
-material is a matter of contract. In the absence of a contract, the | |
person who creates the new material is the author and owner. But people | |
do not generally think about the copyright implications until after the | |
fact. PETERS stressed the need when dealing with copyrighted works to | |
think about copyright in advance. One's bargaining power is much greater | |
up front than it is down the road. | |
PETERS next discussed works not protected by copyright, for example, any | |
work done by a federal employee as part of his or her official duties is | |
in the public domain in the United States. The issue is not wholly free | |
of doubt concerning whether or not the work is in the public domain | |
outside the United States. Other materials in the public domain include: | |
any works published more than seventy-five years ago, and any work | |
published in the United States more than twenty-eight years ago, whose | |
copyright was not renewed. In talking about the new technology and | |
putting material in a digital form to send all over the world, PETERS | |
cautioned, one must keep in mind that while the rights may not be an | |
issue in the United States, they may be in different parts of the world, | |
where most countries previously employed a copyright term of the life of | |
the author plus fifty years. | |
PETERS next reviewed the economics of copyright holding. Simply, | |
economic rights are the rights to control the reproduction of a work in | |
any form. They belong to the author, or in the case of a work made for | |
hire, the employer. The second right, which is critical to conversion, | |
is the right to change a work. The right to make new versions is perhaps | |
one of the most significant rights of authors, particularly in an | |
electronic world. The third right is the right to publish the work and | |
the right to disseminate it, something that everyone who deals in an | |
electronic medium needs to know. The basic rule is if a copy is sold, | |
all rights of distribution are extinguished with the sale of that copy. | |
The key is that it must be sold. A number of companies overcome this | |
obstacle by leasing or renting their product. These companies argue that | |
if the material is rented or leased and not sold, they control the uses | |
of a work. The fourth right, and one very important in a digital world, | |
is a right of public performance, which means the right to show the work | |
sequentially. For example, copyright owners control the showing of a | |
CD-ROM product in a public place such as a public library. The reverse | |
side of public performance is something called the right of public | |
display. Moral rights also exist, which at the federal level apply only | |
to very limited visual works of art, but in theory may apply under | |
contract and other principles. Moral rights may include the right of an | |
author to have his or her name on a work, the right of attribution, and | |
the right to object to distortion or mutilation--the right of integrity. | |
The way copyright law is worded gives much latitude to activities such as | |
preservation; to use of material for scholarly and research purposes when | |
the user does not make multiple copies; and to the generation of | |
facsimile copies of unpublished works by libraries for themselves and | |
other libraries. But the law does not allow anyone to become the | |
distributor of the product for the entire world. In today's electronic | |
environment, publishers are extremely concerned that the entire world is | |
networked and can obtain the information desired from a single copy in a | |
single library. Hence, if there is to be only one sale, which publishers | |
may choose to live with, they will obtain their money in other ways, for | |
example, from access and use. Hence, the development of site licenses | |
and other kinds of agreements to cover what publishers believe they | |
should be compensated for. Any solution that the United States takes | |
today has to consider the international arena. | |
Noting that the United States is a member of the Berne Convention and | |
subscribes to its provisions, PETERS described the permissions process. | |
She also defined compulsory licenses. A compulsory license, of which the | |
United States has had a few, builds into the law the right to use a work | |
subject to certain terms and conditions. In the international arena, | |
however, the ability to use compulsory licenses is extremely limited. | |
Thus, clearinghouses and other collectives comprise one option that has | |
succeeded in providing for use of a work. Often overlooked when one | |
begins to use copyrighted material and put products together is how | |
expensive the permissions process and managing it is. According to | |
PETERS, the price of copyright in a digital medium, whatever solution is | |
worked out, will include managing and assembling the database. She | |
strongly recommended that publishers and librarians or people with | |
various backgrounds cooperate to work out administratively feasible | |
systems, in order to produce better results. | |
In the lengthy question-and-answer period that followed PETERS's | |
presentation, the following points emerged: | |
* The Copyright Office maintains that anything mechanical and | |
totally exhaustive probably is not protected. In the event that | |
what an individual did in developing potentially copyrightable | |
material is not understood, the Copyright Office will ask about the | |
creative choices the applicant chose to make or not to make. As a | |
practical matter, if one believes she or he has made enough of those | |
choices, that person has a right to assert a copyright and someone | |
else must assert that the work is not copyrightable. The more | |
mechanical, the more automatic, a thing is, the less likely it is to | |
be copyrightable. | |
* Nearly all photographs are deemed to be copyrightable, but no one | |
worries about them much, because everyone is free to take the same | |
image. Thus, a photographic copyright represents what is called a | |
"thin" copyright. The photograph itself must be duplicated, in | |
order for copyright to be violated. | |
* The Copyright Office takes the position that X-rays are not | |
copyrightable because they are mechanical. It can be argued | |
whether or not image enhancement in scanning can be protected. One | |
must exercise care with material created with public funds and | |
generally in the public domain. An article written by a federal | |
employee, if written as part of official duties, is not | |
copyrightable. However, control over a scientific article written | |
by a National Institutes of Health grantee (i.e., someone who | |
receives money from the U.S. government), depends on NIH policy. If | |
the government agency has no policy (and that policy can be | |
contained in its regulations, the contract, or the grant), the | |
author retains copyright. If a provision of the contract, grant, or | |
regulation states that there will be no copyright, then it does not | |
exist. When a work is created, copyright automatically comes into | |
existence unless something exists that says it does not. | |
* An enhanced electronic copy of a print copy of an older reference | |
work in the public domain that does not contain copyrightable new | |
material is a purely mechanical rendition of the original work, and | |
is not copyrightable. | |
* Usually, when a work enters the public domain, nothing can remove | |
it. For example, Congress recently passed into law the concept of | |
automatic renewal, which means that copyright on any work published | |
between l964 and l978 does not have to be renewed in order to | |
receive a seventy-five-year term. But any work not renewed before | |
1964 is in the public domain. | |
* Concerning whether or not the United States keeps track of when | |
authors die, nothing was ever done, nor is anything being done at | |
the moment by the Copyright Office. | |
* Software that drives a mechanical process is itself copyrightable. | |
If one changes platforms, the software itself has a copyright. The | |
World Intellectual Property Organization will hold a symposium 28 | |
March through 2 April l993, at Harvard University, on digital | |
technology, and will study this entire issue. If one purchases a | |
computer software package, such as MacPaint, and creates something | |
new, one receives protection only for that which has been added. | |
PETERS added that often in copyright matters, rough justice is the | |
outcome, for example, in collective licensing, ASCAP (i.e., American | |
Society of Composers, Authors, and Publishers), and BMI (i.e., Broadcast | |
Music, Inc.), where it may seem that the big guys receive more than their | |
due. Of course, people ought not to copy a creative product without | |
paying for it; there should be some compensation. But the truth of the | |
world, and it is not a great truth, is that the big guy gets played on | |
the radio more frequently than the little guy, who has to do much more | |
until he becomes a big guy. That is true of every author, every | |
composer, everyone, and, unfortunately, is part of life. | |
Copyright always originates with the author, except in cases of works | |
made for hire. (Most software falls into this category.) When an author | |
sends his article to a journal, he has not relinquished copyright, though | |
he retains the right to relinquish it. The author receives absolutely | |
everything. The less prominent the author, the more leverage the | |
publisher will have in contract negotiations. In order to transfer the | |
rights, the author must sign an agreement giving them away. | |
In an electronic society, it is important to be able to license a writer | |
and work out deals. With regard to use of a work, it usually is much | |
easier when a publisher holds the rights. In an electronic era, a real | |
problem arises when one is digitizing and making information available. | |
PETERS referred again to electronic licensing clearinghouses. Copyright | |
ought to remain with the author, but as one moves forward globally in the | |
electronic arena, a middleman who can handle the various rights becomes | |
increasingly necessary. | |
The notion of copyright law is that it resides with the individual, but | |
in an on-line environment, where a work can be adapted and tinkered with | |
by many individuals, there is concern. If changes are authorized and | |
there is no agreement to the contrary, the person who changes a work owns | |
the changes. To put it another way, the person who acquires permission | |
to change a work technically will become the author and the owner, unless | |
some agreement to the contrary has been made. It is typical for the | |
original publisher to try to control all of the versions and all of the | |
uses. Copyright law always only sets up the boundaries. Anything can be | |
changed by contract. | |
****** | |
SESSION VII. CONCLUSION | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
GENERAL DISCUSSION * Two questions for discussion * Different emphases in | |
the Workshop * Bringing the text and image partisans together * | |
Desiderata in planning the long-term development of something * Questions | |
surrounding the issue of electronic deposit * Discussion of electronic | |
deposit as an allusion to the issue of standards * Need for a directory | |
of preservation projects in digital form and for access to their | |
digitized files * CETH's catalogue of machine-readable texts in the | |
humanities * What constitutes a publication in the electronic world? * | |
Need for LC to deal with the concept of on-line publishing * LC's Network | |
Development Office exploring the limits of MARC as a standard in terms | |
of handling electronic information * Magnitude of the problem and the | |
need for distributed responsibility in order to maintain and store | |
electronic information * Workshop participants to be viewed as a starting | |
point * Development of a network version of AM urged * A step toward AM's | |
construction of some sort of apparatus for network access * A delicate | |
and agonizing policy question for LC * Re the issue of electronic | |
deposit, LC urged to initiate a catalytic process in terms of distributed | |
responsibility * Suggestions for cooperative ventures * Commercial | |
publishers' fears * Strategic questions for getting the image and text | |
people to think through long-term cooperation * Clarification of the | |
driving force behind both the Perseus and the Cornell Xerox projects * | |
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |
In his role as moderator of the concluding session, GIFFORD raised two | |
questions he believed would benefit from discussion: 1) Are there enough | |
commonalities among those of us that have been here for two days so that | |
we can see courses of action that should be taken in the future? And, if | |
so, what are they and who might take them? 2) Partly derivative from | |
that, but obviously very dangerous to LC as host, do you see a role for | |
the Library of Congress in all this? Of course, the Library of Congress | |
holds a rather special status in a number of these matters, because it is | |
not perceived as a player with an economic stake in them, but are there | |
roles that LC can play that can help advance us toward where we are heading? | |
Describing himself as an uninformed observer of the technicalities of the | |
last two days, GIFFORD detected three different emphases in the Workshop: | |
1) people who are very deeply committed to text; 2) people who are almost | |
passionate about images; and 3) a few people who are very committed to | |
what happens to the networks. In other words, the new networking | |
dimension, the accessibility of the processability, the portability of | |
all this across the networks. How do we pull those three together? | |
Adding a question that reflected HOCKEY's comment that this was the | |
fourth workshop she had attended in the previous thirty days, FLEISCHHAUER | |
wondered to what extent this meeting had reinvented the wheel, or if it | |
had contributed anything in the way of bringing together a different group | |
of people from those who normally appear on the workshop circuit. | |
HOCKEY confessed to being struck at this meeting and the one the | |
Electronic Pierce Consortium organized the previous week that this was a | |
coming together of people working on texts and not images. Attempting to | |
bring the two together is something we ought to be thinking about for the | |
future: How one can think about working with image material to begin | |
with, but structuring it and digitizing it in such a way that at a later | |
stage it can be interpreted into text, and find a common way of building | |
text and images together so that they can be used jointly in the future, | |
with the network support to begin there because that is how people will | |
want to access it. | |
In planning the long-term development of something, which is what is | |
being done in electronic text, HOCKEY stressed the importance not only | |
of discussing the technical aspects of how one does it but particularly | |
of thinking about what the people who use the stuff will want to do. | |
But conversely, there are numerous things that people start to do with | |
electronic text or material that nobody ever thought of in the beginning. | |
LESK, in response to the question concerning the role of the Library of | |
Congress, remarked the often suggested desideratum of having electronic | |
deposit: Since everything is now computer-typeset, an entire decade of | |
material that was machine-readable exists, but the publishers frequently | |
did not save it; has LC taken any action to have its copyright deposit | |
operation start collecting these machine-readable versions? In the | |
absence of PETERS, GIFFORD replied that the question was being | |
actively considered but that that was only one dimension of the problem. | |
Another dimension is the whole question of the integrity of the original | |
electronic document. It becomes highly important in science to prove | |
authorship. How will that be done? | |
ERWAY explained that, under the old policy, to make a claim for a | |
copyright for works that were published in electronic form, including | |
software, one had to submit a paper copy of the first and last twenty | |
pages of code--something that represented the work but did not include | |
the entire work itself and had little value to anyone. As a temporary | |
measure, LC has claimed the right to demand electronic versions of | |
electronic publications. This measure entails a proactive role for the | |
Library to say that it wants a particular electronic version. Publishers | |
then have perhaps a year to submit it. But the real problem for LC is | |
what to do with all this material in all these different formats. Will | |
the Library mount it? How will it give people access to it? How does LC | |
keep track of the appropriate computers, software, and media? The situation | |
is so hard to control, ERWAY said, that it makes sense for each publishing | |
house to maintain its own archive. But LC cannot enforce that either. | |
GIFFORD acknowledged LESK's suggestion that establishing a priority | |
offered the solution, albeit a fairly complicated one. But who maintains | |
that register?, he asked. GRABER noted that LC does attempt to collect a | |
Macintosh version and the IBM-compatible version of software. It does | |
not collect other versions. But while true for software, BYRUM observed, | |
this reply does not speak to materials, that is, all the materials that | |
were published that were on somebody's microcomputer or driver tapes | |
at a publishing office across the country. LC does well to acquire | |
specific machine-readable products selectively that were intended to be | |
machine-readable. Materials that were in machine-readable form at one time, | |
BYRUM said, would be beyond LC's capability at the moment, insofar as | |
attempting to acquire, organize, and preserve them are concerned--and | |
preservation would be the most important consideration. In this | |
connection, GIFFORD reiterated the need to work out some sense of | |
distributive responsibility for a number of these issues, which | |
inevitably will require significant cooperation and discussion. | |
Nobody can do it all. | |
LESK suggested that some publishers may look with favor on LC beginning | |
to serve as a depository of tapes in an electronic manuscript standard. | |
Publishers may view this as a service that they did not have to perform | |
and they might send in tapes. However, SPERBERG-McQUEEN countered, | |
although publishers have had equivalent services available to them for a | |
long time, the electronic text archive has never turned away or been | |
flooded with tapes and is forever sending feedback to the depositor. | |
Some publishers do send in tapes. | |
ANDRE viewed this discussion as an allusion to the issue of standards. | |
She recommended that the AAP standard and the TEI, which has already been | |
somewhat harmonized internationally and which also shares several | |
compatibilities with the AAP, be harmonized to ensure sufficient | |
compatibility in the software. She drew the line at saying LC ought to | |
be the locus or forum for such harmonization. | |
Taking the group in a slightly different direction, but one where at | |
least in the near term LC might play a helpful role, LYNCH remarked the | |
plans of a number of projects to carry out preservation by creating | |
digital images that will end up in on-line or near-line storage at some | |
institution. Presumably, LC will link this material somehow to its | |
on-line catalog in most cases. Thus, it is in a digital form. LYNCH had | |
the impression that many of these institutions would be willing to make | |
those files accessible to other people outside the institution, provided | |
that there is no copyright problem. This desideratum will require | |
propagating the knowledge that those digitized files exist, so that they | |
can end up in other on-line catalogs. Although uncertain about the | |
mechanism for achieving this result, LYNCH said that it warranted | |
scrutiny because it seemed to be connected to some of the basic issues of | |
cataloging and distribution of records. It would be foolish, given the | |
amount of work that all of us have to do and our meager resources, to | |
discover multiple institutions digitizing the same work. Re microforms, | |
LYNCH said, we are in pretty good shape. | |
BATTIN called this a big problem and noted that the Cornell people (who | |
had already departed) were working on it. At issue from the beginning | |
was to learn how to catalog that information into RLIN and then into | |
OCLC, so that it would be accessible. That issue remains to be resolved. | |
LYNCH rejoined that putting it into OCLC or RLIN was helpful insofar as | |
somebody who is thinking of performing preservation activity on that work | |
could learn about it. It is not necessarily helpful for institutions to | |
make that available. BATTIN opined that the idea was that it not only be | |
for preservation purposes but for the convenience of people looking for | |
this material. She endorsed LYNCH's dictum that duplication of this | |
effort was to be avoided by every means. | |
HOCKEY informed the Workshop about one major current activity of CETH, | |
namely a catalogue of machine-readable texts in the humanities. Held on | |
RLIN at present, the catalogue has been concentrated on ASCII as opposed | |
to digitized images of text. She is exploring ways to improve the | |
catalogue and make it more widely available, and welcomed suggestions | |
about these concerns. CETH owns the records, which are not just | |
restricted to RLIN, and can distribute them however it wishes. | |
Taking up LESK's earlier question, BATTIN inquired whether LC, since it | |
is accepting electronic files and designing a mechanism for dealing with | |
that rather than putting books on shelves, would become responsible for | |
the National Copyright Depository of Electronic Materials. Of course | |
that could not be accomplished overnight, but it would be something LC | |
could plan for. GIFFORD acknowledged that much thought was being devoted | |
to that set of problems and returned the discussion to the issue raised | |
by LYNCH--whether or not putting the kind of records that both BATTIN and | |
HOCKEY have been talking about in RLIN is not a satisfactory solution. | |
It seemed to him that RLIN answered LYNCH's original point concerning | |
some kind of directory for these kinds of materials. In a situation | |
where somebody is attempting to decide whether or not to scan this or | |
film that or to learn whether or not someone has already done so, LYNCH | |
suggested, RLIN is helpful, but it is not helpful in the case of a local, | |
on-line catalogue. Further, one would like to have her or his system be | |
aware that that exists in digital form, so that one can present it to a | |
patron, even though one did not digitize it, if it is out of copyright. | |
The only way to make those linkages would be to perform a tremendous | |
amount of real-time look-up, which would be awkward at best, or | |
periodically to yank the whole file from RLIN and match it against one's | |
own stuff, which is a nuisance. | |
But where, ERWAY inquired, does one stop including things that are | |
available with Internet, for instance, in one's local catalogue? | |
It almost seems that that is LC's means to acquire access to them. | |
That represents LC's new form of library loan. Perhaps LC's new on-line | |
catalogue is an amalgamation of all these catalogues on line. LYNCH | |
conceded that perhaps that was true in the very long term, but was not | |
applicable to scanning in the short term. In his view, the totals cited | |
by Yale, 10,000 books over perhaps a four-year period, and 1,000-1,500 | |
books from Cornell, were not big numbers, while searching all over | |
creation for relatively rare occurrences will prove to be less efficient. | |
As GIFFORD wondered if this would not be a separable file on RLIN and | |
could be requested from them, BATTIN interjected that it was easily | |
accessible to an institution. SEVERTSON pointed out that that file, cum | |
enhancements, was available with reference information on CD-ROM, which | |
makes it a little more available. | |
In HOCKEY's view, the real question facing the Workshop is what to put in | |
this catalogue, because that raises the question of what constitutes a | |
publication in the electronic world. (WEIBEL interjected that Eric Joule | |
in OCLC's Office of Research is also wrestling with this particular | |
problem, while GIFFORD thought it sounded fairly generic.) HOCKEY | |
contended that a majority of texts in the humanities are in the hands | |
of either a small number of large research institutions or individuals | |
and are not generally available for anyone else to access at all. | |
She wondered if these texts ought to be catalogued. | |
After argument proceeded back and forth for several minutes over why | |
cataloguing might be a necessary service, LEBRON suggested that this | |
issue involved the responsibility of a publisher. The fact that someone | |
has created something electronically and keeps it under his or her | |
control does not constitute publication. Publication implies | |
dissemination. While it would be important for a scholar to let other | |
people know that this creation exists, in many respects this is no | |
different from an unpublished manuscript. That is what is being accessed | |
in there, except that now one is not looking at it in the hard-copy but | |
in the electronic environment. | |
LEBRON expressed puzzlement at the variety of ways electronic publishing | |
has been viewed. Much of what has been discussed throughout these two | |
days has concerned CD-ROM publishing, whereas in the on-line environment | |
that she confronts, the constraints and challenges are very different. | |
Sooner or later LC will have to deal with the concept of on-line | |
publishing. Taking up the comment ERWAY made earlier about storing | |
copies, LEBRON gave her own journal as an example. How would she deposit | |
OJCCT for copyright?, she asked, because the journal will exist in the | |
mainframe at OCLC and people will be able to access it. Here the | |
situation is different, ownership versus access, and is something that | |
arises with publication in the on-line environment, faster than is | |
sometimes realized. Lacking clear answers to all of these questions | |
herself, LEBRON did not anticipate that LC would be able to take a role | |
in helping to define some of them for quite a while. | |
GREENFIELD observed that LC's Network Development Office is attempting, | |
among other things, to explore the limits of MARC as a standard in terms | |
of handling electronic information. GREENFIELD also noted that Rebecca | |
GUENTHER from that office gave a paper to the American Society for | |
Information Science (ASIS) summarizing several of the discussion papers | |
that were coming out of the Network Development Office. GREENFIELD said | |
he understood that that office had a list-server soliciting just the kind | |
of feedback received today concerning the difficulties of identifying and | |
cataloguing electronic information. GREENFIELD hoped that everybody | |
would be aware of that and somehow contribute to that conversation. | |
Noting two of LC's roles, first, to act as a repository of record for | |
material that is copyrighted in this country, and second, to make | |
materials it holds available in some limited form to a clientele that | |
goes beyond Congress, BESSER suggested that it was incumbent on LC to | |
extend those responsibilities to all the things being published in | |
electronic form. This would mean eventually accepting electronic | |
formats. LC could require that at some point they be in a certain | |
limited set of formats, and then develop mechanisms for allowing people | |
to access those in the same way that other things are accessed. This | |
does not imply that they are on the network and available to everyone. | |
LC does that with most of its bibliographic records, BESSER said, which | |
end up migrating to the utility (e.g., OCLC) or somewhere else. But just | |
as most of LC's books are available in some form through interlibrary | |
loan or some other mechanism, so in the same way electronic formats ought | |
to be available to others in some format, though with some copyright | |
considerations. BESSER was not suggesting that these mechanisms be | |
established tomorrow, only that they seemed to fall within LC's purview, | |
and that there should be long-range plans to establish them. | |
Acknowledging that those from LC in the room agreed with BESSER | |
concerning the need to confront difficult questions, GIFFORD underscored | |
the magnitude of the problem of what to keep and what to select. GIFFORD | |
noted that LC currently receives some 31,000 items per day, not counting | |
electronic materials, and argued for much more distributed responsibility | |
in order to maintain and store electronic information. | |
BESSER responded that the assembled group could be viewed as a starting | |
point, whose initial operating premise could be helping to move in this | |
direction and defining how LC could do so, for example, in areas of | |
standardization or distribution of responsibility. | |
FLEISCHHAUER added that AM was fully engaged, wrestling with some of the | |
questions that pertain to the conversion of older historical materials, | |
which would be one thing that the Library of Congress might do. Several | |
points mentioned by BESSER and several others on this question have a | |
much greater impact on those who are concerned with cataloguing and the | |
networking of bibliographic information, as well as preservation itself. | |
Speaking directly to AM, which he considered was a largely uncopyrighted | |
database, LYNCH urged development of a network version of AM, or | |
consideration of making the data in it available to people interested in | |
doing network multimedia. On account of the current great shortage of | |
digital data that is both appealing and unencumbered by complex rights | |
problems, this course of action could have a significant effect on making | |
network multimedia a reality. | |
In this connection, FLEISCHHAUER reported on a fragmentary prototype in | |
LC's Office of Information Technology Services that attempts to associate | |
digital images of photographs with cataloguing information in ways that | |
work within a local area network--a step, so to say, toward AM's | |
construction of some sort of apparatus for access. Further, AM has | |
attempted to use standard data forms in order to help make that | |
distinction between the access tools and the underlying data, and thus | |
believes that the database is networkable. | |
A delicate and agonizing policy question for LC, however, which comes | |
back to resources and unfortunately has an impact on this, is to find | |
some appropriate, honorable, and legal cost-recovery possibilities. A | |
certain skittishness concerning cost-recovery has made people unsure | |
exactly what to do. AM would be highly receptive to discussing further | |
LYNCH's offer to test or demonstrate its database in a network | |
environment, FLEISCHHAUER said. | |
Returning the discussion to what she viewed as the vital issue of | |
electronic deposit, BATTIN recommended that LC initiate a catalytic | |
process in terms of distributed responsibility, that is, bring together | |
the distributed organizations and set up a study group to look at all | |
these issues and see where we as a nation should move. The broader | |
issues of how we deal with the management of electronic information will | |
not disappear, but only grow worse. | |
LESK took up this theme and suggested that LC attempt to persuade one | |
major library in each state to deal with its state equivalent publisher, | |
which might produce a cooperative project that would be equitably | |
distributed around the country, and one in which LC would be dealing with | |
a minimal number of publishers and minimal copyright problems. | |
GRABER remarked the recent development in the scientific community of a | |
willingness to use SGML and either deposit or interchange on a fairly | |
standardized format. He wondered if a similar movement was taking place | |
in the humanities. Although the National Library of Medicine found only | |
a few publishers to cooperate in a like venture two or three years ago, a | |
new effort might generate a much larger number willing to cooperate. | |
KIMBALL recounted his unit's (Machine-Readable Collections Reading Room) | |
troubles with the commercial publishers of electronic media in acquiring | |
materials for LC's collections, in particular the publishers' fear that | |
they would not be able to cover their costs and would lose control of | |
their products, that LC would give them away or sell them and make | |
profits from them. He doubted that the publishing industry was prepared | |
to move into this area at the moment, given its resistance to allowing LC | |
to use its machine-readable materials as the Library would like. | |
The copyright law now addresses compact disk as a medium, and LC can | |
request one copy of that, or two copies if it is the only version, and | |
can request copies of software, but that fails to address magazines or | |
books or anything like that which is in machine-readable form. | |
GIFFORD acknowledged the thorny nature of this issue, which he illustrated | |
with the example of the cumbersome process involved in putting a copy of a | |
scientific database on a LAN in LC's science reading room. He also | |
acknowledged that LC needs help and could enlist the energies and talents | |
of Workshop participants in thinking through a number of these problems. | |
GIFFORD returned the discussion to getting the image and text people to | |
think through together where they want to go in the long term. MYLONAS | |
conceded that her experience at the Pierce Symposium the previous week at | |
Georgetown University and this week at LC had forced her to reevaluate | |
her perspective on the usefulness of text as images. MYLONAS framed the | |
issues in a series of questions: How do we acquire machine-readable | |
text? Do we take pictures of it and perform OCR on it later? Is it | |
important to obtain very high-quality images and text, etc.? | |
FLEISCHHAUER agreed with MYLONAS's framing of strategic questions, adding | |
that a large institution such as LC probably has to do all of those | |
things at different times. Thus, the trick is to exercise judgment. The | |
Workshop had added to his and AM's considerations in making those | |
judgments. Concerning future meetings or discussions, MYLONAS suggested | |
that screening priorities would be helpful. | |
WEIBEL opined that the diversity reflected in this group was a sign both | |
of the health and of the immaturity of the field, and more time would | |
have to pass before we convince one another concerning standards. | |
An exchange between MYLONAS and BATTIN clarified the point that the | |
driving force behind both the Perseus and the Cornell Xerox projects was | |
the preservation of knowledge for the future, not simply for particular | |
research use. In the case of Perseus, MYLONAS said, the assumption was | |
that the texts would not be entered again into electronically readable | |
form. SPERBERG-McQUEEN added that a scanned image would not serve as an | |
archival copy for purposes of preservation in the case of, say, the Bill | |
of Rights, in the sense that the scanned images are effectively the | |
archival copies for the Cornell mathematics books. | |
*** *** *** ****** *** *** *** | |
Appendix I: PROGRAM | |
WORKSHOP | |
ON | |
ELECTRONIC | |
TEXTS | |
9-10 June 1992 | |
Library of Congress | |
Washington, D.C. | |
Supported by a Grant from the David and Lucile Packard Foundation | |
Tuesday, 9 June 1992 | |
NATIONAL DEMONSTRATION LAB, ATRIUM, LIBRARY MADISON | |
8:30 AM Coffee and Danish, registration | |
9:00 AM Welcome | |
Prosser Gifford, Director for Scholarly Programs, and Carl | |
Fleischhauer, Coordinator, American Memory, Library of | |
Congress | |
9:l5 AM Session I. Content in a New Form: Who Will Use It and What | |
Will They Do? | |
Broad description of the range of electronic information. | |
Characterization of who uses it and how it is or may be used. | |
In addition to a look at scholarly uses, this session will | |
include a presentation on use by students (K-12 and college) | |
and the general public. | |
Moderator: James Daly | |
Avra Michelson, Archival Research and Evaluation Staff, | |
National Archives and Records Administration (Overview) | |
Susan H. Veccia, Team Leader, American Memory, User Evaluation, | |
and | |
Joanne Freeman, Associate Coordinator, American Memory, Library | |
of Congress (Beyond the scholar) | |
10:30- | |
11:00 AM Break | |
11:00 AM Session II. Show and Tell. | |
Each presentation to consist of a fifteen-minute | |
statement/show; group discussion will follow lunch. | |
Moderator: Jacqueline Hess, Director, National Demonstration | |
Lab | |
1. A classics project, stressing texts and text retrieval | |
more than multimedia: Perseus Project, Harvard | |
University | |
Elli Mylonas, Managing Editor | |
2. Other humanities projects employing the emerging norms of | |
the Text Encoding Initiative (TEI): Chadwyck-Healey's | |
The English Poetry Full Text Database and/or Patrologia | |
Latina Database | |
Eric M. Calaluca, Vice President, Chadwyck-Healey, Inc. | |
3. American Memory | |
Carl Fleischhauer, Coordinator, and | |
Ricky Erway, Associate Coordinator, Library of Congress | |
4. Founding Fathers example from Packard Humanities | |
Institute: The Papers of George Washington, University | |
of Virginia | |
Dorothy Twohig, Managing Editor, and/or | |
David Woodley Packard | |
5. An electronic medical journal offering graphics and | |
full-text searchability: The Online Journal of Current | |
Clinical Trials, American Association for the Advancement | |
of Science | |
Maria L. Lebron, Managing Editor | |
6. A project that offers facsimile images of pages but omits | |
searchable text: Cornell math books | |
Lynne K. Personius, Assistant Director, Cornell | |
Information Technologies for Scholarly Information | |
Sources, Cornell University | |
12:30 PM Lunch (Dining Room A, Library Madison 620. Exhibits | |
available.) | |
1:30 PM Session II. Show and Tell (Cont'd.). | |
3:00- | |
3:30 PM Break | |
3:30- | |
5:30 PM Session III. Distribution, Networks, and Networking: Options | |
for Dissemination. | |
Published disks: University presses and public-sector | |
publishers, private-sector publishers | |
Computer networks | |
Moderator: Robert G. Zich, Special Assistant to the Associate | |
Librarian for Special Projects, Library of Congress | |
Clifford A. Lynch, Director, Library Automation, University of | |
California | |
Howard Besser, School of Library and Information Science, | |
University of Pittsburgh | |
Ronald L. Larsen, Associate Director of Libraries for | |
Information Technology, University of Maryland at College | |
Park | |
Edwin B. Brownrigg, Executive Director, Memex Research | |
Institute | |
6:30 PM Reception (Montpelier Room, Library Madison 619.) | |
****** | |
Wednesday, 10 June 1992 | |
DINING ROOM A, LIBRARY MADISON 620 | |
8:30 AM Coffee and Danish | |
9:00 AM Session IV. Image Capture, Text Capture, Overview of Text and | |
Image Storage Formats. | |
Moderator: William L. Hooton, Vice President of Operations, | |
I-NET | |
A) Principal Methods for Image Capture of Text: | |
Direct scanning | |
Use of microform | |
Anne R. Kenney, Assistant Director, Department of Preservation | |
and Conservation, Cornell University | |
Pamela Q.J. Andre, Associate Director, Automation, and | |
Judith A. Zidar, Coordinator, National Agricultural Text | |
Digitizing Program (NATDP), National Agricultural Library | |
(NAL) | |
Donald J. Waters, Head, Systems Office, Yale University Library | |
B) Special Problems: | |
Bound volumes | |
Conservation | |
Reproducing printed halftones | |
Carl Fleischhauer, Coordinator, American Memory, Library of | |
Congress | |
George Thoma, Chief, Communications Engineering Branch, | |
National Library of Medicine (NLM) | |
10:30- | |
11:00 AM Break | |
11:00 AM Session IV. Image Capture, Text Capture, Overview of Text and | |
Image Storage Formats (Cont'd.). | |
C) Image Standards and Implications for Preservation | |
Jean Baronas, Senior Manager, Department of Standards and | |
Technology, Association for Information and Image Management | |
(AIIM) | |
Patricia Battin, President, The Commission on Preservation and | |
Access (CPA) | |
D) Text Conversion: | |
OCR vs. rekeying | |
Standards of accuracy and use of imperfect texts | |
Service bureaus | |
Stuart Weibel, Senior Research Specialist, Online Computer | |
Library Center, Inc. (OCLC) | |
Michael Lesk, Executive Director, Computer Science Research, | |
Bellcore | |
Ricky Erway, Associate Coordinator, American Memory, Library of | |
Congress | |
Pamela Q.J. Andre, Associate Director, Automation, and | |
Judith A. Zidar, Coordinator, National Agricultural Text | |
Digitizing Program (NATDP), National Agricultural Library | |
(NAL) | |
12:30- | |
1:30 PM Lunch | |
1:30 PM Session V. Approaches to Preparing Electronic Texts. | |
Discussion of approaches to structuring text for the computer; | |
pros and cons of text coding, description of methods in | |
practice, and comparison of text-coding methods. | |
Moderator: Susan Hockey, Director, Center for Electronic Texts | |
in the Humanities (CETH), Rutgers and Princeton Universities | |
David Woodley Packard | |
C.M. Sperberg-McQueen, Editor, Text Encoding Initiative (TEI), | |
University of Illinois-Chicago | |
Eric M. Calaluca, Vice President, Chadwyck-Healey, Inc. | |
3:30- | |
4:00 PM Break | |
4:00 PM Session VI. Copyright Issues. | |
Marybeth Peters, Policy Planning Adviser to the Register of | |
Copyrights, Library of Congress | |
5:00 PM Session VII. Conclusion. | |
General discussion. | |
What topics were omitted or given short shrift that anyone | |
would like to talk about now? | |
Is there a "group" here? What should the group do next, if | |
anything? What should the Library of Congress do next, if | |
anything? | |
Moderator: Prosser Gifford, Director for Scholarly Programs, | |
Library of Congress | |
6:00 PM Adjourn | |
*** *** *** ****** *** *** *** | |
Appendix II: ABSTRACTS | |
SESSION I | |
Avra MICHELSON Forecasting the Use of Electronic Texts by | |
Social Sciences and Humanities Scholars | |
This presentation explores the ways in which electronic texts are likely | |
to be used by the non-scientific scholarly community. Many of the | |
remarks are drawn from a report the speaker coauthored with Jeff | |
Rothenberg, a computer scientist at The RAND Corporation. | |
The speaker assesses 1) current scholarly use of information technology | |
and 2) the key trends in information technology most relevant to the | |
research process, in order to predict how social sciences and humanities | |
scholars are apt to use electronic texts. In introducing the topic, | |
current use of electronic texts is explored broadly within the context of | |
scholarly communication. From the perspective of scholarly | |
communication, the work of humanities and social sciences scholars | |
involves five processes: 1) identification of sources, 2) communication | |
with colleagues, 3) interpretation and analysis of data, 4) dissemination | |
of research findings, and 5) curriculum development and instruction. The | |
extent to which computation currently permeates aspects of scholarly | |
communication represents a viable indicator of the prospects for | |
electronic texts. | |
The discussion of current practice is balanced by an analysis of key | |
trends in the scholarly use of information technology. These include the | |
trends toward end-user computing and connectivity, which provide a | |
framework for forecasting the use of electronic texts through this | |
millennium. The presentation concludes with a summary of the ways in | |
which the nonscientific scholarly community can be expected to use | |
electronic texts, and the implications of that use for information | |
providers. | |
Susan VECCIA and Joanne FREEMAN Electronic Archives for the Public: | |
Use of American Memory in Public and | |
School Libraries | |
This joint discussion focuses on nonscholarly applications of electronic | |
library materials, specifically addressing use of the Library of Congress | |
American Memory (AM) program in a small number of public and school | |
libraries throughout the United States. AM consists of selected Library | |
of Congress primary archival materials, stored on optical media | |
(CD-ROM/videodisc), and presented with little or no editing. Many | |
collections are accompanied by electronic introductions and user's guides | |
offering background information and historical context. Collections | |
represent a variety of formats including photographs, graphic arts, | |
motion pictures, recorded sound, music, broadsides and manuscripts, | |
books, and pamphlets. | |
In 1991, the Library of Congress began a nationwide evaluation of AM in | |
different types of institutions. Test sites include public libraries, | |
elementary and secondary school libraries, college and university | |
libraries, state libraries, and special libraries. Susan VECCIA and | |
Joanne FREEMAN will discuss their observations on the use of AM by the | |
nonscholarly community, using evidence gleaned from this ongoing | |
evaluation effort. | |
VECCIA will comment on the overall goals of the evaluation project, and | |
the types of public and school libraries included in this study. Her | |
comments on nonscholarly use of AM will focus on the public library as a | |
cultural and community institution, often bridging the gap between formal | |
and informal education. FREEMAN will discuss the use of AM in school | |
libraries. Use by students and teachers has revealed some broad | |
questions about the use of electronic resources, as well as definite | |
benefits gained by the "nonscholar." Topics will include the problem of | |
grasping content and context in an electronic environment, the stumbling | |
blocks created by "new" technologies, and the unique skills and interests | |
awakened through use of electronic resources. | |
SESSION II | |
Elli MYLONAS The Perseus Project: Interactive Sources and | |
Studies in Classical Greece | |
The Perseus Project (5) has just released Perseus 1.0, the first publicly | |
available version of its hypertextual database of multimedia materials on | |
classical Greece. Perseus is designed to be used by a wide audience, | |
comprised of readers at the student and scholar levels. As such, it must | |
be able to locate information using different strategies, and it must | |
contain enough detail to serve the different needs of its users. In | |
addition, it must be delivered so that it is affordable to its target | |
audience. [These problems and the solutions we chose are described in | |
Mylonas, "An Interface to Classical Greek Civilization," JASIS 43:2, | |
March 1992.] | |
In order to achieve its objective, the project staff decided to make a | |
conscious separation between selecting and converting textual, database, | |
and image data on the one hand, and putting it into a delivery system on | |
the other. That way, it is possible to create the electronic data | |
without thinking about the restrictions of the delivery system. We have | |
made a great effort to choose system-independent formats for our data, | |
and to put as much thought and work as possible into structuring it so | |
that the translation from paper to electronic form will enhance the value | |
of the data. [A discussion of these solutions as of two years ago is in | |
Elli Mylonas, Gregory Crane, Kenneth Morrell, and D. Neel Smith, "The | |
Perseus Project: Data in the Electronic Age," in Accessing Antiquity: | |
The Computerization of Classical Databases, J. Solomon and T. Worthen | |
(eds.), University of Arizona Press, in press.] | |
Much of the work on Perseus is focused on collecting and converting the | |
data on which the project is based. At the same time, it is necessary to | |
provide means of access to the information, in order to make it usable, | |
and them to investigate how it is used. As we learn more about what | |
students and scholars from different backgrounds do with Perseus, we can | |
adjust our data collection, and also modify the system to accommodate | |
them. In creating a delivery system for general use, we have tried to | |
avoid favoring any one type of use by allowing multiple forms of access | |
to and navigation through the system. | |
The way text is handled exemplifies some of these principles. All text | |
in Perseus is tagged using SGML, following the guidelines of the Text | |
Encoding Initiative (TEI). This markup is used to index the text, and | |
process it so that it can be imported into HyperCard. No SGML markup | |
remains in the text that reaches the user, because currently it would be | |
too expensive to create a system that acts on SGML in real time. | |
However, the regularity provided by SGML is essential for verifying the | |
content of the texts, and greatly speeds all the processing performed on | |
them. The fact that the texts exist in SGML ensures that they will be | |
relatively easy to port to different hardware and software, and so will | |
outlast the current delivery platform. Finally, the SGML markup | |
incorporates existing canonical reference systems (chapter, verse, line, | |
etc.); indexing and navigation are based on these features. This ensures | |
that the same canonical reference will always resolve to the same point | |
within a text, and that all versions of our texts, regardless of delivery | |
platform (even paper printouts) will function the same way. | |
In order to provide tools for users, the text is processed by a | |
morphological analyzer, and the results are stored in a database. | |
Together with the index, the Greek-English Lexicon, and the index of all | |
the English words in the definitions of the lexicon, the morphological | |
analyses comprise a set of linguistic tools that allow users of all | |
levels to work with the textual information, and to accomplish different | |
tasks. For example, students who read no Greek may explore a concept as | |
it appears in Greek texts by using the English-Greek index, and then | |
looking up works in the texts and translations, or scholars may do | |
detailed morphological studies of word use by using the morphological | |
analyses of the texts. Because these tools were not designed for any one | |
use, the same tools and the same data can be used by both students and | |
scholars. | |
NOTES: | |
(5) Perseus is based at Harvard University, with collaborators at | |
several other universities. The project has been funded primarily | |
by the Annenberg/CPB Project, as well as by Harvard University, | |
Apple Computer, and others. It is published by Yale University | |
Press. Perseus runs on Macintosh computers, under the HyperCard | |
program. | |
Eric CALALUCA | |
Chadwyck-Healey embarked last year on two distinct yet related full-text | |
humanities database projects. | |
The English Poetry Full-Text Database and the Patrologia Latina Database | |
represent new approaches to linguistic research resources. The size and | |
complexity of the projects present problems for electronic publishers, | |
but surmountable ones if they remain abreast of the latest possibilities | |
in data capture and retrieval software techniques. | |
The issues which required address prior to the commencement of the | |
projects were legion: | |
1. Editorial selection (or exclusion) of materials in each | |
database | |
2. Deciding whether or not to incorporate a normative encoding | |
structure into the databases? | |
A. If one is selected, should it be SGML? | |
B. If SGML, then the TEI? | |
3. Deliver as CD-ROM, magnetic tape, or both? | |
4. Can one produce retrieval software advanced enough for the | |
postdoctoral linguist, yet accessible enough for unattended | |
general use? Should one try? | |
5. Re fair and liberal networking policies, what are the risks to | |
an electronic publisher? | |
6. How does the emergence of national and international education | |
networks affect the use and viability of research projects | |
requiring high investment? Do the new European Community | |
directives concerning database protection necessitate two | |
distinct publishing projects, one for North America and one for | |
overseas? | |
From new notions of "scholarly fair use" to the future of optical media, | |
virtually every issue related to electronic publishing was aired. The | |
result is two projects which have been constructed to provide the quality | |
research resources with the fewest encumbrances to use by teachers and | |
private scholars. | |
Dorothy TWOHIG | |
In spring 1988 the editors of the papers of George Washington, John | |
Adams, Thomas Jefferson, James Madison, and Benjamin Franklin were | |
approached by classics scholar David Packard on behalf of the Packard | |
Humanities Foundation with a proposal to produce a CD-ROM edition of the | |
complete papers of each of the Founding Fathers. This electronic edition | |
will supplement the published volumes, making the documents widely | |
available to students and researchers at reasonable cost. We estimate | |
that our CD-ROM edition of Washington's Papers will be substantially | |
completed within the next two years and ready for publication. Within | |
the next ten years or so, similar CD-ROM editions of the Franklin, Adams, | |
Jefferson, and Madison papers also will be available. At the Library of | |
Congress's session on technology, I would like to discuss not only the | |
experience of the Washington Papers in producing the CD-ROM edition, but | |
the impact technology has had on these major editorial projects. | |
Already, we are editing our volumes with an eye to the material that will | |
be readily available in the CD-ROM edition. The completed electronic | |
edition will provide immense possibilities for the searching of documents | |
for information in a way never possible before. The kind of technical | |
innovations that are currently available and on the drawing board will | |
soon revolutionize historical research and the production of historical | |
documents. Unfortunately, much of this new technology is not being used | |
in the planning stages of historical projects, simply because many | |
historians are aware only in the vaguest way of its existence. At least | |
two major new historical editing projects are considering microfilm | |
editions, simply because they are not aware of the possibilities of | |
electronic alternatives and the advantages of the new technology in terms | |
of flexibility and research potential compared to microfilm. In fact, | |
too many of us in history and literature are still at the stage of | |
struggling with our PCs. There are many historical editorial projects in | |
progress presently, and an equal number of literary projects. While the | |
two fields have somewhat different approaches to textual editing, there | |
are ways in which electronic technology can be of service to both. | |
Since few of the editors involved in the Founding Fathers CD-ROM editions | |
are technical experts in any sense, I hope to point out in my discussion | |
of our experience how many of these electronic innovations can be used | |
successfully by scholars who are novices in the world of new technology. | |
One of the major concerns of the sponsors of the multitude of new | |
scholarly editions is the limited audience reached by the published | |
volumes. Most of these editions are being published in small quantities | |
and the publishers' price for them puts them out of the reach not only of | |
individual scholars but of most public libraries and all but the largest | |
educational institutions. However, little attention is being given to | |
ways in which technology can bypass conventional publication to make | |
historical and literary documents more widely available. | |
What attracted us most to the CD-ROM edition of The Papers of George | |
Washington was the fact that David Packard's aim was to make a complete | |
edition of all of the 135,000 documents we have collected available in an | |
inexpensive format that would be placed in public libraries, small | |
colleges, and even high schools. This would provide an audience far | |
beyond our present 1,000-copy, $45 published edition. Since the CD-ROM | |
edition will carry none of the explanatory annotation that appears in the | |
published volumes, we also feel that the use of the CD-ROM will lead many | |
researchers to seek out the published volumes. | |
In addition to ignorance of new technical advances, I have found that too | |
many editors--and historians and literary scholars--are resistant and | |
even hostile to suggestions that electronic technology may enhance their | |
work. I intend to discuss some of the arguments traditionalists are | |
advancing to resist technology, ranging from distrust of the speed with | |
which it changes (we are already wondering what is out there that is | |
better than CD-ROM) to suspicion of the technical language used to | |
describe electronic developments. | |
Maria LEBRON | |
The Online Journal of Current Clinical Trials, a joint venture of the | |
American Association for the Advancement of Science (AAAS) and the Online | |
Computer Library Center, Inc. (OCLC), is the first peer-reviewed journal | |
to provide full text, tabular material, and line illustrations on line. | |
This presentation will discuss the genesis and start-up period of the | |
journal. Topics of discussion will include historical overview, | |
day-to-day management of the editorial peer review, and manuscript | |
tagging and publication. A demonstration of the journal and its features | |
will accompany the presentation. | |
Lynne PERSONIUS | |
Cornell University Library, Cornell Information Technologies, and Xerox | |
Corporation, with the support of the Commission on Preservation and | |
Access, and Sun Microsystems, Inc., have been collaborating in a project | |
to test a prototype system for recording brittle books as digital images | |
and producing, on demand, high-quality archival paper replacements. The | |
project goes beyond that, however, to investigate some of the issues | |
surrounding scanning, storing, retrieving, and providing access to | |
digital images in a network environment. | |
The Joint Study in Digital Preservation began in January 1990. Xerox | |
provided the College Library Access and Storage System (CLASS) software, | |
a prototype 600-dots-per-inch (dpi) scanner, and the hardware necessary | |
to support network printing on the DocuTech printer housed in Cornell's | |
Computing and Communications Center (CCC). | |
The Cornell staff using the hardware and software became an integral part | |
of the development and testing process for enhancements to the CLASS | |
software system. The collaborative nature of this relationship is | |
resulting in a system that is specifically tailored to the preservation | |
application. | |
A digital library of 1,000 volumes (or approximately 300,000 images) has | |
been created and is stored on an optical jukebox that resides in CCC. | |
The library includes a collection of select mathematics monographs that | |
provides mathematics faculty with an opportunity to use the electronic | |
library. The remaining volumes were chosen for the library to test the | |
various capabilities of the scanning system. | |
One project objective is to provide users of the Cornell library and the | |
library staff with the ability to request facsimiles of digitized images | |
or to retrieve the actual electronic image for browsing. A prototype | |
viewing workstation has been created by Xerox, with input into the design | |
by a committee of Cornell librarians and computer professionals. This | |
will allow us to experiment with patron access to the images that make up | |
the digital library. The viewing station provides search, retrieval, and | |
(ultimately) printing functions with enhancements to facilitate | |
navigation through multiple documents. | |
Cornell currently is working to extend access to the digital library to | |
readers using workstations from their offices. This year is devoted to | |
the development of a network resident image conversion and delivery | |
server, and client software that will support readers who use Apple | |
Macintosh computers, IBM windows platforms, and Sun workstations. | |
Equipment for this development was provided by Sun Microsystems with | |
support from the Commission on Preservation and Access. | |
During the show-and-tell session of the Workshop on Electronic Texts, a | |
prototype view station will be demonstrated. In addition, a display of | |
original library books that have been digitized will be available for | |
review with associated printed copies for comparison. The fifteen-minute | |
overview of the project will include a slide presentation that | |
constitutes a "tour" of the preservation digitizing process. | |
The final network-connected version of the viewing station will provide | |
library users with another mechanism for accessing the digital library, | |
and will also provide the capability of viewing images directly. This | |
will not require special software, although a powerful computer with good | |
graphics will be needed. | |
The Joint Study in Digital Preservation has generated a great deal of | |
interest in the library community. Unfortunately, or perhaps | |
fortunately, this project serves to raise a vast number of other issues | |
surrounding the use of digital technology for the preservation and use of | |
deteriorating library materials, which subsequent projects will need to | |
examine. Much work remains. | |
SESSION III | |
Howard BESSER Networking Multimedia Databases | |
What do we have to consider in building and distributing databases of | |
visual materials in a multi-user environment? This presentation examines | |
a variety of concerns that need to be addressed before a multimedia | |
database can be set up in a networked environment. | |
In the past it has not been feasible to implement databases of visual | |
materials in shared-user environments because of technological barriers. | |
Each of the two basic models for multi-user multimedia databases has | |
posed its own problem. The analog multimedia storage model (represented | |
by Project Athena's parallel analog and digital networks) has required an | |
incredibly complex (and expensive) infrastructure. The economies of | |
scale that make multi-user setups cheaper per user served do not operate | |
in an environment that requires a computer workstation, videodisc player, | |
and two display devices for each user. | |
The digital multimedia storage model has required vast amounts of storage | |
space (as much as one gigabyte per thirty still images). In the past the | |
cost of such a large amount of storage space made this model a | |
prohibitive choice as well. But plunging storage costs are finally | |
making this second alternative viable. | |
If storage no longer poses such an impediment, what do we need to | |
consider in building digitally stored multi-user databases of visual | |
materials? This presentation will examine the networking and | |
telecommunication constraints that must be overcome before such databases | |
can become commonplace and useful to a large number of people. | |
The key problem is the vast size of multimedia documents, and how this | |
affects not only storage but telecommunications transmission time. | |
Anything slower than T-1 speed is impractical for files of 1 megabyte or | |
larger (which is likely to be small for a multimedia document). For | |
instance, even on a 56 Kb line it would take three minutes to transfer a | |
1-megabyte file. And these figures assume ideal circumstances, and do | |
not take into consideration other users contending for network bandwidth, | |
disk access time, or the time needed for remote display. Current common | |
telephone transmission rates would be completely impractical; few users | |
would be willing to wait the hour necessary to transmit a single image at | |
2400 baud. | |
This necessitates compression, which itself raises a number of other | |
issues. In order to decrease file sizes significantly, we must employ | |
lossy compression algorithms. But how much quality can we afford to | |
lose? To date there has been only one significant study done of | |
image-quality needs for a particular user group, and this study did not | |
look at loss resulting from compression. Only after identifying | |
image-quality needs can we begin to address storage and network bandwidth | |
needs. | |
Experience with X-Windows-based applications (such as Imagequery, the | |
University of California at Berkeley image database) demonstrates the | |
utility of a client-server topology, but also points to the limitation of | |
current software for a distributed environment. For example, | |
applications like Imagequery can incorporate compression, but current X | |
implementations do not permit decompression at the end user's | |
workstation. Such decompression at the host computer alleviates storage | |
capacity problems while doing nothing to address problems of | |
telecommunications bandwidth. | |
We need to examine the effects on network through-put of moving | |
multimedia documents around on a network. We need to examine various | |
topologies that will help us avoid bottlenecks around servers and | |
gateways. Experience with applications such as these raise still broader | |
questions. How closely is the multimedia document tied to the software | |
for viewing it? Can it be accessed and viewed from other applications? | |
Experience with the MARC format (and more recently with the Z39.50 | |
protocols) shows how useful it can be to store documents in a form in | |
which they can be accessed by a variety of application software. | |
Finally, from an intellectual-access standpoint, we need to address the | |
issue of providing access to these multimedia documents in | |
interdisciplinary environments. We need to examine terminology and | |
indexing strategies that will allow us to provide access to this material | |
in a cross-disciplinary way. | |
Ronald LARSEN Directions in High-Performance Networking for | |
Libraries | |
The pace at which computing technology has advanced over the past forty | |
years shows no sign of abating. Roughly speaking, each five-year period | |
has yielded an order-of-magnitude improvement in price and performance of | |
computing equipment. No fundamental hurdles are likely to prevent this | |
pace from continuing for at least the next decade. It is only in the | |
past five years, though, that computing has become ubiquitous in | |
libraries, affecting all staff and patrons, directly or indirectly. | |
During these same five years, communications rates on the Internet, the | |
principal academic computing network, have grown from 56 kbps to 1.5 | |
Mbps, and the NSFNet backbone is now running 45 Mbps. Over the next five | |
years, communication rates on the backbone are expected to exceed 1 Gbps. | |
Growth in both the population of network users and the volume of network | |
traffic has continued to grow geometrically, at rates approaching 15 | |
percent per month. This flood of capacity and use, likened by some to | |
"drinking from a firehose," creates immense opportunities and challenges | |
for libraries. Libraries must anticipate the future implications of this | |
technology, participate in its development, and deploy it to ensure | |
access to the world's information resources. | |
The infrastructure for the information age is being put in place. | |
Libraries face strategic decisions about their role in the development, | |
deployment, and use of this infrastructure. The emerging infrastructure | |
is much more than computers and communication lines. It is more than the | |
ability to compute at a remote site, send electronic mail to a peer | |
across the country, or move a file from one library to another. The next | |
five years will witness substantial development of the information | |
infrastructure of the network. | |
In order to provide appropriate leadership, library professionals must | |
have a fundamental understanding of and appreciation for computer | |
networking, from local area networks to the National Research and | |
Education Network (NREN). This presentation addresses these | |
fundamentals, and how they relate to libraries today and in the near | |
future. | |
Edwin BROWNRIGG Electronic Library Visions and Realities | |
The electronic library has been a vision desired by many--and rejected by | |
some--since Vannevar Bush coined the term memex to describe an automated, | |
intelligent, personal information system. Variations on this vision have | |
included Ted Nelson's Xanadau, Alan Kay's Dynabook, and Lancaster's | |
"paperless library," with the most recent incarnation being the | |
"Knowledge Navigator" described by John Scully of Apple. But the reality | |
of library service has been less visionary and the leap to the electronic | |
library has eluded universities, publishers, and information technology | |
files. | |
The Memex Research Institute (MemRI), an independent, nonprofit research | |
and development organization, has created an Electronic Library Program | |
of shared research and development in order to make the collective vision | |
more concrete. The program is working toward the creation of large, | |
indexed publicly available electronic image collections of published | |
documents in academic, special, and public libraries. This strategic | |
plan is the result of the first stage of the program, which has been an | |
investigation of the information technologies available to support such | |
an effort, the economic parameters of electronic service compared to | |
traditional library operations, and the business and political factors | |
affecting the shift from print distribution to electronic networked | |
access. | |
The strategic plan envisions a combination of publicly searchable access | |
databases, image (and text) document collections stored on network "file | |
servers," local and remote network access, and an intellectual property | |
management-control system. This combination of technology and | |
information content is defined in this plan as an E-library or E-library | |
collection. Some participating sponsors are already developing projects | |
based on MemRI's recommended directions. | |
The E-library strategy projected in this plan is a visionary one that can | |
enable major changes and improvements in academic, public, and special | |
library service. This vision is, though, one that can be realized with | |
today's technology. At the same time, it will challenge the political | |
and social structure within which libraries operate: in academic | |
libraries, the traditional emphasis on local collections, extending to | |
accreditation issues; in public libraries, the potential of electronic | |
branch and central libraries fully available to the public; and for | |
special libraries, new opportunities for shared collections and networks. | |
The environment in which this strategic plan has been developed is, at | |
the moment, dominated by a sense of library limits. The continued | |
expansion and rapid growth of local academic library collections is now | |
clearly at an end. Corporate libraries, and even law libraries, are | |
faced with operating within a difficult economic climate, as well as with | |
very active competition from commercial information sources. For | |
example, public libraries may be seen as a desirable but not critical | |
municipal service in a time when the budgets of safety and health | |
agencies are being cut back. | |
Further, libraries in general have a very high labor-to-cost ratio in | |
their budgets, and labor costs are still increasing, notwithstanding | |
automation investments. It is difficult for libraries to obtain capital, | |
startup, or seed funding for innovative activities, and those | |
technology-intensive initiatives that offer the potential of decreased | |
labor costs can provoke the opposition of library staff. | |
However, libraries have achieved some considerable successes in the past | |
two decades by improving both their service and their credibility within | |
their organizations--and these positive changes have been accomplished | |
mostly with judicious use of information technologies. The advances in | |
computing and information technology have been well-chronicled: the | |
continuing precipitous drop in computing costs, the growth of the | |
Internet and private networks, and the explosive increase in publicly | |
available information databases. | |
For example, OCLC has become one of the largest computer network | |
organizations in the world by creating a cooperative cataloging network | |
of more than 6,000 libraries worldwide. On-line public access catalogs | |
now serve millions of users on more than 50,000 dedicated terminals in | |
the United States alone. The University of California MELVYL on-line | |
catalog system has now expanded into an index database reference service | |
and supports more than six million searches a year. And, libraries have | |
become the largest group of customers of CD-ROM publishing technology; | |
more than 30,000 optical media publications such as those offered by | |
InfoTrac and Silver Platter are subscribed to by U.S. libraries. | |
This march of technology continues and in the next decade will result in | |
further innovations that are extremely difficult to predict. What is | |
clear is that libraries can now go beyond automation of their order files | |
and catalogs to automation of their collections themselves--and it is | |
possible to circumvent the fiscal limitations that appear to obtain | |
today. | |
This Electronic Library Strategic Plan recommends a paradigm shift in | |
library service, and demonstrates the steps necessary to provide improved | |
library services with limited capacities and operating investments. | |
SESSION IV-A | |
Anne KENNEY | |
The Cornell/Xerox Joint Study in Digital Preservation resulted in the | |
recording of 1,000 brittle books as 600-dpi digital images and the | |
production, on demand, of high-quality and archivally sound paper | |
replacements. The project, which was supported by the Commission on | |
Preservation and Access, also investigated some of the issues surrounding | |
scanning, storing, retrieving, and providing access to digital images in | |
a network environment. | |
Anne Kenney will focus on some of the issues surrounding direct scanning | |
as identified in the Cornell Xerox Project. Among those to be discussed | |
are: image versus text capture; indexing and access; image-capture | |
capabilities; a comparison to photocopy and microfilm; production and | |
cost analysis; storage formats, protocols, and standards; and the use of | |
this scanning technology for preservation purposes. | |
The 600-dpi digital images produced in the Cornell Xerox Project proved | |
highly acceptable for creating paper replacements of deteriorating | |
originals. The 1,000 scanned volumes provided an array of image-capture | |
challenges that are common to nineteenth-century printing techniques and | |
embrittled material, and that defy the use of text-conversion processes. | |
These challenges include diminished contrast between text and background, | |
fragile and deteriorated pages, uneven printing, elaborate type faces, | |
faint and bold text adjacency, handwritten text and annotations, nonRoman | |
languages, and a proliferation of illustrated material embedded in text. | |
The latter category included high-frequency and low-frequency halftones, | |
continuous tone photographs, intricate mathematical drawings, maps, | |
etchings, reverse-polarity drawings, and engravings. | |
The Xerox prototype scanning system provided a number of important | |
features for capturing this diverse material. Technicians used multiple | |
threshold settings, filters, line art and halftone definitions, | |
autosegmentation, windowing, and software-editing programs to optimize | |
image capture. At the same time, this project focused on production. | |
The goal was to make scanning as affordable and acceptable as | |
photocopying and microfilming for preservation reformatting. A | |
time-and-cost study conducted during the last three months of this | |
project confirmed the economic viability of digital scanning, and these | |
findings will be discussed here. | |
From the outset, the Cornell Xerox Project was predicated on the use of | |
nonproprietary standards and the use of common protocols when standards | |
did not exist. Digital files were created as TIFF images which were | |
compressed prior to storage using Group 4 CCITT compression. The Xerox | |
software is MS DOS based and utilizes off-the shelf programs such as | |
Microsoft Windows and Wang Image Wizard. The digital library is designed | |
to be hardware-independent and to provide interchangeability with other | |
institutions through network connections. Access to the digital files | |
themselves is two-tiered: Bibliographic records for the computer files | |
are created in RLIN and Cornell's local system and access into the actual | |
digital images comprising a book is provided through a document control | |
structure and a networked image file-server, both of which will be | |
described. | |
The presentation will conclude with a discussion of some of the issues | |
surrounding the use of this technology as a preservation tool (storage, | |
refreshing, backup). | |
Pamela ANDRE and Judith ZIDAR | |
The National Agricultural Library (NAL) has had extensive experience with | |
raster scanning of printed materials. Since 1987, the Library has | |
participated in the National Agricultural Text Digitizing Project (NATDP) | |
a cooperative effort between NAL and forty-five land grant university | |
libraries. An overview of the project will be presented, giving its | |
history and NAL's strategy for the future. | |
An in-depth discussion of NATDP will follow, including a description of | |
the scanning process, from the gathering of the printed materials to the | |
archiving of the electronic pages. The type of equipment required for a | |
stand-alone scanning workstation and the importance of file management | |
software will be discussed. Issues concerning the images themselves will | |
be addressed briefly, such as image format; black and white versus color; | |
gray scale versus dithering; and resolution. | |
Also described will be a study currently in progress by NAL to evaluate | |
the usefulness of converting microfilm to electronic images in order to | |
improve access. With the cooperation of Tuskegee University, NAL has | |
selected three reels of microfilm from a collection of sixty-seven reels | |
containing the papers, letters, and drawings of George Washington Carver. | |
The three reels were converted into 3,500 electronic images using a | |
specialized microfilm scanner. The selection, filming, and indexing of | |
this material will be discussed. | |
Donald WATERS | |
Project Open Book, the Yale University Library's effort to convert 10, | |
000 books from microfilm to digital imagery, is currently in an advanced | |
state of planning and organization. The Yale Library has selected a | |
major vendor to serve as a partner in the project and as systems | |
integrator. In its proposal, the successful vendor helped isolate areas | |
of risk and uncertainty as well as key issues to be addressed during the | |
life of the project. The Yale Library is now poised to decide what | |
material it will convert to digital image form and to seek funding, | |
initially for the first phase and then for the entire project. | |
The proposal that Yale accepted for the implementation of Project Open | |
Book will provide at the end of three phases a conversion subsystem, | |
browsing stations distributed on the campus network within the Yale | |
Library, a subsystem for storing 10,000 books at 200 and 600 dots per | |
inch, and network access to the image printers. Pricing for the system | |
implementation assumes the existence of Yale's campus ethernet network | |
and its high-speed image printers, and includes other requisite hardware | |
and software, as well as system integration services. Proposed operating | |
costs include hardware and software maintenance, but do not include | |
estimates for the facilities management of the storage devices and image | |
servers. | |
Yale selected its vendor partner in a formal process, partly funded by | |
the Commission for Preservation and Access. Following a request for | |
proposal, the Yale Library selected two vendors as finalists to work with | |
Yale staff to generate a detailed analysis of requirements for Project | |
Open Book. Each vendor used the results of the requirements analysis to | |
generate and submit a formal proposal for the entire project. This | |
competitive process not only enabled the Yale Library to select its | |
primary vendor partner but also revealed much about the state of the | |
imaging industry, about the varying, corporate commitments to the markets | |
for imaging technology, and about the varying organizational dynamics | |
through which major companies are responding to and seeking to develop | |
these markets. | |
Project Open Book is focused specifically on the conversion of images | |
from microfilm to digital form. The technology for scanning microfilm is | |
readily available but is changing rapidly. In its project requirements, | |
the Yale Library emphasized features of the technology that affect the | |
technical quality of digital image production and the costs of creating | |
and storing the image library: What levels of digital resolution can be | |
achieved by scanning microfilm? How does variation in the quality of | |
microfilm, particularly in film produced to preservation standards, | |
affect the quality of the digital images? What technologies can an | |
operator effectively and economically apply when scanning film to | |
separate two-up images and to control for and correct image | |
imperfections? How can quality control best be integrated into | |
digitizing work flow that includes document indexing and storage? | |
The actual and expected uses of digital images--storage, browsing, | |
printing, and OCR--help determine the standards for measuring their | |
quality. Browsing is especially important, but the facilities available | |
for readers to browse image documents is perhaps the weakest aspect of | |
imaging technology and most in need of development. As it defined its | |
requirements, the Yale Library concentrated on some fundamental aspects | |
of usability for image documents: Does the system have sufficient | |
flexibility to handle the full range of document types, including | |
monographs, multi-part and multivolume sets, and serials, as well as | |
manuscript collections? What conventions are necessary to identify a | |
document uniquely for storage and retrieval? Where is the database of | |
record for storing bibliographic information about the image document? | |
How are basic internal structures of documents, such as pagination, made | |
accessible to the reader? How are the image documents physically | |
presented on the screen to the reader? | |
The Yale Library designed Project Open Book on the assumption that | |
microfilm is more than adequate as a medium for preserving the content of | |
deteriorated library materials. As planning in the project has advanced, | |
it is increasingly clear that the challenge of digital image technology | |
and the key to the success of efforts like Project Open Book is to | |
provide a means of both preserving and improving access to those | |
deteriorated materials. | |
SESSION IV-B | |
George THOMA | |
In the use of electronic imaging for document preservation, there are | |
several issues to consider, such as: ensuring adequate image quality, | |
maintaining substantial conversion rates (through-put), providing unique | |
identification for automated access and retrieval, and accommodating | |
bound volumes and fragile material. | |
To maintain high image quality, image processing functions are required | |
to correct the deficiencies in the scanned image. Some commercially | |
available systems include these functions, while some do not. The | |
scanned raw image must be processed to correct contrast deficiencies-- | |
both poor overall contrast resulting from light print and/or dark | |
background, and variable contrast resulting from stains and | |
bleed-through. Furthermore, the scan density must be adequate to allow | |
legibility of print and sufficient fidelity in the pseudo-halftoned gray | |
material. Borders or page-edge effects must be removed for both | |
compactibility and aesthetics. Page skew must be corrected for aesthetic | |
reasons and to enable accurate character recognition if desired. | |
Compound images consisting of both two-toned text and gray-scale | |
illustrations must be processed appropriately to retain the quality of | |
each. | |
SESSION IV-C | |
Jean BARONAS | |
Standards publications being developed by scientists, engineers, and | |
business managers in Association for Information and Image Management | |
(AIIM) standards committees can be applied to electronic image management | |
(EIM) processes including: document (image) transfer, retrieval and | |
evaluation; optical disk and document scanning; and document design and | |
conversion. When combined with EIM system planning and operations, | |
standards can assist in generating image databases that are | |
interchangeable among a variety of systems. The applications of | |
different approaches for image-tagging, indexing, compression, and | |
transfer often cause uncertainty concerning EIM system compatibility, | |
calibration, performance, and upward compatibility, until standard | |
implementation parameters are established. The AIIM standards that are | |
being developed for these applications can be used to decrease the | |
uncertainty, successfully integrate imaging processes, and promote "open | |
systems." AIIM is an accredited American National Standards Institute | |
(ANSI) standards developer with more than twenty committees comprised of | |
300 volunteers representing users, vendors, and manufacturers. The | |
standards publications that are developed in these committees have | |
national acceptance and provide the basis for international harmonization | |
in the development of new International Organization for Standardization | |
(ISO) standards. | |
This presentation describes the development of AIIM's EIM standards and a | |
new effort at AIIM, a database on standards projects in a wide framework | |
of imaging industries including capture, recording, processing, | |
duplication, distribution, display, evaluation, and preservation. The | |
AIIM Imagery Database will cover imaging standards being developed by | |
many organizations in many different countries. It will contain | |
standards publications' dates, origins, related national and | |
international projects, status, key words, and abstracts. The ANSI Image | |
Technology Standards Board requested that such a database be established, | |
as did the ISO/International Electrotechnical Commission Joint Task Force | |
on Imagery. AIIM will take on the leadership role for the database and | |
coordinate its development with several standards developers. | |
Patricia BATTIN | |
Characteristics of standards for digital imagery: | |
* Nature of digital technology implies continuing volatility. | |
* Precipitous standard-setting not possible and probably not | |
desirable. | |
* Standards are a complex issue involving the medium, the | |
hardware, the software, and the technical capacity for | |
reproductive fidelity and clarity. | |
* The prognosis for reliable archival standards (as defined by | |
librarians) in the foreseeable future is poor. | |
Significant potential and attractiveness of digital technology as a | |
preservation medium and access mechanism. | |
Productive use of digital imagery for preservation requires a | |
reconceptualizing of preservation principles in a volatile, | |
standardless world. | |
Concept of managing continuing access in the digital environment | |
rather than focusing on the permanence of the medium and long-term | |
archival standards developed for the analog world. | |
Transition period: How long and what to do? | |
* Redefine "archival." | |
* Remove the burden of "archival copy" from paper artifacts. | |
* Use digital technology for storage, develop management | |
strategies for refreshing medium, hardware and software. | |
* Create acid-free paper copies for transition period backup | |
until we develop reliable procedures for ensuring continuing | |
access to digital files. | |
SESSION IV-D | |
Stuart WEIBEL The Role of SGML Markup in the CORE Project (6) | |
The emergence of high-speed telecommunications networks as a basic | |
feature of the scholarly workplace is driving the demand for electronic | |
document delivery. Three distinct categories of electronic | |
publishing/republishing are necessary to support access demands in this | |
emerging environment: | |
1.) Conversion of paper or microfilm archives to electronic format | |
2.) Conversion of electronic files to formats tailored to | |
electronic retrieval and display | |
3.) Primary electronic publishing (materials for which the | |
electronic version is the primary format) | |
OCLC has experimental or product development activities in each of these | |
areas. Among the challenges that lie ahead is the integration of these | |
three types of information stores in coherent distributed systems. | |
The CORE (Chemistry Online Retrieval Experiment) Project is a model for | |
the conversion of large text and graphics collections for which | |
electronic typesetting files are available (category 2). The American | |
Chemical Society has made available computer typography files dating from | |
1980 for its twenty journals. This collection of some 250 journal-years | |
is being converted to an electronic format that will be accessible | |
through several end-user applications. | |
The use of Standard Generalized Markup Language (SGML) offers the means | |
to capture the structural richness of the original articles in a way that | |
will support a variety of retrieval, navigation, and display options | |
necessary to navigate effectively in very large text databases. | |
An SGML document consists of text that is marked up with descriptive tags | |
that specify the function of a given element within the document. As a | |
formal language construct, an SGML document can be parsed against a | |
document-type definition (DTD) that unambiguously defines what elements | |
are allowed and where in the document they can (or must) occur. This | |
formalized map of article structure allows the user interface design to | |
be uncoupled from the underlying database system, an important step | |
toward interoperability. Demonstration of this separability is a part of | |
the CORE project, wherein user interface designs born of very different | |
philosophies will access the same database. | |
NOTES: | |
(6) The CORE project is a collaboration among Cornell University's | |
Mann Library, Bell Communications Research (Bellcore), the American | |
Chemical Society (ACS), the Chemical Abstracts Service (CAS), and | |
OCLC. | |
Michael LESK The CORE Electronic Chemistry Library | |
A major on-line file of chemical journal literature complete with | |
graphics is being developed to test the usability of fully electronic | |
access to documents, as a joint project of Cornell University, the | |
American Chemical Society, the Chemical Abstracts Service, OCLC, and | |
Bellcore (with additional support from Sun Microsystems, Springer-Verlag, | |
DigitaI Equipment Corporation, Sony Corporation of America, and Apple | |
Computers). Our file contains the American Chemical Society's on-line | |
journals, supplemented with the graphics from the paper publication. The | |
indexing of the articles from Chemical Abstracts Documents is available | |
in both image and text format, and several different interfaces can be | |
used. Our goals are (1) to assess the effectiveness and acceptability of | |
electronic access to primary journals as compared with paper, and (2) to | |
identify the most desirable functions of the user interface to an | |
electronic system of journals, including in particular a comparison of | |
page-image display with ASCII display interfaces. Early experiments with | |
chemistry students on a variety of tasks suggest that searching tasks are | |
completed much faster with any electronic system than with paper, but | |
that for reading all versions of the articles are roughly equivalent. | |
Pamela ANDRE and Judith ZIDAR | |
Text conversion is far more expensive and time-consuming than image | |
capture alone. NAL's experience with optical character recognition (OCR) | |
will be related and compared with the experience of having text rekeyed. | |
What factors affect OCR accuracy? How accurate does full text have to be | |
in order to be useful? How do different users react to imperfect text? | |
These are questions that will be explored. For many, a service bureau | |
may be a better solution than performing the work inhouse; this will also | |
be discussed. | |
SESSION VI | |
Marybeth PETERS | |
Copyright law protects creative works. Protection granted by the law to | |
authors and disseminators of works includes the right to do or authorize | |
the following: reproduce the work, prepare derivative works, distribute | |
the work to the public, and publicly perform or display the work. In | |
addition, copyright owners of sound recordings and computer programs have | |
the right to control rental of their works. These rights are not | |
unlimited; there are a number of exceptions and limitations. | |
An electronic environment places strains on the copyright system. | |
Copyright owners want to control uses of their work and be paid for any | |
use; the public wants quick and easy access at little or no cost. The | |
marketplace is working in this area. Contracts, guidelines on electronic | |
use, and collective licensing are in use and being refined. | |
Issues concerning the ability to change works without detection are more | |
difficult to deal with. Questions concerning the integrity of the work | |
and the status of the changed version under the copyright law are to be | |
addressed. These are public policy issues which require informed | |
dialogue. | |
*** *** *** ****** *** *** *** | |
Appendix III: DIRECTORY OF PARTICIPANTS | |
PRESENTERS: | |
Pamela Q.J. Andre | |
Associate Director, Automation | |
National Agricultural Library | |
10301 Baltimore Boulevard | |
Beltsville, MD 20705-2351 | |
Phone: (301) 504-6813 | |
Fax: (301) 504-7473 | |
E-mail: INTERNET: PANDRE@ASRR.ARSUSDA.GOV | |
Jean Baronas, Senior Manager | |
Department of Standards and Technology | |
Association for Information and Image Management (AIIM) | |
1100 Wayne Avenue, Suite 1100 | |
Silver Spring, MD 20910 | |
Phone: (301) 587-8202 | |
Fax: (301) 587-2711 | |
Patricia Battin, President | |
The Commission on Preservation and Access | |
1400 16th Street, N.W. | |
Suite 740 | |
Washington, DC 20036-2217 | |
Phone: (202) 939-3400 | |
Fax: (202) 939-3407 | |
E-mail: CPA@GWUVM.BITNET | |
Howard Besser | |
Centre Canadien d'Architecture | |
(Canadian Center for Architecture) | |
1920, rue Baile | |
Montreal, Quebec H3H 2S6 | |
CANADA | |
Phone: (514) 939-7001 | |
Fax: (514) 939-7020 | |
E-mail: howard@lis.pitt.edu | |
Edwin B. Brownrigg, Executive Director | |
Memex Research Institute | |
422 Bonita Avenue | |
Roseville, CA 95678 | |
Phone: (916) 784-2298 | |
Fax: (916) 786-7559 | |
E-mail: BITNET: MEMEX@CALSTATE.2 | |
Eric M. Calaluca, Vice President | |
Chadwyck-Healey, Inc. | |
1101 King Street | |
Alexandria, VA 223l4 | |
Phone: (800) 752-05l5 | |
Fax: (703) 683-7589 | |
James Daly | |
4015 Deepwood Road | |
Baltimore, MD 21218-1404 | |
Phone: (410) 235-0763 | |
Ricky Erway, Associate Coordinator | |
American Memory | |
Library of Congress | |
Phone: (202) 707-6233 | |
Fax: (202) 707-3764 | |
Carl Fleischhauer, Coordinator | |
American Memory | |
Library of Congress | |
Phone: (202) 707-6233 | |
Fax: (202) 707-3764 | |
Joanne Freeman | |
2000 Jefferson Park Avenue, No. 7 | |
Charlottesville, VA 22903 | |
Prosser Gifford | |
Director for Scholarly Programs | |
Library of Congress | |
Phone: (202) 707-1517 | |
Fax: (202) 707-9898 | |
E-mail: pgif@seq1.loc.gov | |
Jacqueline Hess, Director | |
National Demonstration Laboratory | |
for Interactive Information Technologies | |
Library of Congress | |
Phone: (202) 707-4157 | |
Fax: (202) 707-2829 | |
Susan Hockey, Director | |
Center for Electronic Texts in the Humanities (CETH) | |
Alexander Library | |
Rutgers University | |
169 College Avenue | |
New Brunswick, NJ 08903 | |
Phone: (908) 932-1384 | |
Fax: (908) 932-1386 | |
E-mail: hockey@zodiac.rutgers.edu | |
William L. Hooton, Vice President | |
Business & Technical Development | |
Imaging & Information Systems Group | |
I-NET | |
6430 Rockledge Drive, Suite 400 | |
Bethesda, MD 208l7 | |
Phone: (301) 564-6750 | |
Fax: (513) 564-6867 | |
Anne R. Kenney, Associate Director | |
Department of Preservation and Conservation | |
701 Olin Library | |
Cornell University | |
Ithaca, NY 14853 | |
Phone: (607) 255-6875 | |
Fax: (607) 255-9346 | |
E-mail: LYDY@CORNELLA.BITNET | |
Ronald L. Larsen | |
Associate Director for Information Technology | |
University of Maryland at College Park | |
Room B0224, McKeldin Library | |
College Park, MD 20742-7011 | |
Phone: (301) 405-9194 | |
Fax: (301) 314-9865 | |
E-mail: rlarsen@libr.umd.edu | |
Maria L. Lebron, Managing Editor | |
The Online Journal of Current Clinical Trials | |
l333 H Street, N.W. | |
Washington, DC 20005 | |
Phone: (202) 326-6735 | |
Fax: (202) 842-2868 | |
E-mail: PUBSAAAS@GWUVM.BITNET | |
Michael Lesk, Executive Director | |
Computer Science Research | |
Bell Communications Research, Inc. | |
Rm 2A-385 | |
445 South Street | |
Morristown, NJ 07960-l9l0 | |
Phone: (201) 829-4070 | |
Fax: (201) 829-5981 | |
E-mail: lesk@bellcore.com (Internet) or bellcore!lesk (uucp) | |
Clifford A. Lynch | |
Director, Library Automation | |
University of California, | |
Office of the President | |
300 Lakeside Drive, 8th Floor | |
Oakland, CA 94612-3350 | |
Phone: (510) 987-0522 | |
Fax: (510) 839-3573 | |
E-mail: calur@uccmvsa | |
Avra Michelson | |
National Archives and Records Administration | |
NSZ Rm. 14N | |
7th & Pennsylvania, N.W. | |
Washington, D.C. 20408 | |
Phone: (202) 501-5544 | |
Fax: (202) 501-5533 | |
E-mail: tmi@cu.nih.gov | |
Elli Mylonas, Managing Editor | |
Perseus Project | |
Department of the Classics | |
Harvard University | |
319 Boylston Hall | |
Cambridge, MA 02138 | |
Phone: (617) 495-9025, (617) 495-0456 (direct) | |
Fax: (617) 496-8886 | |
E-mail: Elli@IKAROS.Harvard.EDU or elli@wjh12.harvard.edu | |
David Woodley Packard | |
Packard Humanities Institute | |
300 Second Street, Suite 201 | |
Los Altos, CA 94002 | |
Phone: (415) 948-0150 (PHI) | |
Fax: (415) 948-5793 | |
Lynne K. Personius, Assistant Director | |
Cornell Information Technologies for | |
Scholarly Information Sources | |
502 Olin Library | |
Cornell University | |
Ithaca, NY 14853 | |
Phone: (607) 255-3393 | |
Fax: (607) 255-9346 | |
E-mail: JRN@CORNELLC.BITNET | |
Marybeth Peters | |
Policy Planning Adviser to the | |
Register of Copyrights | |
Library of Congress | |
Office LM 403 | |
Phone: (202) 707-8350 | |
Fax: (202) 707-8366 | |
C. Michael Sperberg-McQueen | |
Editor, Text Encoding Initiative | |
Computer Center (M/C 135) | |
University of Illinois at Chicago | |
Box 6998 | |
Chicago, IL 60680 | |
Phone: (312) 413-0317 | |
Fax: (312) 996-6834 | |
E-mail: u35395@uicvm..cc.uic.edu or u35395@uicvm.bitnet | |
George R. Thoma, Chief | |
Communications Engineering Branch | |
National Library of Medicine | |
8600 Rockville Pike | |
Bethesda, MD 20894 | |
Phone: (301) 496-4496 | |
Fax: (301) 402-0341 | |
E-mail: thoma@lhc.nlm.nih.gov | |
Dorothy Twohig, Editor | |
The Papers of George Washington | |
504 Alderman Library | |
University of Virginia | |
Charlottesville, VA 22903-2498 | |
Phone: (804) 924-0523 | |
Fax: (804) 924-4337 | |
Susan H. Veccia, Team leader | |
American Memory, User Evaluation | |
Library of Congress | |
American Memory Evaluation Project | |
Phone: (202) 707-9104 | |
Fax: (202) 707-3764 | |
E-mail: svec@seq1.loc.gov | |
Donald J. Waters, Head | |
Systems Office | |
Yale University Library | |
New Haven, CT 06520 | |
Phone: (203) 432-4889 | |
Fax: (203) 432-7231 | |
E-mail: DWATERS@YALEVM.BITNET or DWATERS@YALEVM.YCC.YALE.EDU | |
Stuart Weibel, Senior Research Scientist | |
OCLC | |
6565 Frantz Road | |
Dublin, OH 43017 | |
Phone: (614) 764-608l | |
Fax: (614) 764-2344 | |
E-mail: INTERNET: Stu@rsch.oclc.org | |
Robert G. Zich | |
Special Assistant to the Associate Librarian | |
for Special Projects | |
Library of Congress | |
Phone: (202) 707-6233 | |
Fax: (202) 707-3764 | |
E-mail: rzic@seq1.loc.gov | |
Judith A. Zidar, Coordinator | |
National Agricultural Text Digitizing Program | |
Information Systems Division | |
National Agricultural Library | |
10301 Baltimore Boulevard | |
Beltsville, MD 20705-2351 | |
Phone: (301) 504-6813 or 504-5853 | |
Fax: (301) 504-7473 | |
E-mail: INTERNET: JZIDAR@ASRR.ARSUSDA.GOV | |
OBSERVERS: | |
Helen Aguera, Program Officer | |
Division of Research | |
Room 318 | |
National Endowment for the Humanities | |
1100 Pennsylvania Avenue, N.W. | |
Washington, D.C. 20506 | |
Phone: (202) 786-0358 | |
Fax: (202) 786-0243 | |
M. Ellyn Blanton, Deputy Director | |
National Demonstration Laboratory | |
for Interactive Information Technologies | |
Library of Congress | |
Phone: (202) 707-4157 | |
Fax: (202) 707-2829 | |
Charles M. Dollar | |
National Archives and Records Administration | |
NSZ Rm. 14N | |
7th & Pennsylvania, N.W. | |
Washington, DC 20408 | |
Phone: (202) 501-5532 | |
Fax: (202) 501-5512 | |
Jeffrey Field, Deputy to the Director | |
Division of Preservation and Access | |
Room 802 | |
National Endowment for the Humanities | |
1100 Pennsylvania Avenue, N.W. | |
Washington, DC 20506 | |
Phone: (202) 786-0570 | |
Fax: (202) 786-0243 | |
Lorrin Garson | |
American Chemical Society | |
Research and Development Department | |
1155 16th Street, N.W. | |
Washington, D.C. 20036 | |
Phone: (202) 872-4541 | |
Fax: E-mail: INTERNET: LRG96@ACS.ORG | |
William M. Holmes, Jr. | |
National Archives and Records Administration | |
NSZ Rm. 14N | |
7th & Pennsylvania, N.W. | |
Washington, DC 20408 | |
Phone: (202) 501-5540 | |
Fax: (202) 501-5512 | |
E-mail: WHOLMES@AMERICAN.EDU | |
Sperling Martin | |
Information Resource Management | |
20030 Doolittle Street | |
Gaithersburg, MD 20879 | |
Phone: (301) 924-1803 | |
Michael Neuman, Director | |
The Center for Text and Technology | |
Academic Computing Center | |
238 Reiss Science Building | |
Georgetown University | |
Washington, DC 20057 | |
Phone: (202) 687-6096 | |
Fax: (202) 687-6003 | |
E-mail: neuman@guvax.bitnet, neuman@guvax.georgetown.edu | |
Barbara Paulson, Program Officer | |
Division of Preservation and Access | |
Room 802 | |
National Endowment for the Humanities | |
1100 Pennsylvania Avenue, N.W. | |
Washington, DC 20506 | |
Phone: (202) 786-0577 | |
Fax: (202) 786-0243 | |
Allen H. Renear | |
Senior Academic Planning Analyst | |
Brown University Computing and Information Services | |
115 Waterman Street | |
Campus Box 1885 | |
Providence, R.I. 02912 | |
Phone: (401) 863-7312 | |
Fax: (401) 863-7329 | |
E-mail: BITNET: Allen@BROWNVM or | |
INTERNET: Allen@brownvm.brown.edu | |
Susan M. Severtson, President | |
Chadwyck-Healey, Inc. | |
1101 King Street | |
Alexandria, VA 223l4 | |
Phone: (800) 752-05l5 | |
Fax: (703) 683-7589 | |
Frank Withrow | |
U.S. Department of Education | |
555 New Jersey Avenue, N.W. | |
Washington, DC 20208-5644 | |
Phone: (202) 219-2200 | |
Fax: (202) 219-2106 | |
(LC STAFF) | |
Linda L. Arret | |
Machine-Readable Collections Reading Room LJ 132 | |
(202) 707-1490 | |
John D. Byrum, Jr. | |
Descriptive Cataloging Division LM 540 | |
(202) 707-5194 | |
Mary Jane Cavallo | |
Science and Technology Division LA 5210 | |
(202) 707-1219 | |
Susan Thea David | |
Congressional Research Service LM 226 | |
(202) 707-7169 | |
Robert Dierker | |
Senior Adviser for Multimedia Activities LM 608 | |
(202) 707-6151 | |
William W. Ellis | |
Associate Librarian for Science and Technology LM 611 | |
(202) 707-6928 | |
Ronald Gephart | |
Manuscript Division LM 102 | |
(202) 707-5097 | |
James Graber | |
Information Technology Services LM G51 | |
(202) 707-9628 | |
Rich Greenfield | |
American Memory LM 603 | |
(202) 707-6233 | |
Rebecca Guenther | |
Network Development LM 639 | |
(202) 707-5092 | |
Kenneth E. Harris | |
Preservation LM G21 | |
(202) 707-5213 | |
Staley Hitchcock | |
Manuscript Division LM 102 | |
(202) 707-5383 | |
Bohdan Kantor | |
Office of Special Projects LM 612 | |
(202) 707-0180 | |
John W. Kimball, Jr | |
Machine-Readable Collections Reading Room LJ 132 | |
(202) 707-6560 | |
Basil Manns | |
Information Technology Services LM G51 | |
(202) 707-8345 | |
Sally Hart McCallum | |
Network Development LM 639 | |
(202) 707-6237 | |
Dana J. Pratt | |
Publishing Office LM 602 | |
(202) 707-6027 | |
Jane Riefenhauser | |
American Memory LM 603 | |
(202) 707-6233 | |
William Z. Schenck | |
Collections Development LM 650 | |
(202) 707-7706 | |
Chandru J. Shahani | |
Preservation Research and Testing Office (R&T) LM G38 | |
(202) 707-5607 | |
William J. Sittig | |
Collections Development LM 650 | |
(202) 707-7050 | |
Paul Smith | |
Manuscript Division LM 102 | |
(202) 707-5097 | |
James L. Stevens | |
Information Technology Services LM G51 | |
(202) 707-9688 | |
Karen Stuart | |
Manuscript Division LM 130 | |
(202) 707-5389 | |
Tamara Swora | |
Preservation Microfilming Office LM G05 | |
(202) 707-6293 | |
Sarah Thomas | |
Collections Cataloging LM 642 | |
(202) 707-5333 | |
END | |
************************************************************* | |
Note: This file has been edited for use on computer networks. This | |
editing required the removal of diacritics, underlining, and fonts such | |
as italics and bold. | |
kde 11/92 | |
[A few of the italics (when used for emphasis) were replaced by CAPS mh] | |
*End of The Project Gutenberg Etext of LOC WORKSHOP ON ELECTRONIC ETEXTS | |