Coding an Online, Open Access Edition of Poetry in TEIBy Donna Jan Pridmore. Presented at the Society for Textual Scholarship 2008 Conference, March 13, 2008 I’m going to be talking today about a project I undertook to teach myself the Text Encoding Initiative. I’m interested in online publishing, and especially inspired by the idea of the internet as a great free public library for the world. It was for that reason that I decided to come back and finally finish my PhD, which I left hanging at the ABD point about 15 years ago. I began this project already knowing html, in fact I’ve been publishing a web site called literaryhistory.com in html for the past ten years. (HTML is the easy code for creating internet documents, but the code used in the TEI is xml, notoriously difficult.) I've known about the TEI for a long time, and knew I should learn it, so I created this project for myself to have a chance to learn by doing. It was, is, an independent study, for which Andrew Stauffer is my advisor. Dr. Stauffer worked on the Rossetti Archive as a graduate student and has written about it. His understanding of the Rossetti Archive has been a great help for me, and in fact I’ve ended up using the Rossetti Archive as the structural model for my project. And the fact that a structural model like the Rossetti Archive already existed is really what made it possible for me to understand what I needed to do to create a good TEI project. My project involved creating an online edition of Frank O’Hara’s Lunch Poems. It had an earlier iteration as a project for another class, when I created the O’Hara edition in html because I didn’t yet know TEI. It was a very different project then, too, and has been reconceived in the TEI version. The html project was intended as an exploration of what could be done in an online version of a poem that would not be possible in a print version, of how one could take advantage of the new resources of the internet. The Lunch Poems project in html and the draft of the project in TEI can be seen online through a link at my web site, www.literaryhistory.com. Because it is a demonstration project and the poetry is online without permission, the public link was only added on March 1, 2008, and there will no longer be public access after March 31. [The link and all public access to the project was removed March 30 after an angry email from the Frank O'Hara estate.] My main concerns in the first (html) version were how to transcribe the poems so that the digital version accurately represented the appearance of the printed page, how sound might be incorporated in a web page, and how to handle the notes in a digital edition: where to place them, how to indicate their presence without defacing the poetry, and how to incorporate the reader’s need to navigate back and forth between text and notes with the least disruption. The annotations to the html verision include about 60 image files as explanatory illustrations, none of which are used with permission. I mention these details about permissions and copyright because it is obviously a major problem that anyone interested in open access scholarly publishing will have to contend with, and it will place limits, which are not technical but legal, on what we can do. While linking to another site is permitted under the fair use provisions of copyright, copying and pasting an image onto your web site is not. And of course, twentieth century literature is still covered by copyright and so off-limits to scholars, unless permissions are obtained. I chose Frank O'Hara as the subject of my first project because I liked his poetry and because, with his involvement with visual art and artists, his poetry seemed to lend itself well to incorporating images. Because it was just a project for a class, and especially in the first version quite exploratory and amateurish, I did not seek permission from the estate to work with his poetry, though I did have in mind that once I had a good, completed project I might show that to the estate and ask for permission to do something really scholarly. At that point I had not given much thought to the way copyright law seriously limits, probably prevents, textual scholarship in twentieth century literature. In fact, one of the main things I learned from this project is that it is just not worth it to spend one's time on a twentieth century author's text if one is interested in open access publishing. The second phase of my Frank O’Hara project involved coding the project so that it meets the requirements of the TEI. The Modern Language Association’s recommendations to vetters of online scholarly editions now include the expectation that those publications will be in conformance with the TEI. TEI is not a coding language, like html or xml is; the Text Encoding Initiative is a set of requirements for "correct" coding, which change and are revised over time. The TEI currently requires the use of xml coding. Until I grappled with xml as a coder this did not actually sink in with me, but it is important to note that xml is not a human-readable code. It is a machine-readable code. Well, that is not quite accurate; if you are a geek, you can read xml. To be precise, if you are a human using a browser on the internet to read documents there, which is the way humans often read documents on the internet, the document has to be in html. Perhaps someday browsers will be able to handle documents in xml, but they do not currently. The other option for reading documents on the internet is pdf, which is the format used by the Adobe reader. So creating a document under the standards of the TEI is a two-step process. First you code the material in xml, a difficult markup code to learn but one that is machine-readable. Then you apply a translation process to the xml document to convert it to html, so that it can be read by humans on the internet. The value of all this is that these codes, both xml and html, are universal and can run on any browser, so that it does not require special software like Microsoft Word or other proprietary, commercial, incessantly updated and obsoleted software investments to read an online document. Coding in xml introduces some additional concepts beyond those available in html. The explanation usually given is that html is a presentational code and xml is a structural code. In poetry you can code when a stanza begins and ends and when a line begins and ends with xml. XML also allows you to tag otherwise ambiguous words. For example, in a document that contains both references to Emily Dickinson and Fairleigh Dickinson University Press, you can code one as an author name and the other as a publisher, so that when readers do an electronic search for Emily Dickinson they will not retrieve hits that include Dickinson University Press. An advantage of having the xml (machine-readable) code separate from the visual, presentational code (the html) is that an editor can exercise much control in the presentation of the text. The same xml coded file could be displayed on the screen in many different ways, with notes at the end of pages or the end of volumes, in italic font or red font, and so on, since the presentation is completely separate from the core, structural data. The way the structure will be translated into a presentation is under the control of the translation, which is the function, in xml, of the stylesheet (also called an XSL, or an XSL transformer). Stylesheet design is one of the more complicated parts of the TEI to master, and may be something editors will turn over to technical specialists, although some TEI editors like Laura Mandell do design their own stylesheets. The TEI consortium provides several free stylesheets on their web site, so it is not necessary for an editor to take on the task of stylesheet design at the outset of a TEI project. My Frank O’Hara TEI project currently posted at literaryhistory.com uses the Oxygen default stylesheet. XML also introduces the standardized use of detailed bibliographical information in the "header" of the coding, in which you provide information about who is publishing and distributing this electronic version, what its copyright status is, who to contact for more information, and full bibliographic information about the text that is being transcribed and the type of work it is. This information is undoubtedly helpful for librarians, although they will probably be required to verify it.
XML tagging to indicate document structure and remove ambiguities for search purposes is uncontroversial, but xml tagging can become controversial. In my Frank O’Hara project, in the beginning, I wondered if I should tag certain words as "places," since New York City locations are important in his poetry, and if I should tag names of his friends, and names of artists and musicians, and time references, all of which are distinguishing features of his poetry. Perhaps tagging will someday be practiced as a form of literary criticism. But after some thought and investigation of what other people are doing, I concluded that was a problem too advanced for me. I decided to stick to the simplest, most uncontroversial use of tags in my project.
In the TEI version of my Frank O’Hara project I came to realize that one kind of project that is well worth doing on the internet, and that takes advantage of the internet’s huge storage capacity, is to provide the complete bibliographical record for a given author, as the Rossetti Archive does for Dante Gabriel Rossetti. My Frank O’Hara archive will not be completed, not by me anyway, but it has given me the chance to see how a complete Frank O’Hara archive, or a complete archive for any author, would be organized, planned, and executed. The ideal Frank O’Hara archive would contain facsimiles of all witnesses of each of his works, of each page, and it would include diplomatic transcriptions of all the facsimiles, which makes the material machine-searchable. A diplomatic transcription plus facsimile is the perfect solution for the inability of html to handle subtle formatting. An archive can also contain introductory essays, essays on the publishing history, annotations, recommended secondary reading, or whatever else seems desirable to the editor. But the heart of the archive is that it is a complete bibliographical reference for the author. It should digitize and make publicly accessible all the primary material that a textual editor uses to prepare an authorized edition. An internet archive does not have to stop at the point of presenting all the primary material, though they usually do. For this reason many of us think of them as do-it-yourself editions. But it is not impossible to create a program that can collate the digitized witnesses. Juxta, a program available through the Networked Infrastructure for Nineteenth-century Electronic Scholarship (NINES), at http://www.nines.org/tools/index.html, is one program already available that begins to do this. In my ideal Frank O’Hara archive I would use Juxta on the assembled witnesses to, at least, evaluate Juxta’s ability to help prepare a new edition. Ideally a new edition would be the final product of the archive, with all the sources for that edition at hand for the reader to consult. This is the direction I plan to take with my future research in my dissertation, but applied to an earlier, out of copyright poet. My mini-archive project for Frank O’Hara is a chance to come face to face with the kinds of issues one has to deal with in the process of creating a digital archive. The xml coding is only one of the problems. Another is the organization of the huge amount of material on the back end, and the presentation of this huge amount of material in a way that makes sense to a human on the front end. It helps very much to have a model like the Rossetti Archive to imitate. The Rossetti Archive has already developed a structure for the many types of information that need to connect, and it is not too hard to backward-engineer a good design once someone else has worked out the solution, or a solution. The concept of a "work" as distinguished from a "document," which is used at the Rossetti Archive, and also at the Whitman Archive, is greatly helpful for organizing and conceptualizing an archive. A poem by Frank O’Hara, say "Music" --when one thinks of it as a "work," it becomes an organizing concept. Individual instances of this poem, for example its first publication in the periodical Yugen in 1959, as the first poem of Lunch Poems in 1965, in The Collected Poems of Frank O’Hara in 1971, and any manuscript versions that exist, are "documents.' This distinction between work and document may be debatable on theoretical grounds, but for practical purposes it is very useful for naming files and organizing a complicated file structure. In my Frank O’Hara archive in TEI there are two types of works. Individual poems are works, a concept expansive enough to embrace sound files of O’Hara reading a poem as an instance or document. His books are another type of work, since they could exist in several versions. (A Frank O’Hara archive, to be complete, would have to include additional, unusual types of works, such as the films and art he collaborated on and his art criticism.) For this project I am only transcribing the 1965 City Lights edition of Lunch Poems and the pages of Collected Poems that contain the poems in Lunch Poems. The transcription of Collected Poems will be presented as an incomplete transcription, with large gaps. If the archive were complete, it would include transcriptions of his all books, and all other witnesses of all the poems, with facsimiles of all the pages. Even incomplete, the structure of the archive already shows how it could include the entire Frank O’Hara bibliographical record. When you consider the thousands of files you will end up with when you compile all the versions of all the works, it becomes obvious that some kind of clear and consistent file naming and organizing system is necessary, or you will never know how to find anything or how to instruct the computer, in your code, to find anything. There are contents pages, which list all the works. Works pages list the document instances of each work. The document instances are available in my small archive in only two versions, the version from Lunch Poems and the version from CP. They are available in transcription and in facsimile. There are even two versions of the facsimiles, a thumbnail and a full-page version. All these files, or pages, or they may only be locations within larger files, have to be named so that they can be retrieved by the computer. As this one tiny example of one poem shows, the naming of files and locations within files in an archive takes thought and planning. The organization on the back end is this involved because so many things are available at the archive, and the computer has to have a way of locating them in case anyone wants to see them. Grubby as the details of naming files are, they are key to having an internet archive work. When a reader wants to see "Music," she chooses whether to see a transcription from Collected Poems, or from Lunch Poems, or from the first magazine publication, or the manuscript, or a facsimile of any of those. The computer has to know where that version is, so it has to be named. But from the front end, from the reader’s perspective, all this may seem like too many choices. The reader may just want to read "Music" and not particularly care what version, and not want to wade through all the choices to just read the poem. Designing the front end, the reader’s experience of the archive, is the most difficult challenge of all. It will probably be the biggest problem for internet archives, and confusing user interfaces will probably be the main reason readers are dissatisfied with archives. Providing more choices than the average viewer wants is just one of the ways an internet archive can frustrate the reader, and other problems with designing the user’s experience will be mentioned later. An archive that yields a new edition might be a way of solving the problem of too many choices for the reader. The new edition could be the gateway into the source material, in effect the new edition would be the interface. That way the casual reader could stop at the first level, reading only the edition’s version of "Music," but the more investigative reader would be able to drill down to the sources through certain markers in the edition. Then the interface design problem is shifted to the problem of how to design an edition that is visually intelligible to the reader when it contains so much material. But at least that approach implies subordinating many of the details. Annotations continue to present a problem for electronic editions. In theory one would think the ability to jump directly to a note by clicking a hyperlink would be a plus. But because the transcriptions are in the form of a scroll (often a very long one), there is no good place to put the notes, as there is when notes are at the bottom of the page of a printed book. One solution has been to divide the screen into several panes or windows, with the poem in the largest window, and two smaller windows, one below the poem and one to the right, for textual and contextual notes. This is a fairly attractive design solution, and can remove the clutter of flags in the poems, since the notes can be listed by line number. Another solution is to place the notes at the end of the scroll. But this seems to entail a heavier use of flags in the text of the poem, to alert the reader that notes are available and to take him to the notes. When notes can be glanced at the side of the page in the window pane approach, or at the bottom of the page in the printed book, their presence is partly intuited and they do not need to be as heavily flagged, and disrupting, in the poetry. Another solution is mouse-overs, with the notes popping up automatically, whether you want them or not, any time you touch the word with your cursor. But many users find mouse-overs extremely annoying. The scroll can be a disorienting experience online. Its lack of physicality deprives the reader of sensing where she is, and one can not put a piece of paper at the table of contents to quickly flip back and forth. Heaven help you if you want to refer back to something in a 500 page transcription on a scroll. The clues that print editions use to orient the reader, like discretely repeating the title of a long poem at the head of each page or printing the title of the book at the head of the verso and the title of the poem on the recto, are awkward and heavy-handed when placed mid-screen in the scroll. But these presentation problems will no doubt be solved eventually by clever screen designers. These are early days for online publishing. This discussion has restricted itself to describing the "browse" experience of an internet text so far, but readers also expect to be able to "search" a digital text. Searching in TEI is not a straightforward issue. Simple searching is possible for anyone using a browser, simply by selecting the "search this document" command in the browser, but what is needed in an xml document is a search engine that is xml aware, one that can read the xml and process the information that, returning to an earlier example, "Dickinson" as an author name is distinguished from "Dickinson" as a publisher. There are many free search engines on the internet for use with html, including a widely used one from Google, and a few for use with xml. I have experience with some of the free html search engines, which tend to be imprecise, give the editor little control over search results, and inflict inappropriate ads on the user. I have not yet waded into the water of the free xml search engines that can be found online. The TEI consortium also provides links to a few search engines under development by TEI organizations. For now xml search engines seem to be developed for specific purposes, such as searching extensive and varied collections in library catalogs, and they seem to be the province of programmers. Searching is probably an area where the online textual editor will need to bring in computer experts. Internet archives on a single author are one of several ways documents coded in TEI are being published online. The majority of TEI projects are devoted to transcribing documents that do not have any necessary connection with each other. A library may be specially active in trying to transcribe their collections and make them publicly available. Obscure and hard to obtain works of little known authors are being transcribed by scholars, enriching the resources on the public internet and in university libraries. But a digital archive on a single author is a special kind of online publishing project, distinguished by its thoroughness in locating, presenting, and describing all the relevant primary material. So although the internet often seems to be the publication venue for the obscure and the otherwise unpublishable, an internet archive on a single author, because it demands so much work and so much scholarly expertise, is probably most appropriate for a major, canonical author. For now, the most promising period to be working in for someone interested in internet editions seems to be the long nineteenth century. Thanks to the leadership of Jerome McGann and the IATH at the University of Virginia, the Romanticists associated with Romantic Circles and Romanticism on the Net, and Ed Folsom and Kenneth Price at the Whitman Archive, there appear to be more sources of help and support for aspiring online editors in nineteenth century studies than in other periods. Open access projects will probably always require public or foundation financing because they have little hope of earning revenue, but when projects are sponsored by academic departments they can offer in-kind resources like editorship by eminent scholars and graduate student labor, and obtain technical expertise and support from collaborative organizations like NINES. There seems to be a strong enough foundation for internet archives by now, especially in nineteenth century studies, for us to have confidence that they will be a part of the future of our profession. main page | 20th century literature | 19th century literature | selection policy |