Digitization Process

Startup
This project began with a meeting between Professor Grant, and Andrew Rouner, Director of the Digital Library, in late 2006. At the time, Professor Grant had begun to engage some of her graduate students in the work of transcribing the original Chinese sources of the English translations that appeared in her book, The Red Brush, in Microsoft Word documents, using the Microsoft IME, with the hope that these documents might be made web-viewable at some later date.

The first step DLS took in working with Professor Grant was to develop a filenaming convention, separate documents to achieve a one-to-one correspondence between texts and files, and then create a workflow based on what Professor Grant and her students were already doing. After some exploration and experimentation, we found that Microsoft Word had functionality built in to save documents as Unicode UTF-16 encoded text documents, which would allow for ready transformation into XML documents. It was decided that the shortest path between the current workflow and our target of creating Unicode UTF-16 XML documents encoded in TEI was to generate templates for each of the 752 texts to be transcribed. First, a TEI document model for the texts was created which had basic tagging for prose, poetry and drama in the template. Next, we took the finding aid from the appendix to The Red Brush, also in MS Word format, and scripted the document to output individual XML files for each of the 752 texts which had both unique filenames and the basic bibliographic information from the finding aid. These XML template files could then be opened in MS Word by the graduate students who then transcribed the texts using Word. They then would save the file as a UTF-16 text document. A few minor subsequent changes (done in batches) by DLS staff then made the files valid TEI XML documents.

Project Workflow
Once the blank templates for all 752 texts were complete, graduate students working for Professor Grant began to encode the Chinese texts. Because of the great number of texts to be encoded, DLS hired additional graduate students to work on the project, beginning in the fall of 2007. Student Assistants worked on the Red Brush project anywhere from two to six months, creating a rotating staff, including students in the East Asian studies program at Washington University and a number of native Chinese speakers. Working for the next year, by fall 2008, the students encoded over 500 texts. The project continues to progress and the remaining texts will be available in the near future.

Working with a staff of six graduate students staggered over a year required effective communication and organization, particularly for those students who were not working in the DLS office. A wiki page was created for the Red Brush project which outlined the project workflow and instructions for students. The wiki allowed DLS staff to upload empty templates for the students, who could then key them in and re-upload them to the wiki for DLS staff to validate. A workflow spreadsheet was also created at the beginning of the project to track the encoding progress. The workflow provided the following information: template title, if the text had been keyboarded, the original filetype (i.e. pdf, scanned jpg, etc.), who encoded the template, the date the template was uploaded, if the template had been corrected, if the template had been encoded, and if the template was parsed. In addition, a notes column was added to document problems with individual templates. As the templates were completed, the workflow was updated to indicate if templates were parsed and to document any problems. The workflow was a valuable resource for students to ensure that work was not being duplicated. In addition, any problems encountered during encoding were documented on the wiki. As new problems arose, solutions were also documented to maintain continuity throughout the files.

Encoding
As the work shifted to Students Assistants working in the DLS office, they began to understand XML and the TEI templates. Using this knowledge, students were then able to make suggestions for modifications to the XML mark up. After completing all encoding for texts provided in the pdf files, hundreds of texts remained. A DLS graduate student, Ryan Grimm, used Interlibrary Loan to locate and acquire as many of the texts as possible. Student workers in DLS then scanned images of the texts for reference. These new texts became the sources for the remaining templates.

One modification to the mark up was the introduction of the ‹corr› and tags. In many of the texts, some characters were difficult to read because of a badly scanned page, or because it was a very old text and the characters were not clearly printed. In addition, some characters were in ancient Chinese which lacked Unicode glyphs; in some cases a modern equivalent could be substituted. Scott Paul McGinnis, a student working for DLS, developed project guidelines for using these tags. The resulting markup noted whether the character was unclear or damaged, who corrected the character, and the certainty of the corrector for substitutions. In addition, images of the unclear or damaged characters were scanned and given ids. The id of the image was then referred to in the tag in lieu of an encoded character.