Encoding and File Types
Standardization in encoding, file type, file naming, and resolution is a critical part of digitization, ensuring logevity and acessiblity of your materials.
When digitizing materials, it is essential to create a master copy, or the digital copy from which other copies may be derived. This will be the copy encoded in a preservation XML standard or captured at the highest resolution determined for your project and saved as a lossless file format. Preservation level XML encoding uses a set standard and is not tailored to any specific delivery system (DLXS, ContentDM, etc.). For images, files derived from the master copy for online delivery can be reduced in resolution, file size, and saved as a lossy file format. Mutiple files may be created at this level for multiple purposes or user communities. When creating service/deliverable quality files, the emphasis should be on efficient delivery to users. When generating preview/thumbnail images, the image size and quality can be reduced even further, with an emphasis on quick loading time in a browser and accessiblity to users.
The formats below are recommended for master and deliverable quality images. Scholarly Publishing uses TIFF with lossless LZW compression and JPEG2000 files as master copies and typically delivers images as JPEGs or JPEG2000s. Scholarly Publishing staff are available to consult about proper format, resolution, and file naming conventions for your digital project.
- DOC (Microsoft Word Document): Word documents (DOC) are a proprietary format of Microsoft, which can only be created, edited, and viewed on Microsoft applications. For this reason, DOCs are not a good candidate as a preservation/master copy because of their dependence on one software package. DOCs may be used for access copies if necessary.
- JPG (Joint Photographic Experts Group): JPEGs are a lossy image format and compress image data. Compression lowers the quality of the image, but allows for a smaller file size, which downloads quickly in a browser. This makes JPEGs ideal as a delivery format.
- JP2 (Join Photographic Experts Group in 2000): JPEG2000s can be saved with both lossy and lossless compression. JPEG2000s are ideal as a preservation format because they do not compress images, but retain a smaller file size, saving valuable storage space. JPEG2000s may also be used as a delivery format, but not all browsers and systems support this standard.
- PDF (Portable Document Format): PDFs files embed all of the information needed to run, such as typefaces and images. PDFs were a proprietary format owned by Adobe; however, they are now open source. The Adobe Reader, which is free to download, is necessary for viewing PDFs. PDFs are a good delivery format as they download quickly and display easily.
- TIF (Tagged Image File Format): TIFFs are used for high quality, high resolution images. Metadata about the image is embedded in the file and the files are easily manipulated. TIFFs can be lossless (using LZW compression) or lossy. TIFFs are typically used as a preservation quality files because they are uncompressed and can be saved at a high resolution. However, TIFFs are not typically used as a delivery format because their large file size requires a very long loading time in browsers.
- XML (Extensible Markup Language): XML is a markup language that allows documents to be encoded in a format that both humans and machines can read. XML can be displayed with any text editor or browser, making it highly interoperable and dependable as a preservation format. XML files can also be used as an online delivery format with the use of XSLT or CSS stylesheets to transform it to HTML. There are countless standards for XML depending upon the discipline and materials to be encoded. Monographs and books will typically be encoded using the TEI (Text Encoding Initiative) guidelines. The TEI is the de facto standard of marking up text-heavy documents, particularly in the humanities. DLS has used the VRA Core (Visual Resources Association) standard for encoding image based projects and the EAD (Encoded Archival Description) standard for encoding finding aids. While these are the most commonly used standards in DLS projects thus far, our staff are knowledgeable of many other standards that may be applicable to your project, including Dublin Core, CDWA (Categories for the Description of Works of Art), and FGDC (Federal Geographic Data Committee).