The puzzling process of publishing images to the Web
In this ostensibly digital age, 80% of all new corporate documents created each day are digital; but more than 90% of the information in the world is still on paper, and 70% of the existing corporate memory still resides on paper. A great deal of those paper documents contain color graphics and/or photographs that represent significant invested value. But most of that rich content has yet to be published on the Internet. That is because scanning the documents and posting them to a Web site has been problematic at best. Reducing resolution to achieve satisfactory download speed often means forfeiting quality and legibility.
As more and more corporations and government agencies rush to convert their paper document archives into computer-usable form for posting on public Web sites or corporate intranets, document presentation file format becomes a major issue. At the high resolution necessary to ensure the readability of text and to preserve the quality of embedded images, file sizes can become too bulky for acceptable download speed. Publishing paper documents on the Web involves making trade-offs among the factors of cost, quality, download speed and file size. New compression algorithms for text and photographs make the decision all the more complex. Here is a rundown of the factors you must consider when publishing imaged files to the Web.
It isn't the file size, it's the connection speed
Even though images can be compressed, they are significantly bigger than the average HTML page. The slower the connection, the more file size counts, and the more critical is the choice of format and approach to sharing image files. Nowadays, it is safe to assume a minimum modem speed of 28 K for Internet users. The number of T1, DSL and cable modem installations are growing more rapidly each day, starting to diminish the leverage that file size exerts in limiting image presentation options, and, of course, intranet implementations can enjoy comparable speed. In all cases, it will be essential to make sure that you avoid making multiple requests for image data over the wire. That problem for viewing applications can be circumvented by generating a 72-dpi thumbnail for display, and providing the full image upon demand.
It's also a question of the browser environment
The more combinations of browser, native operating system and related tools and plug-ins that are possible, the fewer the practical choices, because you must focus on common capabilities. It is an accepted axiom that users do not like downloading and installing browser plug-ins. Moreover, the lower the possible screen density, the less reasonable it becomes to send high-resolution images.
It also depends on the type of images you're using, and how they will be used. Are they images of text documents or drawings, or are they color photographs? Is this a viewing application, or will the user need to modify or annotate the image? What quality of image does the user require for the task?
Depending on your answers, you will want to consider the following image file formats:
- Tagged Image File Format
Most imaging systems use a version of TIFF to store their images. TIFF provides a number of different formats that cover the needs of both black-and-white and color applications, and a range of compression techniques. By using TIFF's strengths, such as tiling and banding, imaging vendors have been able to accommodate a range of application needs. The most popular formats seem to be those that encapsulate facsimile encoding.
Publishing images in TIFF has the benefit of eliminating the need for conversion, but it requires that the user have a viewer installed. Every Microsoft Windows system comes with the Wang/Eastman view, but that may not be sufficient. Your audience may include Macintosh and Unix users. TIFF viewers are notorious for not handling all of the many legal format variations, plus many vendors have added proprietary extensions to handle issues like annotation. As a result, with TIFF you are going to need to specify the viewer to be used, and probably provide one on your Web site.
TIFF is not easily streamed over the Web, so the user may have to download the entire file before he or she gets to see it. For some applications that is fine, because TIFF files are excellent as a tool for local review and editing on the user's desktop.
- Graphics Interchange Format (GIF)
Originally developed as a proprietary format by Compuserve, GIF has evolved into a commonly used format for bitmap graphics on the Internet for the layperson. Browsers will display them using native technology; however, they are not the best choice for business imaging. Special features like animation are not of value in a business setting; the poor compression of the GIF format, plus confusion over the use of Unisys' LZW patent, makes GIF less attractive when compared to PNG.
- Portable Network Graphics (PNG)
PNG provides a patent-free replacement for GIF and can also replace many common uses of TIFF. All of the current major browsers will display it in native mode, which eliminates the issue of viewer download and installation. However, it has one great flaw--since it was meant to be single image, its utility for business imaging is limited. You cannot always guarantee that your documents are going to be one page, and breaking a document into single pages is an annoyance.
- Joint Photographic Experts Group (JPEG)
If you are storing and displaying color photographic material, there is no better format to use than JPEG. While a comparable density file is much large than a comparable TIFF, adequate viewing can be had at a lower density. One nice aspect is that JPEG will begin to show a recognizable image as it streams, without requiring the entire file to be transmitted. It has the advantage of supporting watermarks and signatures, to support information control.
Since JPEG is an approximate representation of the image, you shouldn't save things as JPEG and then edit them further later and save them again. You can expect progressive loss of quality each time you do that, especially with different JPEG quality settings. JPEG was developed for photographic images and produces smudgy line art. As a lossy format, JPEG is excellent for real-world scenes, but terrible for drawings and printed text. As a result, it is not a good choice for images of paper documents images.
Originated by AT & T'slabs specifically for storing pages containing both text and pictures, DjVu requires that you download a viewer that works only with a browser. (A standalone desktop viewer is available for $250.) Scanned pages are broken down during the scanning process into two separate layers of searchable text and images that are each compressed using different lossy methods. By separating the text from the backgrounds, DjVu can keep the text at high resolution (thereby preserving the sharp edges and maximizing legibility), while at the same time compressing the backgrounds and pictures at lower resolution with a wavelet-based compression technique. The resulting files are 30% smaller than TIFF or PDF image-only format.
DjVu uses some clever schemes for achieving small file size; for example, if a given character appears X number of times in a document, DjVu will store the compressed image of the character only once, with X-1 number of pointers to other locations. DjVu never decompresses the entire image, but instead keeps the image in memory in a compact form, and then decompresses the piece displayed on the screen in real time as the user views the image. Hence, the DjVu format is streamable. As with JPEG, users get an initial version of the page very quickly, and the visual quality of the page progressively improves as more bits arrive. DjVu achieves compression ratios about five to 10 times smaller than existing methods such as JPEG and GIF for color documents, and three to eight times smaller than TIFF for black-and-white documents. Scanned pages at 300 dpi in full color can be compressed down to 30- to 100-KB files from 25 MB. Black-and-white pages at 300 dpi typically occupy five to 30 KB when compressed.
- Portable Document Format (PDF)
The de facto standard for sharing documents over the Internet is Adobe's Portable Document Format (PDF). It guarantees that the image seen by the viewer is the same across all platforms. While it requires that a free viewer be downloaded and installed, its use is virtually ubiquitous, so most users already have it. PDF files can have metadata, tables of content and links, all of which can make your images more useful to the users. PDF files support locking, watermarking and signing, which means that you have the tools if needed to protect your intellectual capital. If produced properly, Acrobat 4 PDF files can be streamed, providing the responsiveness that browser users expect.
PDF does have some negatives. The cost of the tools to convert to PDF can be a deal killer, and the time required for conversions is noticeable in interactive applications. Adobe owns a (so far) benign monopoly on PDF technology, which is an item of concern to customers that want to limit future licensing costs. We are at the mercy of Adobe. Indeed, if history is any indication, this is not a good idea.
When they contain images, PDF files are even larger than typical image files. Finally, as with all of the formats, there is no single method of encoding a document in this format. PDF image files can be formatted as captured images, captured images with attached OCR information (useful for local scanning for text, but adding to the file size and requiring a better version of the viewer), or converted to an encoded version of the document where the creation tool attempts to represent the image as a set of fonts and text. That last case reduces the file size, but introduces the potential of breaking the reproducibility guarantee.
So what's the answer?
Obviously, one choice will not suit all problems. If your images can't tolerate being reduced to eight bits for GIF or losing precise accuracy for JPEG, then TIFF and PNG are your best options. When distributing images to an internal audience, particularly when the images may need to be recognized by an OCR device, annotated or edited, you should use TIFF. Web browsers are beginning to support the latter, and many external viewers support both. For simple image distribution to an unsophisticated audience, PNG can be an inexpensive and easily supported alternative.
JPEG is for photographic images. GIF is for line art images, such as icons, graphs and line art logos. If you must edit a photographic image, work with it in a lossless format until it is ready for publication, then convert it to JPEG for the Web. The vast majority of Web sites are using GIF for line art and JPEG for everything else, while migrating from GIF to PNG as users upgrade their browsers.
The primary advantage of PDF is that the presentation looks good printed on paper, especially when printed on a high-quality printer. The screen view depends on the resolution of your monitor and the equipment that drives it, but it is arguably the highest quality reproduction you can get from your computer. If the text will be printed for certain, and if the printout should look like conventional printing press output, then PDF seems to be the best option. For distribution of document images to a large viewing audience, PDF is the only reasonable choice.
Arthur Gingrande, partner, and Bernard Chester, principal, are members of IMERGE Consulting (imergeconsult.com). They can be reached respectively at firstname.lastname@example.org, 781-258-8181, and email@example.com, 206-979-7389.