Monday, February 20, 2006

Advanced Compression Making Its Mark


  • There is going to come a day when Group 4 compression is not good enough. For years, this fax-based standard, encased in a TIFF wrapper, has been the format of choice for document imaging applications. Recently, as document scanning has moved into the front office, PDF has been gaining traction as the wrapper. But, typically, inside the PDF we have seen either Group 4 [for bi-tonal] or JPEG [for color] compression. There are signs, however, that things may be a changin’.

    As document imaging moves from dedicated back-office applications with controlled sets of documents, to more ad hoc, distributed applications that cover a wider array of document types, traditional compression formats will no longer fit the bill. They often create files too large or unwieldy to be effective in many distributed, browser-based viewing applications. This is especially true when dealing with graphically rich and/or color documents. And, as color document output increases, due to the falling price of color printing technology, the problems associated with color input will only exacerbate.

    Yes, some day, the document imaging world will be forced to move en masse to advanced compression technology based on concepts such as MRC (mixed raster content). MRC involves the separation and segmentation of document images into pieces and/or layers. Each part is compressed separately using optimum methods, depending on the content of the part. Color graphics, for instance, can be compressed with a photographic-centric compression method such as JPEG. Bi-tonal textual areas, meanwhile, can be compressed with Group 4. This not only results in smaller file sizes, but if done effectively, better viewing characteristics.

    Over the past several years, we’ve done many articles in DIR on MRC, the adoption of which we felt would grow hand-in-hand with the adoption of color document scanning hardware. This has not necessarily proven to be been the case, although many would argue that neither has been embraced with open arms by the end-user community, so maybe they are growing at similar rates. In recent weeks, however, we have seen an unusually large amount of news surrounding MRC, and we think the tide may finally be turning in its favor. Let’s take a brief look at what’s been happening.

    LuraTech Partners With Kodak
    We’ll start with LuraTech, the German compression specialist that recently announced a bundling agreement with Kodak. The newest version of Kodak Capture will include a demo version of LuraTech’s export module for creating MRC-based PDF or JPM files. JPM is the file format created by using JPEG 2000, Part 6 compression [see DIR 10/19/01].


    According to Robert Bijster, worldwide portfolio business manager for Kodak Document Imaging software products, Kodak’s German offices initiated the relationship with LuraTech. “In Germany, we have started to see increased demand for compressed output in PDF format,” said Bijster. “Kodak and LuraTech have been able to jointly promote the use of color to customers and the channel. We also have a cooperative agreement with LizardTech for DjVu software. Customers using our Capture Software now have an even wider choice of output formats.”

    The LizardTech agreement involves a free two-month trial. The LuraTech demo will be included in Kodak Capture 6.8, scheduled to begin shipping this month.

    JBIG2 Catching On For Bi-Tonal Files
    On our recent trip to New York for the HSA Capture Conference 2005, we stopped in Queens and spent an hour with Ari Gross, CTO and founder of CVision. CVision is a JBIG2 compression specialist that has compiled an impressive customer list for its PdfCompressor product line. This includes FedEx, BankOne, Merck, Boeing, and JPMorganChase, as well as several local governments and federal government agencies, and numerous law firms and corporate legal departments.

    JBIG2 is a bi-tonal compression methodology designed to be more effective than Group 4. CVision’s Web site has sample files that are six to 10 times smaller using JBIG2 than Group 4. To date, over 95% of CVision’s business has come from customers that want to create smaller bi-tonal PDFs, for both storage and portability. Boeing, for instance, has used PdfCompressor for the technical manuals it distributes on CDs.

    JBIG2 is included as a compression option in the specs for both PDF and JPM and can be leveraged as a component in MRC applications. Toward this end, CVision has spent three years developing a segmenter for color document image files. This segmenter was introduced last year in version 3.0 of PdfCompressor.

    “When doing MRC and JBIG2 compression with competitive products, you are running a risk of text-degradation,” Gross told DIR. “We have a test that proves our JBIG2 compression can actually improve text recognition by an OCR engine, which virtually guarantees there is no degradation. We first run OCR on a standard, 200 dpi JPEG file. We then compare those results to OCR results from a file we have segmented, compressed, decompressed, and then converted back to a 200 dpi JPEG. We find the results from the second file are often more accurate.”

    Gross credits this improved accuracy to his company’s having “the world’s best matcher for font-learning.” “Sure, there’s cheaper technology available,” said Gross. “But no one else can guarantee the accuracy we achieve.”

    To date, more than 90% of CVision’s sales have been direct, mainly through Web hits. Recently, however, the company has beefed up its OEM efforts. Captiva is already a reseller of PdfCompressor. CVision is also negotiating partnerships with several other capture software players. “Over the next couple years, I think you’ll see advanced compression move from a separate process to something that is done as an ordinary post-scanning process,” said Gross. “In the next four years, I expect it to start showing up on chips.”

    Gross initially expects JBIG2 compression to replace Group 4 in many embedded applications. “JBIG2 is still evolving,” he said. “We get significantly smaller file sizes now, but we are continuing to improve on that. Before we make the commitment to go to hardware, which is expensive, we want to be sure the technology is very mature.”

    As for color compression, Gross acknowledged that segmentation is still a very inexact science. “Segmenting might not be perfected for another 20 years,” he said.

    LizardTech Lands The New Yorker Deal
    Some of CVision’s initial success in the world of color image compression has come in the publishing industry. News distribution services are utilizing the company’s technology to scan articles from business magazines, which they then make available to subscribers. A couple of months ago, DIR did an article touting the publishing industry as a potential killer app for MRC.

    The article featured LizardTech, the developer of the MRC-based DjVu file format. LizardTech CEO Carlos Domingo hinted that his company was working on something really big. That turned out to be the digitization of 80 years worth of back issues of The New Yorker magazine. The project came to light last month when an eight-DVD set began shipping, which contains more than 4,000 digitized back issues of The New Yorker.

    “We have been working on this for more than a year,” Domingo told DIR. “It involved scanning more than a million pages, and quality was very important. Initially, the publisher was planning on using JPEG compression, which would have reduced 24 MB uncompressed color files—the average size of an issue scanned at 300 dpi—to 8 MB. This would have created a total of three to four terabytes of data for all 4,000 issues. Using DjVu, they were able to compress all the issues down to 30 GBs, which can be easily distributed on a DVD set.

    “Initially, with the JPEG files, the publisher was considering releasing the back issues in sets of five or 10 years. They were also going to have a couple of lower resolution versions for Web display and electronic distribution. By using DjVu technology, they’ve been able to take care of all their needs with the same files.”

    We haven’t seen a sample copy yet, but early reviews on Amazon.com are generally favorable (although one customer claims the discs badly damaged his computer). LizardTech worked closely with The New Yorker publisher Conde Nast and a Kansas City service bureau to produce the scans. LizardTech also appears to have developed a custom viewer for the files.

    “The service bureau would send us the images, we’d look at them and make any adjustments we thought necessary, and then run it by Conde Nast, who would give us their feedback,” said Domingo. “Depending on the quality of paper and ink used during a certain time period, our settings varied. Also, the way the magazines were stored, in boxes, created variances between those in the middle of the box and those on the ends.”

    The eight DVD set contains an index file with article summaries that can be searched; no OCR or full-text indexing was used with the DjVu files. For more recent editions of the publication, Conde Nast is going straight from their pre-press PDF files to DjVu—a process used by other publishers for digital distribution of current issues [see DIR 12/17/04].

    Domingo indicated the application could potentially lead to other business for LizardTech with Conde Nast, which publishes magazines such as Glamour, GQ, and Vanity Fair. “The publishing industry is big, but most people involved in it know each other—especially in hubs such as New York,” said Domingo. “You will definitely be hearing more from us in this market.”

    PlanetDjVu Founder Embraces PDF
    Jim Rile of Jim Rile Associates (JRA) is also putting together some interesting MCR-based efforts targeting publishers. He has included options in PDF files like linking from selected photos to high-res versions stored on a Web server and linking and bookmarking options for article threads. “The cumulative effect is that a PDF, starting from scanned images, is now a bona fide e-book,” said Rile “There is clearly a distinction between a collection of e-books and a mere collection of scanned images organized into issues.”

    Rile is the founder of the PlanetDjVu Web site, but with the latest version of his JRAPublish application, which was released last month, he has embraced PDF. “I am still carrying DjVu capabilities,” he told DIR, “but as far as I’m concerned, DjVu is pretty much dead as a commercial application. Even though DjVu files display a lot faster than PDF, everyone has heard of PDF, not DjVu—or even MRC.”

    With JRAPublish 3.0, Rile has introduced the ability to create four layers, rather than the three traditionally used for MRC. “In addition to a foreground, background, and mask layer, I’ve added a specific picture or graphics layer,” he said. “By separating pictures from the background, you can save them at a higher resolution while reducing the resolution of the rest of the background, which reduces the overall file size.”

    Rile has leveraged tools from LizardTech, LuraTech, and ABBYY to create his new application. He sent us some imaged pages from a school yearbook as sample file. The file was either 2 MB or 400K, depending on the level of compression used. Display was faster with the larger file. “It’s not always about size,” Rile said. “When employing MRC, there is a balance between quality and size that you need to maintain.”

    Rile concluded that when the market comes to a better understanding of the potential of MRC, then color document imaging will truly have arrived. “The price and performance issues associated with hardware have gone away,” he said. “However, the lack of a suitable presentation format for color images has kept the transition from being complete. The challenge is creating an MRC format both fast enough and small enough to use in everyday applications.”

    For more information: http://www.kodak.com/go/capturesoftware, http://www.cvisiontech.com, http://www.lizardtech.com, http://www.planetdjvu.com

    No comments: