DjVu, a shining star in the field of layout documents

When I got up to watch the news in the morning, I was shocked by a report of ransomware:

The Read Me.txt file in it states: The victim's files (photos, documents, databases, and all other data) have been encrypted into .djvu files, which can only be decrypted by purchasing a specific tool provided by BigBobRoss. And according to cybercriminals, the exact cost is unknown and depends on how quickly they can be contacted. Victims must be contacted via the [email protected] email address.

The ransomware is scary, but what shocked me was not the BigBobRoss ransomware, but the file extension ".djvu"! When did the DjVu format, which once dominated the publishing industry and the e-book field for more than 10 years and once challenged the dominance of PDF, become an accomplice of ransomware?

The author hurriedly checked the relevant information and found that it might be purely a coincidence that the BigBobRoss ransomware encrypted the victim’s files, which happened to adopt the file extension of .djvu instead of these files really becoming memory DjVu format. As for why the BigBobRoss ransomware virus uses the file extension of .djvu, the reason is unknown. The author guesses that it may be related to the French meaning of DjVu: it is pronounced "déjà vu", which means " been seen, deja vu" in French .

Those born in the 90s probably haven’t heard much about the DjVu format. It’s true that this format has gradually faded out of people’s vision after 2010. Now, if you search for relevant introductory documents on the Internet, they are basically before 2010. From the date of the birth of the DjVu format in 1996 to the gradual disappearance after 2010, it took only 15 years. However, in the 10 years from 2000 to 2010, it was also the existence of the publishing industry and the e-book field. Challenge the PDF format. Let's talk about the once bright meteor of the DjVu format.

Getting to know the DjVu format for the first time

The author was fortunate to come into contact with the DjVu format in the digital archives project of the Zhejiang Provincial Archives in 2006. The scanned image format used in it was the original Tiff format, and the format used online was the DjVu format with a very high compression ratio. The application scenario at that time was relatively simple. DjVu was only used as an image format with a high compression ratio. The text content after OCR recognition was not merged into the DjVu format. The conversion process is shown in the following figure:

But in fact, the DjVu format is not only an image format with a high compression ratio, it is also a layout document format like the PDF format, which can not only fix the layout, but also retrieve text, as shown in the figure below (read in DjVu to open the DjVu file in the browser):

History of the DjVu format

DjVu was originally an image compression technology developed by AT&T Laboratories in the United States in 1996. Its principle is to separate the image into the foreground layer (paper texture and picture) and the background layer (text and lines) for compression. By separating the text from the background, DjVu can restore the text with a high resolution, preserving sharp edges and maximizing legibility; while compressing the background image at a lower resolution, making the entire image quality is guaranteed. Traditional image compression formats are acceptable for simple pictures, but the performance of color contrast between strongly contrasting color areas is greatly reduced, which is why they are not satisfactory for text restoration. In general, a higher resolution (typically 300dpi) is required to ensure the clarity of text and lines, but not so high resolution (typically 100dpi) to reflect the background mechanism of continuous color images and paper. Therefore, the best way to improve clarity is to separate these elements into different layers for processing.

Around 2000, with the gradual popularization of the Internet and the gradual reduction in the cost of scanning storage devices, more and more files began to be processed, disseminated and stored on the Internet in digital form. People's need for instantaneous acquisition of information makes the computer screen the best display medium for various information. Yet more than 90% of the world's information is still on paper. Plenty of paper documents, including classic books, paintings, color pictures and photographs, are invaluable, yet very few of them are published online.

A bottleneck restricting the distribution of this information online is the file size of the scanned images. If you want to ensure the clear effect of text and images, you must scan with a higher resolution, and the resulting files are often very huge (usually dozens/hundreds of MB), which is difficult to download online. Under the network bandwidth conditions at the time, in order to achieve unsatisfactory download speeds, the resolution had to be reduced, which also meant that the image quality and legibility could not be guaranteed. Traditional web image formats, such as JPEG, GIF, and PNG, have large image sizes at common resolutions.

It is against this background that DjVu relies on its famous ultra-high compression ratio technology to enable almost all traditional printed materials to be transmitted on the Internet at high speed, thus quickly establishing its market position. In 2002, DjVu, together with TIFF and PDF formats, was selected by the Internet Archive (Internet Archive)'s Million Books Project as the format for public domain books to be scanned and published. The following is a slightly exaggerated image compression ratio comparison chart provided by the DjVu manufacturer at that time:

Not to mention the original scanned image format TIFF, the DjVu format can be compressed by about 20 times on the basis of the already compressed JPEG or PDF format, and the image quality remains basically unchanged! Its high compression ratio is indeed amazing.

From the perspective of technical implementation, the DjVu format mainly adopts the layered compression method of the ISO/IEC 16485 MRC (Mixed Raster Content) model. For color scanned images, the characteristics of each part of its content elements are different. If these different elements are processed with the appropriate compression technology, it is possible to obtain a smaller file size. The DjVu format just follows this principle. According to the content characteristics of the scanned image, it is separated into three layers: skin, background, and foreground, and is processed by JB2 compression, IW44 Wavelet and other methods, thereby greatly reducing the file size. However, the MRC model is no longer a new technology. ISO/IEC 16485:2000 Information technology — Mixed Raster Content (MRC) has published relevant technical content. As far as the author knows, some domestic OFD manufacturers have also included in their own OFD tool software. MRC compression technology is used to obtain a higher compression ratio than PDF.

The DjVu format reached its application peak around 2006. After 10 years of development, it is not just an image compression technology, but has gradually developed into a multi-functional format document format like the PDF format. While clearly displaying images, it also has a text display mode; supports text keyword search, can realize full-text retrieval and partial text copy, and can quickly obtain text content in documents; supports local path and network path hyperlinks; etc. This makes it widely used in e-books, online publishing of printed materials, sharing and dissemination of publications, etc. The application fields cover book archives, digitization of ancient books, electronic management of government documents, financial institution documents, processing and manufacturing related manuals, Electronicization of maintenance manuals, drawings, etc., city construction, maps, etc. It once posed a threat to the market position of PDF, and some even believed that: " Technically speaking, DjVu is better than PDF in terms of converting paper documents into electronic documents due to its small file size, high quality, and more open features. "

DjVu Format Decline

So, what caused the rapid decline of DjVu? In fact, DjVu has serious inherent deficiencies in commercial applications. DjVu was born in AT&T Labs. AT&T Labs is world-renowned and undoubted for its innovation and R&D capabilities, but it is basically a non-profit organization with a laboratory nature. Moreover, for AT&T, DjVu is only a by-product, and the overall investment is not much, which can be seen from the source code and SDK released by AT&T. In 2000, the DjVu format was resold by AT&T to LizardTech, a company specializing in document scanning. LizardTech's contribution to the DjVu file format specification and djvulibre is obvious to all. Unfortunately, it couldn't last a few years. It was acquired by Celertem in Japan in 2007, and it seems to have been transferred to another Japanese company, Caminova, around 2010. Now I don't know who it belongs to. Only the http://www.djvu.org/  website can still be accessed, so it can only be regarded as a relic of the first dynasty!

The promotion of a common file format is inseparable from the power of capital and successful business operations. In this regard, DjVu has done too badly, not only cannot compare with PDF and DOC formats (behind the world's top ten software giants Adobe and Microsoft respectively) , even if they are not as good as the domestically produced WPS and OFD formats (the domestic software giant Kingsoft and the Electronic Document Management Promotion Alliance of the Ministry of Industry and Information Technology are behind them), the rapid decline is also reasonable.

In addition, in terms of the richness of the format itself, except for the only killer feature of DjVu (high compression ratio), the functions required by other format documents are still lacking compared with the PDF format, including multi-format text , vector graphics, 3D, multimedia, scripts, interactive forms, authority control, security authentication, etc., so it is more suitable for representing scanned images, and it is still controversial to use it as a layout document, which is relatively reluctant. And just after 2010, with the comprehensive promotion of Internet bandwidth of more than 100M and the popularization of 4G smart phones, high compression ratio is not as important as it was around 2000, which is also an important reason for DjVu's rapid failure.

DjVu is like a shooting star across the sky, bursting out with gorgeous colors and dazzling light, although short-lived but brilliant after all. Looking at the world, the current layout document format is still dominated by PDF, but who knows, maybe the domestic OFD format will shake the absolute dominance of the PDF format in another 10 years!

Finally, we end this article with the retro DjVu website homepage ( http://www.djvu.org/ ) (the Copyright at the bottom of the page shows that it is 2002), to commemorate this once brilliant shooting star!

Suddenly remembered, why did the BigBobRoss ransomware use the .djvu file extension? Doesn’t it also mean paying tribute to the legend?

The Digital Rosetta Project is committed to objectively and impartially expressing its views and opinions on the field of archives informatization as a neutral third party. The truth is becoming clearer and clearer, and we sincerely welcome more and more people to devote themselves to the research in the field of archival digital resource management and preservation and express their insights, and work together for the inheritance of human civilization!

Guess you like

Origin blog.csdn.net/weixin_56245650/article/details/130195318