Tech

Algorithm based on LLMs doubles lossless data compression rates

LovabledanielsMay 14, 202574 Views

A powerful lossless data compression algorithm based on LLMs — Image comparing the lossless compression rates of LMCompress with the traditional state-of-the-art methods and the large-model-based method that was proposed independently by a DeepMind-Meta&INRIA team. The comparison is done on four types of data: image, video, audio, and text. It shows that LMCompress consistently outperforms the others on all data types. Note that the DeepMind result on video is not available. Credit: Li et al.

People store large quantities of data in their electronic devices and transfer some of this data to others, whether for professional or personal reasons. Data compression methods are thus of the utmost importance, as they can boost the efficiency of devices and communications, making users less reliant on cloud data services and external storage devices.

Researchers at the Central China Institute of Artificial Intelligence, Peng Cheng Laboratory, Dalian University of Technology, the Chinese Academy of Sciences and University of Waterloo recently introduced LMCompress, a new data compression approach based on large language models (LLMs), such as the model underpinning the AI conversational platform ChatGPT.

Their proposed method, outlined in a paper published in Nature Machine Intelligence, was found to be significantly more powerful than classical data compression algorithms.

“In January 2023, when I taught a Kolmogorov complexity course at the University of Waterloo, I reflected on the idea that compression is understanding,” Ming Li, senior author of the paper, told Tech Xplore. “In other words, if you understand something, you can express it succinctly; and if you can express something in very short expression or in a few words, then you must understand it.

“In this paper: we proved that compression implies the best learning/understanding. The opposite was proved in one of our other papers, which was a precursor to this work, while another paper by Google DeepMind independently obtained our initial results.”

As part of their recent study, Li and his colleagues set out to demonstrate that the better models grasp data, the better they can summarize it and compress it. This idea dates back to 1948, specifically to Claude Shannon’s renowned mathematical theory of communication.

“Shannon essentially proposed that if you understand the data to be communicated, then you can compress it, or in other words, shorten communication time,” explained Li. “For 80 years, this research idea challenge remained open, until AI and large language models came along. Our paper essentially proposes that if a large language model can understand data well, it must be able to guess what we plan to write, which allows us to compress the data significantly better than the best classical lossless data compressors (e.g., bzip for text, JPEG-2000 for images).”

The basic idea behind the researchers’ data compression algorithm is that if an LLM knows what a user will be writing, it does not need to transmit any data, but can simply generate what the user wants them to transmit on the other end (i.e., on a receiver’s device). When Li and their colleagues tested their proposed approach, they found that it at least doubled compression rates for different types of data, including texts, images, videos and audio files.

“This is amazing in the sense that after 80 years of research, if you just improve a lossless compression algorithm by even 1%, this is already remarkable, and we were able to double compression rates,” said Li. “LMCompress is a compression algorithm using large models (large language model for texts, large image model for images, etc.). It compresses texts more than two times better than classical algorithms, images and audios two times better, and video slightly less than two times better. Therefore, when you transmit data, you can go approximately two times faster.”

This recent paper by Li and his colleagues could inform future efforts aimed at developing increasingly advanced data compression techniques, inspiring other researchers to leverage LLMs. Moreover, the team’s LMCompress algorithm could soon be improved further and deployed in real-world settings.

“We demonstrated that understanding equals compression, and we think this is of crucial importance,” added Li. “We also paved the way for a new era of compressing data using LLMs. We think in the future, when these large models are on our cell phones and everywhere, our method of compressing data will replace the classical ones (e.g., .zip files). In our next studies, we also plan to use our methodology to compare large models and detect plagiarism.”

More information:
Ziguang Li et al, Lossless data compression by large models, Nature Machine Intelligence (2025). DOI: 10.1038/s42256-025-01033-7

Citation:
Algorithm based on LLMs doubles lossless data compression rates (2025, May 14)
retrieved 14 May 2025
from

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.