Tech

Multimodality as the next big leap for AI

Share
Share
AI: "The next big leap will deal with multimodality"
Antoine Bosselut. Credit: EPFL/Alain Herzog

As the head of the Natural Language Processing Laboratory at EPFL, Antoine Bosselut keeps a close eye on the development of generative artificial intelligence tools such as ChatGPT. He looks back at their evolution over the past two years and suggests some avenues for the future.

We spoke two years ago, when ChatGPT became public. Looking back, would you say this was the beginning of a new era?

Yes, I think there was indeed a “ChatGPT moment” that changed the paradigm of AI in two ways. First, from a technical point of view: we went from task-based to instruction-based systems, or what is known as generative AI. Before that ChatGPT moment, individual AI systems were trained to perform very specific tasks.

ChatGPT was a game-changer, as you could convert a multitude of instructions into various outputs representing a given task, all based on an enormous amount of data used to train the system. That technical shift created a perceptual shift as well. With that instruction-based AI, anybody can use such systems, and the general public understood that AI could be integrated into various aspects of their daily lives.

Competitors were quite quick to launch their own solutions. Was OpenAI really a precursor?

A lot of companies were already working on similar approaches. Anthropic, which launched Claude, was founded a year before ChatGPT came out, by a group of ex-OpenAI engineers. Google had for many years been working on instruction-learning models as well.

The OpenAI release was a step up from what anybody else had done, but the real change was that they managed to put the technology into a product. This changed user perception on the maturity of this technology, which forced a shift of focus from all the big tech actors.

What about DeepSeek, launched in late 2024? Is it that different from other models?

It’s too soon to say whether it is a similar jump to what we saw two years ago. A lot of the excitement around DeepSeek is based around the cost, not necessarily novel capabilities. The truth is, we still don’t really know much about that model itself. The price tag they announced is based on the final training round. We don’t know the cost of the pre-trained model.

Saying it’s “open-source” would be a stretch. One can use its code to integrate it into other applications and develop it further, but we don’t really know what its foundations are since there’s little information around the training data. You don’t know what you’re building on top of.

We see a massive race to invest in AI: The US announced 500 billion dollars, Europe mentioned 200 billion euros. Is it really worth spending that much money?

You’re going to spend this money anyway; the question is, who gets it? AI is not going anywhere and will continue to grow as a technology that people use every day. If Europe fails to develop convincing generative AI solutions, users will turn to U.S. or Chinese services, with all the risks this entails around sovereignty.

What about the place of Switzerland in all this?

Both EPFL and ETH Zurich are excellent at training the next generation of specialists, developing solid theoretical knowledge and making it available to society at large, thus providing a trusted alternative to foreign tools. In that respect, this is exactly what the Swiss AI Initiative and the Swiss National AI Institute were created to do—train the younger generation of engineers and scientists, make them available to society.

Let’s get back to how large models work. Is there a risk that the pollution of training data—particularly by data generated by AI itself—will impair its quality?

There is a theoretical risk. But paradoxically, thanks to the filters and cleaning of results that are being developed in parallel, the synthetic data that serve as sources are rather of very high quality. Conversely, a lot of unfiltered content generated by humans can be false or biased. Therefore, it’s hard to say whether this fear is justified.

In which field do you foresee generative AI playing a major role?

It might be easier to think about the fields in which AI won’t play any role … There are fields—health, national security, confidential information—in which data is sensitive, so we can’t just easily transfer it to the servers where generative AI systems are hosted. Trust towards these systems and their owners will remain a question mark for many years.

So far, we’ve observed a technological leap every two to three years. What’s next?

Despite the ever-accelerating capabilities of these models, they remain fundamentally text-based. In concrete terms, everything today is based on a vocabulary of around 50,000 words. This may be enough to give human users the impression that the machine is capable of reasoning. But human reasoning is far more complex and uses other perception modes too—sounds, images or even smells.

I think the next big evolution will come when models are also able to directly integrate other types of content, such as images, sounds and videos. This “multimodal AI” will then come even closer to artificial “thinking”—even if its definition remains more philosophical than technical.

Provided by
Ecole Polytechnique Federale de Lausanne


Citation:
Q&A: Multimodality as the next big leap for AI (2025, May 27)
retrieved 27 May 2025
from

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.

Share

Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *