AI, the giants in the data battle

AI, the giants in the data battle


L’generative artificial intelligence, capable of imitating human creativity, is like Pandora’s box. The algorithms that worry society and governments so much, capable of generating texts, images, music and videos as a real person would, are in most cases invisible to the eyes, closed inside many “black boxes” that Big tech has no intention of uncovering it. Because, in many cases, they represent a huge competitive advantage.

The mysteries of AI

Let’s take it as an example ChatGpt, one of the most famous and used generative AI in the world, with more than 100 million monthly active users. Those who use it know well that by entering a certain textual input, for example “create an image of a cat”, they will receive an equally precise output in exchange: the drawing or photo portrait of a feline. But how AI achieves this is a mystery. Even for its own developers. “If we opened ChatGpt or similar tools and looked inside, we would see nothing but millions of numbers changing hundreds of times per second,” he explained Sam Bowmana New York university professor and researcher at Anthropic, a reputedly valuable AI development company 15 billion dollars. “Most importantly we would have no idea what was happening,” Bowman added.

The investments of the protagonists to develop and “train” AI

Self-learning to advance AI

In fact, AI progresses thanks to machine learning. This self-learning system, which lasts months and costs millions of dollars, pushes AI to explore complex and unpredictable paths. The “recipe” of an artificial intelligence, however, has certain ingredients: the initial data provided by humans. There are billions of texts, images and videos for AI to study. For those who develop artificial intelligence this is a problem. First of all because the public databases from which to extract information, such as the free encyclopedia Wikipedia, I am limitedthe. And then because “quality” inputs – necessary for more accurate content generation – are often protected by copyright or belong to large companies. For these reasons, the data business has undergone a notable surge since the advent of ChatGpt (November 2022).

It all started when large publishing groups – along with major online photo and video banks – realized that companies like OpenAI (the creator of ChatGpt) could thrive by misusing their content. Getty moved first, suing Stable Diffusion, a popular AI that generates images, in 2023. A few months later George RRMartin, John Grisham and other writers started legal action against OpenAI, who allegedly fed the pages of their books to ChatGpt without paying or asking permission.

NYT against OpenAI

An identical reason prompted the New York Times to sue, at the beginning of 2024, of new OpenAI, that he would have illegally used American newspaper articles to feed his AI. These are accusations – and suspicions – that are difficult to prove, precisely due to the “closed” nature of ChatGpt and the lack of transparency regarding its sources. Recently the Wall Street Journal interviewed Mira Murati, the Chief Technology Officer of OpenAI, i.e. the person responsible for the company’s technological strategy. But when reporter Joanna Stern asked her where the data on which Sora, the new artificial intelligence that generates extraordinary realistic videos, was trained came from, Murati clamed up: “We used public and licensed data.” «And so also videos from YouTube, Facebook or Instagram?», pressed the journalist. “I’m not sure,” replied the top manager.

In reality, OpenAI has long understood that the era of the Wild West in the race for artificial intelligence has come to an end. All data that will be used from here on out will have to be paid for. Weight in gold. The company led by Mira Murati – and from CEO Sam Altman – has entered into agreements in recent months with several excellent “suppliers” for a total value of 20 million dollars (per year): with the Shutterstock media library for images, audio and video; with Associated Press, the German giant Axel Springer (publisher, for example, of Politico and Business Insider), Le Monde and Prisa Media (publisher of El Pais) as far as the texts are concerned.

From Google to Apple

The causes that affected OpenAI were a also a warning to competitors. For 60 million dollars a year, Google secured the data of the Reddit platform, a social network that boasts 850 million active users and channels with millions of subscribers – the so-called “subreddits” – in which the most popular topics are discussed every day. disparate. Google will use Reddit contents to train Gemini, its most advanced AI. Even Apple, which has yet to unveil the AI ​​model for its devices, is investing in the quality data that publishers possess. The New York Times claims that the Cupertino company will spend in next years at least 50 million dollars to train its AI. And in this regard he has already entered into negotiations with the Condé Nast Group – which publishes for example Vogue and the New Yorker – and the broadcaster NBC News.

Then there are those who are lucky enough to find themselves a Huge database in house. Meta was able to use tens of millions of photos shared on its social networks in “public” mode. Facebook and Instagram are in fact among the sources used to train an AI that generates images (it is called “Imagine with Meta AI” and is not yet available in Italy). But not everyone developing AI has the resources of Big Tech. Has anyone noticed, for example, that Midjourneyone of the most effective AIs in creating realistic shots, studied the frames of films produced by Warner Bros (Joker, The Batman). There is no economic agreement between the two parties. But innovative companies have always taken risks. Twenty years ago a young Zuckerberg advised Harvard students: “Think about creating first, then lawyers.”


Source link