Your thousands of photos stored on Google or Amazon. Your likes on Spotify or Netflix. The commands you give to Siri or Alexa. Even the damned “captchas” that you have difficulty decoding. Without necessarily knowing it, users make actions every day that represent a gold mine for tech giants, allowing them to accumulate valuable data to train their artificial intelligence (AI) systems.
“Almost everything we do on the internet is recorded,” immediately specifies Laurent Charlin, senior member of the Mila Quebec Institute of Artificial Intelligence and associate professor at HEC Montréal. “Eventually, there is probably a greater and greater chance, from month to month or from year to year, that this information will be used in one way or another somewhere to train an automatic system . »
Effective because popular
Google’s search engine is in itself a good illustration of what gives value to this user participation. Without it, Google would never have been able to distinguish itself from previous search engines which were essentially content to archive and list sites by keywords. It is the concept of PageRank, based on the popularity of web pages and presented by the founders of Google, Sergey Brin and Larry Page, in 1998, which marked the takeoff of this search engine which today holds 83, 5% of the market, according to Statista.
In other words, the more Internet users use Google, the more relevant its search engine is.
“Fundamentally, having a human’s time has a lot of value,” explains Mr. Charlin. Artificial intelligence is trained with often labeled data, so the more labels I have, the more datasets I have that will allow me to improve it. »
The magic of recommendations at work on most popular sites, from Facebook to YouTube to Amazon, also relies on the analysis of user behavior. “It’s still quite long, training a machine requires millions and millions of data statements,” explains Jonas Colin, researcher and doctoral student in cognitive computing at the University of Quebec in Montreal (UQAM).
For example, you go to YouTube, you select different videos, everything is kept behind the scenes in the machine’s memory. If you watch car videos, the next time you go to YouTube, we’re going to offer you car videos. […] The machine knows its users much better and is able to react much better to their preferences.
Jonas Colin, researcher and doctoral student in cognitive computing at the University of Quebec in Montreal (UQAM).
Material to refine
Everything happens on the web as if technological platforms permanently had, and free of charge, discussion groups and surveys on their users. Which is an advantage for these companies, of course, but also benefits Internet users, says Louis-François Bouchard, scientific popularizer in artificial intelligence and co-founder of Towards IA, an education platform.
“As soon as we use Google or Apple, as soon as we use a system, we help it improve. Which is good for us too: it makes what we use better. »
However, we should not believe that these raw data are sufficient, he specifies. “It’s sure that it’s cool for them to have a lot of data, to have access to all that, except that it’s also a bit of a poison. […] Then there’s a lot of work, a lot of processing, a lot of engineering. It’s clear that we help them, but we shouldn’t say that we do everything for them either. »
The example of ChatGPT illustrates this well: the content on which generative artificial intelligence was built, the billions of pages of text on the web, was accessible to everyone. “There is this first training which gives so much data, I would say that it is the first step to understand a little of the world, then to have more refined knowledge,” summarizes Mr. Bouchard. It’s like going to primary school first, then going into a more specific technique at CEGEP or university. »
Four training examples
Captchas
“Captchas” are these online tests where you can be asked to identify a specific object in photos – bicycle, staircase, traffic light – or to type handwritten letters. It is Google reCAPTCHA, with a market share ranging from 93 to 99.9% depending on sources, which controls this tool offered to sites. Its official role: to ensure that it is indeed humans who want to enter a site. But Google specifies online, reCAPTCHA is used to train the AI. “High-quality human-labeled images are compiled into datasets that can be used to train machine learning systems,” it reads. “CAPTCHAs have no longer been used to train AI since 2019,” however, clarified a Google spokesperson.
EXTRACTS FROM THE GOOGLE SITE
1/4
Recommendations
Whether on Netflix, Spotify, TikTok or YouTube, users’ choices are carefully compiled and used to suggest their next listenings or viewings. The algorithms at work are much more complex than a simple analysis of individual tastes. “On TikTok, we will recommend popular videos from people who have a profile somewhat similar to yours,” explains Louis-François Bouchard. If you have watched an entire series on Netflix, it will be offered to several subscribers of the same profile. And if a lot of people watch it, they will produce a similar series. »
Pictures
A class action recently authorized in Quebec against Google Photos has lifted the veil on the usefulness of the billions of photos posted online. In the case of Google, these photos are analyzed by facial recognition technology called FaceNet which allows people to be tagged. Facebook has similar technology for which it had to pay US 650 million as part of a class action in Illinois in 2021. “These machines, if you train them with a few million photos, they are capable of recognizing distinctive features that you and I would not be able to recognize,” explains Jonas Colin.
Voice assistants
No one likes to be told off or corrected, except voice assistants like Siri, Alexa and Google Assistant. It is indeed a gold mine for the engineers who designed these artificial intelligence systems, who can thus improve thanks to the free collaboration of their users. In 2019, Amazon confirmed The Press having called on “a few thousand French-speaking users” in Canada to provide Alexa with an understanding of Quebec French. The autocorrector offered on mobile devices, notes Louis-François Bouchard, improves through feedback from its users. “If he notices that words he doesn’t know come up often, he will end up accepting them. »