But here we’re talking about massive tech corporations. A cynic would argue that Google and Microsoft developed many of the technologies in place to prevent bots getting into places they shouldn’t be. Most CAPTCHA puzzles, for example – those boxes full of images you need to click to prove you’re human – are run by Google. What would stop them taking everything?
A pair of class action lawsuits recently filed in the US allege that’s exactly what the tech giants have been doing.
Brought by Clarkson Law Firm, the cases allege that OpenAI, Microsoft and Google have been harvesting as much information from the internet as possible — including, illegally, private information and copyrighted works — to build their AI chatbots.
The lawsuits also allege that these actions have caused or will cause harm; that the companies ignored an existing and established market for purchasing information, that models trained on copyrighted works will be used to compete against traditional media, and that people will be sold products that were created by data stolen from them in the first place.
“[Bing and ChatGPT] use stolen private information, including personally identifiable information, from hundreds of millions of internet users, including children of all ages, without their informed consent or knowledge,” the first lawsuit alleges, in part.
“Furthermore, defendants continue to unlawfully collect and feed additional personal data from millions of unsuspecting consumers worldwide, far in excess of any reasonably authorised use, in order to continue developing and training the products.”
A Microsoft spokesperson declined to comment.
Clarkson makes similar claims in its lawsuit against Google.
“Google illegally accessed restricted, subscription-based websites to take the content of millions without permission and infringed at least 200 million materials explicitly protected by copyright, including previously stolen property from websites known for pirated collections of books and other creative works,” that lawsuit says in part.
“Without this mass theft of private and copyrighted information belonging to real people, communicated to unique communities for specific purposes, and targeting specific audiences, many of Google’s AI products, including Bard, would not exist.”
In response to the allegations, Google reiterated that all its data collection practices were legal.
“We’ve been clear for years that we use data from public sources – like information published to the open web and public datasets – to train the AI models behind services like Google Translate, responsibly and in line with our AI principles,” said Google general counsel Halimah DeLaine Prado.
“American law supports using public information to create new beneficial uses, and we look forward to refuting these baseless claims.”
Loading
This masthead does not suggest that Google, OpenAI or Microsoft have done anything illegal.
The question of how chatbots are trained and improved, and whether our own personal data is used, brings to mind a familiar problem we’ve seen primarily in social media.
The fundamental problem is that the inner workings of this technology – which seems bound to play a big part in our lives – is completely opaque, making it impossible to answer burning questions or truly understand what’s happening with the information we place online. It’s unclear even if those operating the bots fully control or understand the scope of their data collection.
And as with social media, privacy policies and official FAQs don’t offer much illumination. For example, Google’s privacy policy makes it clear that it collects information about every email you send, photo you upload, and search you make, and that it’s used to develop, maintain and improve services. So does that mean an original novel that you have sitting in your Google Drive is helping train language models?
Loading
A Google spokesperson said no. Personal data from Gmail, Photos and Google’s Workspace services are not used to train AI models including Bard.
But the same might not be true if you happen to post your writing on a forum. A recent change to Google’s policy explicitly states it will use information that’s “publicly available or from other public sources” to train AI models and build products.
Recently, discussion site Reddit has seemed to realise that its service was being constantly crawled by bots, and instituted a change to its backend that meant nobody could get unfettered access to its content without paying a hefty subscription fee. The moved caused a site-wide protest supporting developers who needed that access to create accessible and custom-made versions of the site. But Reddit chief executive Steve Huffman said at the time he didn’t “need to give all of that value to some of the largest companies in the world for free”.
Data scraping was also supposedly behind recent changes at Twitter, where limits were imposed on how many tweets a given user could see per day. Owner Elon Musk said it was to “address extreme levels of data scraping [and] system manipulation”.
Loading
Elsewhere, several media companies have expressed concerns that their articles and journalism could be scraped to train bots, which eventually would be used to produce work that would compete with traditional media. Arguably, most news sites are “public sources”. Similar issues have been raised by writers as part of the film industry strikes in the US.
As with social media, the full weight of regulation and litigation will be a few years behind, it seems, as practically all rules that govern what data crawlers can do – or what information can be collected for which purpose – were written well before the advent of generative AI. In the meantime, it’s close to impossible to tell exactly what’s going on behind the scenes.
Get news and reviews on technology, gadgets and gaming in our Technology newsletter every Friday. Sign up here.