Australian universities and the New South Wales government are among the largest known data sources from Australia used to train artificial intelligence chatbots such as ChatGPT but receive no compensation for their material.
Most of the enormous volumes of data that train the powerful generative AI chatbots, which are poised to transform white-collar industries from media to education, remain secret. But at least two major AI companies – Google and Stability AI – draw some information from the Common Crawl, a non-profit project that scans the internet gathering text from billions of pages.
New South Wales government web pages, which include the thousands of sites from schools, hospitals and local councils around the state, contribute more pages than any other Australian entity for the Common Crawl, according to its database of the top 500 registered domains. It is followed by the Australian National University, University of Adelaide and University of Melbourne.
Individually, these sites contribute only a fractional amount to the overall Common Crawl database – which is measured in the thousands of terabytes – and rank far below major sources, such as Wikipedia and Amazon-hosted pages. But their presence shows how websites created by millions of people, including in Australia, and intended for completely different purposes are being fed into artificial intelligence systems that have already generated billions for their small handful of owners.
Social media services Reddit and Twitter, international media giant News Corp and photo library Getty Images are all demanding payment for the way artificial intelligence companies have used their data to train their generative image and text systems.
Meanwhile, Australian public institutions are only just coming to terms with the use of their data for AI. A spokesman for the Australian National University said the institution was looking at the issue closely but did not yet have a developed position.
“This is mainly because if the company is operating under US legislation, then they use website content consistent with the legislation of that country rather than Australian legislation,” the spokesman said.
“Our experts note that many technology innovations could not occur in Australia because of our legislation, which is fair-dealing rather than fair-use based.”