AI Training and Copyright
Are large corporations overstepping copyright boundaries while training their AI models? It seems they might be.
Industry leaders such as Meta, Google, and Microsoft commonly train their artificial intelligence tools with information found in the public domain, including articles from websites and social media posts. Such content typically falls under copyright protection unless explicitly stated otherwise.
Take Meta, for example. They've openly acknowledged using user-generated content from their platforms, Facebook and Instagram, to refine their Llama 2 model. This model can interpret texts, images, and even gestures. Developers argue that the diverse data sourced from myriad users significantly enhances the model's efficacy. Nevertheless, Meta has been cautious, ensuring entries containing sensitive information are omitted from the training dataset.
Beyond the breach of copyright, many publications might also encompass users' personal preferences, often harnessed for analysis and delivering tailored ads under specific terms. This functionality, although frequently overlooked, plays a pivotal role in companies generating millions from advertisements. When this data is fed into AI model training, it becomes a permanent fixture of the new tech product, with no strings attached.
While company attorneys might champion the 'fair use' defense for leveraging open materials for tech advancements, the rollout of subscriptions for new applications often undermines these legal justifications. Moreover, unlike with tailored ad settings, users seldom have the option to decline participation in AI model training, an issue continually highlighted by the creative community, including artists, actors, and musicians.
Some firms offer data governance tools. For instance, OpenAI provides an option for users to block their content from AI training sessions. However, perhaps to reduce the volume of such requests, the company mandates specific legal documentation for each piece of work, which complicates matters, making this feature largely redundant.
Currently, technological giants operate predominantly on their terms when it comes to handling open-source data and AI. The absence of universally accepted guidelines and regulations renders any standardized approach moot. But, once these rules are in place, we could very likely see tailored account configurations that afford users enhanced control over their content and personal data.