OpenAI Signs Content Deal with Reddit for ChatGPT
KAKALI DAS

On May 17, Reddit and OpenAI announced a partnership designed to benefit both their user communities in various ways.
Global rules are desperately needed, especially in the realm of Artificial Intelligence. Despite the praises of AI’s capabilities, there’s a pressing need for regulation. Bots excel in exams, educate students, and even handle office and school tasks.
The capability of AI bots stems from data. They operate akin to vending machines, but instead of snacks, they dispense data. Users can choose the data they need, which the AI bot provides.

However, unlike vending machines where we’re aware of the food source, the origins of AI bot data remain largely undisclosed. This lack of transparency poses a significant challenge since understanding the data source is crucial for each AI bot’s information intake.
In 2023, a study analyzed Google’s C4 dataset. This compilation comprises 15 million websites and serves as training data for numerous AI models, including those utilized by Google and Facebook. The analysis revealed that the second largest data source within the C4 dataset was Wikipedia, a platform familiar to many.
However, Wikipedia’s open-editing policy means its content isn’t always reliable or accurate. Despite this, AI bots are frequently trained on Wikipedia data.
Another notable entry on the list was b-ok.org, an illegal online marketplace for pirated books. This trend is prevalent, with AI bots drawing data from 27 websites flagged for piracy by the US government. Additionally, they utilize personal blogs such as WordPress, Tumblr, and Blogspot.
The issue here is clear: information on Wikipedia isn’t rigorously scrutinized, and facts on blogs often lack verification. This opens the door to misinformation and opinion being presented as fact.
Training AI bots on such data poses significant risks, which is the primary concern. Additionally, the use of copyrighted material, such as songs on Spotify, or videos on YouTube, raises legal issues since they are intellectual properties requiring legal permission for use.
However, AI companies don’t always obtain this permission.

Reports indicate that OpenAI transcribed millions of YouTube videos with the aim of developing its own video bot named Sora. This bot is capable of converting text commands into video content, similar to its functionality with news articles.
The New York Times has filed a lawsuit against OpenAI, alleging that their articles were used without permission or payment to train ChatGPT. While the NYT has the resources to pursue legal action, the plight of smaller content creators is equally concerning. These creators invest time and money into producing valuable content, be it news, music, or art, only to have AI developers allegedly exploit it for their bots without permission.

One solution being explored is companies striking deals with AI firms. Recently, Reddit signed a deal with OpenAI, allowing Reddit posts to be used for training AI models. Similar agreements have been made with Google, with a deal valued at $60 million.
Reddit isn’t the only one pursuing this approach. Several news organizations, such as The Financial Times and The Associated Press, have also permitted OpenAI to utilize their articles. Perhaps this collaborative model is the way forward.
Many news companies are grappling with financial challenges, making AI a potential new revenue stream. However, significant challenges persist for newcomers in this arena.
Who will be held accountable for past intellectual property infringements? AI companies have amassed wealth by utilizing third-party data without proper compensation. Their soaring valuations and billion-dollar investments now enable them to afford licensing deals. This lack of accountability amounts to hijacking the system.

Governments should mandate AI companies to disclose their training data, a step many, including OpenAI, are hesitant to take. Transparency regarding training data is crucial, especially when encouraging public trust and usage of AI bots, as it provides insight into the data’s origins.
Consider encountering a news article online that haphazardly presents facts and figures without citing any information. Would you trust it? Most likely not, and you shouldn’t. This is analogous to the current state of AI bots.
Secondly, accountability is essential for combating fake information. Occasionally, AI bots can generate false narratives, essentially “hallucinating” events that never occurred.
However, there is currently no one held accountable for such occurrences. It seems to be considered an inevitable aspect of the AI development journey.

Fake news carries significant consequences, and its dissemination by humans is troubling enough. If AI contributes to its spread, the ramifications could be even more difficult to contain. Accountability is crucial in preventing this scenario.
Companies cannot simply excuse themselves by labeling it as an emerging technology; they must take responsibility for their products. While this might discourage and slow down innovation, prioritizing accuracy over speed is imperative for a more trustworthy world.
18-05-2024
Images from different sources
Mahabahu.com is an Online Magazine with collection of premium Assamese and English articles and posts with cultural base and modern thinking. You can send your articles to editor@mahabahu.com / editor@mahabahoo.com (For Assamese article, Unicode font is necessary)