Mumsnet, a widely popular UK-based parenting forum for mothers, has posts on almost any topic related to raising children imaginable. With a history spanning over two decades, Mumsnet has accumulated an extensive archive containing more than six billion words contributed by its actively engaged users, discussing subjects from dirty diapers to lazy husbands and even an unusual rant about dolphins.
This spring, Mumsnet discovered that AI companies were scraping its data. In response, the company decided to pursue licensing agreements with major AI entities, including OpenAI, which initially showed interest. However, negotiations with OpenAI ultimately fell through, leading Mumsnet to announce in July its intention to take legal action against OpenAI and other scrapers.
During the initial discussions, an OpenAI strategic partnership lead indicated interest in datasets exceeding one billion words. Mumsnet’s leadership was enthusiastic and engaged in detailed exchanges, including signing NDAs and providing extensive information. Yet, over a month later, OpenAI informed Mumsnet that it was no longer interested in a partnership. OpenAI’s rationale, as reviewed by WIRED, was that Mumsnet’s dataset was considered too small and mostly comprised publicly accessible content. OpenAI prefers large, publically inaccessible datasets that capture a broad human experience.
This position was reiterated by OpenAI’s spokesperson, Kayla Wood, who stated, “We pursue partnerships for large-scale datasets that reflect human society and do not pursue partnerships solely for publicly available information.” Wood emphasized that OpenAI supports publisher and creator preferences regarding their content’s interaction with AI in search results and the training of generative AI models.
Justine Roberts, Mumsnet founder and CEO, expressed frustration at the development, noting that OpenAI originally seemed particularly interested because of Mumsnet’s primarily female-generated content, which is rare and of high quality.
OpenAI has established various data-licensing agreements with media outlets and platforms over the past year, including partnerships with Vox Media, The Atlantic, Axel Springer, Time, and WIRED’s parent company Condé Nast, as well as user-generated content platforms like Reddit. The specifics of these agreements remain undisclosed, leaving the size of these corpuses unknown.
When asked by WIRED about the minimum dataset size for commercial licensing, OpenAI declined to specify, but Wood highlighted that the company’s partnerships aim to integrate publisher content into their products while driving traffic to the publishers’ sites.