[克里斯]一個免費的人工智能圖像數(shù)據(jù)集因兒童性虐待圖像而被刪除，此前曾受到批評

時間：2023-12-21|瀏覽：339

您準備好提高您的品牌知名度了嗎？

考慮成為人工智能影響之旅的贊助商。

詳細了解這里的機會

。

斯坦福大學(xué)互聯(lián)網(wǎng)的一份新報告稱，大型開源人工智能數(shù)據(jù)集 LAION-5B 已用于訓(xùn)練穩(wěn)定擴散和谷歌 Imagen 等流行的人工智能文本到圖像生成器，其中包含至少 1,008 個兒童性虐待材料實例天文臺發(fā)現(xiàn)——還有數(shù)千起疑似病例。

斯坦?；ヂ?lián)網(wǎng)觀測站是網(wǎng)絡(luò)政策中心的一個項目，是弗里曼·斯波利國際問題研究所和斯坦福大學(xué)法學(xué)院的聯(lián)合倡議。

報告稱，LAION-5B 數(shù)據(jù)集于 2022 年 3 月發(fā)布，包含來自互聯(lián)網(wǎng)的超過 50 億張圖像和相關(guān)說明文字，還可能包括數(shù)千條疑似兒童性虐待材料（CSAM）。

該報告警告說，數(shù)據(jù)集中的 CSAM 材料可以使基于這些數(shù)據(jù)構(gòu)建的人工智能產(chǎn)品能夠輸出新的且可能真實的虐待兒童內(nèi)容。

作為回應(yīng)，LAION 周二告訴 404 Media，出于“高度謹慎”，它暫時刪除了其數(shù)據(jù)集，“以確保在重新發(fā)布之前它們是安全的”。

LAION 數(shù)據(jù)集之前曾受到過批評

但這并不是 LAION 的圖像數(shù)據(jù)集第一次受到攻擊。

早在 2021 年 10 月，認知科學(xué)家 Abeba Birhane（現(xiàn)任 Mozilla 值得信賴的人工智能高級研究員）就發(fā)表了一篇論文《

多模態(tài)數(shù)據(jù)集：厭女癥、色情和惡性刻板印象》

，該論文研究了早期的圖像數(shù)據(jù)集 LAION-400M。

研究發(fā)現(xiàn)，該數(shù)據(jù)集包含“令人不安的露骨圖像和文本對，包括強奸、色情、惡意刻板印象、種族主義和種族誹謗以及其他極其有問題的內(nèi)容?！?/p>

VB事件

人工智能影響之旅

VentureBeat 的 AI Impact Tour 即將來到您附近的城市，與企業(yè) AI 社區(qū)建立聯(lián)系！

了解更多

2022 年 9 月，一名藝術(shù)家發(fā)現(xiàn) LAION-5B 圖像數(shù)據(jù)集中引用了她的醫(yī)生于 2013 年拍攝的私人醫(yī)療記錄照片。

藝術(shù)家 Lapine 在 Have I Been Trained 網(wǎng)站上發(fā)現(xiàn)了這些照片，該網(wǎng)站允許人們在流行的人工智能訓(xùn)練數(shù)據(jù)集中查找他們的作品。

And a class-action lawsuit, Andersen et al. v. Stability AI LTD et al., was brought by visual artists Sarah Andersen, Kelly McKernan, and Karla Ortiz against Stability AI, Midjourney, and DeviantArt in January 2023. While LAION was not sued, it was named in the lawsuit, which said that “Stability is alleged to have ‘downloaded of otherwise acquired copies of billions of copyrighted images without permission to create Stable Diffusion’ known as ‘training images.’ Over five billion images were scraped (and thereby copied) from the internet for training purposes for Stable Diffusion through the services of an organization (LAION, Large-Scale Artificial Intelligence Open Network) paid by Stability.”

Ortiz, an award-winning artist who has worked for Industrial Light & Magic (ILM),Marvel Film Studios, Universal Studios and HBO, spoke at a virtual FTC panel in October and discussed the LAION-5B dataset.

“LAION-5B is a dataset that contains 5.8 billion text and image pairs, which…includes the entirety of my work and the work of almost everyone I know,” she said. “Beyond intellectual property, data sets like LAION-5B also contain deeply concerning material like private medical records, non consensual pornography, images of children, even social media pictures of our actual faces.”

AI pioneer Andrew Ng has criticized removing access to LAION

As VentureBeat reported in September, Andrew Ng, former co-founder and head of Google Brain, has made no bones about the fact that the latest advances in machine learning have depended on free access to large quantities of data, much of it scraped from the open internet.

In an issue of his DeepLearning.ai newsletter, The Batch, titled “It’s Time to Update Copyright for Generative AI, he wrote that a lack of access to massive popular datasets such asCommon Crawl,The Pile, andLAIONwould put the brakes on progress or at least radically alter the economics of current research.

“This would degrade AI’s current and future benefits in areas such as art, education, drug development, and manufacturing, to name a few,” he said.

And in theJune 7 editionof The Batch, Ng admitted that the AI community is entering an era in which it will be called upon to be more transparent in our collection and use of data. “We shouldn’t take resources likeLAIONfor granted, because we may not always have permission to use them,” he wrote.

LAION was founded to create an open-source dataset

Hamburg, Germany-based high school teacher and trained actor Christoph Schuhmann helped found LAION, short for “Large-scale AI Open Network. According to an April 2023 Bloomberg article, Schuhmann was hanging out on a Discord server for AI enthusiasts and was inspired by the first iteration of OpenAI’s DALL-E to make sure there would be an open-source dataset to help train image-to-text diffusion models.

“幾周之內(nèi)，舒曼和他的同事就獲得了 300 萬個圖像文本對。

三個月后，他們發(fā)布了包含 4 億對的數(shù)據(jù)集，”彭博社的文章稱。

“這個數(shù)字現(xiàn)已超過 50 億，使 LAION 成為最大的免費圖像和字幕數(shù)據(jù)集?！?/p>

從那時起，非營利組織 LAION 就開源人工智能話題公開發(fā)表意見：例如，在 2023 年 3 月一封呼吁人工智能“暫停”的公開信引發(fā)了圍繞風(fēng)險與炒作的激烈爭論后，LAION 呼吁加快研究并建立用于大規(guī)模開源人工智能模型的聯(lián)合國際計算集群。