Major AI Training Data Set Exposes Millions of Personal Records

Introduction to Online Data Scraping

The bottom line, says William Agnew, a postdoctoral fellow in AI ethics at Carnegie Mellon University and one of the coauthors, is that “anything you put online can [be] and probably has been scraped.” This statement highlights the risks associated with sharing personal information online, as it can easily be accessed and used by others.

The Extent of Data Scraping

The researchers found thousands of instances of validated identity documents—including images of credit cards, driver’s licenses, passports, and birth certificates—as well as over 800 validated job application documents (including résumés and cover letters), which were confirmed through LinkedIn and other web searches as being associated with real people. In many more cases, the researchers did not have time to validate the documents or were unable to because of issues like image clarity.

Sensitive Information at Risk

A number of the résumés disclosed sensitive information including disability status, the results of background checks, birth dates and birthplaces of dependents, and race. When résumés were linked to people with online presences, researchers also found contact information, government identifiers, sociodemographic information, face photographs, home addresses, and the contact information of other people (like references).

Examples of Scraped Data

Examples of identity-related documents found in CommonPool’s small scale dataset include a credit card, social security number, and a driver’s license. For each sample, the type of URL site is shown at the top, the image in the middle, and the caption in quotes below. All personal information has been replaced, and text has been paraphrased to avoid direct quotations. Images have been redacted to show the presence of faces without identifying the individuals.

The DataComp CommonPool Dataset

When it was released in 2023, DataComp CommonPool, with its 12.8 billion data samples, was the largest existing data set of publicly available image-text pairs, which are often used to train generative text-to-image models. While its curators said that CommonPool was intended for academic research, its license does not prohibit commercial use as well.

The Risks of Web-Scraped Data

CommonPool was created as a follow-up to the LAION-5B data set, which was used to train models including Stable Diffusion and Midjourney. It draws on the same data source: web scraping done by the nonprofit Common Crawl between 2014 and 2022. While commercial models often do not disclose what data sets they are trained on, the shared data sources of DataComp CommonPool and LAION-5B mean that the datasets are similar, and that the same personally identifiable information likely appears in LAION-5B, as well as in other downstream models trained on CommonPool data.

The Consequences of Good Intentions

Good Intentions Are Not Enough

“You can assume that any large scale web-scraped data always contains content that shouldn’t be there,” says Abeba Birhane, a cognitive scientist and tech ethicist who leads Trinity College Dublin’s AI Accountability Lab—whether it’s personally identifiable information (PII), child sexual abuse imagery, or hate speech. This highlights the importance of being cautious when sharing personal information online, even if the intentions are good.

Conclusion

The risks associated with online data scraping are significant, and it is essential to be aware of the potential consequences of sharing personal information online. With the increasing use of web-scraped data to train AI models, it is crucial to take steps to protect sensitive information and ensure that it does not fall into the wrong hands.

FAQs

Q: What is data scraping?
A: Data scraping is the process of extracting data from websites, often using automated tools.
Q: What is the DataComp CommonPool dataset?
A: The DataComp CommonPool dataset is a large collection of publicly available image-text pairs, often used to train generative text-to-image models.
Q: What are the risks associated with online data scraping?
A: The risks associated with online data scraping include the potential for sensitive information to be accessed and used by others, as well as the risk of hate speech and child sexual abuse imagery being spread.
Q: How can I protect my personal information online?
A: To protect your personal information online, it is essential to be cautious when sharing sensitive information, use strong passwords, and keep your online presence private.
Q: What is the importance of being aware of online data scraping?
A: Being aware of online data scraping is crucial to protect sensitive information and ensure that it does not fall into the wrong hands.