• About Us
  • Contact Us
  • Terms & Conditions
  • Privacy Policy
Technology Hive
  • Home
  • Technology
  • Artificial Intelligence (AI)
  • Cyber Security
  • Machine Learning
  • More
    • Deep Learning
    • AI in Healthcare
    • AI Regulations & Policies
    • Business
    • Cloud Computing
    • Ethics & Society
No Result
View All Result
  • Home
  • Technology
  • Artificial Intelligence (AI)
  • Cyber Security
  • Machine Learning
  • More
    • Deep Learning
    • AI in Healthcare
    • AI Regulations & Policies
    • Business
    • Cloud Computing
    • Ethics & Society
No Result
View All Result
Technology Hive
No Result
View All Result
Home Artificial Intelligence (AI)

Major AI Training Data Set Exposes Millions of Personal Records

Adam Smith – Tech Writer & Blogger by Adam Smith – Tech Writer & Blogger
July 18, 2025
in Artificial Intelligence (AI)
0
Major AI Training Data Set Exposes Millions of Personal Records
0
SHARES
2
VIEWS
Share on FacebookShare on Twitter

Introduction to Online Data Scraping

The bottom line, says William Agnew, a postdoctoral fellow in AI ethics at Carnegie Mellon University and one of the coauthors, is that “anything you put online can [be] and probably has been scraped.” This statement highlights the risks associated with sharing personal information online, as it can easily be accessed and used by others.

The Extent of Data Scraping

The researchers found thousands of instances of validated identity documents—including images of credit cards, driver’s licenses, passports, and birth certificates—as well as over 800 validated job application documents (including résumés and cover letters), which were confirmed through LinkedIn and other web searches as being associated with real people. In many more cases, the researchers did not have time to validate the documents or were unable to because of issues like image clarity.

Sensitive Information at Risk

A number of the résumés disclosed sensitive information including disability status, the results of background checks, birth dates and birthplaces of dependents, and race. When résumés were linked to people with online presences, researchers also found contact information, government identifiers, sociodemographic information, face photographs, home addresses, and the contact information of other people (like references).

Examples of Scraped Data

Examples of identity-related documents found in CommonPool’s small scale dataset include a credit card, social security number, and a driver’s license. For each sample, the type of URL site is shown at the top, the image in the middle, and the caption in quotes below. All personal information has been replaced, and text has been paraphrased to avoid direct quotations. Images have been redacted to show the presence of faces without identifying the individuals.

The DataComp CommonPool Dataset

When it was released in 2023, DataComp CommonPool, with its 12.8 billion data samples, was the largest existing data set of publicly available image-text pairs, which are often used to train generative text-to-image models. While its curators said that CommonPool was intended for academic research, its license does not prohibit commercial use as well.

The Risks of Web-Scraped Data

CommonPool was created as a follow-up to the LAION-5B data set, which was used to train models including Stable Diffusion and Midjourney. It draws on the same data source: web scraping done by the nonprofit Common Crawl between 2014 and 2022. While commercial models often do not disclose what data sets they are trained on, the shared data sources of DataComp CommonPool and LAION-5B mean that the datasets are similar, and that the same personally identifiable information likely appears in LAION-5B, as well as in other downstream models trained on CommonPool data.

The Consequences of Good Intentions

Good Intentions Are Not Enough

“You can assume that any large scale web-scraped data always contains content that shouldn’t be there,” says Abeba Birhane, a cognitive scientist and tech ethicist who leads Trinity College Dublin’s AI Accountability Lab—whether it’s personally identifiable information (PII), child sexual abuse imagery, or hate speech. This highlights the importance of being cautious when sharing personal information online, even if the intentions are good.

Conclusion

The risks associated with online data scraping are significant, and it is essential to be aware of the potential consequences of sharing personal information online. With the increasing use of web-scraped data to train AI models, it is crucial to take steps to protect sensitive information and ensure that it does not fall into the wrong hands.

FAQs

Q: What is data scraping?
A: Data scraping is the process of extracting data from websites, often using automated tools.
Q: What is the DataComp CommonPool dataset?
A: The DataComp CommonPool dataset is a large collection of publicly available image-text pairs, often used to train generative text-to-image models.
Q: What are the risks associated with online data scraping?
A: The risks associated with online data scraping include the potential for sensitive information to be accessed and used by others, as well as the risk of hate speech and child sexual abuse imagery being spread.
Q: How can I protect my personal information online?
A: To protect your personal information online, it is essential to be cautious when sharing sensitive information, use strong passwords, and keep your online presence private.
Q: What is the importance of being aware of online data scraping?
A: Being aware of online data scraping is crucial to protect sensitive information and ensure that it does not fall into the wrong hands.

Previous Post

“Smart Coach” Helps LLMs Switch Between Text and Code

Next Post

Netflix’s first show with generative AI is a sign of what’s to come in TV, film

Adam Smith – Tech Writer & Blogger

Adam Smith – Tech Writer & Blogger

Adam Smith is a passionate technology writer with a keen interest in emerging trends, gadgets, and software innovations. With over five years of experience in tech journalism, he has contributed insightful articles to leading tech blogs and online publications. His expertise covers a wide range of topics, including artificial intelligence, cybersecurity, mobile technology, and the latest advancements in consumer electronics. Adam excels in breaking down complex technical concepts into engaging and easy-to-understand content for a diverse audience. Beyond writing, he enjoys testing new gadgets, reviewing software, and staying up to date with the ever-evolving tech industry. His goal is to inform and inspire readers with in-depth analysis and practical insights into the digital world.

Related Posts

AI Video Generation Techniques
Artificial Intelligence (AI)

AI Video Generation Techniques

by Adam Smith – Tech Writer & Blogger
September 12, 2025
VMware starts down the AI route, but it’s not core business
Artificial Intelligence (AI)

VMware starts down the AI route, but it’s not core business

by Adam Smith – Tech Writer & Blogger
September 11, 2025
Collaborating with Generative AI in Finance
Artificial Intelligence (AI)

Collaborating with Generative AI in Finance

by Adam Smith – Tech Writer & Blogger
September 11, 2025
DoE selects MIT to establish a Center for the Exascale Simulation of Coupled High-Enthalpy Fluid–Solid Interactions
Artificial Intelligence (AI)

DoE selects MIT to establish a Center for the Exascale Simulation of Coupled High-Enthalpy Fluid–Solid Interactions

by Adam Smith – Tech Writer & Blogger
September 10, 2025
Therapist Caught Using ChatGPT
Artificial Intelligence (AI)

Therapist Caught Using ChatGPT

by Adam Smith – Tech Writer & Blogger
September 9, 2025
Next Post
Netflix’s first show with generative AI is a sign of what’s to come in TV, film

Netflix’s first show with generative AI is a sign of what’s to come in TV, film

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Latest Articles

NetEase to Shut Down Public Cloud Service Amid AI Competition

NetEase to Shut Down Public Cloud Service Amid AI Competition

March 11, 2025
What is Sam Altman’s IQ?

What is Sam Altman’s IQ?

March 1, 2025
Becoming a Chief AI Officer

Becoming a Chief AI Officer

April 11, 2025

Browse by Category

  • AI in Healthcare
  • AI Regulations & Policies
  • Artificial Intelligence (AI)
  • Business
  • Cloud Computing
  • Cyber Security
  • Deep Learning
  • Ethics & Society
  • Machine Learning
  • Technology
Technology Hive

Welcome to Technology Hive, your go-to source for the latest insights, trends, and innovations in technology and artificial intelligence. We are a dynamic digital magazine dedicated to exploring the ever-evolving landscape of AI, emerging technologies, and their impact on industries and everyday life.

Categories

  • AI in Healthcare
  • AI Regulations & Policies
  • Artificial Intelligence (AI)
  • Business
  • Cloud Computing
  • Cyber Security
  • Deep Learning
  • Ethics & Society
  • Machine Learning
  • Technology

Recent Posts

  • Create a Voice Agent in a Weekend with Realtime API, MCP, and SIP
  • AI Revolution in Law
  • Discovering Top Frontier LLMs Through Benchmarking — Arc AGI 3
  • Pulling Real-Time Website Data into Google Sheets
  • AI-Powered Agents with LangChain

Our Newsletter

Subscribe Us To Receive Our Latest News Directly In Your Inbox!

We don’t spam! Read our privacy policy for more info.

Check your inbox or spam folder to confirm your subscription.

© Copyright 2025. All Right Reserved By Technology Hive.

No Result
View All Result
  • Home
  • Technology
  • Artificial Intelligence (AI)
  • Cyber Security
  • Machine Learning
  • AI in Healthcare
  • AI Regulations & Policies
  • Business
  • Cloud Computing
  • Ethics & Society
  • Deep Learning

© Copyright 2025. All Right Reserved By Technology Hive.

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?