Over the last few years, your digital footprint has become a goldmine for AI companies. At every opportunity, your tweets, snaps, videos and Facebook posts are being gleefully scraped by tech industry ‘innovators’, eager to develop the Next Big AI thing off the back of your data. Doesn’t matter if it’s a drunken Facebook rant you posted back in 2013 or a cringey LinkedIn inspirational post you don’t even remember writing, if it’s still floating round cyberspace, the likelihood is it’s been gobbled up and used to train LLMs.

While GDPR offers UK users some protection, this isn’t always a cast-iron guarantee you have full control over how your data is used and it doesn’t bode well that the giants of tech are seemingly able to bat away lawsuits like flies (yes, I’m looking at you, Google).

How does your data end up in AI training?

AI systems are trained on vast datasets scraped from a variety of publicly available sources, such as social media posts, forums and websites. Once your data is in the training mix, it can be hard to fish it out again.

If the idea of being an unwitting contributor to an AI training model to generate big techs next billion leaves a bad taste in your mouth, you’re not alone. The good news is, in this latest battle of Man Vs the Machine, there are measures you can take to protect yourself and your data from falling into greedy robot hands.

How to prevent your data being used for AI training

Twitter/X

On 15th November, X rolled out an update to the platform’s terms of service, allowing it to gather user data from tweets, photos and videos to train its AI model ‘Grok’.

By posting on The Artist Formally Known As Twitter, you’re granting it the right to “analyze text and other information you provide…for use with and training of our machine learning and artificial intelligence models, whether generative or another type”.

This latest move obviously didn’t go down well with some people, particularly at a time when X is shedding hundreds of thousands of users, due in no small part to the platform becoming a mouthpiece for Musk and his fellow Trump fanboys.

Bailing Twitter post image

At present, you *can* still opt out of having your data pinched (for how much longer though – who knows?) by following these steps:

  • Open ‘Settings and privacy’
  • Click on ‘Privacy and safety’.
  • Scroll down and tap ‘Grok’
  • Untick the box that says ‘allow your posts as well as your interactions, inputs, and results with Grok to be used for training and fine-tuning’.

(As a personal aside, we would then recommend deleting your account altogether – it feels very satisfying!)

Meta (Facebook and Instagram)

Meta doesn’t have an ‘opt out’ option per se. You do, however, have the right to ‘object’ to having your data used which Meta then decides whether to ‘honour’ or not. In the UK, the objection is upheld almost instantaneously (thanks to good ol’ GDPR). It’s less straightforward for users in the USA where data protection seems to be more of a grey area. Regardless of where you are in the world, you do have to take a rather convoluted route to find the correct form. To save you the hassle, I have included the link to the form here which covers both Facebook and Instagram. Once you’ve completed the form, you should then get the message below:

Facebook AI objection image

LinkedIn

Once again, GDPR offers a degree of protection for UK users, with LinkedIn stating: ‘Note that we do not currently train content-generating AI models from members located in the EU, EEA, UK, Switzerland, Hong Kong, or Mainland China’. For anyone outside of these areas though, LinkedIn ‘did an X’ and bought in data collection for AI back in September which wasn’t very well publicised at the time. Even if you’re in the UK, it’s still worth following the steps outlined in this Mashable article to turn off the setting as you never know when the goal posts will be surreptitiously moved.

BlueSky

The current darling of social media is keeping its halo well and truly polished by announcing that it has no intention of using users’ data to train AI, so if you have joined the millions of others who have flocked to this platform, your data is safe.

Blue Sky AI stance image

Gmail

Nestled within Gmail’s settings is a feature called Smart Compose, a predictive text that uses your own emails, chats and video content to predict what you might say next. This service is designed to ‘personalise’ your experience across Google products, not just your email (e.g. Google Docs, YouTube, Slides, etc). Where it gets a bit hazy is to what extent your email and chat content is being shared with third parties, including advertisers.

You can, however, opt out of letting Google use your email and chat data to train Smart Compose, which can be done via your desktop or phone app, as shown below (available in your Gmail Settings > General tab):

Smart Features Google image

Microsoft 365

Microsoft has firmly rebuffed claims that it uses customer data from Word and Excel to train its AI models. A blog post by Casey Lawrence suggested Microsoft had implemented an opt-out feature for AI training, raising concerns about document scraping. However, Microsoft quickly responded on social media, stating: 'In the M365 apps, we do not use customer data to train LLMs. This setting only enables features requiring internet access like co-authoring a document'.

Snap

Snapchat has its own chatbot called My AI that uses information you share with the app to train it. As with other platforms, if you engage with the chatbot directly, that information may be retained and used to train it. Other data, such as your location, may also end up being used to inform the chatbot depending on your settings on the app. If you don’t want to give My AI access to your location, you will have to revoke location permissions, however you may need to go a step further and delete your data from My AI altogether to prevent it being cached by the app.

Your website

As well as having personal data scraped to train the new machines, your website is very much ‘in play’ too. The web was built to be open for anyone and the new breed of tech companies have sucked up as much indexable content as they can to build their new businesses. Being part of the ChatGPT experience is great now, but what about in the future when it gives all of the answers to users without giving you the traffic or credit?

If you want to stop your website from being indexed by the likes of ChatGPT, Claude, Google and others, then there are options. Like traditional search indexing, you need to refer back to your robots.txt file and add in some extra blocks. This article by Neil Clarke outlines some of the things that you need to do to stop any more content from being indexed.

Final thoughts

Without meaning to come off all dystopian, it does feel like our data has never been more vulnerable than it is right now. Whilst GDPR affords us some degree of protection here in the UK, we can certainly take steps to retain control of our own information. Stay vigilant, read the fine print and remember this golden rule – more often than not, a digital footprint is for life, not just for Christmas.

Back to blog
Meet the author ...

Anna Heathcote

Content Manager

Based way up on the Northumbrian coast, Anna uses her creative copywriting expertise and SEO experience to ensure clients have fresh, relevant and optimised content on their ...