Google finally announced a robots.txt flag to opt out a website of their AI models without having to block the full Google crawler and have the website out of Google results.
In theory this should do it:
User-agent: Google-Extended
Disallow: /
@hugo I just added it to my site, but it doesn't very great that I have to opt out. This should be opt-in.
Sadly, that's not going to happen unless there are laws introduced in both the EU and US.
@loke yep, true. One option is to just block all bots in robots.txt and only opt in to the wanted ones.
@hugo That sounds like a good idea. I'll have to find out how to do that. Do you know of a list of bots that I can use to decide which ones I want?
@loke To disallow all is pretty simple, then you need to decide on the ones you want to allow and that could take a lot more work depending on what you want.
This page shows the basics of robots.txt and how to disallow all.
Now regarding a list of bots that you can decide to use or not, I don't know of one, no.
@loke I just did a search and this one doesn't look extensive but maybe it's a start.
@hugo @loke As far as I remember, OpenAI has said GPTBot will only respect explicit disallow's matching its user-agent. It'll ignore a generic disallow all, because they deem that's intended for search engines and they can just ignore it.
Other AI crawlers seem to do this too, so be weary of a disallow all, without explicit opt-outs for GPTBot and Google-Extended.
@hugo Thanks, added!
I have some more Google bots on my list:
User-agent: AdsBot-Google-Mobile
Disallow: /
User-agent: AdsBot-Google
Disallow: /
User-agent: Mediapartners-Google
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Storebot-Google
Disallow: /
@hugo damn this needs to be opt in innit
@owl well, if you block every bot in robots.txt then you can opt it just to the ones you want but yes, I understand the feeling and it's not understandable that they have been training their AI models on others people data for a long time using everything they wanted from their search index bot.
@hugo I opted-out the OpenAI GPTBot and got delisted by DDG and Bing.
@jan WOW, that's crazy!
@hugo ...good to realise that the large majority will not opt-out, just as most websites don't opt-out from Google Search. Personally i find it strange if a website would allow Google crawling for search but not for AI, especially if you know where search is heading strategically.
@ErikJonker I also don't think that the vast majority of people will opt-out but I think it's good to have the option for the ones who want to opt-out of having their data in the models but want to continue be on Google Search. But yes, there is a convergence on AI and search and the distinction can become blurry.
@ErikJonker @hugo It's not really clear what the description means for the way the platform would represent my stuff to the world, is it? I may opt out of training, but without knowing how search data pools and the LLM interact, I have no idea what the consequences of a choice will be.
I don't have skin in the game here, I'm old and inconsequential. But I can feel Parnassus looking down its snout at me with this pattern.
Nice one Google: Acknowledging your blatant thievery whilst offering an olive branch.
Will I be adding this to my website? Probably. Will I continue to be mad that we weren't given a choice to opt out from the beginning? Also yes.
@hugo Boy howdy, do I ever disagree with that first sentence.
"Over the years advances in AI have enhanced our products, benefiting the people who use them, web content creators, and the overall web ecosystem."
AI just seems to have exacerbated and accelerated the enshitification of everything.
@hugo that really should be an opt-in. To many people have websites who do not know all the settings to implement. And Google is pushing all the work to everyone else. Could be a nice target for the EU and extending the GDPR.
@hugo seems like this should be opt-in not opt out
@dgodon yep, like most bots should and I don't know a single one that is.
@hugo Wonder if adding that to Mastodon servers would make them not scrape content off of my particular server.
(Suspect they could always just use another one but still, every bit helps)
@tchambers Yep, in theory it should block them from adding the data to their AI models.
@hugo Everyone: disables google AI crawler
me: Fills website with lorem ipsum in every language.
@hugo Promptly added that for the rest of my websites (the main one has had it since I first became aware of this a couple of days ago)