I really do hate pouring fuel on the artificial intelligence (AI) fire, but I have seen quite a bit of panic stricken cries this week about blocking ChatGPT bots from your site. E.g. https://www.linkedin.com/posts/analyticsnerd_chatgpt-activity-7044983361940938752-GSFK/
We had an interesting chat here in the office yesterday and I wanted to share my (possibly very wrong) thoughts on the issue.
Why block AI bots?
The argument for blocking access to the AI bots (mainly ChatGPT for now, but Bard is becoming more of a thing) is very simple – why allow them to use your carefully curated content to feed their content monster?
You have, after all, slaved over creating industry leading content and that takes time and a considerable amount of effort. Why on earth should you hand it over to the bots, who will then effectively plagiarise your genius? It feels overly generous to allow your masterpiece(s) to be used as training data for the various AI platforms so that they can produce decent content.
How can you block AI bots?
The easiest way to stop the crawlers munching up your content is to block the Common Crawl bot. It does appear that the spider honours such directives, which you can implement in a couple of ways.
Arguably the fastest method, you should add the following to your robots.txt:
User-agent: CCBot Disallow: /
Another approach is to add nofollow robots meta tag directives on each page that you wish to protect. the following to all pages that you wish to protect. Although this will give you page by page control, this introduces the risk of accidentally leaving it off the very pages that you wish to hide from the bots, so I would personally recommend a blanket ban using the robots.txt directive.
Does blocking the bots actually work?
As mentioned, it does appear that the bots obey requests to prevent crawling.
A very substantial ‘but’ is the fact that there is currently no way to remove content from the Common Crawl dataset. The same is true for other datasets such as C4 and Open Data.
In other words, it is probably too late for the vast majority of the content that you have already published. I am sorry folks, but you have already helped stoke the AI content fire.
What do I think about blocking AI bots?
The discussion we had in the office yesterday was triggered by a (very valid) suggestion by Victoria that we should suggest to our clients that they may wish to prevent the AI bots from accessing their content.
Whilst I absolutely understand why this feels like a sensible approach, I found myself fighting the ‘free access to all content’ corner and, having slept on it, I still think that blocking the AI bots is probably a waste of time.
The first issue is the fact that robots.txt or meta tag directives do not always work and it would be extremely easy for the crawlers to mask their real identify or simply choose to ignore such directives. I really don’t want to get into a running battle of banning each new incarnation of bots and it feels fairly pointless when you consider the fact that most of your content has already been crawled.
I am very aware that I may be looking at things from a different perspective. As an agency, we work so hard to create and amplify excellent content that I find it unnatural to want to hide it away. This is generally true to my view of gated content. There are many very valid reasons why you may wish to prevent free access to your content, but I typically err on the side of free access as I am a little bit obsessed about domain authority and appreciate the potential that free content has to build natural links. Preventing access goes against the grain of most SEO focussed people who want to share content as far and wide as humanly possible.
Despite the incessant noise about AI, it is still relatively early days and we do not really know whether attribution will come in the future. A very brief play with Bard (Google’s effort) showed that some sources are shown. That is fairly significant in my humble opinion and preventing access to your content could come at the expense of missing out on significant brand exposure.
When you combine that with the very real prospect of Google’s AI bots being used to help inform SEO / SERPs, you really wouldn’t want to miss out on that party.
Whilst I have sympathy for the plagiarism concerns, the reality is that your content is already being used to inspire other content. Research is a key phase of any copywriting project and the bots are just doing what humans have always done – use other content as inspiration. Rather than seeing that as a threat, you may wish to adopt the ‘imitation is the sincerest form of flattery’ mantra and celebrate the fact that your content is being recognised as being brilliant and therefore used as a stimulus for other content.
The future of AI
It feels extremely difficult to picture the future of AI.
A lot of what I see makes me think that we are already living in the future as some of it is incredibly clever. Worryingly so – I have no doubt that the rise of the machines will continue unabated and effectively make a lot of roles redundant.
I also remain adamant that the human brain will always ultimately trump a machine when it comes to content. AI is getting very close, and significantly speeds up the research phase of content production, but subtle nuances or some key features of content such as irony remain the preserve of our grey cells.
I also think that it will always be possible to identify AI content. Google is at it themselves, so they will sure be able to spot signs of spun content and *hopefully* reward the original?
One to watch. I hope that I am right!