You may have been noticing an increase in referral traffic from ‘spam’ sites. Some common ones that tend to show up in Google Analytics under the referral report are:
And there are many, many more. These will no doubt be jacking up your reporting – not only with inflated visits, but also your engagement metrics like bounce rate and time on site.
Recently, some porn sites have been added to the mix – great news for anyone reporting on top sources of referral traffic for smaller clients! Wahey.
What do I need to be rid of?
There are 2 key types of referrers that you will need to identify and exclude:
Questionable crawlers (such as semalt.com) and fake referrers
These guys will often visit your site in the hope that you’ll notice it in the referral report and click the link to see what the source is. I’d strongly recommend that you don’t do this, as there is the risk that these sites may contain a virus.
Our old favourite semalt.com on the other hand seems to be crawling sites so that when you visit their site, they can promote their ‘fabulous’ SEO tool. If their crawler software was any cop it wouldn’t inflate your visits, or hog your bandwidth. While they may claim that this is not spammy, I’d disagree!
They can usually be identified as they will be using a valid hostname – ie. they use your hostname, making it hard to filter them out of your data.
A real pain in the backside, as they don’t actually visit your site, meaning that you can’t block them at the server. Ghost referrers can usually be identified by fake hostnames.
So, how do I fix this?
Because these visits are utter cack, you’ll want to exclude them from your reports. The question is, how?
Before we begin, I have some sad news to share with you. There is no foolproof way of excluding 100% of your referral spam long term. Sorry.
A number of methods have been suggested to prevent them from showing up in Google Analytics, but they all come with issues, or do not work at all.
Block these sites in the robots.txt file
You can add them if you like, but bear in mind that most bad bots and crawlers will ignore this anyway, and remember, ghost referrers don’t actually visit your site either.
Block these sites in the .htaccess file (on Apache servers)
Blocking these sites via the .htaccess file will work for any low-level bots that actually hit your server, so if your site is on an Apache server, by all means exclude them. Unfortunately, many of these sites are those ghost referrers who never hit your server at all, meaning excluding them will not work. The good news is that it is effective against questionable crawlers and fake referrers.
Always make a backup before attempting to make changes to the .htaccess file! Help on how to add these sites can be found in this fabulous post here.
Use the ‘Exclude hits from all known bots and spiders’ checkbox in Google Analytics
There are loads of bots and spiders cropping up on a daily basis, but this will exclude known bots and crawlers, so tick that lovely box. However, semalt.com is not excluded by ticking this box – and I’ve been whinging about them causing havoc with my data for at least a year! Womp.
Use the referral exclusion list in Google Analytics
This is not intended to be used to exclude these sort of referrals. The purpose of this is for preventing sources from starting a new session while a current session is active, e.g. third-party-hosted shopping or payment solutions like PayPal. Don’t use this to exclude referral spam.
A whole bit just on filters…brace yourselves.
Before adding any filters, create a new view – it’s a good idea to leave a raw data account even if it is full of rubbish data just in case something goes wrong.
Also, filters have a 255 character limit. To filter out referrer spam, I currently have 5 separate filters set up as there are so many of these bad boys wrecking up my data. Fun times.
Exclude locations of known spam sites using filters
This is a pretty extreme method of filtering out traffic, as it can remove legitimate visits from your data.
It may still be worthwhile setting this up as a filter on a different view if you do notice a lot of spam traffic coming from certain countries.
To set this up, create a new view in Google Analytics and name it something snazzy like ‘Exclude Spam Visits – Location’
Then go into filters under your account level settings and create a new filter, name it, select the custom filter type, exclude, filter field by country, and enter the name of the location/s you want to exclude in the filter pattern box and choose the view you want the filter to be applied to.
Include only valid hostnames using filters
OK, so this is as close to a perfect solution as you are going to get to be rid of ghost referrers; but it comes with risks. This should only be implemented if you have sufficient Google Analytics data to identify all valid hostnames – I’d recommend looking back over a year at least as messing this up will exclude valid traffic!
First of all, identify all valid hostnames that may use your website tracking ID; this could include other websites that you are tracking as part of your web ecosystem such as your own domain, payment gateways, ecommerce shopping carts, and all reserved domains. Also, ensure that you include webcache.googleusercontent.com and translate.googleusercontent.com, as these are used by google.
There is a great guide on how to do this right here.
It is worth bearing in mind that this will not exclude the bot referrers that set the hostname to your own!
Exclude the spam sites using filters on referral or campaign source
This is another method that will work to an extent, but it will need to be updated frequently as new spam referrers crop up.
There are three ways of doing this – I’m yet to decide which is the most effective method, and a number of blogs will vouch for one method over another.
To set this up, you can:
Create a new view in Google Analytics and name it something like ‘Exclude Spam Visits – Referrals’
Then go into filters under your account level settings and create a new filter, name it, select the custom filter type, exclude, filter field by Campaign Source, and enter the name of the sites you want to exclude in the filter pattern box and choose the view you want the filter to be applied to.
Another way of doing this is to follow the same method as above, but exclude by Referral or Request URI instead. Campaign Source seems to do the job just fine though, as it excludes at domain level so that you don’t have to match the full referrer – only the domain.
There is a great up to date list of spam that you can refer to to make sure you are blocking them all right here
Note: This will need to be set up as multiple filters now as the list is so long!
Excluding referral spam retrospectively
Exclude the spam using custom segments
When it comes to reporting, thankfully you can set up a custom segment in pretty much the same way as the filter to remove spam visits retrospectively. Whoop!
Click on create segment, then add new segment.
Then under advanced, select conditions, exclude, source, matches regex, and enter those pesky sites with a pipe (|) separating each one.
Note: You don’t need to use the full regex for this (with the \ before the .) – just the domains.
Give it a cheeky preview to make sure it is working – you should see these visits disappear from the data in the table.
While far less of a problem, you may also want to exclude some funky keywords using the same method above, but selecting keyword instead of source.
Thanks for that.
Finally, add a filter that shows just valid hostname traffic.
In traffic > referrals apply hostname as a secondary dimension:
Once applied, you’ll probably see some instances where the hostn
So to recap, the segment needs to be set up as follows:
- Under the condition source the filter should be set to exclude spam referrers with regex selected from the drop down, with a list of all spam referrers, each separated by a |
- Under the condition keyword the filter should be set to exclude spam keywords with regex selected from the drop down, with a list of all spam keywords, each separated by a |
- And finally, on the condition hostname the filter should be set to include your hostname and other valid hostnames with regex selected from the drop down, each separated by a |
Click save and you will be good to go – make sure you use your correct domain name (www./non.www.) and ensure there are no spaces in the regex entered, or it will not work.
If you like, though, just import a segment from the solutions gallery (search for something like ‘exclude spam’) and some lovely person has done most of the hard work for you.
I hope that this post has helped in some way to prevent a nervous breakdown when trying to make sense of your data in Google Analytics. I wish I had some kind of magical ninja powers to provide an awesome, and easy to implement solution to sort this spam issue, but sadly I don’t. Let us all pray that the almighty Google sorts it out, somehow.
If anybody has any other suggestions – let us know!
Also published on Medium.