Improving crawl efficiency

SEO has evolved, but some things will never change. High-quality content, authoritative and relevant links, and a well structured and maintained site are all crucial parts of an SEO strategy, but if search engines can’t crawl and index the pages of your site that you’ve invested time and effort in, none of this matters. This is why the crawlability of your site is important.

What is crawling?

Search engine spiders (also known as web crawlers and bots) crawl your site to fetch the contents, which are then added to their index. Once crawled and indexed, your website’s URLs will appear in search engine results pages (SERPs). For sites with more than 1,000 pages, crawl efficiency becomes more important, as the bigger your site gets, the longer the crawl will take; therefore It’s important your crawl budget is not being wasted on pages that you do not want to be indexed.

Why is it important for search engines to crawl a site efficiently?

Crawl budget is the number of pages a search engine spider wants to crawl (crawl demand) and the number of pages a search engine spider is able to crawl (crawl rate). A search engine spider will stop crawling a site once its crawl budget is used up, meaning that important pages may end up not being indexed.

According to Google, there are two factors that determine crawl demand:

Popularity: URLs that are more popular on the Internet tend to be crawled more often to keep them fresher in the index
Staleness: Google attempts to prevent URLs from becoming stale in the index

This demonstrates the importance of adding and updating content frequently – as well as promoting it on a regular basis to ensure your newest and most popular content is being indexed quickly.

What can prevent efficient crawling, and how can I fix crawl issues?

There are a number of issues that can affect how search engines crawl your site.

Why make it difficult for bots to crawl? – Via: giphy.com

Errors

A site that has a lot of errors, for example, a high number of 404s and 500 server errors, is likely to be wasting crawl budget.

When a search engine spider hits a page with an error, it will move on to the next URL. This becomes a problem when they run into error after error – particularly server errors that can suggest the site is unable to cope with being crawled. The crawl rate will be slowed down as the spider is concerned that it’s causing the problems, which might lead to the site crashing. These errors can be checked and marked as resolved once fixed in Google Search Console/Bing Webmaster Tools. It is recommended that this is monitored.

Excessive query parameters

This issue is rife on e-commerce sites that utilise query parameters when applying filters. For example, if a user starts on the URL www.chickadee.co.uk/womens-shoes/trainers,
but then uses filters to see Converse, in black, and in a size 5, the URL ends up looking like this:

www.chickadee.co.uk/womens-shoes/trainers?brand=converse&colour=black&size=5

Now imagine every time any user adds a filter, a similar URL is served. This results in thousands and thousands of URLs that require crawling.

Many people assume this can be fixed by adding the canonical tag <link rel=”canonical” href= “www.chickadee.co.uk/womens-shoes” to each URL with a parameter – while this will stop the page from being indexed, it won’t stop it from being crawled. URLs with parameters should be excluded from being crawled, either through marking the links themselves with a nofollow attribute, or by blocking them via the robots.txt, or the parameter tool within Google Search Console and Bing Webmaster Tools.

Obviously, if your site relies on URL parameters to serve content (i.e. when a certain category/product does not have a static landing page) this will be a disaster, so only implement this if you are certain that excluding parameters will not affect your site negatively. You should consider adding static pages rather than relying on URL parameters for those that are a priority – this will help with internal and external linking and help to increase your visibility in SERPs too.

Example:

If parameters are excluded: www.chickadee.co.uk/womens-shoes/trainers?brand=converse

Difficult to write optimised content, and can be linked to, but will not be crawled and indexed, therefore will not show in search results. Any links pointing to the URL lose value.

If URL is static: www.chickadee.co.uk/womens-shoes/trainers/converse

Dedicated landing page so able to write optimised content, can be linked to, and will be crawled and indexed, therefore will be shown in search results. Any links pointing to the URL will pass link juice, thus contributing to page/domain authority.

Duplication and thin content

Again, the canonical tag can be used to prevent duplicate content from being indexed, but this does not prevent it from being crawled.

Entire domain duplication can occur when multiple versions of a website are accessible to search engine spiders.

http://chickadee.co.uk
http://www.chickadee.co.uk
https://chickadee.co.uk
https://www.chickadee.co.uk

301 redirects should be put in place from duplicate versions of the site to the preferred domain.

Thin content is another issue that can use up crawl budget unnecessarily. If pages with thin content cannot be, or do not need to be rewritten to expand on the content, the URLs should have the noindex, nofollow tag added, as should any links pointing to these URLs to prevent them from being crawled. Failing that, evaluate whether these pages are required at all, and if not, redirect them to a single, more useful page.

Excessive 301 redirects and redirect loops

A common issue solved with a 301 redirect is URLs with trailing vs. non-trailing slash:

http://www.chickadee.co.uk
http://www.chickadee.co.uk/

For a few URLs, no biggie. But if an entire website is using non-trailing slash URLs, but all the links in the navigation and external links do use the trailing slash, that’s a lot of unnecessary redirects. Pick one or the other and update the links accordingly, or ask a developer to look into using a rewrite rule to add or remove trailing slashes. Another example of this could be when migrating a domain and not updating all of the internal links – again, make sure that these are updated to prevent unnecessary redirects.

Redirect loops are also a huge waste of crawl budget. A 301 redirect should be used to tell search engines page A points to page B – not page A points to page B, points to page C, points to page D (a non-cyclic loop) or page A points to page B points to page A (a cyclic redirect loop).

Unnecessary redirects are bad, and redirects that do not follow a linear path should be resolved not only for the benefit of crawl budget, but also for users, who may get stuck in a redirect loop or experience issues with page load while going through several redirects.

Page load times

Having fast loading pages benefits SEO in many ways, and crawling is no different – a site is not going to be crawled if it times out. Make sure servers do not take too long to respond, and run page speed tests to identify issues and improvements.

What tools can help to identify crawl issues?

There are a number of crawlers that can be used to identify crawl issues. Screaming Frog’s free version will crawl up to 500 URLs, but is missing many useful features, so paid is the way to go. Xenu is a free crawler that does a pretty good job – and there is no limit on the number of URLs.

Note that you need to be careful with these tools – especially if your site is prone to slowing down or crashing.

You should also check Google Search Console and Bing Webmaster Tools for crawl errors on a regular basis, as well as keeping your sitemap/s up to date.

Think there is an issue with the way your site is being crawled? Have a chat with a member of our SEO team today.