Crawlability and the basics of SEO

It’s an ever-elusive acronym: SEO. But what is search engine optimisation? In truth, it’s probably the case that no-one truly understands how search engines evaluate internet content. It’s an ever-evolving game and the rules are continually being redefined. Particularly in our new AI dominated search world.

Even those at Google sometimes struggle to comprehend how their algorithms index a piece of content, especially when there are more than 200 ranking factors. SEO brings with it new ideas, new knowledge and new concepts. Google’s AI bots decide what content to show for what search query, it’s just a matter of understanding the language used to communicate this content across the internet.

To understand SEO, we need to first understand crawlability.

Man on laptop

What is crawlability?

Before Google can index a piece of content, it must first be given access to it so that Google’s crawlers (or spiders) – the bots that scan content on a webpage – can determine its place in the search engine results pages (SERPs). If Google’s algorithms cannot find your content, it cannot list it.

Think about a time before the internet. We had listing services like the Yellow Pages. A person could choose to list their phone number for others to find, or choose not to list a number and remain unknown. It’s the same concept on Google. Your web page (whether that’s a blog post or otherwise) must offer permission to crawlers so it can be indexed.

Robots.txt files: how do they work?

The internet uses a text file called robots.txt. It’s the standard that crawlers live by, and it outlines the permissions a crawler has on a webpage (i.e. what they can and cannot scan). Robots.txt is a part of the Robots Excursion Protocol (REP), which is a group of web standards that regulate how robot crawlers can access the internet.

Want an example? Type a website URL into your search browser and include ‘/robots.txt’ at the end. You should find yourself with a text file that outlines the permissions that a crawler has on a website. For example, here is Facebook’s robots.txt file:

robots.txt

 

So what we see here is that a Bingbot (a crawler used by Bing.com) cannot access any URL that will have ‘/photo.php’. This means that Bing cannot index on its SERPs any users’ Facebook photos, unless these photos exist outside of the ‘/photo.php’ subfolder.

By understanding robots.txt files, you can begin to comprehend the first stage a crawler (or spider or Googlebot, it’s all the same thing) goes through to index your website. So, here’s an exercise for you:

Go to your website and search your robots.txt file and become familiar with what you do and don’t allow crawlers to do. Here’s some terminology so you can follow along:

  • User-agent: The specific web crawler to which you’re giving crawl instructions (usually a search engine).
  • Disallow: The command used to tell a user-agent not to crawl particular URL.
  • Allow (Only applicable for Googlebot. Other search engines have different variations of bots and consequently, different commands): The command to tell Googlebot it can access a page or subfolder even though its parent page or subfolder may be disallowed.
  • Crawl-delay: How many milliseconds a crawler should wait before loading and crawling page content. Note that Googlebot does not acknowledge this command.
  • Sitemap: Used to call out the location of any XML sitemaps associated with this URL.

Crawlers (spiders): what do they look for?

A crawler is looking for specific technical factors on a web page to determine what the content is about and how valuable that content is. When a crawler enters a site, the first thing it does is read the robots.txt file to understand its permissions. Once it has permission to crawl a web page, it then looks at:

  • HTTP headers (Hypertext Transfer Protocol): HTTP headers specifically look at information about the viewer’s browser, the requested page, the server and more.
  • Meta tags: these are snippets of text that describe what a web page is about, much like the synopsis of the book.
  • Page titles: H1 and H2 tags are read before body copy is. Crawlers will get a sense of what content is by reading these next.
  • Images: Images come with alt-text, which is a short descriptor telling crawlers what the image is and how it relates to the content.
  • Body text: Of course, crawlers will read your body copy to help it understand what a web page is all about.

With this information, a crawler can build a picture about what a piece of content is saying and how valuable it is to a real human reading it.

But here’s the thing…

There are more than 200 ranking factors that a crawler will consider. It’s a complicated process, but so long as your technical checks are in place, you have a great chance of ranking in the SERPs. Backlinks, for example, are extremely important to determine how authoritative a piece of content is, as is the overall domain authority.

SEO is nothing more than about ensuring your content has the correct technical checks in place. It’s about making sure you give a crawler permission in the robots.txt files, that the crawler can easily understand your meta tags, that your page headings are clear enough and relate to the body copy, and that what you provide your readers is valuable and worth reading. And this last point is quite possibly the most important: value is everything. Because let’s face it, if an algorithm isn’t going to read your content, a human certainly won’t.