Search engines and SEOs both use a robot/crawler to crawl URLs. It is thanks to this crawling that they can analyze pages and their content. However, SEO robots and search engine robots have important differences in their operation.
1. Page discovery
To discover new pages on the Internet, a search engine uses various information sources.
Google, for example, discovers URLs using:
- links encountered during crawls on known pages on any site
- URLs in an XML sitemap
- URLs submitted through the URL inspection tool
In addition, other search engines like Bing allow you to provide a list of URLs by API.
The URLs of all these sources are added to the list of pages to crawl.
Conversely, a SEO robot only discovers URLs by crawling through the structure of your website. This gives the SEO robot a more restricted view of your site. An AdWords landing page without inbound internal links, but which served as a landing page for a social media campaign, for example, will be unknown to SEO bots… but will be quickly found by a search engine!
2. Exploration temporality
Not all at once!
URLs known to a search engine are added to a crawling list. As we have seen, they come from different sources. The sequential pages on your site may not be in this list together. To confirm this, take a look at the bot hits in your log files.
Google has indicated that, per crawl session on a site, various elements can limit the number of pages crawled:
- Google does not crawl more than 5 links in a chain of redirects per session
- Google can shorten a crawl session if your website’s server is not responding quickly enough.
In addition, it prioritizes the pages in the list of pages to crawl. Page source, site or page importance factors, post frequency metrics, and other things can help a URL “move up” in the list of pages to crawl.
A SEO robot only has known pages of your website in its list of crawl URLs. Therefore, it crawls them one after the other. Often, SEO crawlers follow the mesh of the internal links of the site: they crawl all pages that are 1-click away from the page where they started, then all pages that are 2-clicks away, all pages that are 3-clicks away, and that in the page discovery order.
Therefore, unlike Google, a too fast SEO robot can saturate a website with too many URL requests too close in time.
Google’s crawl is recurring: although Google’s crawl budget limits the frequency of Google bot visits (i.e. the number of pages crawled by Google in a given period), after crawling a few pages on your site, Google comes back later to revisit them or browse others.
On Google, a recent page can get indexed quickly, while other pages, updated before that one is published, are always indexed in their previous version until the robot returns to it to discover the changes.
Apart from the guidelines that you can give to bots via meta robots tags, robots.txt and htaccess files, search engines never stop visiting a site and will eventually discover certain pages that have not been viewed (or not yet published). ) on their first visits.
However, a SEO robot does not constantly update its list of known and crawled pages. It provides a capture of all the pages of the site that are accessible at the time of its single visit.
Even if it stops when it has crawled all known pages, a SEO crawl can take too long to be usable if Google has given up or split the exploration into several sessions:
- Very slow crawl speed
- Very large number of pages to crawl
- Robot traps that create an endless list of pages to crawl
To avoid a crawl that never ends, most SEO crawlers allow the robot to stop under the following conditions:
- When a maximum number of URLs have been crawled
- When a maximum depth (in number of clicks from the starting URL) has been reached
- When the user has decided to stop the crawl
This can produce “incomplete” crawls of the site where the crawler is aware of the existence of additional URLs that it has not crawled.
4. Compliance with robots.txt and meta directives
Most search engines follow the instructions in robots.txt: If, in the meta directives of the pages or in the robots.txt file, a page or a folder is prohibited for the robots, they do not go there.
The only difficulty is knowing which pages the crawler should visit, and how to express that according to the complex rules of the robots.txt file.
In theory, SEO robots do not have any restrictions. Most SEO crawlers offer friendly bots, like search engine bots.
But this can pose a problem to marketers who want to know how Google sees their site: since instructions to robots can target a particular robot, a non-Google robot will not have the same access as a Google robot.
Search engines crawl with a User-Agent, a profile that serves as their identifier, well defined:
The robots.txt and meta robots’ rules can target a specific robot thanks to its User-Agent.
Example of robots.txt for Googlebot only:
Example of meta robots for Googlebot only:
<meta name=”googlebot” content= “noindex, nofollow “>
A SEO robot crawls your site with its own identity and therefore does not respond to specific guidelines from other robots.
This means that it may have a different view and experience on your site than a search engine crawler.
6. New crawls
Google (as well as other search engines) periodically returns to visit the pages of a website. This aims to check if elements of the page have undergone modifications:
- Has a temporary HTTP status (503, 302…) changed?
- Has the content been updated?
- Has an error been corrected (404)?
- Has a new indexing been requested via Search Console?
- Is the page a candidate for a better position in the results pages?
An SEO robot, during a site audit, only passes once on each URL.
Tips for bringing them together
Although the two types of robots are different, their differences are never impossible to resolve!
- Take advantage of SEO crawlers which offer other page discovery options: lists of URLs, sitemaps, connections to other tools, analysis of log files, etc.
- Focus on analyzes that include backlinks, or inbound links from other sites, as well as SEO robots that are able to pick up the HTTP status code of the outbound link response, so you can identify broken outbound links.
- For SEO audits, you must find the right crawling speed: fast enough to get a quick analysis, but reasonable enough that the site server can keep up with requests from the robot and human visitors to the site.
- Remember to make several crawls of the same site and to compare them to reveal the elements which evolve regularly on this site.
Limits of crawling
- It is better to run regular or even scheduled crawls to get an idea of the evolution of a site.
- In some cases – such as the case of a multilingual site with translations in other domains or subdomains – it may be important to verify that the SEO robot crawls all the subdomains, or to launch it on several domains. at the same time.
Instructions to robots and bot identity
SEO crawlers offer wo main workarounds to bring together the behavior of SEO and Google bots:
- Possibility to ignore the real robots.txt file of the site and to take into account robots.txt rules specific to SEO crawl.
- Possibility of modifying all or part of the robot’s identity to “disguise” it as a search engine robot. For example, it is possible to replace the name in the User-Agent of the OnCrawl bot by “Googlebot”:
- Imitate the behavior of the search engine crawler by repeating or scheduling subsequent crawls of your site.
Rebecca works as Content Manager at OnCrawl. She is passionate about NLP, computer models of languages, systems and their general functioning. Rebecca is never short of curiosity when it comes to technical SEO topics. She believes in the evangelism of technology and the use of data to understand the performance of websites on search engines.