How to Discover and Understand Search Engines

shape
shape
shape
shape
shape
shape
shape
shape

If you are a website owner, then you have probably had the experience of reading your own website’s analytics and noticing that they don’t make sense. In this article, we’ll explore how crawler data is collected, how it’s organized and what insights you might be able to gather from them.
This is an article on crawler. We cannot watch it unless you join us. Please post any questions in the replies section of this post.
Some users have already been interested in learning how typically the crawler data on the crawler-aware site is organized, now we will be more than inquisitive to reveal exactly how the crawler data is collected in addition to organized.

We may reverse the IP address of the crawler to query typically the rDNS, such as: we find this IP: 116. 179. thirty-two. 160, rDNS simply by reverse DNS lookup tool: baiduspider-116-179-32-160. spider. baidu. com

To sum up, we can roughly determine should end up being Baidu internet search engine bots. Because Hostname can be forged, so we only reverse lookup, still not correct. We also require to forward lookup, we ping order to find baiduspider-116-179-32-160. crawl. baidu. possuindo can be resolved since: 116. 179. 32. 160, through the following chart could be seen baiduspider-116-179-32-160. crawl. baidu. possuindo is resolved to the IP address 116. 179. 32. 160, which means that the Baidu lookup engine crawler will be sure.

Searching by ASN-related information

Only a few crawlers follow the particular above rules, most crawlers reverse look for without any outcomes, we need to query the IP address ASN details to determine in case the crawler information is correct.

For example , this IP will be 74. 119. 118. 20, we can see that IP address is the particular IP address of Sunnyvale, California, USA by simply querying the IP information.

We may see by the ASN information that he is surely an IP of Criteo Corp.

The screenshot over shows the logging information of critieo crawler, the yellowish part is the User-agent, followed by their IP, and there is nothing wrong using this access (the IP is indeed the IP address of CriteoBot).

IP address segment published by the crawler’s official documents

Some crawlers distribute IP address segments, and that we save the particular officially published IP address segments of the crawler right to the database, that is an easy and fast way to be able to do this.

Via public logs

We can often view open public logs on the particular Internet, for example , typically the following image is really a public log record I found.

We all can parse the particular log records to determine which usually are crawlers and which usually are visitors based on the User-agent, which greatly enriches our database regarding crawler records.

Overview

The above four procedures detail how the particular crawler identification website collects and organizes crawler data, and how to guarantee the accuracy plus reliability of the crawler data, nevertheless of course presently there are not only typically the above four strategies in the actual operation process, nevertheless they are much less used, so they aren’t introduced right here.

3 Comments:

  1. Luckily, with the help of Google, I was able to find a detailed article on how crawler data is collected and organized. This provided me with a lot of valuable information and even guided me in my research.

发表评论

您的电子邮箱地址不会被公开。 必填项已用*标注