memo

What Crawl Budget Means for GoogleBot

Googlebotはweb userの良い手本となるように設計されている。 Crawl budgetはCrawl rate limitとCrawl demandにもとづいて、googlebotがcrawlするURLの数として定義されている。

googlebotのcrawlのlimit

Crawl rate limitは主に以下の2つの要素によって変更される。

Crawl health
- web siteが一定期間早いresponseを返した場合、limitは引き上げられ、google botはcrawlの際に多くのconnectionを利用するようになる
- 逆にresponseが遅いやserverがerrorを返した場合は、limitは引き下げられ、googlebotのcrawlは減る
Limit set in search console
- web siteのowenerはSearch consoleから明示的にcrawlを減らすことができる
- Search consoleでLimitを引き上げても、必ずしも反映されるわけではない

Crawl rate limitに達していなくても、indexの要求がなければGoogleBotはactiveにindexをしようとしない。 indexの要求は以下の2つの要素が重要である。

siteのURLの変更/移動もまた、crawl demandを増やす要因である。

価値の低いURLの追加が、crawlingとindexingの数に影響を与えることがわかっている。価値の低いURLとして以下のようなものがある。数字が小さいものほど重要である。

1. Faceted navigation/ session identifier
  - Faceted navigation
  - Official Google Webmaster Central Blog: Faceted navigation best (and 5 of the worst) practices
  - 価格や色などで、filteringできる機能
  - faceted navigationは同じようなcontentsを持つ複数のURLを持つ場合が多いので、crawlingには悪い影響を与える * session identifier
  - Official Google Webmaster Central Blog: Google, duplicate content caused by URL parameters, and you
  - query parameterなどにsession IDなどの情報が含まれていて、query parameterは違うが表示されるpageは同じ
1. On-site duplicate contents
  - 重複コンテンツ
1. Soft error page
  - Official Google Webmaster Central Blog: Crawl Errors now reports soft 404s
  - pageは存在しないのにserverが200を返している場合に、Soft404と判定される
1. Hacked pages
1. Infinite spaces and proxies
  - Official Google Webmaster Central Blog: To infinity and beyond? No!
  - カレンダーの翌月へのリンクのように無限にリンクが続くもの
  - nofollowをつける
1. Low quality and span content