What LLM Web scrapers are there?

Back to Bots

With the explosion of Large Language Model providers there has been a corresponding explosion of bots scraping the web to provide training material. These bots scrape websites in a similar manner to Search Engine Crawlers like GoogleBot and BingBot. Indeed both Google and Bing have bots for the purpose of scraping the web to provide training material for their AI models.

Here at Peakhour we classify these as Grey bots, potentially even as Malicious. They provide next to no value for the vast majority of websites and are typically very aggressive at crawling your website. This aggressive crawling can severely impact website performance for your legitimate users and inflate your cloud bill.

Major LLM Training bots

GPTBot (OpenAI): GPTBot is a web scraper used by OpenAI to gather data for training its GPT models. It respects robots.txt directives.
ClaudeBot (Anthropic): Used by Anthropic for training its language models. ClaudeBot respects robots.txt directives.
CCBot (Common Crawl Bot): This bot collects data for the Common Crawl dataset, which is widely used for training ' language models. It respects robots.txt rules and aims to minimize disruption to websites.
MSBot (Microsoft): Used by Microsoft for various AI and language model training purposes. It adheres to robots.txt directives and is designed to gather useful data while respecting website owners’ preferences.
ByteSpider (ByteDance/TikTok): ByteDance uses this bot for training its language models. ByteSpider does not appear to respect robots.txt and is extremely aggressive.
PerplexityBot: Used by Perplexity AI for gathering data to improve their models. This bot respects robots.txt files when generally crawling but not when generating responses to user generated queries to its LLM. It has also been reported to be pretending to be other browser;
AmazonBot: While not technically an LLM training bot, Amazon says it uses AmazonBot to train Alexa to provide better responses.
ImagesiftBot: Owned by Hive, this bot scrapes the internet for publicly available images. While primarily used for reverse image search, the data can also be used to train image generation models. ImagesiftBot respects the robots.txt file.

Managing LLM Bots

You can try disallowing each of these bots in your robots.txt file. If you have a firewall service that allows you to customise it, then you can make a rule to deny each bot by its user agent value.