A web robot claiming to be from SimilarPages has been crawling my sites, while completely ignoring robots.txt.
It's difficult to tell for certain that this robot really is from SimilarPages.com, because the crawling is coming from Amazon's AWS cloud computing system.
The robot visits look like this:
If this bot really is from SimilarPages.com, I want to tell them that they aren't going to beat Google by building a search engine from an open source project (Nutch) and hosted at Amazon Web Services.
2009-02-25 (Wed) 02:20:40 address is 126.96.36.199, hostname is ec2-67-202-52-58.compute-1.amazonaws.com, agent is SimilarPages/Nutch-1.0-dev (SimilarPages Nutch Crawler; http://www.similarpages.com
; info at similarpages dot com)
If this is just another blackhat crawler hosted at Amazon, I want to tell Amazon that their customers are hurting their brand value.
I think that I will block Amazon Web Services's IP ranges on our external firewall.