NetBuilders

You are welcome to look around. You will have to register before you can post a message, create a blog, chat live with our members, or add a site to our directory.



Advertise With Us

Reply The Tech Talks
Old 8 December, 2008, 15:39 PM   #1 (permalink)
Gozer
 
Will.Spencer's Avatar
 
Location: Singapore
iTrader: (45)
Blog Entries: 1
Thanked 1,622 Times in 890 Posts
Posts: 4,958
$NetBucks: 8,562
Join Date: Dec 2008
Last Online: Today 12:20 PM
Default Block Web Content Scrapers and Downloaders

I have found Bot Trap to be extremely effective in blocking the sneakiest of the web scraper bots.

Bot Trap works by placing a hidden link on your homepage. That link can only be seen in the source code to the page. Ergo, the only things that should see and follow that link are web robots. But, that link is Disallowed in robots.txt, so polite robots will never try to follow that link.

Robots that do follow that link get automatically added with Deny statements into your sites .htaccess file.

Sometimes legitimate web robots get out of sync, so to make the script able to run unattended, I recommend that you whitelist those in your .htaccess file.

Here's my current whitelist:
Code:
Allow from 127.0.0.1
Allow from 64.233.160.0/19      # Google
Allow from 65.55                # MSN
Allow from 66.249               # Google
Allow from 67.195               # Yahoo!
Allow from 72.14.192.0/18       # Google
Allow from 72.30                # Yahoo!
Allow from 74.6                 # Yahoo!
Allow from 74.125.0.0/16        # Google
Allow from 122.152.129.15       # Baidu
Allow from 202.160              # Yahoo!
This blocks people stealing your content to place on MFA sites and it blocks people downloading your entire websites for offline reading. I constantly see people trying to download my fifty-thousand page web sites. It's a complete waste of bandwidth.

Bot Trap is friendly though. Users will see a message telling them that they are blocked and they only have to enter the word "access" into a form to be automatically unblocked.

Every blocking or unblocking action generates an email to the site admin.

It's really a beautiful script.
  Reply With Quote
Thanked by:
menj (1 February, 2010), WebEvader (20 February, 2009)
Old 9 December, 2008, 10:16 AM   #2 (permalink)
Unknown Net Builder
 
iTrader: (0)
Thanked 0 Times in 0 Posts
Posts: 9
$NetBucks: 2
Join Date: Dec 2008
Last Online: 8 September, 2009 07:43 AM
Default

I don't get this, you allow some search bots and you block all others, how do you do that with real visitors? they have an IP address too
  Reply With Quote
Old 9 December, 2008, 13:01 PM   #3 (permalink)
Gozer
 
Will.Spencer's Avatar
 
Location: Singapore
iTrader: (45)
Blog Entries: 1
Thanked 1,622 Times in 890 Posts
Posts: 4,958
$NetBucks: 8,562
Join Date: Dec 2008
Last Online: Today 12:20 PM
Default

Quote:
Originally Posted by freexs.org View Post
I don't get this, you allow some search bots and you block all others, how do you do that with real visitors? they have an IP address too
Only search bots which access the hidden bot-trap file get blocked.

Real visitors get blocked if they access that hidden file also, but they can unblock themselves just by typing "access" into the form.

Since search bots tend to be stupid, I specifically whitelist (allow) the important ones. This means that they can't get themselves blocked, even if they do access the hidden bot-trap file.
  Reply With Quote
Old 10 December, 2008, 10:00 AM   #4 (permalink)
Super-Duper Moderator
 
DomainMagnate's Avatar
 
iTrader: (8)
Blog Entries: 2
Thanked 117 Times in 83 Posts
Posts: 622
$NetBucks: 446
Join Date: Dec 2008
Last Online: 21 February, 2010 11:52 AM
Send a message via AIM to DomainMagnate
Default

Why not just allow the known se bots and disallow the rest?
  Reply With Quote
Old 10 December, 2008, 10:18 AM   #5 (permalink)
Gozer
 
Will.Spencer's Avatar
 
Location: Singapore
iTrader: (45)
Blog Entries: 1
Thanked 1,622 Times in 890 Posts
Posts: 4,958
$NetBucks: 8,562
Join Date: Dec 2008
Last Online: Today 12:20 PM
Default

The normal way to do that would be to block them in robots.txt, but the bad guys just ignore robots.txt.

To deny them in .htaccess, you need to know their IP addresses, and the really bad guys change IP's frequently.

But it's not just bots. Web downloaders like wget, HTTrack, and WebCopier each up huge amounts of bandwidth and provide almost no value to the web site owner.
  Reply With Quote
Old 11 December, 2008, 16:39 PM   #6 (permalink)
Super Moderator
 
firetown's Avatar
 
Location: shuddup
iTrader: (0)
Blog Entries: 3
Thanked 88 Times in 68 Posts
Posts: 501
$NetBucks: 206
Join Date: Dec 2008
Last Online: Today 15:59 PM
Send a message via MSN to firetown Send a message via Yahoo to firetown Send a message via Skype™ to firetown
Default

Someone on Sphinn asks:

Who other than the major search engines would you add to your white list? Will you be hurting your rank if you keep the scrapers from grabbing your content?
  Reply With Quote
Old 11 December, 2008, 17:24 PM   #7 (permalink)
One is glad to be of service
 
Mr.Bill's Avatar
 
iTrader: (9)
Blog Entries: 1
Thanked 275 Times in 137 Posts
Posts: 715
Recent Blog: Diaper Rash
$NetBucks: 80
Join Date: Dec 2008
Last Online: 30 November, 2009 04:21 AM
Send a message via MSN to Mr.Bill
Default

I use robot trap on a couple sites. However coming the beginning of next year might not be as important anymore.


Scrapers are not going to fair well
webpronews.com/topnews/2008/11/17/seo-about-to-get-turned-on-its-ear
__________________
Myspace Proxy ಠ_ಠ Reverse IP Check ಠ_ಠ Proxy Sites
Proxy Locator Proxy List
  Reply With Quote
Old 11 December, 2008, 18:45 PM   #8 (permalink)
Gozer
 
Will.Spencer's Avatar
 
Location: Singapore
iTrader: (45)
Blog Entries: 1
Thanked 1,622 Times in 890 Posts
Posts: 4,958
$NetBucks: 8,562
Join Date: Dec 2008
Last Online: Today 12:20 PM
Default

I don't have much faith in personalized search, but that's a whole other thread.
  Reply With Quote
Old 26 December, 2008, 11:07 AM   #9 (permalink)
Unknown Net Builder
 
iTrader: (0)
Thanked 5 Times in 4 Posts
Posts: 9
$NetBucks: 0
Join Date: Dec 2008
Last Online: 8 February, 2009 11:00 AM
Default

I agree with Will's assessment of personalized search...but back onto the main topic...

Up until now, I have not worried too much about scrapers, but just noticed one of my main real estate sites getting scraped for listings by a major national third party site... time to take action, I guess....

Thanks for the how to on this.

Eric
__________________
Eric Blackwell
SEO, Technologist, Web Entrepreneur

My Search Engine Marketing Blog
The Eric Blackwell Personal Blog
My latest real estate project.

Too many other projects to mention...hehe
  Reply With Quote
Old 2 February, 2009, 13:45 PM   #10 (permalink)
Net Builder
 
TopDogger's Avatar
 
iTrader: (3)
Thanked 184 Times in 127 Posts
Posts: 504
$NetBucks: 722
Join Date: Jan 2009
Last Online: Today 14:42 PM
Default

Will, does your white list cover all of the Google spiders?

I've seen some lengthy lists of GoogleBot IP addresses.
  Reply With Quote
Reply

Bookmarks

Tags
block, content, downloaders, scrapers, web


Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Block Robots and Web Downloaders with robots.txt Will.Spencer Managing 12 6 June, 2009 15:40 PM
Anyone know how to Block Web Proxies Mr.Bill Web Proxies 4 30 April, 2009 20:48 PM
How to Profit from Content Scrapers? Shenron Promoting 4 12 March, 2009 18:58 PM
How to block this proxy ? Szise Web Proxies 3 24 February, 2009 18:15 PM
Block A Country Will.Spencer Managing 11 8 January, 2009 20:58 PM


All times are GMT. The time now is 16:54 PM.
Powered by vBulletin® Version 3.8.4
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.
Search Engine Friendly URLs by vBSEO 3.3.1
vBAdvertise v1.0.0 Copyright ©2009, PixelFX Studios
vBCredits v1.4 Copyright ©2007 - 2008, PixelFX Studios