Page 1 of 5 123 ... LastLast
Results 1 to 10 of 44

Thread: Block Web Content Scrapers and Downloaders

  1. #1
    Will.Spencer's Avatar
    Will.Spencer is offline Retired
    Join Date
    Dec 2008
    Posts
    5,033
    Blog Entries
    1
    Thanks
    1,010
    Thanked 2,329 Times in 1,259 Posts

    Block Web Content Scrapers and Downloaders

    I have found Bot Trap to be extremely effective in blocking the sneakiest of the web scraper bots.

    Bot Trap works by placing a hidden link on your homepage. That link can only be seen in the source code to the page. Ergo, the only things that should see and follow that link are web robots. But, that link is Disallowed in robots.txt, so polite robots will never try to follow that link.

    Robots that do follow that link get automatically added with Deny statements into your sites .htaccess file.

    Sometimes legitimate web robots get out of sync, so to make the script able to run unattended, I recommend that you whitelist those in your .htaccess file.

    Here's my current whitelist:
    Code:
    Allow from 127.0.0.1
    Allow from 64.233.160.0/19      # Google
    Allow from 65.55                # MSN
    Allow from 207.46               # MSN
    Allow from 66.249               # Google
    Allow from 67.195               # Yahoo!
    Allow from 72.14.192.0/18       # Google
    Allow from 72.30                # Yahoo!
    Allow from 74.6                 # Yahoo!
    Allow from 74.125.0.0/16        # Google
    Allow from 122.152.129.15       # Baidu
    Allow from 202.160              # Yahoo!
    This blocks people stealing your content to place on MFA sites and it blocks people downloading your entire websites for offline reading. I constantly see people trying to download my fifty-thousand page web sites. It's a complete waste of bandwidth.

    Bot Trap is friendly though. Users will see a message telling them that they are blocked and they only have to enter the word "access" into a form to be automatically unblocked.

    Every blocking or unblocking action generates an email to the site admin.

    It's really a beautiful script.
    Last edited by Will.Spencer; 25 December, 2010 at 01:05 AM.
    Submit Your Webmaster Related Sites to the NB Directory
    I swear, by my life and my love of it, that I will never live for the sake of another man, nor ask another man to live for mine.

  2. Thanked by:

    Andy101 (1 November, 2010), Aziz (31 October, 2010), Mike30 (26 December, 2010), WebEvader (20 February, 2009)

  3. #2
    freexs.org is offline Unknown Net Builder
    Join Date
    Dec 2008
    Posts
    9
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I don't get this, you allow some search bots and you block all others, how do you do that with real visitors? they have an IP address too

  4. #3
    Will.Spencer's Avatar
    Will.Spencer is offline Retired
    Join Date
    Dec 2008
    Posts
    5,033
    Blog Entries
    1
    Thanks
    1,010
    Thanked 2,329 Times in 1,259 Posts
    Quote Originally Posted by freexs.org View Post
    I don't get this, you allow some search bots and you block all others, how do you do that with real visitors? they have an IP address too
    Only search bots which access the hidden bot-trap file get blocked.

    Real visitors get blocked if they access that hidden file also, but they can unblock themselves just by typing "access" into the form.

    Since search bots tend to be stupid, I specifically whitelist (allow) the important ones. This means that they can't get themselves blocked, even if they do access the hidden bot-trap file.
    Submit Your Webmaster Related Sites to the NB Directory
    I swear, by my life and my love of it, that I will never live for the sake of another man, nor ask another man to live for mine.

  5. #4
    DomainMagnate's Avatar
    DomainMagnate is offline Super-Duper Moderator
    Join Date
    Dec 2008
    Posts
    749
    Blog Entries
    2
    Thanks
    150
    Thanked 175 Times in 110 Posts
    Why not just allow the known se bots and disallow the rest?

  6. #5
    Will.Spencer's Avatar
    Will.Spencer is offline Retired
    Join Date
    Dec 2008
    Posts
    5,033
    Blog Entries
    1
    Thanks
    1,010
    Thanked 2,329 Times in 1,259 Posts
    The normal way to do that would be to block them in robots.txt, but the bad guys just ignore robots.txt.

    To deny them in .htaccess, you need to know their IP addresses, and the really bad guys change IP's frequently.

    But it's not just bots. Web downloaders like wget, HTTrack, and WebCopier each up huge amounts of bandwidth and provide almost no value to the web site owner.
    Submit Your Webmaster Related Sites to the NB Directory
    I swear, by my life and my love of it, that I will never live for the sake of another man, nor ask another man to live for mine.

  7. #6
    Mike Dammann's Avatar
    Mike Dammann is offline Super Moderator
    Join Date
    Dec 2008
    Location
    Geographically flexible
    Posts
    964
    Blog Entries
    3
    Thanks
    237
    Thanked 182 Times in 148 Posts
    Someone on Sphinn asks:

    Who other than the major search engines would you add to your white list? Will you be hurting your rank if you keep the scrapers from grabbing your content?
    For blood type dating go here. If your blood type is rhesus negative, go there. If you are bored and feel like liking a Facebook page, hit this one.

  8. #7
    Mr.Bill's Avatar
    Mr.Bill is offline One is glad to be of service
    Join Date
    Dec 2008
    Location
    Redmond, Oregon
    Posts
    828
    Blog Entries
    1
    Thanks
    72
    Thanked 350 Times in 182 Posts
    I use robot trap on a couple sites. However coming the beginning of next year might not be as important anymore.


    Scrapers are not going to fair well
    webpronews.com/topnews/2008/11/17/seo-about-to-get-turned-on-its-ear

    Reverse IP Check ಠ_ಠ Proxy Sites
    <?php if ($youask == 'stupid question') { echo ('stupid answer'); } ?>

  9. #8
    Will.Spencer's Avatar
    Will.Spencer is offline Retired
    Join Date
    Dec 2008
    Posts
    5,033
    Blog Entries
    1
    Thanks
    1,010
    Thanked 2,329 Times in 1,259 Posts
    I don't have much faith in personalized search, but that's a whole other thread.
    Submit Your Webmaster Related Sites to the NB Directory
    I swear, by my life and my love of it, that I will never live for the sake of another man, nor ask another man to live for mine.

  10. #9
    EricBlackwell is offline Net Builder and all around nice guy
    Join Date
    Dec 2008
    Posts
    12
    Thanks
    1
    Thanked 6 Times in 5 Posts
    I agree with Will's assessment of personalized search...but back onto the main topic...

    Up until now, I have not worried too much about scrapers, but just noticed one of my main real estate sites getting scraped for listings by a major national third party site... time to take action, I guess....

    Thanks for the how to on this.

    Eric
    Eric Blackwell
    SEO, Technologist, Web Entrepreneur

    My Search Engine Marketing Blog
    The Eric Blackwell Personal Blog
    My latest real estate project.

    Too many other projects to mention...hehe

  11. #10
    TopDogger's Avatar
    TopDogger is offline Über Hund
    Join Date
    Jan 2009
    Location
    Hellfire, AZ
    Posts
    3,050
    Thanks
    345
    Thanked 909 Times in 694 Posts
    Will, does your white list cover all of the Google spiders?

    I've seen some lengthy lists of GoogleBot IP addresses.

Page 1 of 5 123 ... LastLast

Similar Threads

  1. Obfuscate Proxy Content to make harder to Block
    By tibbie in forum Web Proxies
    Replies: 4
    Last Post: 2 May, 2011, 08:34 AM
  2. New Content writer on the block!
    By AjiContent in forum Introduction Forum
    Replies: 0
    Last Post: 28 February, 2011, 08:19 AM
  3. Google on Content Scrapers
    By Kovich in forum Managing
    Replies: 18
    Last Post: 14 May, 2010, 06:58 AM
  4. Block Robots and Web Downloaders with robots.txt
    By Will.Spencer in forum Managing
    Replies: 12
    Last Post: 6 June, 2009, 15:40 PM
  5. How to Profit from Content Scrapers?
    By Shenron in forum Promoting
    Replies: 4
    Last Post: 12 March, 2009, 18:58 PM

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •