Page 2 of 5 FirstFirst 1234 ... LastLast
Results 11 to 20 of 44

Thread: Block Web Content Scrapers and Downloaders

  1. #11
    Will.Spencer's Avatar
    Will.Spencer is offline Retired
    Join Date
    Dec 2008
    Posts
    5,034
    Blog Entries
    1
    Thanks
    1,010
    Thanked 2,329 Times in 1,259 Posts
    Quote Originally Posted by TopDogger View Post
    Will, does your white list cover all of the Google spiders?
    It blocks every one I've seen. Each time an IP gets blocked by bot-trap, I get an email. I look at the email to see if the User-Agent is a bot that should not be blocked. If it is, I check it's IP address in WHOIS to make sure it's not some content thief faking their User-Agent. If the User-Agent and the WHOIS data match, I add the new IP address range to my white list.
    Submit Your Webmaster Related Sites to the NB Directory
    I swear, by my life and my love of it, that I will never live for the sake of another man, nor ask another man to live for mine.

  2. #12
    TopDogger's Avatar
    TopDogger is offline Über Hund
    Join Date
    Jan 2009
    Location
    Hellfire, AZ
    Posts
    3,141
    Thanks
    350
    Thanked 924 Times in 707 Posts
    I'm taking a look at this today.

    If I understand this correctly, in addition to copying the other files to the server and modifying the robots.txt file, I need to set 777 permissions on the .htaccess file in the root directory and add the whitelist IPs. Is that correct?

    I currently have the following list of spammer IPs blocked in the .htaccess on the test site. Do I just add the whitelist IPs to this block or keep them in a separate group?

    Code:
    <Limit GET HEAD POST>
    order allow,deny
    deny from 24.129.33.46
    deny from 69.94.108.180
    deny from 82.128. 
    deny from 208.66.195.
    allow from all
    </LIMIT>
    Where is the blacklist being created? Could this system fill up the .htaccess file with blacklisted IPs over time?

    Are you using a single pixel image for the link trap or did you place a larger image on the page somewhere?

    BTW, the .htaccess protocol does not allow comments on the same line as a directive. I was making notations just like those in your whitelist for IPs that I was manually banning and my server error log filled up with error messages. You have to place comments on separate lines.

  3. #13
    Will.Spencer's Avatar
    Will.Spencer is offline Retired
    Join Date
    Dec 2008
    Posts
    5,034
    Blog Entries
    1
    Thanks
    1,010
    Thanked 2,329 Times in 1,259 Posts
    Quote Originally Posted by TopDogger View Post
    If I understand this correctly, in addition to copying the other files to the server and modifying the robots.txt file, I need to set 777 permissions on the .htaccess file in the root directory and add the whitelist IPs. Is that correct?
    My .htaccess files seem to work with 644 (rw-r--r--) permissions, but that's because they are owned by the same user id as the web server runs under.

    Quote Originally Posted by TopDogger View Post
    I currently have the following list of spammer IPs blocked in the .htaccess on the test site. Do I just add the whitelist IPs to this block or keep them in a separate group?
    Hmmm... you got me... I think they can be separate.

    Quote Originally Posted by TopDogger View Post
    Where is the blacklist being created? Could this system fill up the .htaccess file with blacklisted IPs over time?
    It's appended to the end of .htaccess. The .htaccess can grow large over time. Every month or two I delete the older deny statements.

    Quote Originally Posted by TopDogger View Post
    Are you using a single pixel image for the link trap or did you place a larger image on the page somewhere?
    I'm using a single pixel image.

    Quote Originally Posted by TopDogger View Post
    BTW, the .htaccess protocol does not allow comments on the same line as a directive. I was making notations just like those in your whitelist for IPs that I was manually banning and my server error log filled up with error messages. You have to place comments on separate lines.
    That's odd -- it seems to be working fine here under Apache 2.2. Are you running Apache 1.3 or Apache 2.2?
    Submit Your Webmaster Related Sites to the NB Directory
    I swear, by my life and my love of it, that I will never live for the sake of another man, nor ask another man to live for mine.

  4. #14
    TopDogger's Avatar
    TopDogger is offline Über Hund
    Join Date
    Jan 2009
    Location
    Hellfire, AZ
    Posts
    3,141
    Thanks
    350
    Thanked 924 Times in 707 Posts
    Thanks for the quick update.

    Quote Originally Posted by Will.Spencer View Post
    My .htaccess files seem to work with 644 (rw-r--r--) permissions, but that's because they are owned by the same user id as the web server runs under.
    644 is the standard permissions for the .htaccess. The instructions say to, "Make blacklist.dat and .htaccess writable by the web server user." I interpret that to mean that the file needs to be writable by the script being run by a user, which 644 doesn't cover. Maybe I'm interpreting this wrong. I'm looking at it as being similar to a cache directory, where the permissions typically need to be set to 666 or 777. Are the permissions on the blacklist.dat file set to 644 also?

    Quote Originally Posted by Will.Spencer View Post
    That's odd -- it seems to be working fine here under Apache 2.2. Are you running Apache 1.3 or Apache 2.2?
    I'm running Apache 2.2.10. One of the techs at my hosting company pointed that out a few months back while we were troubleshooting a server issue. He noticed that the Apache error log was pretty fat and packed with hundreds of messages pointing to the .htaccess files. He took a look at one of the .htaccess files and said that comments had to be on separate lines. I never saw an error with my sites, so the messages may have been warnings. It was something new to me.

  5. #15
    Will.Spencer's Avatar
    Will.Spencer is offline Retired
    Join Date
    Dec 2008
    Posts
    5,034
    Blog Entries
    1
    Thanks
    1,010
    Thanked 2,329 Times in 1,259 Posts
    Quote Originally Posted by TopDogger View Post
    644 is the standard permissions for the .htaccess. The instructions say to, "Make blacklist.dat and .htaccess writable by the web server user." I interpret that to mean that the file needs to be writable by the script being run by a user, which 644 doesn't cover. Maybe I'm interpreting this wrong. I'm looking at it as being similar to a cache directory, where the permissions typically need to be set to 666 or 777. Are the permissions on the blacklist.dat file set to 644 also?
    I think it means "the script being run by the web server." On my system, the web server runs as the user www and .htaccess is owned by the user www.

    But, 777 works too -- no matter who owns the file.

    Quote Originally Posted by TopDogger View Post
    I'm running Apache 2.2.10. One of the techs at my hosting company pointed that out a few months back while we were troubleshooting a server issue. He noticed that the Apache error log was pretty fat and packed with hundreds of messages pointing to the .htaccess files. He took a look at one of the .htaccess files and said that comments had to be on separate lines. I never saw an error with my sites, so the messages may have been warnings. It was something new to me.
    Very odd -- not a single mention of this in my error logs.

    But I did some Googling and found other people with similar issues -- particularly when the comments contained forward slashes. So, it seems best to move the comments to separate lines.
    Submit Your Webmaster Related Sites to the NB Directory
    I swear, by my life and my love of it, that I will never live for the sake of another man, nor ask another man to live for mine.

  6. #16
    TopDogger's Avatar
    TopDogger is offline Über Hund
    Join Date
    Jan 2009
    Location
    Hellfire, AZ
    Posts
    3,141
    Thanks
    350
    Thanked 924 Times in 707 Posts
    Quote Originally Posted by Will.Spencer View Post
    I think it means "the script being run by the web server." On my system, the web server runs as the user www and .htaccess is owned by the user www.
    OK. after I see how it logs the first spider, I will try setting the permissions to 644 to see what happens. If it does not work, an error will probably show up in the site's error log. Personally, I prefer not setting anything to 777.


    Quote Originally Posted by Will.Spencer View Post
    Very odd -- not a single mention of this in my error logs.

    But I did some Googling and found other people with similar issues -- particularly when the comments contained forward slashes. So, it seems best to move the comments to separate lines.
    All I use is a hash for comments in the .htaccess. There might be some kind of obscure server configuration issue that causes it to log errors on some servers, but not on others.

  7. #17
    Will.Spencer's Avatar
    Will.Spencer is offline Retired
    Join Date
    Dec 2008
    Posts
    5,034
    Blog Entries
    1
    Thanks
    1,010
    Thanked 2,329 Times in 1,259 Posts
    Quote Originally Posted by TopDogger View Post
    OK. after I see how it logs the first spider, I will try setting the permissions to 644 to see what happens. If it does not work, an error will probably show up in the site's error log. Personally, I prefer not setting anything to 777.
    As an old Unix guy, 777 makes me feel weird.

    Quote Originally Posted by TopDogger View Post
    All I use is a hash for comments in the .htaccess. There might be some kind of obscure server configuration issue that causes it to log errors on some servers, but not on others.
    The forward slashes were apparently being misinterpreted as being part of a CIDR (Classless Internet Domain Routing) statement. Like 192.168.0.0/24.
    Submit Your Webmaster Related Sites to the NB Directory
    I swear, by my life and my love of it, that I will never live for the sake of another man, nor ask another man to live for mine.

  8. #18
    TopDogger's Avatar
    TopDogger is offline Über Hund
    Join Date
    Jan 2009
    Location
    Hellfire, AZ
    Posts
    3,141
    Thanks
    350
    Thanked 924 Times in 707 Posts
    Quote Originally Posted by Will.Spencer View Post
    As an old Unix guy, 777 makes me feel weird.
    I tested it again this morning. It would not work with 644, so I had to settle for 666.

    Quote Originally Posted by Will.Spenser View Post
    The forward slashes were apparently being misinterpreted as being part of a CIDR (Classless Internet Domain Routing) statement. Like 192.168.0.0/24.
    I do not understand how the CIDR number works. What range of IPs does 64.233.160.0/19 cover?

    I have a long list of G spider IPs. Most are in different IP ranges.

  9. #19
    Will.Spencer's Avatar
    Will.Spencer is offline Retired
    Join Date
    Dec 2008
    Posts
    5,034
    Blog Entries
    1
    Thanks
    1,010
    Thanked 2,329 Times in 1,259 Posts
    Quote Originally Posted by TopDogger View Post
    I do not understand how the CIDR number works. What range of IPs does 64.233.160.0/19 cover?
    CIDR notation uses binary math. /19 means "the first 19 binary digits are the network range and the rest is the IP address range." When you make the number after the slash larger, the networks get smaller. When you make the number after the slash smaller, the networks get larger.

    10.0.0.0/8 is a traditional Class A network, i.e. 10.0.0.0 to 10.255.255.255.

    192.168.0.0/24 is a traditional Class C network, i.e. 192.168.0.0 to 192.168.0.255.

    Raising the number after the slash by one digit cuts the size of the network in two. Lowering the number after the slash doubles the size of the network.

    I've never liked doing math, so I use a CIDR calculator for these calculations.
    Submit Your Webmaster Related Sites to the NB Directory
    I swear, by my life and my love of it, that I will never live for the sake of another man, nor ask another man to live for mine.

  10. Thanked by:

    TopDogger (20 February, 2009)

  11. #20
    TopDogger's Avatar
    TopDogger is offline Über Hund
    Join Date
    Jan 2009
    Location
    Hellfire, AZ
    Posts
    3,141
    Thanks
    350
    Thanked 924 Times in 707 Posts
    I snagged my first spider yesterday.

    Code:
    address is 80.57.190.67, hostname is g190067.upc-g.chello.nl, agent is Java/1.6.0-oem
    I feel like I've gone fishing.

    Will, are you running the spider trap link on multiple pages? Right now, I just have mine on the home page.

Page 2 of 5 FirstFirst 1234 ... LastLast

Similar Threads

  1. Obfuscate Proxy Content to make harder to Block
    By tibbie in forum Web Proxies
    Replies: 4
    Last Post: 2 May, 2011, 09:34 AM
  2. New Content writer on the block!
    By AjiContent in forum Introduction Forum
    Replies: 0
    Last Post: 28 February, 2011, 09:19 AM
  3. Google on Content Scrapers
    By Kovich in forum Managing
    Replies: 18
    Last Post: 14 May, 2010, 07:58 AM
  4. Block Robots and Web Downloaders with robots.txt
    By Will.Spencer in forum Managing
    Replies: 12
    Last Post: 6 June, 2009, 16:40 PM
  5. How to Profit from Content Scrapers?
    By Shenron in forum Promoting
    Replies: 4
    Last Post: 12 March, 2009, 19:58 PM

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •