An Unfinished Symphony

It's about the internet and stuff.

Advanced spam control with mod_rewrite

I can't remember where I first got the idea from, but for some time now I've been using mod_rewrite to protect against spam and hack attempts, and this has worked quite well for some time. Essentially, I have a number of rules contained in my .htaccess file which are designed to block attacks from "users" displaying common traits – with one of those common traits being the absence of a user-agent string from the request headers.

As was pointed out to me yesterday, there's no obligation for any user-agent (UA) to send a user-agent string as a part of its request headers. I have no quarrel with that statement at all – except, on this site there is. Considering the vast number (several thousand per month) of attempts to directly access comment and contact forms, or to access non-existent files with random character file names by bots, spammers or hackers whose request headers lack a user-agent string, and the fact that it is very rare in my experience for a legitimate visitor to not include one, I decided that it was a requirement for visiting here and used the following code to block them:

The mod_rewrite code used to block visitors without a user-agent string
  1. RewriteEngine On
  2. RewriteCond %{HTTP_USER_AGENT} =""
  3. RewriteRule .* - [F,L]

Line 1 turns the rewrite engine on. Line 2 sets the condition to be checked for, in this case an empty user-agent string (denoted by the absence of content between the double quote marks), and line 3 says what should happen when the condition is met – with the F stating that the request should fail. In which case it returns a 403, forbidden, error.

As I said above, that has worked quite well for some time and I've been happy with the effect that it's had on the amount of spam I've experienced. However, when checking my access logs on a couple of occasions recently I noticed that something had been trying to access a file relating to the Text Link Ads service; in order to check that their adverts are working properly their server periodically checks publishers' sites to make sure that the adverts are displayed. Whilst this is a reasonable, and sensible, thing to do it appears that their server fails to include a user-agent string in its request headers – meaning that every attempt to check my site was being rejected by the server, which isn't so good. Consequently, this meant that either I had to stop blocking them, or they had to include a user-agent string in their headers

As my attempts to explain the situation to their support people seemed to be met with misunderstanding it turned out that I had to stop blocking them. Though this wasn't as simple as just removing the code from my .htaccess as this would only result in my being bombarded with spam and hack attempts yet again. Instead I had to check for two conditions instead of one, with the extra condition being that the visitor wasn't them. To do that I also checked to see if the visitor's IP belonged to their server or not, like so:

  1. RewriteCond %{REMOTE_ADDR} !^12\.34\.567\.89$

That line of code checks to make sure that the visitor's IP is not the one listed (nb. that is just a dummy IP address rather than their actual one). If both conditions are met (not the listed IP and no user-agent string) then the visitor gets blocked. When added to the previous code we get the following:

The amended mod_rewrite code
  1. RewriteEngine On
  2. RewriteCond %{HTTP_USER_AGENT} =""
  3. RewriteCond %{REMOTE_ADDR} !^12\.34\.567\.89$
  4. RewriteRule .* - [F,L]

While that snippet of code will allow them to access my site even when they have no user-agent string in their request headers, and while there's no obligation for one to be included (as mentioned previously), I personally feel that it would be wiser for them to fix their software to ensure that it identifies itself when accessing remote servers. Not doing so means that it's quite easy to confuse them with spammers and hackers who do their best to disguise their actions and methods, and so leaves them to potentially be blocked by many other users who might take similar measures. Hopefully the support person that responded to my queries will pass the matter on to someone who will understand the issue and be able to do something about it.

Up arrow

Comments on 'Advanced spam control with mod_rewrite'

RSS feed for comments on this post.

gravatar

[…] Dave's full post plus the htaccess code used can be found on his post – Advanced Spam Control. […]

gravatar

[…] – they just don't care. Dave at ap4a has been back and forth with them because they're using an empty user agent for their bot, just like most spammers and […]

gravatar

I think .htaccess files do some amazing things anywhere from 301 redirct to blocking IP address. This is an interesting point you made I think I will try doing this in my site also. making sure to block those bots. and crazy spammers.

gravatar

I think .htaccess files do some amazing things

They certainly do, Todd.

By the way, I remembered where I got the idea of doing it from originally, and there's a very comprehensive list of spammer IPs and useragents available along with a walk-through on configuring them all. Very useful if you get a lot of junk coming through to your site. It's archived at Dive into Mark.

gravatar
Jan says:

Hm, cool idea! However, after the latest update of the pagerank databases I'd turn off the TLA. When I browsed one blog I found one post explaining the penalty given to that site just because of selling text links. Immediately after removing the ads, the traffic returned to normal.

gravatar

Hi Jan, thanks for the comment!

I understand what you're saying, however I don't place so much weight on Google PR that I'm going to allow Google to use it to dictate what appears on my site, or who/what I link to. To me such anti-competetive practices do nothing to instill anything other than a feeling of enmity towards them, and I'm not about to make any changes here out of fear for what they might do.

At the end of the day they provide a second rate search service that is only at number one because the majority of users don't know any better than to use them. That'll change in time, the volume of rubbish results they provide will only continue to rise as they artificially alter them in ways like this, which will eventually result in more people switching to better services such as Yahoo's.

gravatar

Since posting the above comment Google has now stripped all PR from this site (that's after already dropping 1 point after the update mentioned by Jan earlier). It's good to see the impartial criteria they place on ranking in action for the benefit of the end user. Pathetic.

gravatar
jwcoggin says:

Jan,
You are exactly right but it is also important to not harp on page rank. A good quality blog such as this will have "regular" users no matter what.

gravatar
jwcoggin says:

David, It is really good to hear someone who has a page and is not concerned with page rank. Your credit will come in time.

Fine blog!

gravatar

Cool idea, I may implement it on my site.

Are there any SEO-issues with disallowing empty user-agent strings?

gravatar

Are there any SEO-issues with disallowing empty user-agent strings?

Not as far as I'm aware, Andreas. Most/all SE spiders identify themselves with a UA string, so there shouldn't be a problem.

gravatar

Ok, thanks for sharing. Nice site, bookmarked.

gravatar

Thanks Andreas 🙂

gravatar

[…] This related link does a simple test to find if the User-Agent is empty and blocks it if it is: http://www.ap4a.co...with-mod_rewrite/ […]

Sorry, the comment form is closed at this time.