Sunday, November 16, 2008

PHP Robot Check

I've seen quite a few methods for determining if a request was made by a robot or search engine spider using PHP over the years, and they always have something to do with checking the User Agent. That works, but you have to keep up with User Agent strings which can get to be annoying.

I had an idea how to check for robots using PHP and htaccess today that doesn't involve a User Agent string at all. Instead it uses sessions and exploits the fact that all well behaved robots are going to request robots.txt before they request anything else.

I start by making sure my usual robots.txt file is in place. Then I upload a robots.php file to the same location as robots.txt with a tiny bit of code in it.

<?php
session_start();
$_SESSION['robot'] = 1;
echo file_get_contents('robots.txt');
exit;
?>


All that does is start a session, set a robot $_SESSION variable that I can check in subsequent scripts, then return the contents of robots.txt.

In htaccess I have the following line which transparently redirects requests to robots.txt made by visitors/spiders to robots.php, which in turn returns the contents of robots.txt

RewriteEngine on
RewriteRule robots\.txt robots.php


Now in my applications I can easily check for robots and drop things like advertisement banners to speed up page loads since spiders don't look at advertisements anyway.

<?php
session_start();
echo isset($_SESSION['robot']) ? 'ROBOT !!' : 'Not a robot';
?>


Here are some of the benefits of doing it this way.

  1. I can continue to modify robots.txt as I normally would
  2. I don't need to keep up with changing User Agent strings
  3. Checking for the existance of a session variable is quicker than performing pattern matching on a variable

4 comments:

Anonymous said...

So if I hit robots.txt on your site, I won't see ads?

Though in fairness this isn't a huge issue since what proportion of people are likely to do that?

Joe Kovar said...

Good point. I guess it would be less likely to happen than someone finding the print-version of a site to drop ads.

The ads bit was just one example of how this could be used.
Surely there are other practical applications.

Anonymous said...

I see the problem in this being that often times I find bots don't store cookies, which means their $_SESSION variables are never reloaded (because the cookie that exists to tell WHICH session it is is never passed). Close, but no cigar, good idea though, clever.

Mark said...

You can use an API, such as the one at www.atlbl.com, that will catch all the webcrawlers including the ones that don't bother to check robots.txt