Jim's Notebook: Handling 404 Errors

I had previously noted the lack of hard evidence about the ability of 404 error pages to salvage the user experience – but skipped any prescriptive information.

And so, here are some of the practices I’ve used in the past:

1. Paying Attention

Before the problem can be addressed, it must be understood. My approach has been to use an .htaccess file to redirect users to a CGI script that logs four bits of data (page requested, referring page, user-agent, IP address, and date stamp), the aggregating data and considering it in other reports. Without this step, finding 404 errors is very difficult.

2. Sorting It Out

The data in those logs are cleaned and sorted into three heaps, each of which requires a different remedy:

The first heap is hacker traffic. There are people (or most often, spiders) that will comb a site looking for backdoors into maintenance programs that can be used to gain access to the site. For example, a handful of systems use the address http://yourhost/admin/ as an administrative login, and hackers regularly comb sites looking for that address. (An unrelated tip: if you can help it, don’t put admin logins on the public site, or at the very least put them in locations that aren’t so easy to guess.)

The second heap is internal 404 errors. While there are various causes for this (a bad reference in your own HTML code), the ones of greatest interest are where a user has visited one of your pages and clicks a link to get to another page, but runs into a dead end. More on that later.

The third heap is external 404 errors. This occurs when someone is on a Web site that links to yours, and the user will get a 404 error when they click through because the link is and (there’s a typo, the file has moved, etc.). These are the most difficult to address, but are likely the most important.

3. Hacker Traffic

My approach to dealing with hacker traffic is to serve them as little content as possible. My standard 404 “redirect” script (not available online just now, as I haven’t taken the time to clean it up) serves up a blank page whenever there’s a file request that looks like someone attempting to find a back door.

I’ve been chided for this once or twice by those who suggest it’s possible turn a hacker into a customer by serving up some promotional content, but I’m not convinced that’s a good idea. Most of this traffic likely comes from programs that don’t bother to read the content – so it’s just wasted bandwidth. And even if it’s a real person, the kind of individual who attempts to hack into your Web site will likely try to take advantage of your business in other ways, so I don’t see the need to put out a welcome mat.

When I notice that a lot of this traffic comes from a specific IP address or user agent, I modify the access permissions on my site to block their access altogether. Again, there’s the argument that a given user may be a hacker one day and a customer the next, but my previous answer holds. The one problem worth considering is that an IP address may be dynamic, such that a legitimate customer might be using it at a later time. That’s valid, and worth considering, but is a separate matter to consider what level of nefarious behavior merits banishing a remote address – sometimes, it’s entirely warranted.

4. Internal 404 errors

Data pertaining to internal 404 errors is fairly simple to tidy up, in that it comes from a bad link within your own Web site, and you should be able to clean up your own house with minimal effort by using the data in the log file to identify the exact page and link that’s causing the problem.

It’s worth noting that there are maintenance utilities that can be used to keep your site error-free in this regard, but tend to choke when a site exceeds a few thousand pages of content and they’re not very good at finding errors in pages whose content is dynamic. In these instances, a log file really helps.

Ideally, you shouldn’t have any internal errors, and should be able to clean them up in short order if they do arise. My practice has been to set up a maintenance script that would e-mail me this report twice a day so that I could react promptly. Most times, the report came back empty, which is the goal.

5. External 404 Errors

Data pertaining to external 404 errors is harder to deal with, because the “broken links” are on other peoples’ Web sites and you have no ability to address them directly.

Fortunately, it’s been my experience that the majority of site operators are attentive to the problem of broken links, and will generally tidy up promptly if you send them an polite e-mail with specific information so that they can easily find and repair the link on their site.

However, not all are as prompt or conscientious as you’d prefer them to be, so there are two ways to deal with the problem yourself:

The best (but most labor intensive) method is to visit the other site to find the bad link, determine what they meant to link to, and set up your 404 error script to redirect the user to the appropriate content (and tweak the analysis program to flag it in future, as it’s been dealt with and should no longer distract you).

The easiest and least effective method is to create a single custom 404 error message that attempts to provide the user with a link to what they were seeking (not just a cute/funny error message). A general-purpose link to the site’s home page, site map, or search engine is better than nothing, but largely insufficient for the user’s needs. You can use the path/file name to get a fairly good idea of what they were searching for and provide a link (either run a search query and return matching pages, or keep a list of expected problems).

When to Bother

As a final note to all of this, it begs the question of “why bother?” The purist might argue that this level of effort should be undertaken for every site you build, but I would disagree.

I don’t invest this level of effort for most of the personal sites or some of the frivolous ones I operate because they get a low level of traffic, and I’m not making any income from them. That’s not to say I turn a blind eye to 404 errors completely, merely that there aren’t that many and there’s no return on investment, so I check the logs about once a month just to tidy up.

On the other hand, when a site gets a significant amount of traffic (I draw the line at 100,000 unique visitors per month) and generates significant revenue, then there is certainly value in attempting to salvage those visitors who have arrived at a dead-end.

Jim's Notebook

Saturday, August 21, 2010

Handling 404 Errors

No comments:

Post a Comment