An Introduction to Log File Analysis for SEOs & Webmasters
It doesn’t matter what sort of web analytics suite you choose to use… If you’re not actively reviewing your log files, you’re missing out on some key data and reporting metrics. Today’s blog post is all about log files and how you can begin using them to learn more about what is happening on your web site.
My goal here is to encourage you to the point of using log files on a regular basis to better tune up your hosting and marketing platforms.
Page Tagging Vs. Log Based Analytics
When you think about web analytics, what do you really think of? The majority of people I know have told me all about their nicely designed dashboards filled with telling graphs, growth charts and tables that just wait online for them to view them whenever needed. While that’s all well and good – those reports are always coming out of page tagging analytic programs. The problem? Page tagging analytics has limitations, and some of those limitations are simply unacceptable for hardened SEOs and webmasters.
That’s not a knock on page tagging analytics either. Page tagging is a popular method of acquiring data as the ease of use and on demand availability of reporting all add up to make these tools a required resource. Log files for me simply help me go that extra mile.
I want you to guess how many of the following analytic suites are providing you with reports and data generated from, in part, log files:
- Google Analytics
- Omniture
- Microsoft adCenter Analytics
- Sitemeter
- Quantcast
- Compete
- HitWise
Ready for the obvious answer?
— Zero! None of the above use data recorded by your server for statistical analysis.
Olivier Amar of CompuCall earned some kudos this morning. When I asked how many followers were not out there checking their logs – he tweeted a reply about ClickTracks – one of the few analytic suites out there for SEOs and site owners that actually integrates log files out of the box.
I don’t want to get into a whole lecture about the differences between page tagging analytics and log parsers (or hybrid solutions for that matter). What I do want you to realize is that no matter the hosting platform, there is some more useful information you could be extracting about your web site and your visitors if you can acquire the logs.
Familiarize Yourself with Log Files
Before we jump too far in it’s probably best for us to review what a server log file is, what it looks like, what data it contains, etc.
What is a Server Log File?
Wikipedia defines a server log as:
A server log is a log file (or several files) automatically created and maintained by a server of activity performed by it.
A typical example is a web server log which maintains a history of page requests. The W3C maintains a standard format[1] for web server log files, but other proprietary formats exist. More recent entries are typically appended to the end of the file. Information about the request, including client IP address, request date/time, page requested, HTTP code, bytes served, user agent, and referer are typically added. These data can be combined into a single file, or separated into distinct logs, such as an access log, error log, or referrer log.
How Do I Retrieve Server Logs?
Each hosting provider or company handles this differently. My hosting company makes it easy for me by keeping logs available via FTP on a 7 day cycle before any logs are removed. I have adapted to just pull those logs down off my server oce a week through an automated application. Set it up once, and now I can forget about the hassle.
I’ve seen other hosts that make log acquisition more… Trying. In any event, server logs have a number of different recording options, structures, formats and file types. This post is focused on using the logs you have available to you – not acquiring them. I highly recommend working with your server administrator or hosting provider to acquire access to logs if you do not have that already.
If you are ever presented with an option – push to acquire Extended Log Files and then quickly hand your hosting provider or server admin a copy of this resource from the W3C.
What Does a Server Log Look Like?
Here are five lines I pulled out of a server log file from my blog as recorded yesterday, March 29, 2009:
85.89.185.215 – – [29/Mar/2009:01:00:09 -0700] “GET /wp-content/uploads/2007/09/100cap006.jpg HTTP/1.1” 200 46012 “http://www.ironworksforum.com/forum/showthread.php?p=1200167” “Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/530.1 (KHTML, like Gecko) Chrome/2.0.169.1 Safari/530.1”
38.99.107.141 – – [29/Mar/2009:01:00:10 -0700] “GET /feed HTTP/1.1” 302 5 “-” “Mozilla/5.0 (compatible; FriendFeedBot/0.1; +Http://friendfeed.com/about/bot)”
193.252.149.15 – – [29/Mar/2009:01:07:24 -0700] “GET /276.html HTTP/1.1” 200 24655 “-” “Mozilla/5.0 (Windows; U; Windows NT 5.1; fr; rv:1.8.1) VoilaBot BETA 1.2 (support.voilabot@orange-ftgroup.com)”
69.147.112.169 – – [29/Mar/2009:01:11:01 -0700] “GET /feed/rss HTTP/1.0” 302 0 “-” “Yahoo Pipes 1.0”
66.249.72.136 – – [29/Mar/2009:01:12:28 -0700] “GET /robots.txt HTTP/1.1” 200 508 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
In a 12 minute span on my web site, some pretty cool things happened that I would never have known about through any page tagging analytic suites I’m using.
Let’s dissect each of these five lines and I’ll show you what I mean.
Entry #1
85.89.185.215 – – [29/Mar/2009:01:00:09 -0700] “GET /wp-content/uploads/2007/09/100cap006.jpg HTTP/1.1” 200 46012 “http://www.ironworksforum.com/forum/showthread.php?p=1200167” “Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/530.1 (KHTML, like Gecko) Chrome/2.0.169.1 Safari/530.1”
Someone is stealing my images! Like a lazy webmaster, I don’t lock down much and prevent other people from using it. In this referenced log file I now have evidence of someone using one of my images (regularly, I might add) on another web site’s discussion board.
This log file entry tells me that this person is using this image on this discussion thread.
Not cool! Now for me, bandwidth isn’t much of an issue and I don’t really mind if someone is repurposing that image. If that were protected photography though – I’d want to keep it under lock and key. More on this later.
The key here though is that the actual “page” being loaded up (the discussion board thread or user profile page) is hosted elsewhere. Since I don’t own that site, I don’t have Google Analytics code on the site and without this log file, I never would have known that this was taking place.
When you consider how much this could happen with a large web site – you can probably see how quicly this can become a big issue.
Entry #2
The next log file entry was this:
38.99.107.141 – – [29/Mar/2009:01:00:10 -0700] “GET /feed HTTP/1.1” 302 5 “-” “Mozilla/5.0 (compatible; FriendFeedBot/0.1; +Http://friendfeed.com/about/bot)”
As the tail end may suggest to you, this is a FriendFeed bot that’s coming through my web site and pulling a copy of my blog’s feed. Friendfeed’s bot will then see if there’s any new entries and pull them via RSS to use on their own site since I’ve allowed them to do so.
If you’re watching things like page views, this wouldn’t actually count in other analytics since again – the user requesting the data never actually came to my web site. The other issue? The “user” here is actually a bot and my guess is that if it’s like GoogleBot, it probably won’t bother to execute any javascript code that would be required for page tagging analytics to record the hit.
Entry #3
Next up:
193.252.149.15 – – [29/Mar/2009:01:07:24 -0700] “GET /276.html HTTP/1.1” 200 24655 “-” “Mozilla/5.0 (Windows; U; Windows NT 5.1; fr; rv:1.8.1) VoilaBot BETA 1.2 (support.voilabot@orange-ftgroup.com)”
An old post on my blog on the Internet Marketer’s Charity Party at SES San Jose is being retrieved here by another bot, this time called VoilaBot. Ever heard of VoilaBot before? Sadly, I had not – which is more telling about my failures as an International SEO.
Voila is the provider for Wanadoo, which is a huge portal in France and one of the biggest european ISPs.
Voila itself is one of the best known web brands in France.
Where’d I get that information? From heini, a veteran user on Brett Tabke’s WebmasterWorld, silly.
Entry #4
Still with me? Good, because we’re going to go easy on these last two entries to review! Next is…
69.147.112.169 – – [29/Mar/2009:01:11:01 -0700] “GET /feed/rss HTTP/1.0” 302 0 “-” “Yahoo Pipes 1.0”
This is the footprint of Yahoo! Pipes, a fairly new RSS / News Aggregator that’s actually quite cool. All that was happening here is that a user of the Pipes program was loading up (or refreshing) my blog’s RSS feed. Again – this would never show up in anything like Google Analytics or Omniture. Why not? You know this. Just read the last three log dissections. :)
Entry #5
And finally… The staple of any SEOs diet… Googlebot!
66.249.72.136 – – [29/Mar/2009:01:12:28 -0700] “GET /robots.txt HTTP/1.1” 200 508 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
A well behaved Googlebot, too! The above request is the mark of GOOG coming through and requesting my blog’s robots.txt file for some more direction. It’s always nice when bots do what they say and are supposed to do first, right?
Now, Onto YOUR Log Files…
You don’t really care about what’s happening here on my blog — you want to see what’s going on with your web site. So now we get to take a look at how to make these log files work for you!
What You’ll Need
1.) Server Log Files
2.) Server Log Parsing Application
3.) Curiosity
Again, I’m not helping you with item number one.
With item number two, I’d recommend WebLog Expert. It’s an application that I’ve been using for years and bought the professional version of some time ago. Considering the low cost, I’d recommend it – but there are certainly other log file analyzers available to you.
Just check out download.com or directory listings on the Yahoo! Directory or on DMOZ.
Since WebLog Expert offers a free BETA version with some filtering options though, I’ll use them for screen shots.
Here are some report ideas I’m going to demonstrate for you…
- Google, Yahoo & LiveSearch Spidering
- Stolen Content
- 400 Errors
- 300 Server Redirections
For the purpose of this demonstration I’m going to use WebLog Expert as the log file analyzer because it’s a free solution and provides some easy filtering options. The key here is in using these filters to look at very specific data.
Google, Yahoo & LiveSearch Spidering
Log files record the user agent of each request. When a human visitor visits your site, their web browser will be recorded as it is labeled. Refer back to log entry #5 from above to see how Googlebot identifies itself. It tells my web server that it’s user agent is Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html).
In order to report on the spiders, you need to set up a filter that excludes all activity outside of the spiders. I’m going to take this one step further and show you how to set up filters in WebLog Expert that only pull activity on the big three – Google, Yahoo and LiveSearch.
Using the filters dialogue, you will need to add a new filter that includes activity based on spider name, and then select each of the appropriate spiders from the drop down list. To do this, you’ll need to set up three filters like so:
Once you set up three filters, one for each, you should see this:
And if you do, just click through the Finish button and then run you’re report. You’ll now get a wealth of information on your spidering activity.
Want to see what data is available? Click here to download the resulting report in PDF format. Here’s a hint to what you may find out… Yahoo! Slurp is sometimes a little more… aggressive than you may think:
Stolen Content
This report, even though I’m using it as an example for this post – really is one I need to take action on. Our goal here is just to find anyone out there who may be using my CSS or images on their own sites or for their own needs. If you were to run this same report, I’d suggest that as an action step – you take measure to prevent images from being hotlinked and so on.
With WLE, you’ll want to create the following filters:
Replace ericlander.com with your own domain, and, add or subtract any files you’d like to see in there. Other popular files to be stolen and reused? mp3, pdf, swf, avi, mpg, mov, and css lead the way for me.
Again, a sample report output of the above can be found here in PDF format.
400 Errors
One of the most useful reports for me over the years has been this report that only looks at 400-type responses. Now, any 400 error from your server indicates that something hasn’t been found. The most popular of which is the 404 error we’re all used to seeing – but there are other useful not found errors to note, including the following table from HTML Goodies:
- 400 : There is a syntax error in the request. It is denied.
- 401 : The header in your request did not contain the correct authorization codes. You don’t get to see what you requested.
- 402 : Payment is required. Don’t worry about this one. It’s not in use yet.
- 403 : You are forbidden to see the document you requested. It can also mean that the server doesn’t have the ability to show you what you want to see.
- 404 : Document not found. The page you want is not on the server nor has it ever been on the server. Most likely you have misspelled the title or used an incorrect capitalization pattern in the URL.
- 405 : The method you are using to access the file is not allowed.
- 406 : The page you are requesting exists but you cannot see it because your own system doesn’t understand the format the page is configured for.
- 407 : The request must be authorized before it can take place.
- 408 : The request timed out. For some reason the server took too much time processing your request. Net congestion is the most likely reason.
- 409 : Conflict. Too many people wanted the same file at the same time. It glutted the server. Try again.
- 410 : The page use to be there, but now it’s gone.
- 411 : Your request is missing a Content-Length header.
- 412 : The page you requested has some sort of pre-condition set up. That means that If something is a certain way, you can have the page. If you get a 412, that condition was not met. Oops.
- 413 : Too big. What you requested is just too big to process.
- 414 : The URL you entered is too long. Really. Too long.
- 415 : The page is an unsupported media type, like a proprietary file made specifically for a certain program…
The filter setup here is super simple. Just create this one:
And the resulting report looks like this (again, PDF!).
300 Redirections
Every SEO needs to have a grasp of 301 redirects, and reporting on the ones your server dishes out is super simple here. Just like the 400-responses, you’ll need to set up a quick filter that only pulls 300-level response codes. Easy!
The value here for an SEO is pretty obvious – so I’ll let you run with why this report is useful. To check out the sample in PDF format, just click here.
Wrapping Up…
Hopefully this post has given you some more insight on how you can begin analyzing server log files. If I’ve confused you at any point, please do drop a comment below and open up a discussion for us as others may have similar questions or hangups.
Don’t be afraid to get creative with the use of filters too with WebLog Expert or any other application that you may find yourself using. It’s very easy to use filters to extract in depth metrics like time spent on site by visitors viewing movies, path of visits referred from Digg, bounce rate for StumbleUpon referrals, etc.
Finally, this isn’t meant as a knock on page tagging analytics and the information they offer. Every successful web site marketer should rely on both regularly – but when it comes to running a clean site, don’t just assume the logs have nothing to provide to you.
37 thoughts on “An Introduction to Log File Analysis for SEOs & Webmasters”
Eric, that is awesome. I’m going to read through a few more times before letting it settle in. You have a lot of stuff up there and most of it is going to go over quite a few heads. A few years ago, I would have said the same for myself. I’ve since learned. :)
Another error code you’ll want to track are 500s. They will make or break a site. Those usually occur when the visitor takes the time to fill in requested information and they may get a blank screen or nothing happened. Maybe the session was interrupted, maybe this, maybe that. Either way, we track those 500s the same way we do 400s. I want to know every error encountered and be able to take a proactive approach to fixing them. My largest client no longer sends us emails when something may go wrong. They know the system is alerting the dev team at the moment the error is encountered and someone is working to correct it. That’s what you call having total faith in your provider. We do serve friendly 400/500 pages and tend to get creative with some of the 400s. :)
Custom Error Pages – 403 Forbidden
http://www.SEOConsultants.com/errors/403/
We also have a nifty page in place for the 401 series too. You’ll have to invoke those. Be careful though, too many and our system will Blacklist your IP. We don’t take any prisoners either. ;)
Absolutely terrific. Are there Mac programs like WLE? I hope so….
Hey Eric, first of all I’ll like to thanks your for presenting things in a better way. Even though topic is not familiar to me, I could follow you only because of your step by step explanation.
Very Excellent Information.Good Keep it up. Thanks a lot. such a nice post
Thanks for the enjoyable article. Local logs are so often neglected and can provide a little more depth over web-based analytics, plus you don’t have to surrender your viewership information to a third party.
WebLog Expert is a wonderful program and it’s my log analyzer of choice, but a close second is the completely free and very extensible Analog. Analog is more difficult to configure but provides more flexibility in the end, so with a little investment in time it can provide only the metrics most important to you! It works on Windows and Linux, and perhaps Mac though I’ve never tried.
http://www.analog.cx/
Gr8 post. I have been using AWStats for a while and it seems to be pretty good. Will also try WebLog Expert
Thanks Eric. This was a great contribution to the community and I am sure it took you a while on this one.
Yes thanks Eric, I will try and digest this into little bite-sized chunks
Eric,
Head’s up: WordStream provides reports and data generated from your own server log files and integrates them with their keyword management tool.
http://www.wordstream.com/
Cheers,
Ken
Thanks Eric, this is extremely helpful – even for a dinosaur like me! I’m learning and digesting. So, here’s a question (and I apologize if it’s stupid) if a site has 301 redirects on it’s homepage does this in any way effect the unique visitor count as reported by Google?
Thanks
This really not my area. But deep down I know I should get around to it. So thanks for the intro.. very helpfull.
Thanks for the great post, Eric. The examples are very useful. This adds another thing to aggregate in a monthly statistics report for clients besides social media stats. It is important for having a whole picture to act on, though.
Shoot I didn’t realize this was an issue. Back in the day all analytics packages were log driven, even the old Urchin and Webtrends.
Been using WLE for years now. Their tool is quick at processing and very easy to use. Many of us out there have been to addicted to JavaScript solutions like GA it is nice to see someone promoting everyone to dig in those logs and see what is really happening.
I am sure you opened a lot of peoples eyes with this post Eric.
Very good information. I was unaware of this serer log file, thank you for briefly illustrating server logs file and how it can hep us.
Thank you again…
Eric you keep astounding me!
I had to learn a bit about logfile analysis on my own a while ago becasue due to pure laziness, all my test sites run on Cpanel, which uses AWstats. In order to make some serious sense of the data and dig deeper, I had to break out he log files.
Most of the tipe you have given here are very useful to logfile n00bs like me.
Eric,
Great job. The lazier we get, due to easy tools like Analytics, the more we forget how valuable the raw data within a log file can be. Thanks for sharing!
roblaw
Great article Eric. I tried to download the “300-codes.pdf” and it was corrupt. Not sure if this was a issue my end or with the file on the server, but all other PDF reports downloaded OK to my machine.
Sweet! Another reason why Eric is the man: log file analysis. A true player :)
We wrote a parsing script awhile back that will allow you to pull out bot crawls using the terminal. Unix chops required: http://www.audettemedia.com/blog/seo-diagnostics-tool
I use it every week. Crucial!
Nice Write-up Eric. I’m a fan of Nihuo Web Log Analyzer (www.loganalyzer.net/). It’s very similar to Web Log Expert, but Web Log Expert crashed on two different PCs that I tried it on.
Eric, thanks for this great tutorial. I didn’t realize log files were so valuable and could be used in conjunction with a Google Analytics-type tool. So glad I found this…
Good article about log file analysis and what’s missing in some popular tag-based web analytics packages. I like Weblog Expert. A few months back I compared several free log file tools and Weblog Expert was the best I found by a long shot.
I will have to take this post down in doses and digest. The more technical you can get will always be an advantage.
Nice explanation! Very interesting, I just landed-up some how to this page after going through this I definitely want to go through all other posting.