A Complete Guide to Log Analysis with Big Query

Curious about log analysis? This guide will cover everything from why you should do it, to getting logs, uploading and analysing them.

Will you need any programming skill to follow this? No, but we’ll provide some options for those who do.

Will you already need to know SEO? Yes. If you don’t then hopefully the why will be interesting to you, but log analysis is no good by itself. To take advantage of all the information it’s giving you, you’ll need to know your SEO basics and be able to crawl and understand the possible problems it will highlight.

What’s the TL;DR for this article? You should use analyse logs to gain valuable insights about how Google crawls and views your site, discover and monitor errors and  prioritise decisions, by using BigQuery your analysis will be easy and repeatable and you can use all the amazing queries we already wrote.


  1. What is a log?
  2. Why should you do log analysis?
    1. Diagnosing crawling and indexing
    2. Prioritisation
    3. Spotting bugs and checking site health
    4. Seeing how important Google sees parts of your site
    5. Measuring the freshness of your content
  3. How to get logs
    1. Is there PII
    2. Are all the log files in the same place?
      1. Mobile sites
      2. Website sections
      3. Website caches
  4. Different tools that can be used to analyse logs
    1. BigQuery
    2. Excel
    3. ELK
    4. Commercial Log Tools
    5. Your very own SQL server
  5. How to process logs
    1. Processing logs with Screaming Frog Log Analyser
    2. Uploading logs to BigQuery
  6. How to analyse logs
    1. Verifying your data
    2. A quick SQL tutorial
    3. Compare your data to Google Search Console
    4. What is the process for analyzing logs
  7. Asking the right questions

What is a log?

Every time you visit a website, the server sends you the page and writes down some basic information about the person asking for it. That’s one line for every single request from anyone, human or bot that accesses your website.

A log file entry looks like this: – – [23/Aug/2010:03:50:59 +0000] “GET /my_homepage HTTP/1.1” 200 2262 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://ift.tt/eSXNch)”

  • IP Address: This is the IP of the person or bot visiting the website. It’s their internet address.
  • Timestamp: This is the time the request was made (and the time zone)
  • Request Type: A GET request (which is what you see most of the time) is a person or bot asking for a page, a PUT request is someone sending information (e.g. sending a form).
  • Page: the page being requested, in this case, my_homepage.
  • The protocol: This will always show HTTP, if you have a mixed HTTP/HTTPS website you’ll need to get your developer to set-up tracking for this.
  • Status Code: A number indicating the server’s response type, e.g. 200 – request returned, 301 – request redirected etc.
  • Page size in bytes: How many bytes the file the user downloaded is.
  • User agent: The name of the bot, or the name of the browser version a user is using to access the website.

Why should you do log analysis?

In the queries section you’ll be able to see all the detailed questions you can ask with log analysis, but broadly most fit into 5 main buckets:

  1. Diagnosing crawling and indexing issues
  2. Prioritisation
  3. Spotting bugs and checking site health
  4. Seeing how important Google sees parts of your site
  5. Measure the freshness of your content

Diagnosing crawling and indexing issues

This is the bread and butter of log file analysis. If Google is having problems crawling your site then it can have a lot of negative effects:

  • If your content isn’t indexed, then you’re missing out on potential traffic.
  • Any changes you make can take a long time to take effect delaying results.
  • Following on from that if you’re attempting to measure the effect of changes on your site, and all the pages update at different times it’s difficult to measure the changes you’ve made.
  • Problems crawling are a common indicator that you may have accidentally generated a lot of thin or duplicate content.

Thankfully this is something log files are great at. By examing where Google is crawling and where it spends it’s time you can narrow down exactly where it is spending too much time, where it isn’t spending enough time, which sections of your site aren’t crawled often enough and where it shouldn’t be crawling at all.


One of the most difficult parts of SEO is prioritising the different issues, that you come across. For example:

  • Which is more of a problem on my site, a redirect chain or thin content?
  • I have two different sets of internal link 301 redirects, which is causing Google more problems?

By seeing where Google spends its time on your site you can prioritise the areas that will affect it the most.

Spotting bugs and checking site health

Particularly on large websites, lots of changes will be happening at once and it can be difficult to keep track of them. Even on small sites, this can be a problem as although the website is smaller, the teams are smaller and have less time to keep track of things.

All these changes can often break things and it can be weeks before you discover a set of links that 404 or part of the site that 302 redirects.

You could scrape your site every day, but that’s a lot of work and you may also be working a site that’s too large to crawl.

Google Search Console provides an interface for checking for these kinds of errors but it has a lot of problems:

  • You get limited to 1000 results for each error type
  • Old errors can stick around in the interface if you’re not keeping on top of it
  • There is no prioritisation
  • There is a delay between the errors being created and them appearing in search console.
  • You can only see errors, not redirects.

Log file analysis solves all the problems with the Google Search Console.

How important does Google see parts of your site?

We’re started to get more complex now. Looking into your log files can also give you an idea of how important Google views different pages or page templates on your site.

Broadly speaking, one of the larger factors that affect how often a page is crawled on your site is how valuable Google thinks it.

Does this mean if you line up all your pages in order from most crawled to least, you’ll see which pages are important? Not necessarily, as there are other things can affect it such as frequency of updates. For example RSS feeds are crawled a lot because they constantly update with content, not because they’re an important page Google wants to rank well.

For example, if you own an e-commerce and on average all your products are being crawled as often as your category pages, it seems likely Google hasn’t understood your hierarchy. On the other hand, if one product URL is getting crawled far more, perhaps it’s particularly controversial  and has received a lot of social traction or press mentions then it’s these other circumstances which are most likely to be causing this.

Measuring the freshness of your content

There’s an excellent post by Cyrus Shepard which explores why freshness might be important for your content, by looking at some of the techniques mentioned in a Google patent.

However often you update your content, though, it’s only fresh if Google is discovering those updates. If you update your page 3 times a day and Google only crawls it once a week, you’re wasting a lot of time. Either you need Google to crawl your page more frequently or you can save some time and update your content less.

Log files let you measure that freshness.

You can calculate how long it takes for Google to crawl a page when it’s published or updated or calculate the average number of times a page is crawled a day and measure the impact of any changes you make to improve that freshness.

How to get logs

If you reached here, you’ve been convinced or you were already convinced. So onto the how.

First, we need to get the logs and to do that, we need to ask our developers some information about the log set-up to make sure we get the correct logs. This information gathering also has the dual purpose of making it more likely that we get the logs, particularly if you’re working for an agency in which case some companies will be more reluctant to hand over the data.

So what do we need to find out?

Is there any PII?

Log files can contain personally identifiable information, things like emails, telephone numbers etc.

Companies are understandably careful with this information so the first thing we want to make clear is that we don’t want any of it.

If the log files contain any, ask for it to be removed before handing over the logs.

Are all the logs in the same place?

We need to make sure we get all the logs for everything we’re interested. These are some common places where logs can be different places and we need to make sure we explicitly check for it:

Mobile sites

If a website has a www. And an m. site e.g.

Sections of the website that look notably different

What looks like one website to you from the front might actually be two different websites at the backend.

If that’s the case each section might log to different places and we need to get both parts of the logs.

Website caches

The other common example is if the website is running some sort of caching, which can separate the logs. What on earth does that mean?

Take the example below:

When user 1 asks for a page, the server generates the page and sends to them. It goes through a cache which makes a copy of the page.

When user 2 asks for the same page, the cache spots this and sends them the copy it made without the server having to do anything.

This cache can log somewhere different from the server, so we have to make sure we get both sets of logs.

How long can we get the logs for?

We want enough log data to establish a useful baseline, typically three months is plenty but on particularly large sets where sets of pages may be crawled very rarely you might need more.

Some websites will only store one month of logs at a time, so we need to ask for those logs to be stored somewhere for us so we can access more than one month at a time.

Do they log hostname?

Let’s take an example e-commerce sites where they have two subdomains:

  1. www.example.com/tvs
  2. example.com/tvs
  3. blog.example.com/tvs

The first page is the category for TVs, the second is a duplicate version which hasn’t been correctly redirected and  the second is the blog category for content on TVs.

The crucial difference between these 3 URLs is the hostname, www, blog or *blank*.

Unfortunately in some cases when logs are aggregated together a website won’t store hostname which means all three of these pages look identical.

We need to ask if they record hostname and have it turned on if they don’t.

Is there too much log data?

Remember what we said earlier? Log files contain every request made to your server. That’s a lot of requests.  For very large websites, or websites who receive a lot of spam traffic  this can mean the files we get given are too large for us to deal with.

We need to find out how large the log files are. If they’re more than we can comfortably work with (for the non-technically minded, this limit is around 4-6GB), or your developer is complaining about the size of them, there is one easy action we can do to cut down on the size.

As we’re primarily interested in Googlebot (sorry Bing), we can have our developer strip out everything which isn’t Googlebot.

The following command (which you don’t need to understand) will catch 99.9% of Googlebot requests and make your files a lot easier to work with.

grep -i -E “googlebot|mediapartners-google|adsbot-google”

(Of course, if you are interested in a different search engine you’ll need to change the string between the apostrophes.)

An email for a developer

Here’s a copy of an email that I send to developers at the start of a project to find out about these areas and narrow down the problems.

Hi x

I’m {x} from {y} and we’ve been asked to do some log analysis to understand better how Google is behaving on the website and I was hoping you could help with some questions about the log set-up (as well as with getting the logs!).

What we’d ideally like is 3-6 months of historical logs for the website. Our goal is look at all the different pages search engines are crawling on our website, discover where they’re spending their time, the status code errors they’re finding etc.

There are also some things that are really helpful for us to know when getting logs.

Do the logs have any personal information in?

We’re just concerned about the various search crawler bots like Google and Bing, we don’t need any logs from users, so any logs with emails, or telephone numbers etc. can be removed.

Do you have any sort of caching which would create separate sets of logs?

If there is anything like Varnish running on the server, or a CDN which might create logs in different location to the rest of your server? If so then we will need those logs as well as just those from the server.  (Although we’re only concerned about a CDN if it’s caching pages, or serving from the same hostname; if you’re just using Cloudflare for example to cache external images then we don’t need it).

Are there any sub parts of your site which log to a different place?

Have you got anything like an embedded WordPress blog which logs to a different location? If so then we’ll need those logs as well.

Do you log hostname?

It’s really useful for us to be able to see hostname in the logs. By default a lot of common server logging set-ups don’t log hostname, so if it’s not turned on, then it would be very useful to have that turned on now for any future analysis.

Is there anything else we should know?



Different tools that can be used to analyse logs

Once we’ve got our logs we need to choose how to analyse them.

Broadly I think there are 6 criteria that should be used when judging a tool for log analysis:

  1. Can we ask powerful and complicated questions?
  2. Is the analysis repeatable?
  3. Can we combine with crawl data?
  4. Is the tool easy to setup?
  5. Is the tool easy to learn?
  6. Is it scalable?
  7. Is it cheap?

The most notable one which might not apply to smaller businesses is number 5.

Agencies will need a tool that’s scaleable, large or fast growing businesses will also need that, however for small some small to medium businesses, 5 will be less of a concern.


BigQuery is the tool that I’d recommend.  It’s Google’s cloud scalable database for data analysis.

For our purposes there are two important things about it:

  1. It’s a database already set-up in the cloud, so it needs no set-up, scales well and is dead cheap to use.
  2. It lets us use SQL to analyse our logs. Don’t worry if that sounds scary. It’s actually dead simple.

Why is SQL the right tool for the job?

It’s worth going over specifically why SQL is the right tool for this job. As  criteria 1,2 & 5 are solved by SQL.

First, it’s easy to learn. Writing SQL is like writing a request to a very logical human being. E.g

price_range = “affordable”

Without much thought I’m sure many of you could guess what that might return:









It’s  also very powerful. While that last query was very basic, SQL was designed to be able to support very complicated queries as well. We could write a single query to tell us on average how long it takes for our pages to be crawled by Google.

Lastl it’s repeatable. If we had a different list of cars, we could copy and paste our query, only changing the FROM and get the same set of information for our new table:

   price_range  = “affordable”

But there are various other options, which are worth touching on:


Sorry Excel, you’re just not right for log analysis. You struggle to handle all the data we’re going to through at you without slowing down and crashing.

Asking complicated questions like calculating how long on average it takes for a page to be crawled by Google, take a long time with multiple steps and it’s only repeatable if you can code in Excel. And if you can code VBA wouldn’t you rather code in something else instead?

ELK Stack

ELK stands for Elasticsearch, Logstash and Kibana.  This is a series of programs that can be set-up to monitor all the logs on your server.

Typically it will be set-up by a developer to monitor the health of  a server, however, it can be hijacked by an enterprising SEO to keep an eye on Googlebot in real-time.

The main 3 problems with the ELK stack are that

  1. It’s not set-up to allow you to easily combine in crawl data
  2. The query language it uses isn’t as intuitive as SQL, the syntax is more complicated. (Which is sort of like trying to learn a language, where one has very few grammar rules and the other has a lot. The second will be harder.)
  3. You’ll have to set it up yourself which can be difficult.

However, if your developers have already set-up, then it’s worth spending the little extra time it will take to learn the query language as it will let you do some of the tasks we’ll describe in the analysis (typically around looking at status codes) very quickly.

Commercial log tools

On the SEO specific side this includes things like, Botify, OnCrawl & Screaming Frog Log Analyzer and on the non-SEO side, it includes tools like Splunk.

Broadly speaking these tools fall down in one of two areas, either:

  1. They’re too expensive OR
  2. They can’t ask powerful questions in a repeatable easy way.

Your very own SQL server

We already championed SQL in the first section, so why not just use your own SQL server, rather than using Google’s one?

Mostly convenience. Setting up your own SQL server can be a pain and many of the common free interfaces for it like MySQL Workbench are cluttered and difficult to use for beginners.

Considering BigQuery is free for all but the largest pieces of log analysis why not use it rather than put yourself through all that extra effort?

Processing the logs

We need to process our logs to upload them into BigQuery (BQ from here on), where we’re going to be doing our log analysis.

There are two things we’re trying to do:

  1. Format our logs into a CSV so we can load them into BQ
  2. Remove anything which isn’t Googlebot

(We will assume for this that we’re only analysing Googlebot.)

There are two ways of doing this. If you can’t code then we’re going to use Screaming Frog analyser.  If you can code then read on (or are interested in the details), then jump over to our GitHub page where we go through the ins and outs of processing logs.

Processing logs with Screaming Frog Log Analyser

This is best shown with a video. So get your logs ready and here we go.

Uploading logs into BigQuery

Again this is best shown with a video.

Here is the schema and query that are referred to in the video:

Schema for uploading process

logs: full_url:STRING,timestamp:TIMESTAMP,ip_address:STRING,method:STRING,status_code:INTEGER,size:FLOAT,time_taken:FLOAT,user_agent:STRING,referer:STRING

Query for initial processing:

WHEN full_url CONTAINS "?" THEN 1
END AS has_params,
WHEN user_agent CONTAINS "Googlebot" THEN "googlebot"
WHEN user_agent CONTAINS "Mediapartners-Google" THEN "googlebot-ad"
WHEN user_agent CONTAINS "AdsBot-Google" THEN "googlebot-ad"
ELSE "not googlebot"
END AS google_agent,
FIRST(SPLIT(full_url, ':')) AS protocol,
FIRST(SPLIT(REGEXP_REPLACE(full_url, r'(http|https):\/\/(www.)?example\.com', ''), '?')) AS path,
NTH(2,SPLIT(REGEXP_REPLACE(full_url, r'(http|https):\/\/(www.)?example\.com', ''), '?')) AS query,
NTH(3, SPLIT(full_url, '/')) AS page_path_1,
NTH(4, SPLIT(full_url, '/')) AS page_path_2,
NTH(5, SPLIT(full_url, '/')) AS page_path_3,
NTH(6, SPLIT(full_url, '/')) AS page_path_4,
NTH(6, SPLIT(full_url, '/')) AS page_path_5
[insert your own table]

How to analyse logs

Verifying the data

Once we have our logs in BigQuery we need to make sure we have all the logs.

To do this we compare the total crawl numbers we’re getting for our logs to the crawl numbers we’re seeing on Google Search Console.

Time for our first SQL query:

Question: How many times did Googlebot crawl our site each day in our logs?


date(timestamp) as date,
count(*) as num_requests
date desc

A quick run through of that bit of SQL

As it’s our first query we’ll spend a little time going over it.

We’re selecting timestamp from our database, then converting it to a date. We’re getting that from the log_analysis table and ordering it so the date is in descending order.

That covers everything except lines 3, 6 & 7.  We don’t actually want the date from each individual log request, we want the total for each day. Lines 3,6 & 7 need to be read together.  We’re counting all the rows in line 3 and in line 6 & 7 we’re saying count where the date is the same.

Don’t worry if you couldn’t quite follow that, we’ve got a whole list of example queries coming at the end and the easiest way to get the hang of SQL is just to play around with it. Try queries and see what you get.  (We’ve got some notes at the end for common debugging errors.)

Comparing log results to GSC

We then want to compare this to Google Search Console. In my experience log data has never differed wildly from search console data. While the two graphs don’t match exactly they should look roughly the same shape (perhaps offset by a day depending on server timezone) and be in roughly the same ballpark.

For example, if you have 10,000 hits on a day in your logs, but 20k in GSC, you’re missing some logs.

If they don’t match time to go back to your developer and work out which logs you’re missing.

But assuming that they do. Congratulations we’ve finally reached the exciting part of this, the actual analysis.

What is the process of analysing logs?

Whatever you’re doing the process of log analysis will broadly go through the same three areas.

  1. You go and make some queries
  2. You find something worth investigating
  3. You go to your crawl data from a program like screaming frog or deep crawl or head to the actual website to understand what’s going on.

While logs are good you’ll still need that ability to look around someone’s website and understand the problems you’re finding.

Asking the right questions

With that process in mind we need to ask the right questions.  For each of these five areas we mentioned at the beginning:

  1. Diagnosing crawling and indexation
  2. Prioritisation
  3. Spotting bugs and viewing site health
  4. Seeing how important Google views various parts of the site
  5. Measuring the freshness of your content

We have questions (and the queries) that can help you dig into them.

If you’ve followed the steps up until this point you should be able to copy and paste these queries straight in, as long as you remember to change the the table in the FROM section.

If you’ve uploaded the logs from SF Log Analyser, you don’t just have Googlebot, you have all the bots (Bing, Yandex, etc.). If you want to look at a bot other than Googlebot, you’ll need to remove:

google_agent = "googlebot" OR
google_agent = "googlebot-ad"

from where it appears in the queries.

Utility Queries

Most of this section are queries that let you check your data:

Get the date range:

Why: Not sure what time period your logs are over? Don’t worry it happens. This query will tell you the first and last date in your logs.


min(timestamp), max(timestamp)

Get total number of requests Googlebot makes each day:

Why: This is the query we mentioned above, that we use to double check our data is good.


DATE(timestamp)as date, count(*) as num_requests
google_agent = "googlebot" OR
google_agent = "googlebot-ad"
date asc

See the crawls per user agent day by day

Why: Sometimes it’s useful to see the percentage of Googlebot crawling vs the percentage of Googlebot mobile crawling. Particularly with the upcoming mobile only index, it’s useful to understand how often Google is crawling your site with it’s mobile crawler and the desktop.


user_agent, date(timestamp) as date, count(user_agent) as num_requests

Diagnosing crawling and indexation issues & Importance

Get the top 100 most crawled URLs:

Why? A great place to begin finding places Google shouldnt be crawling that it is.


path, count(path) as num_requests
ORDER BY num_requests desc

Get the top 100 most crawled page_path_1 folders:

Why: It’s great to see at a high level how Google splits it’s time across your site. You can see from this the places it’s spending time that it shouldn’t be and so on.


  page_path_1, count(page_path_1) as num_requests
GROUP BY page_path_1
ORDER BY num_requests desc

Get the top 100 most crawled directories for combinations of page_path_1 and page_path_2

Why: Often a websites architecture means the interesting directories are two levels deep. Suppose most of your crawl budget is in the /shop/ folder and it’s actually those second level folders which are interesting, /shop/shirts/, /shop/trousers/ etc. This query will show you the most crawled combinations of your first and second level directories.


  page_path_1, page_path_2, count(page_path_1) as num_requests
 timestamp > TIMESTAMP('2015-11-08') and
 timestamp < TIMESTAMP('2015-12-08')
GROUP BY page_path_1, page_path_2
ORDER BY num_requests desc
LIMIT 1000

Get daily query numbers for path – from the top 20 crawled paths over the entire period

Why: As we mentioned above, before we make any recommendations, we need daily query numbers. This query gives us the day by day crawling numbers for the top 20 most crawled paths.


path, count(path) as num_requests, DATE(timestamp) as date
path IN (
(SELECT path, count(path) as num_requests
FROM [insert_your_table_here_2.log_analysis]
ORDER BY num_requests desc,
LIMIT 20))
GROUP BY path, date
ORDER BY path asc, date asc

Get the total number of URLs with parameters being crawled every day

Why: URL parameters are often a common sources of duplicate content (although obviously this may change from site to site), this query gives you a top level look into how much time Google is spending on them.


date(timestamp) as date,
count(*) as num_requests
google_agent = “googlebot” OR
google_agent = “googlebot-ads”
has_params, date

Get the total number of requests day by day to all the specified parameters

Why: Following on from the above, you may then want to investigate the crawling of specific parameters to understand which individual parameters are the most problematic and measure the effect of you changing how the parameters are accessed. I will typically use this in tandem with the list of found parameters in Search Console as you need to provide this query with the parameters you want to investigate. In the example below, I’ve called them param_1 & param_2, so when you copy and paste this query you’ll need to replace all instances of param_1 & param_2 with your own parameters. You could also add more parameters if you want to invetigate more than two.


SUM(param_1) AS param_1_count,
SUM(param_2) AS param_2_count,
DATE(timestamp) AS date
IF (query CONTAINS 'param_1=',INTEGER(num_requests), 0) AS param_1,
IF (query CONTAINS 'param_2=',INTEGER(num_requests), 0) AS param_2,
COUNT(query) AS num_requests,
(google_agent = "googlebot" OR
google_agent = "googlebot-ad")
timestamp ) )
date ASC

Spotting Bugs

Get the total number of each status code

Why: This query gives you a very top level overview of how many of each status code you’re seeing.

status_code, count(status_code) as num_requests
google_agent = "googlebot" OR
google_agent = "googlebot-ad"
GROUP BY status_code
ORDER BY status_code asc
LIMIT 1000

Get the total number of each status code day by day

Why: Once you’ve seen a top level overview, then you want to see the status codes day by day to see how they change and allow you to spot when major changes happened.

date(timestamp) as date,
count(status_code) as num_requests
google_agent = "googlebot" OR
google_agent = "googlebot-ad"
GROUP BY status_code
ORDER BY status_code asc
LIMIT 1000

Get the top 100 URLs which have returned the most of status code “xxx”

Why: Once you’ve found a problem with a particular status code you’ll want to investigate which URLs are causing it. For this query, replace xxx with the status code you’re investigating. (No need for quotes!)

path, status_code, has_params, count(path) as num_requests
(google_agent = "googlebot" OR
google_agent = "googlebot-ad")
AND status_code = xxx
path, status_code, has_params
num_requests desc

Common Alterations:

Searching by a specific date

You can filter most queries by adding in a date section to the WHERE clause. For example take the first query from above:

path, count(path) as num_requests
timestamp >= TIMESTAMP('2015-11-08') and
timestamp =< TIMESTAMP('2015-12-08')
ORDER BY num_requests desc
LIMIT 1000

via Distilled http://ift.tt/2egamN3

Tagged with: ,
Posted in News

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: