How To Query Google Analytics 360 Data Via BigQuery

by Colin Temple

In a previous post, I covered the basics of setting up the Google Analytics 360 BigQuery Export, and covered some details about the schema that export provides. In this post, I want to start digging in to the fun part: querying the Google Analytics data in Google BigQuery.

Now, this post requires a very basic understanding of Structured Query Language (SQL), which forms the basis of the languages you can use to query data using BigQuery. We’re not going to be updating the data in BigQuery with any of the queries we’ll be writing, so we really only need to understand the syntax for the SELECT statement, which is used to pull data rows from a database, such as the one that powers BigQuery.

SELECTing Data

In SQL, you obtain data from the database using a SELECT statement, which describes what data you want to look at. The basic anatomy of a SELECT statement is like this:

SELECT
  /* the columns you want */
FROM
  /* the table that contains the data */
WHERE
  /* some condition is met */
GROUP BY
  /* columns that, when their values match, should result in a rollup of the data */ 
HAVING
  /* some condition that is met after grouping */
ORDER BY
  /* some column you want to order results by */
LIMIT
  /* some number of results you want to limit the response to */

There are more things you can do, of course. You can create or rename columns, you can join multiple tables together. But the basic structure looks like the above.

When looking at these examples, take note that BigQuery currently has two query languages. One is Legacy SQL, the original BigQuery language. The other is called Standard SQL, and aims to bring BigQuery’s SQL dialect closer to the standard SQL used by most database servers. I’ve generally supplied examples in Standard SQL, unless the Legacy SQL method differs significantly, in which case I’ve provided both.

Date Ranges

The first thing you’ll want to understand about the Google Analytics data set in BigQuery is that it follows the <prefix><day>  format that BigQuery requires to save each day’s worth of data in its own table. This means that, in addition to querying the literal name of a table, you can also query multiple tables simultaneously by defining a date range. So, if you were pulling data only for a single day, you might do something like this in Legacy SQL:

SELECT
    SUM(totals.visits) as Sessions
FROM
    [project-name:VIEW_ID.ga_sessions_20171001]

Or, in Standard SQL:

SELECT 
  SUM(totals.visits) as Sessions
FROM
  `project-name.VIEW_ID.ga_sessions_20171001`

That basic query will get you a session count for a single day, in this case October 1, 2017. But what if I want to get all sessions in October? This is accomplished differently in the two dialects of BigQuery SQL. Using Legacy SQL, you can do this with the TABLE_DATE_RANGE() function and some TIMESTAMP() -encoded dates:

SELECT 
  SUM(totals.visits) as Sessions
FROM
  TABLE_DATE_RANGE(
    [project-name:VIEW_ID.ga_sessions_], 
    TIMESTAMP('2017-10-01'), 
    TIMESTAMP('2017-10-31')
  )

Notice that I left off the date portion of the table name when I supplied it to the TABLE_DATE_RANGE()  function. That function will automatically add the date portions when it builds the list of tables to query.

If you’re using Standard SQL, you can instead query the dates using a table suffix wildcard:

SELECT 
  SUM(totals.visits) as Sessions
FROM
  `project-name.VIEW_ID.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN "20171001" AND "20171031"

Top-Line Metrics

As I mentioned in my previous post, most of the data in the BigQuery export for Google Analytics 360 excludes derivative dimensions and metrics, although there are some totals that are kept for common top-line metrics. Those totals are summed up for each session (therefore, in each row). You can query them like this, in Standard SQL:

SELECT
	date,
	SUM(totals.totalTransactionRevenue)/1000000 AS revenue,
	SUM(totals.transactions) AS transactions,
	COUNT(DISTINCT fullVisitorId) AS users,
	SUM(totals.visits) AS sessions,
	SUM(totals.pageviews) AS pageviews,
	SUM(totals.bounces) AS bounces
FROM 
	`projectname.viewid.ga_sessions_*`
WHERE 
	_TABLE_SUFFIX BETWEEN "20171001" AND "20171031"
GROUP BY
	date
ORDER BY
	date ASC

This query will pull the total revenue, transactions, users, sessions, pageviews and bounces for each session and add them up, grouping them together by date. The result will be a table that looks like this:

daterevenuetransactionsuserssessionspageviewsbounces
2017100118567211686931203234567112536812180926
2017100221561023105445021146810024158801491855
. . .
201710313892105891529893621153251568841352102568

Some things to notice about this query:

  • We are defining new columns for each of our metrics, calculated using the SUM() function.
  • We are naming each column AS  some name.
  • We are grouping by date, which means that instead of having the date spread out over several rows, we are aggregating the results for each date (using the SUM()  function above).
  • We are ordering by ascending date, which means our output goes from the first to the 31st of October, in this case.
  • Sessions are called “visits” in the exported schema. This is an artefact of the days when Google Analytics used the term “visits” instead of “sessions”. We’ve re-named the column as “sessions” in our output.
  • Revenue here is divided by one million before returned. This is because Google Analytics stores currency amounts as micros, one millionths of the currency used in the view. We need to convert this back to single units before reporting it.
  • We’re using the COUNT(DISTINCT fullVisitorId) method to count users. This is roughly the old calculation that Google Analytics used to calculate its Users metric before early 2017 when it changed things. The old calculation is still used in unsampled reporting in Google Analytics 360. Analytics standard reports still use the old calculation by default, but if you want your Analytics 360 reports to match BigQuery exactly, you’ll have to disable the Enable Users Metric in Reporting feature in the property settings. This forces the old calculation to be used.

In this query, I used Date as a dimension. That gets us those metrics broken down by day, but you can use the same method to pull data by any other dimension. For instance, to pull data by medium and source:

SELECT
	medium,
	source,
	SUM(totals.totalTransactionRevenue)/1000000 AS revenue,
	SUM(totals.transactions) AS transactions,
	COUNT(DISTINCT fullVisitorId) AS users,
	SUM(totals.visits) AS sessions,
	SUM(totals.pageviews) AS pageviews,
	SUM(totals.bounces) AS bounces
FROM 
	`projectname.viewid.ga_sessions_*`
WHERE 
	_TABLE_SUFFIX BETWEEN "20171001" AND "20171031"
GROUP BY
	medium, source
ORDER BY
	revenue DESC

This produces a similar table to the above, but in this case we have data by medium and source, and it’s sorted by a metric this time (in this case, transaction revenue).

That covers off some basics when it comes to pulling this data, but we haven’t even scratched the surface. In my next post in this series, I’m going to look at re-creating the default channel groupings in Google Analytics 360 within the BigQuery-exported data. Stay tuned!

Colin Temple

VP, Product

Colin serves as VP, Product for Napkyn Analytics. A diverse background in data, software and marketing and an education in logic and philosophy give Colin a unique perspective on where the analytics practice is, and where it should be.

See more posts from Colin