Google Analytics 360 BigQuery Export: The Basics
September 21, 2017 -
Yesterday, Google announced a new feature that it started rolling out to Google Analytics 360 customers in August: data streaming into BigQuery. This offers Analytics 360 customers a great opportunity to start working with more complete near-real-time data much more rapidly than they could before. That’s especially exciting for online retailers nearing their peak season. However, many Analytics 360 customers aren’t yet taking advantage of the BigQuery export at all. If you’re in that camp, this post serves as an introduction to the export to help you get started.
If you’re a Google Analytics 360 customer, you’re probably using the platform for large-scale analytics work. And if you’re working at scale, occasionally you’re likely to come across questions that you can’t quite answer within the Google Analytics interface.
Google Analytics offers a wide range of reporting, of course, but you can also get at your data through custom reporting, or through the standard Analytics Reporting API and the Unsampled Reporting features in the Management API. To easily work with either of these APIs, there are tools like Analysis Engine out there to configure exports.
But sometimes even those are not enough, and you want to start to explore the complete Google Analytics data set on your own terms. For that, Google Analytics provides an optional daily export of your data into Google BigQuery.
BigQuery is a big data querying tool that allows you to import or stream data into its database, and then work on that data set through complex queries using SQL. Offered through the Google Cloud Platform, it’s a pay-per-use solution that allows you to pay only for the storage and the computational resources you use. With BigQuery, you can run ad hoc investigations on the Google Analytics 360 data set to answer questions that may not be addressed by the Analytics UI on its own.
Costs of BigQuery
BigQuery is part of the Google Cloud Platform, not the Google Analytics 360 Suite. As such, it has a different pricing model than the Analytics products and is not included with the Suite. For BigQuery, specifically, pricing is basically a function of how much you store in BigQuery and how much you query.
The first 10 GB of storage and 1 TB of query processing is free monthly. You may also have received a monthly allowance for BigQuery use with your Google Analytics 360 contract, which typically gives you $500 worth of BigQuery monthly. After that’s all used up, you begin paying for storage, querying, and, if you choose to have your Analytics data streamed into BigQuery for near-real-time reporting, you’ll also pay for the streaming service.
Overall, the cost is not that high for what you get, considering how powerful BigQuery can be in augmenting what you already get from Analytics 360. You can see current pricing here.
Setting it up
To get started pushing your data to Google BigQuery, the first thing you’ll need is a Google Cloud Platform project, which can be created in the Cloud Platform console if you don’t have one already. You’ll need to enable billing for the project, which can be done with a credit card through the Billing section in the console’s main navigation.
You’ll also need to give Google Analytics 360’s export service account Editor-level access to your project (not just to BigQuery). This can be done in the IAM & Admin > IAM section of the console. The service account email you’ll need to add is firstname.lastname@example.org.
Once billing has been enabled, and the appropriate access has been given to GA’s service account, you can move on to Google Analytics 360 to link it to BigQuery. In the Admin section of GA, under the property you want to link, you’ll see a “Product Linking” subheader. Click on All Products under that header, and find BigQuery in the list of products. From here, you can set up your link.
You’ll need to provide Google Analytics information about what data you want to send to BigQuery, and what project to send it to. You’ll be asked to provide the project name and choose the view you want data to come from. You can currently only link one view per property to BigQuery, and there are implications about what project and view you choose, so be careful about your choices here.
Finally, you’ll be asked to choose your current day streaming preferences. This determines how Google Analytics will push the data to BigQuery throughout the day, giving you access to data before the day is up. There are two options: to have your data exported 3 times per day (every 8 hours), or to have it streamed to BigQuery continuously.
The first option is the default and, as it says, pushes partial data from the day to BigQuery three times daily.
The second option is brand new. This makes use of BigQuery’s streaming import capabilities to push hits into BigQuery about ten to fifteen minutes after they are received. This gives you nearly real-time access to most of the Google Analytics data for quick ad hoc investigations. Notably this does not include any data from AdWords or DoubleClick integrations, and also incurs a small incremental cost as the BigQuery streaming service is pay-per-use (about 5 cents per GB at the time of this writing).
When you first initiate the link, Google Analytics will drop the last 13 months’ worth of data into BigQuery within about 24 hours of setting up the link. However, it will only do this once per view. That is, if you choose to break the link and then link the same view to another project, the last 13 months’ worth of data will not be imported, nor will moving the data back to the same project fill in any gaps in the data from when the link was broken.
What the data looks like
Once you’ve set up your link, you’ll see a new data set appear in BigQuery, named with the ID number of the view. In that, you’ll find there are multiple sets of tables.
The first one, named as “ga_sessions_YYYYMMDD”, contains the bulk of the data. You’ll have a table in that group for each day for which the BigQuery export has occurred. They tend to show up early in the morning, though there’s no SLA for when to expect these tables to arrive. This is the most complete data set.
If you chose the thrice-daily intraday option, you’ll have another group of tables named according to the schema “ga_sessions_intraday_YYYYMMDD”. The table for the current day will contain the most recent export with partial data for the day. These tables are eventually deleted as the permanent data for the day is put into place.
Finally, if you chose the continuous export option, you’ll have tables named “ga_realtime_sessions_YYYYMMDD” and “ga_realtime_sessions_view_YYYYMMDD”. These contain the data streamed from BigQuery. Do not query the “ga_realtime_sessions_YYYYMMDD” table — it is a staging table that will often contain duplicated sessions. These are de-duplicated in the view table, called “ga_realtime_sessions_view_YYYYMMDD” — that is the only table you should query to obtain data from the current day.
The full schema for the exported tables is provided in the documentation, but the structure is basically this:
- Every single row in the main table represents one session.
- There are various properties of the session recorded, such as the Client ID and User ID of the user, and there are nested tables that contain attribution (traffic source) information for the session, and the technical details of the session, such as browser and operating system of the visitor.
- There’s a nested “totals” table that contains some information about the session, such as the duration of the session, and some calculated totals, such as how many transactions occurred in the session and what the total revenue for it was.
- Finally, there’s a repeated nested “hits” table, which contains every hit that occurred during the session: all pageviews, events and social interaction hits and (in the case of classic ecommerce deployments), transactions and items. For app properties, this also includes app views and exceptions.
- The session and hit level data contains all of the custom dimension and metric values applied to each session or hit.
What the data does not contain
The data that Analytics 360 exports to BigQuery is specific to the view, meaning that all view filters have been applied by the time it’s exported. This means that anything you filter out in the view you’re exporting does not appear in the BigQuery data.
As for the dimensions and metrics available in the schema, there are some things that are noticeably absent. Most importantly, aside from the few totals that are provided for each session, you’ll notice that dimensions and metrics that are calculated or derived from the data are not included. For example, the following are not included:
- Custom channel groupings
- Days since last session
- Landing page path
- Goal names
- Bounce rate, conversion rate, % new/returning sessions, or any average calculated across multiple sessions or users
- Goal counts
Marketing cost data imports are not present in the data.
Extended data imports (formerly known as dimension widening) for processing time imports are present in the data. Because processing time imports apply the dimension/metric values to the imported data at the time the hit is recorded, those values show up in BigQuery.
However, query time extended data imports are not present in the data at all. Query time works by saving the dimension or metric values in a separate table and then applying them to the data when you query it in the Google Analytics UI. Consequently, that data doesn’t show up in BigQuery at all.
Refund data imports are present in the data, as they count as hits.
Next step: query and learn from your data
Once you have an understanding of what the BigQuery export and its schema look like, your next step is to start querying. In a follow-up post, I’ll go through some of the basics of querying this data set.