An Ultimate Guide to Google Analytics Sampling and How to Avoid It

6
1068

Sampling is a method used to infer insights or make generalizations when it’s unfeasible or simply impractical to analyze all the data under consideration. Instead, smaller data subsets are culled from the data set and considered to represent the whole picture.

Sampling has been a tried-and-tested practice long before the glorious rise of Google Analytics. It’s used in public opinion polls, surveys, consumer tests, basically the vast majority of statistical studies. And whenever sampling kicks in, it creates a certain level of uncertainty in the observations. Keep reading, and you’ll learn the reasons behind Google Analytics sampling, why it is an issue, and how you can avoid sampling in your reports.

Why, and when sampling occurs

Even Google’s processing servers can’t always handle endlessly-large volumes of data in a finite amount of time (and the faster the better). That’s why Google Analytics applies sampling in aim to strike a balance between accuracy and processing speed. You can always see if sampling is in effect, by taking a look at the top of each report. If there’s a line saying that the report is based on less than 100% of sessions, this means that the data is sampled.

Sampling notification in the Google Analytics interface

This most often happens when the amount of data for the selected date range exceeds 500,000 sessions. However, keep in mind that sampling is not just about the number of sessions in your reports. It’s about how deeply into your data you want Google Analytics to dig.

Default & Ad-hoc reports

Most of the default Google Analytics reports, the ones you see in the left pane in the Google Analytics interface, are always unsampled. For each reporting view, Google Analytics creates a set of pre-aggregated data tables with combinations of certain dimensions and metrics taken from the complete data, and processes this pre-aggregated data on a daily basis. This means that the metrics for the dimensions in default reports are already calculated and ready to go when you run the report.

The screenshot below demonstrates an example of a report based on the complete non-sampled data:

Without segments applied, the report is based on 100% of sessions

However, once you apply a segment, a filter or a secondary dimension to a default report, Google Analytics has to issue a non-standard, ad-hoc query to return the information you’ve requested. The same happens when you create custom reports with combinations of dimensions and metrics that don’t exist in the default reports. At first, Google Analytics checks if this new query can be fully satisfied by the existing tables of pre-aggregated data. If not, then the query goes to the complete, raw data to compute the requested information. If the number of sessions for the date range you are using is above the sampling threshold of 500,000 sessions or even less, ad-hoc reports may be presenting you with sampled data.

Here’s an example of a report based on sampled data:

Applying a segment results in a sampled report based on 6.47 % of sessions

Flow reports

With flow-visualization reports, such as Users Flow, Behavior Flow, Events Flow, and Goals Flow, Google Analytics only allows for a maximum of 100,000 sessions for a selected date range. Above this threshold, the data is always sampled.

Since flow-visualisation reports are based on a different sample set than the default reports, you may see discrepancies in the presented metrics. This means that the total number of visits, users, exits, etc. may be different between the default Behavior and Conversion Overview reports and the number of actions in the Behavior Flow and Goal Flow reports.

You can find an example of a flow report below:

Multi-Channel Funnel and Attribution reports

In Multi-Channel Funnel and Attribution reports, no sampling is applied unless you modify the report in any way. Otherwise Google Analytics will display a sample of 1,000,000 conversions.

Why sampling is an issue

If the sample size is, say, 90% of sessions, then the overall information in your reports is likely to be reliable. However, the smaller the data set used for the sample, the less accurate your results and interpretations may be. If you’re seeing 100 sessions in a report based on 1% sample, this means your estimate is based on 1 session, multiplied by 100. The other 99% of the data remain shrouded in complete mystery.

Without seeing the whole picture, you can’t fully trust your data. When a well-known brand of cat food claims that 8 out of 10 cats preferred their brand, can you be sure your tabby will love it? When a toothpaste producer claims 9 out of 10 dentists agree that their brand of toothpaste is best, can it be that these doctors were cherry-picked to provide preferrable opinions? You never know. Your brain reads «8 out of 10» and interprets it as 80% while of course there’s much left out of the picture. With sampling, Google Analytics works the same way. And the deeper you dig, the less you see.

Data accuracy might not be a great problem if you only look at the number of sessions. However, when it comes to money-related metrics such as goals, conversions, and revenue, sampling may cost you a fortune. Sampled reports deliver distorted metric values. Because of this, you risk overlooking valuable input from your advertising, or, conversely, keep investing in underperforming ads. The result? Money lost.

How to avoid sampling

Whether or not sampling should concern you, depends on how much ambiguity and uncertainty you can tolerate in your course of work. If you feel that sampling has become an issue, here’s a handful of practices you can adopt in your analytics routine to avoid or minimize sampling.

In the Google Analytics interface

Let’s start with the practices you can adopt to avoid sampling without even resorting to any additional tools, straight in your Google Analytics account.

1. Shorten the date range of your report to stay under the sampling threshold.

The longer the time span, the greater amount of data Google Analytics has to process, the higher risk of sampling. And vice versa, shorter date ranges will produce more accurate data. It’s that simple, isn’t it?

For example, if your website is getting less than 500 thousand visits per month but a longer time span results in sampling, try looking at each month’s worth of data. A more advanced approach would be, aggregating the data for each month outside of the Google Analytics interface — we discuss this approach below.

Choosing a shorter time range helps reduce or completely avoid sampling

2. Avoid ad-hoc reporting

As said above, most default reports in Google Analytics are not subject to sampling. Analysts are often tempted to use customized, ad-hoc reports where the default ones would do the same job. This means that you may get more accurate results by simply avoiding segments and secondary dimensions in your reports.

Take for example, you want to estimate the amount of organic search traffic landing on your website. You can do this either by applying the Organic segment to the Landing Pages report, or by looking at the Organic Search report under Channels, with the «Landing page» primary dimension. The first-case report may be prone to sampling, while in the second case you’ll see the report based on 100% of sessions.

It should be noted though, that in default Google Analytics reports there’s a limit of 50,000 rows per day that can be displayed in reports based on pre-aggregated data. In ad-hoc reports, this limit is set to 1,000,000 rows per day. When the number of dimension-value combinations exceeds this limit, the additional values will be grouped into a single row labeled (other).

The default Google Analytics reports always deliver unsampled data, with a limit of 50,000 rows per day.

3. Apply view-level filters to only display the data you need the most often

Google Analytics samples data at the view level, after view-level filters have been applied. This means that, the sample is taken from the sessions allowed by the filter. Again, say you want to see how many visitors land on your website through organic search, and using an organic medium segment leads to a sample. If this happens often, you could create a duplicate view, then apply a permanent view-level filter to allow only organic traffic in that view. This approach might not entirely resolve the sampling issue, as ad-hoc queries might still trigger sampling on large amounts of data, but the unsampled default reports will give you valid information without extra hassle on your part.

Note that it’s not recommended to filter the data by page-level dimensions. Say you have an Ecommerce website with different product categories and pages. If you use a different view for each page type, the session of one user will end up being split across different views, and the total number of sessions may be heavily inflated.

Applying view-level filters can help you prevent sampling, as Google Analytics samples data after view-level filters have been applied.

4. Track each website with a different Property

It’s a common practice to track multiple websites in a single Google Analytics property, and use filters should the need be to look at a single website. The more data you collect in one property, the higher the risk that you’ll get a sampled report. If that’s your case, consider using a separate property for each of your websites. This will reduce the amount of traffic to a property, thereby reducing the risk of sampling.

Outside the Google Analytics interface

Below, we explore how you can prevent sampling by exporting data from Google Analytics. Mind that it’s impossible to extract raw demographic data outside from Google Analytics, since this data is always aggregated in the system.

5. Use Google Analytics API

Another way to deal with your sampling issues is to access the data programmatically outside the Google Analytics interface through the Google Analytics Reporting API. While API responses may contain sampled data over a long time span, the API allows you to specify how much data you want to retrieve in one request, and also adjust the sampling level. If you’re running a high-traffic website and your data is getting heavily sampled, it may require you to run hundreds of different requests to extract all the data you need. The API allows for up to 50,000 requests per project per day and returns up to 10,000 rows per request.

The main drawback of this approach is that it will take your team’s time and programming skills. It’s unfeasible and even impossible to run thousands of daily requests manually, and hence coding is needed to automate the process. Also note that the API only allows you to supply a maximum of 7 dimensions and 10 metrics in any query, a query must always have at least one metric, and only certain dimensions can be queried together in one query. Click here for more information about the Reporting API.

Below is the example of an error you may see if you supply more than 7 dimensions:

While the Google Analytics Core Reporting API allows you to extract more data with a greater level of precision, it has a number of limitations to dimensions and metrics.

6. Use Google Analytics Spreadsheet add-on

The official addon utilizes the Google Analytics Reporting API, enabling you to query Google Analytics data and bring it to Google Sheets without having to code. The addon makes it possible to automatically extract the data you need from one or multiple Google Analytics views and then manipulate it in Google Sheets, perform custom calculations, create visualizations and share them with your partners or colleagues. It’s also worth mentioning that the addon allows 9 dimensions — 2 more than when accessing the API in any other way.

However, mind that Google Sheets has its own limits, such as a maximum of 400,000 cells across all sheets in a spreadsheet, which makes it impossible to use the addon for really large amounts of data:

The Google Analytics Spreadsheets add-on allows you to extract unsampled or less sampled data for a chosen set of dimensions and metrics.

7. Upgrade to Google Analytics 360 (ex. Premium)

В  Google Analytics 360 provides lots of advanced solutions to the sampling issue, including:

  • Sampling threshold of 100 million sessions per view, that is, 200 times more data as compared to 500 thousands sessions per property in the standard version.
  • Unsampled Reports with up to 3 million unique rows of data that can be run on demand or on a scheduled basis.
  • Custom Tables with up to 1 million rows per day, enabling instant access to unsampled data for up to 6 dimensions, 25 metrics, 5 filters and 4 segments in a table.

Beyond that, Google Analytics 360 can be integrated with the Google BigQuery analytics data warehouse, with a $500 monthly credit to spend on Google BigQuery projects. The integration allows you to automatically import unsampled, hit-level, near real time data from Google Analytics to Google BigQuery, and use SQL-like queries to create even the most advanced and complex reports in a matter of seconds.

For all its indisputable benefits , Google Analytics 360 is designed as an enterprise-level solution, and thereby requires significant annual investments. If you’re thinking about switching to the paid version, consider these three conditions: you’re facing sampling all the time, your website gets more than 10 million hits per month, and your annual revenue allows investing in the license.

Google Analytics 360 boasts significantly higher sampling threshold than the free version.

8. Use OWOX BI Pipeline

OWOX BI Pipeline provides a way to avoid sampling without having to invest in Google Analytics 360 or spend your company’s resources on coding an API solution. The tool allows you to harness the power of Google BigQuery to collect Google Analytics data, except the data will always be raw and collected in near real time. Since OWOX BI uses its own algorithm to compute sessions, the data will always be unsampled, no matter how many sessions you have in your Google Analytics. Take about 10 minutes to implement the tool, and it will do the rest. OWOX BI Pipeline starts at $115 per month and you can try it for free with a 14-day trial.

OWOX BI Smart Data

To wrap up the above solutions, here’s a table demonstrating which of the approaches would be the best fit for tackling your data sampling issues. In the table, we consider the feasible amounts of data for each of the above approaches. However, your decision shouldn’t be based only on the amounts of data. Each to their own: every organization’s got to find their best way to avoid sampling. We recommend starting with the simplest, and paying close attention to the shortcomings of your chosen approach. And if you have questions, don’t hesitate to ask them in the comments section.

Within the Google Analytics interface
Solution: Google Analytics 360
(ex. Google Analytics Premium)
Default reports Setting shorter date ranges View-level filters
Overview
  • Sampling threshold: 100M sessions
  • Unsampled Reports
  • Custom Tables
  • BigQuery Export
Always unsampled thanks to pre-calculated data The shorter the time span, the less data Less data, including only the traffic you want to see
Cons Еxpensive annual license
  • Max. 2 dimensions
  • Limited set of reports
  • More effort to retrieve data for longer time span
  • Max. 5 dimensions
  • Page-level dimensions inflate user count
  • Max. 5 dimensions
Recommended number of sessions a day 0 — 1,000,000 or more 0 — 500,000 0 — 500,000 0 — 500,000
Outside the Google Analytics interface
Solution Google BigQuery Export for Analytics 360 OWOX BI Pipeline + Google BigQuery Google Analytics Core Reporting API Google Analytics Spreadsheet Add-on
Overview
  • Near real time hit data and unsampled session data export
  • Max. 200 dimensions
  • Raw real-time hit data
  • Unsampled session data
  • Unlimited number of dimensions
  • Free for 14 days
  • Programmatic way to pull out unsampled data
  • Up to 9 dimensions
  • No coding required
Cons Available for Google Analytics 360 only AdWords data retrieved through BigQuery Data Transfer Service
  • Coding required
  • Not all dimensions and metrics compatible
  • Max. 7 dimensions in a query
Unfeasible to use with large amounts of data
Recommended number of sessions a day 0 — 1,000,000 or more 0 — 1,000,000 0 — 1,000,000 0 — 40,000

How often do you have to deal with sampling in Google Analytics? Make sure that your data is always crispy raw and unsampled, with OWOX BI Pipeline.

14 DAY FREE TRIAL

You might also like