An Ultimate Guide to Google Analytics Sampling and How to Avoid It
Sampling is a method used to infer insights or make generalizations when it’s unfeasible or simply impractical to analyze all the data under consideration. Instead, smaller data subsets are culled from the data set and considered to represent the whole picture.
Sampling has been a tried-and-tested practice long before the glorious rise of Google Analytics. It’s used in public opinion polls, surveys, consumer tests, basically the vast majority of statistical studies. And whenever sampling kicks in, it creates a certain level of uncertainty in the observations.
Table of contents
- Why, and when sampling occurs in Google Analytics
- Default & Ad-hoc reports
- Flow reports
- Multi-Channel Funnel and Attribution reports
- Why sampling is an issue
- How to avoid sampling
- In the Google Analytics interface
- Outside the Google Analytics interface
Why, and when sampling occurs in Google Analytics
Even Google’s processing servers can’t always handle endlessly-large volumes of data in a finite amount of time (and the faster the better). That’s why Google Analytics applies sampling in aim to strike a balance between accuracy and processing speed. You can always see if sampling is in effect, by taking a look at the top of each report. If there’s a line saying that the report is based on less than 100% of sessions, this means that the data is sampled.
This most often happens when the amount of data for the selected date range exceeds 500,000 sessions. However, keep in mind that sampling is not just about the number of sessions in your reports. It’s about how deeply into your data you want Google Analytics to dig.
If you want to get raw, unsampled data from the site in order to build any reports without restrictions, use streaming from OWOX BI. Sign up for a free demo to find out the details.
Default & Ad-hoc reports
Most of the default Google Analytics reports, the ones you see in the left pane in the Google Analytics interface, are always unsampled. For each reporting view, Google Analytics creates a set of pre-aggregated data tables with combinations of certain dimensions and metrics taken from the complete data, and processes this pre-aggregated data on a daily basis. This means that the metrics for the dimensions in default reports are already calculated and ready to go when you run the report.
The screenshot below demonstrates an example of a report based on the complete non-sampled data:
However, once you apply a segment, a filter or a secondary dimension to a default report, Google Analytics has to issue a non-standard, ad-hoc query to return the information you’ve requested. The same happens when you create custom reports with combinations of dimensions and metrics that don’t exist in the default reports.
At first, Google Analytics checks if this new query can be fully satisfied by the existing tables of pre-aggregated data. If not, then the query goes to the complete, raw data to compute the requested information. If the number of sessions for the date range you are using is above the sampling threshold of 500,000 sessions or even less, ad-hoc reports may be presenting you with sampled data.
Here’s an example of a report based on sampled data:
With flow-visualization reports, such as Users Flow, Behavior Flow, Events Flow, and Goals Flow, Google Analytics only allows for a maximum of 100,000 sessions for a selected date range. Above this threshold, the data is always sampled.
Since flow-visualization reports are based on a different sample set than the default reports, you may see discrepancies in the presented metrics. This means that the total number of visits, users, exits, etc. may be different between the default Behavior and Conversion Overview reports and the number of actions in the Behavior Flow and Goal Flow reports.
You can find an example of a flow report below:
Multi-Channel Funnel and Attribution reports
In Multi-Channel Funnel and Attribution reports, no sampling is applied unless you modify the report in any way. Otherwise Google Analytics will display a sample of 1,000,000 conversions.
Why sampling is an issue
If the sample size is, say, 90% of sessions, then the overall information in your reports is likely to be reliable. However, the smaller the data set used for the sample, the less accurate your results and interpretations may be. If you’re seeing 100 sessions in a report based on 1% sample, this means your estimate is based on 1 session, multiplied by 100. The other 99% of the data remain shrouded in complete mystery.
Without seeing the whole picture, you can’t fully trust your data. When a well-known brand of cat food claims that 8 out of 10 cats preferred their brand, can you be sure your tabby will love it? When a toothpaste producer claims 9 out of 10 dentists agree that their brand of toothpaste is best, can it be that these doctors were cherry-picked to provide preferrable opinions? You never know. Your brain reads «8 out of 10» and interprets it as 80% while of course there’s much left out of the picture. With sampling, Google Analytics works the same way. And the deeper you dig, the less you see.
Data accuracy might not be a great problem if you only look at the number of sessions. However, when it comes to money-related metrics such as goals, conversions, and revenue, sampling may cost you a fortune. Sampled reports deliver distorted metric values. Because of this, you risk overlooking valuable input from your advertising, or, conversely, keep investing in underperforming ads. The result? Money lost.
How to avoid sampling
Whether or not sampling should concern you, depends on how much ambiguity and uncertainty you can tolerate in your course of work. If you feel that sampling has become an issue, here’s a handful of practices you can adopt in your analytics routine to avoid or minimize sampling.
In the Google Analytics interface
Let’s start with the practices you can adopt to avoid sampling without even resorting to any additional tools, straight in your Google Analytics account.
1. Shorten the date range of your report to stay under the sampling threshold
The longer the time span, the greater amount of data Google Analytics has to process, the higher risk of sampling. And vice versa, shorter date ranges will produce more accurate data. It’s that simple, isn’t it?
For example, if your website is getting less than 500 thousand visits per month but a longer time span results in sampling, try looking at each month’s worth of data. A more advanced approach would be, aggregating the data for each month outside of the Google Analytics interface — we discuss this approach below.
2. Avoid ad-hoc reporting
As said above, most default reports in Google Analytics are not subject to sampling. Analysts are often tempted to use customized, ad-hoc reports where the default ones would do the same job. This means that you may get more accurate results by simply avoiding segments and secondary dimensions in your reports.
Take for example, you want to estimate the amount of organic search traffic landing on your website. You can do this either by applying the Organic segment to the Landing Pages report, or by looking at the Organic Search report under Channels, with the «Landing page» primary dimension. The first-case report may be prone to sampling, while in the second case you’ll see the report based on 100% of sessions.
It should be noted though, that in default Google Analytics reports there’s a limit of 50,000 rows per day that can be displayed in reports based on pre-aggregated data. In ad-hoc reports, this limit is set to 1,000,000 rows per day. When the number of dimension-value combinations exceeds this limit, the additional values will be grouped into a single row labeled (other).
3. Apply view-level filters to only display the data you need the most often
Google Analytics samples data at the view level, after view-level filters have been applied. This means that, the sample is taken from the sessions allowed by the filter. Again, say you want to see how many visitors land on your website through organic search, and using an organic medium segment leads to a sample. If this happens often, you could create a duplicate view, then apply a permanent view-level filter to allow only organic traffic in that view. This approach might not entirely resolve the sampling issue, as ad-hoc queries might still trigger sampling on large amounts of data, but the unsampled default reports will give you valid information without extra hassle on your part.
Note that it’s not recommended to filter the data by page-level dimensions. Say you have an Ecommerce website with different product categories and pages. If you use a different view for each page type, the session of one user will end up being split across different views, and the total number of sessions may be heavily inflated.
4. Track each website with a different Property
It’s a common practice to track multiple websites in a single Google Analytics property, and use filters should the need be to look at a single website. The more data you collect in one property, the higher the risk that you’ll get a sampled report. If that’s your case, consider using a separate property for each of your websites. This will reduce the amount of traffic to a property, thereby reducing the risk of sampling.
Outside the Google Analytics interface
Below, we explore how you can prevent sampling by exporting data from Google Analytics. Mind that it’s impossible to extract raw demographic data outside from Google Analytics, since this data is always aggregated in the system.
1. Use OWOX BI Pipeline
OWOX BI collects data in Google BigQuery directly from the website. The service is independent of Google Analytics and avoids its restrictions, allowing you to build reports without sampling and based on any parameters.
At the same time, OWOX BI uses a data structure compatible with Google Analytics and includes many prewritten SQL queries, saving your time when preparing reports.
Collecting raw data with OWOX BI, you can:
- Build reports without sampling or restrictions. The service transfers data from your website to Google BigQuery in full and in a non-aggregated form. It also increases the maximum size of transmitted hits to 16 KB. This means you’ll get a full picture of user actions on your website.
- Collect an unlimited number of user parameters and key dimensions in BigQuery. This allows you to segment users by any characteristic and build deep reports for detailed analysis.
- Analyze your data in real-time. With OWOX BI, you can quickly form a triggered mailing and find problems on your website as user behavior data appears in your Google BigQuery project within 1 to 5 minutes of the behavior’s occurrence.
- Compare the profitability of cohorts, landing pages and product groups. The service calculates the value of each session so you can determine the ROI/ROAS for new and returning users. Find out how much you spend and how much you earn on each group you advertise to. Evaluate advertising performance across regions, landing pages, mobile app versions, and applications.
- Consider order purchases or purchased returns or to find out what a new subscriber did on your website in the last 30 days before signing up. OWOX BI allows you to retrospectively update information about costs, users, and transactions that has already been uploaded to Google BigQuery.
- Ensure data quality and security. OWOX BI compares data in BigQuery with data in Google Analytics daily and reports on any significant discrepancies to make sure you don’t lose important data that third-party trackers can’t provide. The service also automatically saves data when your Google Analytics or Google Cloud project fails. Therefore, we’re ready to guarantee a level of data collection quality and processing above 96% in our service-level agreement (SLA).
- Collect personal data. Unlike in Google Analytics, in Google BigQuery you can collect and process personal customer data including emails and phone numbers.
Read more about all the benefits of collecting data from the website using OWOX BI in the article "How to avoid sampling and collect complete data for advanced analytics".
grow 22% faster
Grow faster by measuring what works best in your marketing
Analyze your marketing efficiency, find the growth areas, increase ROIGet demo
2. Use Google Analytics API
Another way to deal with your sampling issues is to access the data programmatically outside the Google Analytics interface through the Google Analytics Reporting API. While API responses may contain sampled data over a long time span, the API allows you to specify how much data you want to retrieve in one request, and also adjust the sampling level.
If you’re running a high-traffic website and your data is getting heavily sampled, it may require you to run hundreds of different requests to extract all the data you need. The API allows for up to 50,000 requests per project per day and returns up to 10,000 rows per request.
The main drawback of this approach is that it will take your team’s time and programming skills. It’s unfeasible and even impossible to run thousands of daily requests manually, and hence coding is needed to automate the process. Also note that the API only allows you to supply a maximum of 7 dimensions and 10 metrics in any query, a query must always have at least one metric, and only certain dimensions can be queried together in one query. Click here for more information about the Reporting API.
Below is the example of an error you may see if you supply more than 7 dimensions:
3. Use Google Analytics Spreadsheet add-on
The official addon utilizes the Google Analytics Reporting API, enabling you to query Google Analytics data and bring it to Google Sheets without having to code. The addon makes it possible to automatically extract the data you need from one or multiple Google Analytics views and then manipulate it in Google Sheets, perform custom calculations, create visualizations and share them with your partners or colleagues. It’s also worth mentioning that the addon allows 9 dimensions — 2 more than when accessing the API in any other way.
However, mind that Google Sheets has its own limits, such as a maximum of 400,000 cells across all sheets in a spreadsheet, which makes it impossible to use the addon for really large amounts of data:
4. Upgrade to Google Analytics 360 (ex. Premium)
Google Analytics 360 provides lots of advanced solutions to the sampling issue, including:
- Sampling threshold of 100 million sessions per view, that is, 200 times more data as compared to 500 thousands sessions per property in the standard version.
- Unsampled Reports with up to 3 million unique rows of data that can be run on demand or on a scheduled basis.
- Custom Tables with up to 1 million rows per day, enabling instant access to unsampled data for up to 6 dimensions, 25 metrics, 5 filters and 4 segments in a table.
Beyond that, Google Analytics 360 can be integrated with the Google BigQuery analytics data warehouse, with a $500 monthly credit to spend on Google BigQuery projects. The integration allows you to automatically import unsampled, hit-level, near real time data from Google Analytics to Google BigQuery, and use SQL-like queries to create even the most advanced and complex reports in a matter of seconds.
Google Analytics 360 is designed as an enterprise-level solution, and thereby requires significant annual investments. If you’re thinking about switching to the paid version, consider these three conditions: you’re facing sampling all the time, your website gets more than 10 million hits per month, and your annual revenue allows investing in the license.
Note: Due to the release of the new version of Google Analytics 4, the Universal Analytics platform will stop processing data from July 1, 2023. Google Analytics 360 properties will stop processing hits on October 1, 2023
5. Use Google BigQuery export in Google Analytics 4
One of the main advantages of the new Google Analytics 4 is the free export of raw unsampled data to Google BigQuery. As we know previously this option was available only in the paid version of Google Analytics 360.
In Google Analytics 4, the standard reports are always unsampled, but sampling can be used in custom reports: when comparing data, using additional parameters and filters, when the limit of 10 million events is exceeded. Sampling is also applied when the date range is greater than 60 days.
By setting up Google Analytics 4 integration with Google BigQuery, you’ll be able to collect raw, unsampled data from the site into cloud storage, where each user and his actions will be displayed in separate tables. Using SQL queries to this data, you can calculate any parameters and indicators you need.
By collecting website data in Google BigQuery, you can avoid sampling and other limitations of Google Analytics 4. You will analyze complete data, which means that the quality of decisions based on this data will be more valuable.
The migration to Google Analytics 4 is an inevitable reality that most companies will face as early as July 1, 2023. OWOX will make this transition as painless as possible.
We will help you develop and implement a metrics system, as well as correctly set up data tracking. You will be able to save the reports you need and get new ones without having to deal with the new data structure and rewrite SQL queries.
Which way to choose to avoid sampling? Each to their own: every organization’s got to find their best way to avoid sampling. We recommend starting with the simplest, and paying close attention to the shortcomings of your chosen approach. And if you have questions, don’t hesitate to ask them in the comments section.