Statistics in Web Analytics, or How to Become a True Data Scientist
Why you need statistics in web analytics
To start with, how can you use the information you get from your website? The first and the simplest thing that comes to mind is to learn more about your traffic: where it comes from, number of visits, clicks, and so on. For this task, standard Google Analytics reports are enough.
However, Google Analytics alone is not the best fit for calculations of relative metrics (ROAS Return on ad spend = revenue from advertising / advertising costs × 100% , CPC The cost of one click on an ad , etc.). You can’t really on metrics calculated without considering the specifics of individual advertising channels, external factors and running tests.
Statistical methods can help meet many objectives and business challenges, including:
- Classify your users and manage advertising more efficiently.
- Evaluate the effect website redesign has on business performance. For example, use A/B testing to see how reducing the number of fields in a checkout form affects conversion rates.
- Evaluate the effect an increase or decrease of a certain metric has on business, by having determined the permissible values for the main website performance indicators.
- Predict the behavior of website users by different indicators. Identify your potential buyers and run targeted advertising campaigns.
Roughly speaking, the whole process of data analysis can be divided into three main phases:
- Digital analytics — collecting and analyzing data. This phase mainly includes superficial analysis of user interactions with the website and evaluation of advertising performance. For example, Digital Analysts can identify the most popular web pages and product categories, discover ver weak spots in website functionality.
- Data governance — managing the data. This phase includes coordinating responsibilities among different company departments, and managing access to the data.
- Data science — the art of data processing and management. This phase includes a deeper analysis of the collected data: website user behavior, offline transactions, purchases made over the phone, data from CRM systems. Data scientists can assess the impact of a particular factor (acquisition source, location, day of the week, etc.) on the observed results, such as conversion rates, and predict future outcomes.
The difference between a Digital Analyst and a Data Scientist
Digital analytics is, in fact, the first step towards Data Science. Data science is widely used in various fields: analytics, biology, medicine, psychology, political science, etc. Regardless of the field, any Data Scientist should:
- Be familiar with the subject area, and know how to analyze the available information.
- Be able to work with large amounts of data (have competence in such programming languages as R and Python, know how to apply machine learning).
- Have a good understanding of statistical analysis methods (have some math background).
If you try to sketch these requirements schematically, you’ll get a picture like this one, with a Data Scientist in the very center:
To better grasp the difference between a Digital Analyst and a Data Scientist, let’s take a look at one particular example. Let’s say the revenue generated by the website decreased by 3%, as compared to the average value for the previous week.
Digital Analysts would be able to:
- Point towards the source that generated significantly less traffic than before.
- Tell the time when the traffic began decreasing.
- Calculate the exact percentage of the traffic drop for different sources.
Data Scientists will consider the situation from a different perspective, using methods of mathematical statistics. They will start with checking whether the obtained revenue values are outside the range of admissible values for this indicator (the so-called confidence interval), and whether this change is to be considered critical. Perhaps no instant action should be taken if, for example, if the revenue for the day was less than on the same day in the last week, but isn’t less than the average value for the month.
In general, Data Science includes performing the following tasks:
- Manage risks, that is, recommend management decisions to reduce the likelihood of poor outcomes and minimize possible business losses.
Make forecasts for any indicators that are important for business. You can do this using the
Box-Jenkins approach. Thanks to these forecasts, you can plan purchases, pricing, advertising campaigns and promotions. For example, to forecast sales for particular products over the given period of time.
- Classify users for different purposes, such as targeting, using logistic regression, probit regression or a ROC curve.
Now, let’s take a closer look at a couple of tasks that can be solved using statistical methods.
How to apply statistical methods in A/B testing
A/B testing is probably one of the most common tasks in web analytics. The results of testing must be validated, in order to make sure that you can rely on them. This is where statistics comes to an aid. When conducting A/B tests, it’s worth considering such concepts as statistical power, sample size, confidence interval and statistical significance. Now, let’s look at a few examples of what these concepts mean, and how to apply them.
Statistical power is measured in percentages and determines how likely the test is to show the difference between the two options in consideration. Let’s say you want to test the hypothesis that men prefer green over red. If you show two different buttons to two different men, and one clicks on the red one while the other picks the green one, can you say that your hypothesis is refuted? Of course not. One of the two men could possibly be colorblind, or simply a fan of bright colors. However, if you show those buttons to a thousand men who visit your website, you’ll be able to identify which button is more likely to be clicked. That is, the larger the sample, the greater its statistical power. It’s not recommended to rely on tests whose statistical power is less than 80%.
What should the sample be, in order to provide reliable results? It depends on what statistical power and significance (we’re writing about them below) you expect from the test. Fortunately, you don’t have to calculate sample size manually, as there are convenient online calculators, like this one.
Another aspect to consider in A/B testing is statistical significance. It determines the likelihood that a result from testing doesn’t occur by chance. The optimal level of significance (also called confidence) in A/B testing is 95%. That is, the probability of error (the so-called P-value) is the remaining 5%. The statistical significance of the test depends on the confidence intervals and the area of their intersection.
Confidence intervals The range of values a population parameter will fall between, with a given confidence level and a larger sample size tell you how stable the results of the test are. In other words, if the results will be the same if you increase the size of the sample. Let’s say a green button (variant A) was shown to a thousand website visitors and 30% of them clicked on it. Then, you can calculate the margin of error (you can do this online), which is equal to ± 2,8%. This means that, if you increase the size of the sample, there’s a 95% probability that 27.2% to 32.8% of visitors will click on the green button. Another 1000 visitors were shown a red button (variant B), and 26% of them clicked on it. The confidence interval for this group is 23. 3% to 28.7%.
If we compare the confidence intervals for variants A and B, we’ll see that they intersect in the range from 27.2% to 28.7%. The graphical presentation of the comparison looks as follows:
The KPI (in our example, it’s conversion rate) goes on the X-axis, and probability density (the density of a random variable) goes on the Y-axis. The smaller the intersection area of the confidence intervals, the higher the reliability of the test results. In our example this intersection is 1.5%. This figure does not exceed the P-value of 5%, and therefore the test can be trusted.
There are a number of statistical criteria you can apply to decide whether or not to accept the hypothesis. One of the most well-known and widely used criteria is the t-test, also known as the Student’s test. In fact, the t-test is any statistical hypothesis test in which the test statistic follows the Student’s t-distribution. Here’s a calculator you can use to calculate the t-criteria and validate the test. Just copy the document and fill the green cells with the values you got as a result of testing.
Rejoice, geeks, we’ve got some formulas for you :)
To start with, the t-test can only be applied under the following conditions:
- The source data should be normally distributed.
- If a two-sample t-test Used to check the hypothesis that the two sample means in two samples are equal. Can be applied, for example, when one needs to compare the scores on the final exam for two different universities is used for independent samples, the variances should be equal.
A two-sample t-test for independent samples
If there’s little difference in sample sizes, a simplified formula can be applied for approximate calculations:
where and are random variables, n1 and n2 are number of elements in a sample, and
The number of degrees of freedom is calculated as follows:
The t-test approach has the following advantages:
- It works reliably with huge samples, as there’s no limit to the data volume.
- It takes account of the distribution and size of the sample.
- It’s suitable for measuring different parameters, quantitative indicators can be compared as well.
So, you want to become a Data Scientist. Where to begin?
Here, we made a compilation of useful resources to help you on your journey to becoming a Data Scientist.
There are lots of free online courses on the Internet, providing lectures and presentations in data science. These lectures are usually followed with tasks and topics for self-study. If something doesn’t work out, you can always ask question to your teacher at the forum. In addition, there’s a possibility to get certificates, usually for a fee. Here are a few websites worth paying a visit:
The R programming language is the most popular tool for working with big data. Here are a few resources that will help you learn the language and chat with professionals:
Another goodie — online games on probability theory and mathematical statistics:
We promised you a couple examples of how you can use statistical methods, and the blogpost only accommodated A/B testing. Enter your email, and we’ll send you an example of how you can classify your users. You’ll be able to identify user segments with the highest and the lowest LTV The predicted total revenue a business will get from their entire relationship with a customer , and use different marketing strategies for these segments.
That’s all for today. Hope you found this useful :) If you have any questions left, drop them in the comments below, we’ll be happy to answer.