Common data quality issues and how to overcome them

Though we are in the age of data-driven technologies, more than 50% of senior executives are generally dissatisfied with the value obtained from analytics. One of the main reasons for this is poor data quality. Moreover, having poor quality data at the very beginning of any task makes further actions useless. Additionally, any decisions made will be based on faulty information, and for businesses, this sure is bad news.

That’s why we’ve decided to share our many years’ worth of experience and expertise working with data. Over the past year, we’ve talked a lot with our fellow analysts and conducted lots of interviews with world-class specialists. Now we want to provide guidance on the most frequent problems in working with data and how to overcome them.

In the article, we discuss the stages of a data workflow, what errors can occur, and why. And most importantly, we share approaches to solve data quality difficulties.

Table of contents

Some definitions before we begin:

What is quality data? In a nutshell (and in terms of marketing data), quality data is relevant, up-to-date data without errors and discrepancies. If we look up data quality on Wikipedia, we’ll see more than 10 (!) definitions. Furthermore, Wikipedia cites the latest research by DAMA NL into definitions of dimensions of data quality using ISO 9001 as a frame of reference.

Why is quality data a must for marketers, analysts, and decision-makers? Without quality and trustworthy data, it’s impossible to make decisions relying on marketing analytics.

Where do data quality issues come from?

In dealing with the vast amount of data that marketers and analysts use daily, it’s a challenge to completely eliminate errors and discrepancies. It’s extremely difficult to immediately provide quality data to an end user; however, data errors can be actively fought and proactively found. First of all, let’s look at the process of working with data and distinguish the steps where data quality issues can appear:

  1. Measurement planning
  2. Primary data collection
  3. Raw data normalization
  4. Business data preparation
  5. Data visualization

In our experience, factual data errors usually appear in the first two steps, whereas cases of data misuse happen more frequently during the last steps.

Data processing workflow

Let’s see in more detail what data quality problems can arise at these steps and how to solve them.

Step 1. Plan measurements

Even though there are no errors in data at this step, we cannot completely omit it. The devil is in the details, and collecting data for analysis begins with detailed planning. Our recommendation is to always start with an express analysis and carefully plan the collection of all the marketing data you need.

Skipping the planning step leads to an unstructured approach and not enough data for new tasks or projects, whereas the goal is to collect fragmented data from all data sources you work with. If you don’t have all the data, it makes your decisions and actions flawed from the very beginning. Let’s see what data you should collect before starting new projects:

  • User behavior data from your website and/or application
  • Cost data from advertising platforms
  • Call tracking, chatbot, and email data
  • Actual sales data from your CRM/ERP systems, etc.

Step 2. Collect primary data

Once you’ve created your measurement plan, let’s proceed to the primary data collection step. During this step, among all other challenges you need to overcome, you have to consider controlling access to your data (it’s all about data security) and preparing in advance for the creation of your data storage or data lake.

If you want to gain complete control over your raw data without modifying it, our recommendation is to use a single storage with automated data import. For marketing needs, Google BigQuery remains one of the best options.

What data quality difficulties you can come across at this step:

1.1 Getting incomplete and incorrect data from an advertising service’s API

What is this?

Advertising platforms and services collect vast amounts of valuable user behavior data, and the problem occurs when you try to get all of this information in full from these data sources without damaging its completeness.

*An Application Programming Interface (API) is a part of the server that transmits data (receives requests and sends responses), interacting with the user every time the user visits a page on the website. Most modern websites use at least several third-party APIs.

What’s wrong with that?

An advertising service collects data about a user’s actions; however, data may change retrospectively after it’s transferred, and the service API may be updated or be unavailable for some time. As a result, part of the data simply won’t be delivered, and the overall quality of the data will be corrupted. The analyst may not know about this and may use irrelevant data to prepare business data. However, it’s not possible to get useful insights or make the right decision based on incomplete and/or inaccurate data. Simply put, you can invest lots of time and effort and still find yourself back at the bottom of the ladder with neither useful data nor proof that performance is successful and profitable for the business.

What are the reasons?

Data can be incomplete or incorrect due to data loss, discrepancies, or duplicates. Common reasons for these issues include:

  • A new account on an advertising platform is created, but analysts aren’t informed and data from it isn’t collected.
  • An API doesn’t support dynamic parameters in UTM tags and doesn’t collect or transfer them. If you use unsupported parameters in UTM tags, you’ll see the parameter names instead of their values — for example, {{geo}} instead of the actual campaign name.
  • Retrospective data updates. Advertising services often retrospectively update data (on costs, clicks, and impressions in advertising accounts). However, not all analysts are aware of this and take it into account.

How to solve this problem?

Since it’s impossible to directly control code when working with APIs, you can fight these problems by:

  • distributing areas of responsibility during data collection to gain better control over data harvesting.
  • using automated data import tools that support API changes, dynamic parameters, etc. In case data is unavailable, these tools can show existing data gaps and download data retrospectively.

1.2 Getting incomplete and incorrect data from a website

What is this?

By analyzing data from advertising services, we know how much we spend on advertising. While from website user behavior data, we get information about how much we earn. Since business questions usually sound like “which advertising pays off, and which does not?” it’s essential to know the income/expense ratio.

What's wrong with that?

Data on website user behavior differs from cost data collected by advertising services because data on website user behavior:

  • is collected directly by website owners
  • is significantly larger in volume than cost data from advertising services

In addition, if we don’t notice that there’s no collection of user behavior data and we don’t eliminate the problem, then this data is lost forever.

As with data collection problems from advertising service APIs, incomplete or incorrect information leads to erroneous solutions. Incorrect management decisions, in turn, lead to a loss of income.

What are the reasons?

Errors can occur when collecting website data because:

  • Not all website pages have a Google Tag Manager (GTM) container. GTM collects the result of advertising campaigns and user behavior data. Therefore, if there’s no GTM container on the page, no data is collected.
  • The Google Cloud project isn’t paid on time, and therefore data collection is put on hold.
  • The correctness of data entered into application forms by website users (name, address, email, etc.) is not validated.

How to solve this problem?

As with collecting data from an API, the solutions for website data collection include:

  • distributing areas of responsibility during data collection to gain better control over data harvesting
  • using automated data import tools; in case data is unavailable, these tools can warn you of existing data errors

2. Getting aggregated, sampled data

What is this?

Aggregated and sampled data is generalized data that appears in cases when not all data is processed and used for analysis and reporting. This happens when services like Google Analytics analyze only part of the data to reduce the load on servers and balance the speed and accuracy of data processing. Since sampling results in generalization, it leads to a lack of trust in the obtained results.

What’s wrong with that?

Sampled reports distort performance data, and that can cost you a fortune when it comes to money-related metrics such as goals, conversions, and revenue. Because of this, you risk not noticing a profitable advertising campaign and may turn it off due to distorted data in a report, or vice versa — you may spend all your money on inefficient campaigns.

In other words, every time sampling is applied, there’s a risk of obtaining inaccurate results. And if you try to make a decision based on only part of the data, you play guessing games with the company’s budget.

What are the reasons?

In an attempt to create reports as soon as possible and save resources, systems apply sampling, aggregation, and filtering instead of processing massive data arrays.

How to solve this problem?

The only thing you can do to avoid data sampling is to collect raw data and constantly check data completeness throughout all your reports. This process monitoring is preferably done automatically as a way to elude human factors. For example, you can apply automatic testing of correct metrics collection on your website as our client did with the help of OWOX BI.

Step 3. Normalize raw data

After collecting all the necessary data, it’s time to normalize it. At this step, analysts turn available information into the form required by the business. For example, we must get phone numbers into a single format.

Data normalization is a manual and routine “monkey job” that usually keeps analysts from more exciting tasks such as extracting useful data insights. Not to mention that normalization difficulties usually take up to 50% of an analyst’s work time overall.

Data quality difficulties one can come across at this stage:

1. Insertion, updating, and deletion dependencies

What is this?

These are troublesome side effects that appear during the normalization of unstructured data.

What’s wrong with that?

The common outcome of these data dependencies is that reporting systems discard such incorrect data while analyzing it. As a result, we end up with inaccurate reports that aren’t based on full data. For example, say we have a sessions object and an advertisements object. In sessions, we have data for days 10 to 20, and in advertisements, there’s data from days 10 to 15 (for some reason there is no cost data for days 16 to 20). Accordingly, either we lose data from advertisements for days 16 to 20, or data from sessions will only be available for days 10 to 15.

What are the reasons?

Undesirable side effects appear when an advertising service API is changed, isn’t available, or returns incorrect data.

How to solve this problem?

In the same way, you check for data collection errors, you should always verify the data you work with. Moreover, if the specifics of data merging aren’t known by the user, then chances are mistakes will occur while normalizing the data. In practice, the best decision at this step is to develop a data quality monitoring system that alerts the person responsible for data quality in case of data anomalies. For this, you can use such services as OWOX BI with its embedded data monitoring functionality.

Our clients
grow 22% faster

Grow faster by measuring what works best in your marketing

Analyze your marketing efficiency, find the growth areas, increase ROI

Get demo

2. Different data formats, structures, and levels of detail

What is this?

Each advertising platform or service uses different data formats, currencies, and structures. As a result, after having collected data from all sources, you have to normalize it into a single format/structure.

What’s wrong with that?

It’s a tough task to build a triangular fortress when all you have are round and oval pieces. When you have lots of data sets in different formats, you simply cannot create a report before unifying the data.

What are the reasons?

There are different data schemes across advertising services. For instance, the same column in different platforms can be named Product Name and Product Category. Another example is using different currencies in cost data: dollars for Twitter Ads and pounds on Facebook, for example.

How to solve this problem?

Before analyzing data, it must be converted to a single format; otherwise, nothing good will come out of your analysis. For example, you should merge user session data with advertising cost data to measure the impact of each particular traffic source or marketing channel and to see which advertising campaigns bring you more revenue. Of course, this can be done manually by utilizing scripts and SQL, but applying automated solutions is a better choice.

Step 4. Prepare business-ready data

Business-ready data is a cleaned final dataset in the structure that corresponds to the business model. In other words, if you have gone through all the steps in working with data and have completed everything, you should get the final dataset. Its ready-made data that can be sent to any data visualization service (Power BI, Tableau, Google Data Studio, etc.).

However, you shouldn’t confuse it with raw data on which you can try to build a report. It's an erroneous approach that will entail many recurring problems: finding errors and causes of discrepancies will take a long time, and business logic will need to be constantly duplicated in all SQL queries. Of course, such a volume of raw, unstructured data is challenging to keep up to date and control changes in the logic of transformations. For example, you may encounter such problems as updating the cost data history after the advertising service has adjusted it. Another example could be repurchasing completed transactions. In this case, it takes time to understand whether the transaction has been redeemed, but we have to attribute this status to the order moment.

Unfortunately, it’s common practice to skip this step and jump right into creating the report. However, our experience demonstrates that using a final dataset and preparing business-ready data for the modern world of analytics is a must. In the long run, it’s cheaper and easier to work with prepared data instead of running around raw data doing the same things again and again.

What data quality difficulties can appear during this step:

1. Lack of data definitions leads to discrepancies

What is this?

It’s difficult to control changes in transformation logic due to inconsistent or absent definitions of the types of data required throughout data processing.

What’s wrong with that?

When data is not clearly defined, users aren’t on the same page about data use: they aren’t sure which table or column to query, which filter to use, or who to ask for information about data objects. Besides, it takes too long to navigate through and understand all data objects from raw data, including their attributes, their place in the data model, and their relevance to each other.

What are the reasons?

A business hasn’t clearly defined its core data and its data model. Therefore, the logic for merging data is incomprehensible.

How to solve this problem?

First of all, don’t apply business logic to each report or dataset but use data modeling at the company level. Within the company, there should be a transparent business data model and control of the data lifecycle. This means that all definitions used must be clear. For example, end users should be aware of what conversion and website visitor metrics represent.

Along with that, as it’s challenging to prepare and maintain up-to-date simulated data, the answer lies in applying automated solutions (e.g. data build tool (dbt)) that can collect, clean, normalize, and monitor the quality of your data so it’s business-ready.

Reconsider your relationship to data with OWOX BI

The OWOX BI team knows more than anyone how severe the data problem is, since each of our clients encounters it. We have made a product that allows analysts to automate the routine, deliver business value from data, and ensure data quality.

OWOX BI is a unified platform that empowers you to collect, prepare, and analyze all your marketing data. It automates data delivery from siloed sources to your analytics destination, ensuring data is always accurate and up to date.

By applying OWOX BI, you can get business-ready data according to your business model with transparent data quality monitoring and an easy-to-use report builder for unlocking insights without SQL or code.

Let’s look at how OWOX BI can help you at all the steps we mentioned above.

  • Plan your measurements. Create a measurement plan for your business or develop a system of metrics especially for your business needs with the help of our specialists.
  • Collect primary data. OWOX BI collects raw data from Google Analytics, advertising services, websites, offline stores, call tracking systems, and CRM systems in your data storage. The platform works smoothly with large ad accounts and uploads all data regardless of the number of campaigns. Besides automatically importing cost data, OWOX BI recognizes dynamic parameters in UTM tags, converts costs into one currency, and monitors the relevance of data.

You no longer need to look for various connectors and, if necessary, our team can even set up custom integrations for you.

  • Normalize raw data. When using OWOX BI, you don’t need to manually clean, structure, and process data. You’ll receive ready datasets in the clearest and most convenient structure. Moreover, at any time you can get a visual report on the relevance of data from advertising services uploaded to Google Analytics.

  • Prepare business data. With OWOX BI, you have trusted business-ready data at your fingertips. There’s no longer any need to create a new dataset for every new report, as you get prebuilt final datasets prepared according to your business data model. With up-to-date and unified data ready for further data segmentation, you can get insights into the speed of your business and increase the value of your data.

  • Visualize data. The OWOX BI platform lets you analyze and visualize your data wherever you want. Once your marketing data is ready, you can send it to the BI or visualization tool of your choice with a few clicks.

Book a free demo to see how OWOX BI guarantees data quality and how you can benefit from fully automated data management today!


Step 5. Visualize data

Visually presenting key metrics is the last step to make data work, so your data presentation should be both informative and user-friendly. Automated and properly configured visualizations can significantly reduce the time to find a problem; that is, you can perform more iterations with less effort over the same period to improve data quality.

Also, it’s important to remember data visualization services like the popular Google Data Studio cannot merge or transform data. If you need reports based on many data sources, our recommendation is to collect all the data you need beforehand into a single data storage to avoid any difficulties.

Data quality difficulties you can come across at this step:

1. Factual data errors

What are these?

When something goes wrong at the previous levels of data collection and normalization, data errors can end up in reports shown by data visualization services.

What’s wrong with that?

Reports created with factual data errors are a waste of time and money. They provide no profit or valuable insights into risk and growth zones for the business. Sorry, Mario, but the princess is in another castle!

What are the reasons?

Simply put, the visualized data isn’t relevant. However, to find out the source of those errors, you should go all the way back to verifying your data again.

How to solve this problem?

The only way to solve this problem is to thoroughly prepare data and monitor its quality before creating reports.

2. Broken SQL queries or too many edits to reports (and/or SQL queries)

What is this?

Data requirements are constantly changing, and SQL queries also change. As a result, the more complicated the reporting system becomes, the easier it is for it to break down.

What’s wrong with that?

There’s nothing wrong with changes, unless there are so many that it’s impossible to remember what changes were made, where, and when. Eventually, all carefully built reporting systems can disappear, since the SQL queries aren’t working and there’s no correct data to visualize.

What are the reasons?

It’s quite a challenge to remember every small thing, so the typical mistake is to forget to apply edits on all datasets where they’re needed.

How to solve this problem?

Ideally, users should be able to generate marketing reports without using lots of SQL queries and without needing to apply lots of changes and/or edits.

3. Misunderstanding and misuse of collected data

What is this?

One of the most common problems is misunderstanding data (and, therefore, misusing it). This happens when a particular metric or parameter can be interpreted in more than one way. For example, say there’s a conversion metric in a report and this report is used by different users. One of the users thinks that a conversion means a website visit, while another thinks it means placing an order. However, there’s also a third person who thinks that this conversion metric is about delivered and purchased orders. As you can see, there are lots of potential interpretations, so you must make it clear what information is presented in the report.

What’s wrong with that?

If there’s no clear understanding of what data is used in reports and dashboards, there’s no guarantee that your decisions will be based on the facts on the ground.

What are the reasons?

An unclear explanation of metrics and parameters used in reports or an inappropriate type of data visualization can lead to poor decisions.

How to solve this problem?

Data verification doesn’t end when you make sure your input data is correct and relevant. This data still can be misused. To avoid this problem, end users have to have access to complete, up-to-date, business-ready data with clear and precise explanations of what information is presented in the reports.

Our clients
grow 22% faster

Grow faster by measuring what works best in your marketing

Analyze your marketing efficiency, find the growth areas, increase ROI

Get demo

Key takeaways

Quality analytics starts with quality data. Your data efforts should be valuable and trusted. That’s why you should proactively identify and fight data errors and discrepancies to get holistic performance reporting and improve the quality of your data. In general, improving data quality is an iterative process in which three steps can be distinguished:

  1. Examining the status of your data
  2. Localizing a problem that degrades data quality
  3. Fixing the problem

The most time-consuming are the first two steps. It's difficult to identify that a problem exists and understand in which situations an error arises. After it becomes clear what the problem is and where it’s located, it turns out to be relatively easy to fix it.

Finally, you can create three documents to improve your data:

  • Data movement scheme
  • Responsibility matrix (who is responsible for what)
  • Business data model