So your data quality sucks: A beginner’s guide to data transformation

Data transformation (the “T” in “ELT”) is the process of turning data from one format to another, often from the format of the source to the format of the destination.

So your data quality sucks: A beginner’s guide to data transformation
Transformation refines your crude data into decision-making fuel. Photo by Chris Liverani / Unsplash

Let’s face it — data quality is awful at most startups. In the early days, you’re focused on hitting your numbers, not making them perfect. But sooner or later, questions like this will start to slow your team down:

  1. Why is there test data in these numbers?
  2. What are these weird characters I’m seeing in this query?
  3. Why is your number different than mine?

This is most likely the result of poor or non-existent data transformation. In this post, we’ll cover what data transformation is and why it’s important.

What is data transformation?

Data transformation (the “T” in “ELT”) is the process of turning data from one format to another, often from the format of the source to the format of the destination. Data transformation typically involves filtering, splitting, merging, enriching, and eliminating duplicate data to reach a clean and usable form.

There are a few types of transformations:

  1. Constructive (addition and replication of data)
  2. Destructive (deletion of records)
  3. Aesthetic (standardization of names)
  4. Structural (renaming and merging database columns).

What’s involved?

  1. Data discovery — The first step is determining what changes are necessary to transform the data into the destination format. This is done by identifying and understanding the data in source format, typically through the use of a data profiling tool.
  2. Data mapping — In this step, the transformation of data is planned. This is done with the help of data mapping tools. It’s the most time-consuming step as it involves a number of processes, namely validation, translation, value derivation, enrichment aggregation, and routing.
  3. Code generation — The code needed to run the data transformation process is generated in this step. To simplify this task, it is often done with the assistance of a data transformation tool or platform.
  4. Code execution — When the code is ready, it’s used to run the data transformation process that has been planned. The data is converted to the target format.
  5. Review — Lastly, the transformed data is checked for accuracy and precision.

After the five-step process, data analysts can perform additional layers of data transformation such as filtering and enriching, as necessary.

Data transformation — the (potentially) good

Better data management: Organizing massive volumes of data from different sources can result in irregularities in metadata, making data management more difficult. Through data transformation, these irregularities are resolved, and the data becomes easier to handle.

Improved accessibility: Transformation structures, standardizes, and stores data in a location where it’s easy to access.

Improved data quality: Bad data means higher risks and higher costs. Through data transformation, data quality is maintained so better insights can be extracted.

Enhanced data efficiency: Using data transformation tools lift data efficiency as they support end-to-end processes by establishing one-to-many ecosystem data exchanges. They also eliminate bottlenecks in the integration process through the consolidation of integrations to a singular platform.

Data transformation — the (potentially) bad

Cost: Data transformation can be expensive. The price is dependent on the infrastructure, software, and tools necessary to transform the data to the desired format.

Resources: Performing the transformation process within premises can place a heavy burden on computational power that may negatively affect other operations.

Competency: Data transformation requires some expertise. Inexperienced data analysts may fail to notice errors in the data, especially if they’re unfamiliar with the range of accurate and acceptable values.

To build or to buy?

Building your own transformation pipeline might be tempting — you get full control and it might look like the cheaper option at surface level. But using a data transformation tool is often cheaper and easier in the long-run thanks to fewer manual errors and maintenance costs.

Aside from giving more bang for your buck, these tools make data flow easier to understand by creating a graphical representation known as a DAG. Data transformation tools also increase the efficiency of the five-stop process, cutting down the time needed to make raw data usable.

Once your data is transformed, Canvas helps your business teams explore your data. This tool allows business and data teams to work in the same workspace, enabling them to get from dbt model to chart within minutes instead of hours. Canvas also syncs your documentation and puts it directly in context to business teams for quick discovery. Sign up today and try Canvas for free.