Coalesce believes pipelines, rather than Common Table Expressions (or CTEs), are the ideal approach for data transformations. We've included a high level overview of each approach below, and our article CTEs vs Pipelines for Complex Data Processing details our philosophy on this approach as well as provides an introductory walkthrough of the approach. In this article, we dig deeper into the approach with a step by step example detailing how to migrate a CTE to a pipeline in Coalesce.
Common Table Expressions (CTEs) 101
Common table expressions, or CTEs, are used to embed complexity in several SQL blocks pulled together into a single SELECT or DML operation. Within the context of data transformation, by processing data this way, developers are able to consolidate logic within the context of a single SQL file. This means there are less files to manage overall, but managing the logic contained within those files becomes increasingly more complex as data transformation projects grow.
Pipelines allow for each logical process in a typical CTE to be broken into their own SQL blocks, or what Coalesce calls Nodes. In addition to reducing complexity, the pipeline approach provides lineage at the object and column level that is not possible with CTEs.
Build with a Pipeline Approach in Coalesce
Coalesce believes in a pipeline approach for data modeling and transformation. However, if you are migrating from a code first solution that uses CTEs to Coalesce, it is not always obvious where to start. In the following video, we show how you can take a CTE and what migrating that into Coalesce looks like:
We'll now walk you through the process of migrating a CTE to Coalesce. Here is the CTE code that we will be migrating to Coalesce:
with stg_location as (
upper(city) as city,
where location_id is not null
and location_id != 0
stg_customer_loyalty as (
upper(city) as city,
left(postal_code, 5) as postal_code,
coalesce(e_mail, phone_number) as contact_info
where customer_id is not null
and (e_mail is not null or phone_number is not null)
stg_truck as (
upper(make) as make,
upper(model) as model,
where truck_id is not null and TRUCK_OPENING_DATE::date < '2024-01-01'
stg_order_header as (
order_id || truck_id as primary_key,
order_ts::date as order_date,
where order_id is not null
and truck_id is not null
and order_total >= 0
stg_order_totals as (
sum(ORDER_AMOUNT) as total_order_amount,
sum(ORDER_TAX_AMOUNT) as total_tax_amount,
sum(ORDER_DISCOUNT_AMOUNT) as total_discount_amount,
sum(order_total) as total_order_value
group by order_id,
orders_cleaned as (
make as truck_make,
model as truck_model,
first_name as cust_first_name,
last_name as cust_last_name,
row_number() over (partition by primary_key order by order_ts) as dupe_count
from stg_order_header as soh
inner join stg_location as sl
on soh.LOCATION_ID = sl.LOCATION_ID
inner join stg_truck as st
on soh.TRUCK_ID = st.TRUCK_ID
inner join stg_customer_loyalty as scl
on soh.CUSTOMER_ID = scl.CUSTOMER_ID
inner join stg_order_totals as sot
on soh.order_id = sot.order_id
where order_date is not null
and contact_info is not null
qualify dupe_count = 1
Identify Your Data Sources
When migrating a CTE to Coalesce, the first thing you should do is identify all of the data sources that will need to be brought into your pipeline.
It is possible that a CTE will have a dependency that will need to be created first, before you can migrate the CTE. You should create any dependency(s) first within Coalesce.
Once you have identified your data sources, you can create them within your Coalesce workspace. In the example CTE used here, each of the sources is a raw data table coming from Snowflake, so these tables can be added in as sources into the Coalesce workspace.
In the case of a CTE having a dependency that is not a raw data source, you would create or use existing node(s) within your pipeline, to support the logic of the CTE.
Migrate Your Logic in Sequential Order
Once you have data sources added to your pipeline, you can begin migrating the logic from your CTE in the order in which the CTE is written (top to bottom). This will ensure that any dependencies created within the CTE are available to be processed as you complete your migration.
For example, in the image below, you can see the stg_order_totals CTE is dependent upon the stg_order_header CTE. If these were to be created out of sequential order, the stg_order_header CTE would not be available for the stg_order_totals CTE to reference.
Migrate each CTE
Once you have identified the order in which you are migrating your CTEs, you can begin migrating each CTE into Coalesce. Let’s take a look at migrating the first CTE from our code.
It can be helpful to think of each SELECT statement in a CTE as its own node in Coalesce. Although this will not always be the case, in most scenarios, a single CTE can be represented by a single node. Since this CTE is processing some raw data and applying some transformations, we can use a stage node to reflect these transformations in Coalesce.
Coalesce will automatically bring in all of the columns from the source table. From there, we can apply the UPPER function to the city column, within the Transform field of Coalesce.
There is a WHERE statement that has two filter conditions applied. We can apply the same logic within the JOIN tab of Coalesce.
With that, we’ve just migrated the stg_location CTE into Coalesce.
Let’s walk through one more example that’s a little more complex.
The order_cleaned CTE is joining multiple other CTEs together, adding a column, and applying some filters. Migrating this CTE to Coalesce should be straightforward.
First, we need to ensure that the upstream dependencies have been created
For this example, let’s assume you’ve already migrated the rest of these nodes:
With these nodes available, we can create a new node that joins all of these nodes together: just like in the CTE. You can hold down shift and select all of the nodes together, right click on any node, and select Join Nodes -> Stage.
Coalesce will automatically generate the join syntax in the join tab of the new node.
From here, all you need to do is replace the /*COLUMN*/ with the actual column reference for the join.
With the join complete, we can go back to the mapping grid and delete the columns we don’t need, so they can match the same columns from the CTE. Additionally, we can add the window function called DUPE_COUNT by double clicking on the Column Name area of the mapping grid.
With the window function added, we can apply the filters to the node.
With the filters added, we have successfully migrated a complex CTE into Coalesce, while taking advantage of several of the features of Coalesce that allow you to streamline development.
Once each CTE has been migrated to Coalesce, your pipeline using the code above, would look something like this:
Migrating CTEs to Coalesce is a streamlined process that allows you to gain additional transparency, auditability, reusability, and scalability. When migrating CTEs, it is important to follow a migration process to ensure a CTE can be fully created in Coalesce:
- Identify data sources
- Identify any dependencies
- Migrate CTEs in sequential order
By following this process, migrations can be quickly and easily completed, allowing you to gain all the advantages of building your data transformations in a pipeline pattern. For questions about migrating CTEs, you can open a support request or view our resources on our YouTube channel.