Pipelines execution in Apache Hop: Filter, Merge & Append streams
NOTE: this is a repost of a post on the know.bi blog by Adalennis Buchillón Soris earlier.
This post builds upon the content covered in our previous post (Pipelines execution in Apache Hop: Distribution & Copy), where we introduced the principles of pipeline execution and the two main data movements: distribution and copy.
#1 Pipelines execute in parallel by default.
#2 The data movement method by default is distribution. The distribution method sends rows in turns.
#3 Explicitly selecting the copy option ensures all connected transforms receive all rows.
This time, let’s explore some more essential tips about pipeline execution and data movements.
Our new plan involves exploring scenarios where execution outcomes depend on the specific implementation or transforms used.
Let’s get to work, starting with the first example.
In our previous post, we discussed the options of distributing or copying data when connecting a transform to two others.
Filter
However, what happens when using a filter transform?
In this example, we’ve introduced a condition to filter values greater than 5 in the data stream.
Now, notice that when connecting the filter to the first transform, different options are displayed.
In this scenario, you can select a transform to receive the rows that meet the conditions, while the other transform will receive the rows that do not meet the conditions.
We select “dummy-1” for true and “dummy-2” for false.
When you run this pipeline, the filter operation results in a distribution, but it’s not a random distribution. True results are directed to “dummy-1,” while false results are sent to “dummy-2”.
So far, we’ve looked at scenarios with a single transform connecting to two subsequent transforms: distribution, copying, and distribution through filtering.
But what if you need to perform operations like joins or append streams?
Merge
In the provided simplified scenario, two small data streams are sorted and subsequently joined using a “Merge join” transform.
An inner join is performed with the data coming from “sort-id” and “sort-id-2” using the id field.
It’s crucial to note that the data must be sorted before conducting the join.
Hovering over the visible blue info icon shows this message.
The message indicates that rows are sent to the “merge-rows” transform to serve as additional information.
Additionally, it mentions that the target transform “sort-i.d-2” is treated as a special case and that the standard reading rules don’t apply in this case.
But how does this work in practice?
In this scenario, the “Merge join” transform does not process any rows until they have been processed by the “Sort rows” transforms.
It uses the sorted data from “sort-id” as a reference and performs the join with the sorted data from “sort-id-2”.
During execution, using a similar example with 5 000 000 rows in each data stream, it looks like this.
Note that the “merge-rows” transform doesn’t start its execution until all rows are sorted, at which point it begins to execute the join and gradually sends the rows to the ”Dummy” transform.
Append
Now, let’s examine a third and final example.
Consider a scenario where you concatenate two data streams into one.
In such a case, all pipelines start simultaneously.
If you sort the data before appending, the execution occurs as in the previous example, with parallel execution continuing after the sorting rows.
Remarks
#1 In this post we covered some examples of how a pipelines is executed when data is filtered, joined or appended. Take into account that the transforms used are not the only options in Apache Hop to perform those operations.
- For example, to filter data you can also use the “Java filter”, or the “Switch case” transforms.
- The “Join rows”, “Merge rows”, “Database join”, etc, to perform joins.
#2 The “Filter Rows” transform enables you to sift through rows based on specified conditions and comparisons. It influences the workflow when linking the True or False outcomes to other transforms. There are three available paths as outputs: True, False, and Main. The Main output exclusively follows the True path.
#3 The “Merge join” transform executes a traditional merge join between datasets sourced from two distinct input transforms. This transform operates under the assumption that your data is sorted based on the join keys. Join options encompass INNER, LEFT OUTER, RIGHT OUTER, and FULL OUTER.
#4 If you care about the order in which the output rows occur, you can use the “Append Streams” transform. If the order of output rows is not significant, any transform can be utilized to append two or more data streams into a union. We’ll cover this topic in a future post.
Don’t miss the YouTube video for a step-by-step walkthrough.
Stay connected
If you have any questions or run into issues, contact us and we’ll be happy to help.