Workflows execution in Apache Hop: Parallel, sequential & hybrid
NOTE: this is a repost of a post on the know.bi blog by Adalennis Buchillón Soris earlier.
One of the fundamental concepts that new Apache Hop users encounter is the parallel execution of pipelines and sequential execution of workflows.
This concept refers to how Apache Hop manages task execution within a project. When we talk about “parallel execution of pipelines,” it means that different data transformations within a pipeline can run simultaneously, allowing for faster and more efficient processing of large datasets.
On the other hand, “sequential execution of workflows” entails that different actions within a workflow are executed one after another in a predefined order, allowing for more precise control over process flow and ensuring dependencies are met before moving to the next action. In summary, while pipelines can run concurrently to maximize performance, workflows are executed sequentially to ensure consistency and integrity of processed data.
Nevertheless, there are instances where you might want to override these defaults and run pipelines sequentially while executing workflows in parallel. Let’s delve deeper into this scenario and explore how you can achieve parallel execution of actions within a workflow.
Sequential execution
As you’re aware, actions within a workflow typically execute sequentially, with each action having an exit code (success or failure) dictating the workflow’s path. However, this exit code can be disregarded in the case of an unconditional hop.
While a workflow action can have multiple outgoing hops, this doesn’t imply parallel execution. By default, if an action has multiple outgoing hops, all actions are executed sequentially in the order they were added to the workflow.
For instance, in the example below, the workflow will first execute sample-pip1.hpl, followed by sample-pip2.hpl.
But, what if I need to run in parallel?
Parallel execution
Parallel execution within a workflow is indeed feasible but requires explicit specification. To achieve this, you can click on an action’s icon and select the “parallel execution” option.
Once activated, the hop line will appear dotted and double-crossed.
However, it’s crucial to note that parallel execution means sharing resources within the Java Virtual Machine (JVM).
Small pipelines and workflows actions that run in parallel may be faster, but larger tasks that demand significant resources may actually run faster when executed sequentially.
When running this workflow, the log message will indicate that both actions have started in parallel.
Hybrid execution
Once you initiate parallel execution from a given action within a workflow, subsequent actions will also run in parallel.
Consider the simplified workflow below, where both “Pipeline” actions start in parallel. After the pipelines, the respective “Write to log” actions are executed before proceed with the “Dummy” action.
In many cases, you may prefer to execute specific parts of a workflow in parallel. In such instances, isolating parallel processing into a separate child workflow is advisable.
In the provided screenshot, the section of the workflow intended for parallel execution has been isolated into a child workflow. When this workflow runs, the child workflow (parallel-execution.hwf) will execute both actions in parallel. Once the last of these two pipelines concludes, the parent workflow resumes its sequential execution.
Avoid this pitfall in workflow design
One important note is to avoid combining parallel execution flows back into a single line of execution within a workflow unless you really need it.
For example, if you have two parallel actions that both connect to a single action afterward, it may seem like this would synchronize the flows and continue sequentially. However, that’s not how Apache Hop handles this scenario.
What actually happens?
When two parallel actions finish, both will independently trigger the next action. As a result, the subsequent action will execute once for each completed parallel flow.
For example:
- If you have two parallel pipelines that both connect to a workflow (e.g., sample-workflow.hwf), that workflow will execute twice — once for each pipeline that completes.
This behavior may be desirable in some scenarios, but it can be misunderstood.
Remember to separate sequential and parallel execution into different workflows. For example, use a sequential parent workflow to manage the overall flow and a child workflow to handle parallel tasks.
Summary
In this article, we’ve explored the options available for running workflow actions in Apache Hop.
- By default, workflow actions are executed sequentially, one after another in a predefined order.
- Parallel execution allows different workflow actions to run concurrently.
- You also learned how to combine parallel and sequential execution using child workflows.
- Remember to avoid merging parallel execution paths directly back into a single action to ensure clarity and proper execution flow.
Don’t miss the YouTube video for a step-by-step walkthrough.
Stay connected
If you have any questions or run into issues, contact us and we’ll be happy to help.