Avoiding deadlocks in Apache Hop pipelines using “Stream lookup”

Bart Maertens
5 min readJan 16, 2025

--

NOTE: this is a repost of a post on the know.bi blog by Adalennis Buchillón Soris earlier.

Apache Hop is a powerful, flexible tool for building data projects, but certain designs can run into deadlocks (also known as blocking, stalling, or hanging).

Deadlocks happen when different transforms within a pipeline prevent each other from finishing, causing the entire pipeline to stall indefinitely. This blog post focuses on a common cause of deadlock when using the “Stream lookup” transform and offers practical workarounds to avoid it.

Understanding pipelines deadlocks

Deadlocks in Apache Hop can occur for several reasons:

  1. External locks: When the database itself places locks on a table, it can prevent the pipeline from progressing.
  2. Pipeline design issues: transforms that are configured to block until previous transforms complete can run into deadlocks, especially when working with large datasets.
  3. Buffer limits and rowset size: In pipelines where streams split and then rejoin, setting an appropriate rowset size is critical to avoid deadlocks.

How “Stream lookup” can cause deadlocks

One of the most common deadlock scenarios arises when using the “Stream lookup” transform, especially if the pipeline processes a large number of rows. Here’s how it happens and what you can do to prevent it.

A deadlock often arises when the number of rows being processed exceeds the buffer capacity between transforms, defined by the Rowset size setting. In Apache Hop, each hop (the data link between transforms) has a limited buffer capacity based on the rowset size. The default rowset size is typically 10,000 rows, meaning that a hop can temporarily store up to 10,000 rows between transforms.

Scenario

Imagine a pipeline where data flows from a Generate Rows transform, splits into two streams, and then recombines in the “Stream lookup” transform.

One stream flows directly to the “Stream lookup”, while the other passes through an intermediate transform, such as Group By.

Here, 10,001 rows are generated, but the current rowset size for the local Pipeline Run Configuration used to run this pipeline is set to 10,000.

When the rowset buffer reaches its capacity (say, 10,000 rows), the pipeline halts the flow until downstream transforms can process some of these rows.

If the “Stream lookup” transform is waiting for data from both streams and one stream has a buffer that’s full, neither stream can proceed, leading to a complete stall of the pipeline.

Solutions to avoid deadlocks

1. Adjust rowset size (with caution)

Increasing the rowset size can provide a short-term fix by allowing more rows to be buffered. However, this should be approached cautiously because as data volumes grow, a larger rowset may lead to increased memory usage and potentially worsen performance in the long term.

The rowset size in Apache Hop determines the number of rows temporarily stored in memory between pipelines transforms. By default, it’s set to 10,000 rows, but this value can be customized based on your data volume and pipelines structure.

Key points:

  • A pipeline runs using a Pipeline Run Configuration.
  • A Pipeline Run Configuration specifies which engine is used to run the pipelines.
  • If the selected engine type is local, then you have the option to modify the “Rowset size” configuration.

2. Separate input streams

Another solution is to split the input data streams, creating two parallel copies of the original data.

Each stream can then independently perform necessary operations, preventing the deadlock caused by bottlenecked transforms in a single stream.

3. Divide pipelines into smaller units

Breaking the pipeline into separate, smaller pipelines allows you to process data in stages. Intermediate data can be written to a temporary table or file, allowing each stage to complete without depending on the others.

This setup is especially effective in preventing buffer-related deadlocks in complex pipelines.

4. Use the Blocking transform

For pipelines that require sequential processing, the Blocking transform can be a valuable addition.

Configure the Blocking transform with the option “Pass all rows” to ensure that all rows in a stream are fully processed before moving on to the next transform.

Adjust settings like cache size in the Blocking transform to optimize performance according to your needs.

Key takeaways

  • Pipelines with the “Stream lookup” transform are prone to deadlocks when processing large datasets.
  • Rowset size determines the maximum rows buffered between transforms. Setting an appropriate rowset size is essential for managing data flow in pipelines.
  • Practical solutions to avoid deadlocks include splitting streams, dividing pipelines, and using the Blocking transform for sequential data handling.

Don’t miss the YouTube video for a step-by-step walkthrough!

Stay connected

If you have any questions or run into issues, contact us and we’ll be happy to help.

--

--

Bart Maertens
Bart Maertens

Written by Bart Maertens

Data architect, data engineer with 20+ years of experience. Co-founder at Apache Hop and Lean With Data, founder at know.bi.

No responses yet