Combining Data Streams in Apache Hop: Direct Hops vs. Append Streams

5 min readJan 16, 2025

NOTE: this is a repost of a post on the know.bi blog by Adalennis Buchillón Soris earlier.

Hey everyone! In this post, we’re exploring how to efficiently combine data from multiple sources in Apache Hop. We’ll explore when you can keep things simple by directly connecting streams and when you need the Append Streams transform for precise control over data order. Let’s get started!

Introduction

Imagine this scenario: you’re building a data pipeline in Apache Hop, and you’ve got three different data sources feeding into it.

These data streams are identical — they have the same columns, the same data types, everything matches perfectly. So, the big question is: how do you combine these streams into a single workflow efficiently?

The simple approach

If your data streams are identical and you don’t care about the order in which the data is processed, there’s a straightforward solution. You can simply create direct hops from each of your input streams to a single downstream transform. This method is clean, efficient, and keeps your pipeline easy to manage.

Example without append streams:

Let’s consider a scenario where you have two CSV files, each containing flight data for January and February. Both files have the same columns: Flight Number, Departure, Arrival, Date, and Airline.

Pipeline overview:

Input files: Start with two Text File Input transforms — one for each CSV file (January and February).
Direct hops: Create direct hops from each of these Text File Input transforms to a single downstream transform, such as ‘Select Values’ or ‘Sort Rows’.

In this setup, Apache Hop automatically handles the data union, merging all streams into one without the need for any extra transforms. This method is perfect for cases where you just want to aggregate data without worrying about the sequence in which the records appear.

Key points:

Efficiency: This approach is very efficient since it avoids the overhead of managing the order of the streams.
Simplicity: The pipeline is simple, with minimal transforms and an easy-to-understand data flow.
Flexibility: You can easily scale this approach by adding more identical streams, all feeding into the same downstream transform.

This pipeline design is ideal when you’re dealing with multiple identical data sources and need to combine them quickly without caring about the order in which rows are processed. It’s a great solution for large-scale data processing where performance and simplicity are key.

When to use the Append Streams transform

But what if the order of your data matters? In the previous example, when the pipeline starts, all transforms are initialized simultaneously. As a result, data from the first and second inputs is processed concurrently, meaning the order of the data may not be preserved.

But in some cases, you need to ensure that data from one source is fully processed before the next source’s data begins. This is where the ‘Append Streams’ transform comes into play.

Example requiring append streams:

Imagine you have the same two flight data files: flights_january.csv and flights_february.csv. Both files have identical columns. However, the “Operating Airline” column in January’s file contains extra characters, like parentheses, that need to be cleaned up.

You want to make sure that all the January data is processed and cleaned before the pipeline starts processing February’s data. This order is critical because the cleanup logic applies only to January’s data.

Pipeline overview:

Input files: Start with two separate Text File Input transforms — one for flights_january.csv and another for flights_february.csv.
Data cleaning for January: Use a ‘Replace in String’ transform to clean the “Operating Airline” column in the January data.
Append Streams: Use the Append Streams transform to combine the cleaned January data with the February data.

By using the Append Streams transform, you ensure that data from the first stream (January) is fully processed before the second stream (February) starts.

This is crucial when the sequence of rows affects the outcome of your data processing or where maintaining the order of data is necessary for downstream operations.

If you hover over the blue info icon you’ll see a note explaining that round-robin row distribution is not applicable for sequential processing. This ensures that all rows from the first input are processed before moving to the next.

Key points:

Order preservation: This approach is crucial when the order of processing is important.
Control: The pipeline gives you fine-grained control over the sequence of data processing, which can be important for certain types of data analysis or reporting.
Handling multiple streams: If you have more than two streams to combine, remember that the Append Streams transform only allows two input streams. You’ll need to use multiple Append Streams transforms to handle more than two entries.

This pipeline design is ideal when you have multiple data sources with identical schemas and need to strictly control the order of data processing. The Append Streams transform is the perfect tool for this scenario, ensuring that your data is processed in the correct sequence.

Use cases recap

To recap: if you’re dealing with identical streams and don’t need to worry about processing order, direct hops to a single transform are your best bet. It’s quick, easy, and keeps things simple.

But when order matters, or if you need to ensure data is processed in a specific sequence, the Append Streams transform is the tool you need.

Conclusion

Understanding when and how to use these methods will help you build more efficient and effective pipelines in Apache Hop. Thanks for reading! If you found this helpful, stay tuned for more tips on mastering your data workflows. See you in the next post!

Don’t miss the YouTube video for a step-by-step walkthrough!

Stay connected

If you have any questions or run into issues, contact us and we’ll be happy to help.