Apache Hop (Incubating) 1.0 — a milestone in data orchestration

Bart Maertens
6 min readOct 7, 2021

--

The Apache Hop (Incubating) community released Apache Hop 1.0, the first major release of the platform.

Apache Hop (Incubating) 1.0 is available!

Apache Hop, an incubating project at the Apache Software Foundation, is exploring the future of data orchestration and data integration. More often than not, data platforms aim for highly skilled developers and focus on data science rather than on the underlying problem of integrating, preparing and orchestrating data. Nevertheless, these humble task still require large amounts of time and effort and shouldn’t be neglected nor taken lightly.

Data orchestrating and integration for everyone

Apache Hop is different. The platform aims to enable non-developers to be as -if not more- productive than “real” developers can be with “real” code. To do this, Hop offers a set of tools that enable Hop data developers to design, run, test and monitor data processes.

Visual development and a uniform set of tools

Apache Hop offers a development environment to visually design workflows and pipelines. Pipelines perform the heavy data lifting: read data from a variety of sources, clean, combine, enrich that data before writing it to one or more target platforms. Workflows is where the orchestration happens. In a workflow, you’ll check if your environment is ready to start processing, run child workflows and pipelines and handle errors.

Apache Hop (Incubating) — Hop Gui

History and architecture

Hop started as a fork of the Pentaho Data Integration (Kettle) open source data integration platform. From day one, the project team decided to part ways with the original platform. Compatibility was not a goal, things had to be changed dramatically. After almost two years of development, literally not a single source file was left untouched.

The newly designed Apache Hop platform has a very strong focus on metadata. Every single item in a Hop project is metadata: workflows, pipelines, (relation or NoSQL) database connections and about twenty other metadata types. The architecture was reworked to a kernel architecture: Hop now runs a small but reliable and flexible engine that lets data flow through workflows and pipelines. All non-kernel functionality is added through plugins. The Hop 1.0 release contains over 400 plugins, adding over 20 different types of functionality. This architecture gives Hop an unparalleled flexibility that go from processing IoT data on edge devices to petabytes of data, in streaming, batch or hybrid modes, on-premise, in the cloud or hybrid architectures.

Design once, run anywhere

Data projects are dynamic in nature. Requirements change, data volumes change and architectures evolve.

Hop workflows and pipelines are designed in Hop Gui, but are not limited to the local Hop runtime. Flexible runtime configurations allow data developers to take their workflows and pipelines where the data is. Workflow can be run on the native Hop runtime engine, both on a local machine or on a remote server. Pipelines have these options too but can also run on an Apache Spark, Apache Flink or Google Dataflow cluster through Apache Beam.

This ability to design a workflow or pipeline once and run them on the platform where it makes most sense gives Hop data projects a huge advantage. With little or no changes, your workflows and pipelines can follow the data and scale with the volume you need to process.

Life Cycle Management

Managing data or analytics projects as they mature and evolve is not a trivial task. Keeping track of changes, making sure the code continues to work and performance doesn’t degrade typically require a lot of work and resources. Hop offers all the tools a data team needs to efficiently develop and maintain long term projects.

Projects and environments

Data and analytics teams typically work on multiple projects simultaneously., with different versions of a project deployed on development, testing, production and CI/CD environments.

Hop makes managing multiple projects and environments a breeze. A project in Hop is a collection of workflows, pipelines and other metadata items. Environments contain the configuration for a system where these projects are deployed. This not only allows projects to be quickly deployed to new environments, it also guarantees a strict separation between code (projects) and configuration (environments).

Switching between projects and environments in Hop Gui is as easy as selecting a project or environment from a drop-down list. In the various command line tools, projects and environments can be passed as command line arguments.

Integrated git version control

A strict separation between code and configuration makes Hop projects easy to manage in version control. The file explore perspective in Hop Gui lets data developers perform the most common git operations (pull, push, commit etc) directly from the IDE, without leaving their train of thought. A visual diff even gives a quick indication of which items in a workflow or pipeline were added, changed or deleted between two versions of a pipeline.

Apache Hop (Incubating) — git visual diff

Test, test, test

Once a project goes beyond the initial phases of development, it needs to be run frequently. Hop has functionality to gracefully handle errors in workflows and pipelines, but that only tells whether any errors occurred, it doesn’t tell you anything about whether your data was processed correctly. To ensure workflows and pipelines produce exactly the expected results, Hop data developers can add unit tests to their pipelines. In these unit tests, a sample data set is used to execute a pipeline. The generated result of that pipeline is then compared to an expected (golden) data set. If the generated output matches the golden data set, the test passes. If there’s any difference between the generated and expected results, the test fails. Building a library of unit, integration and regression tests allows Hop teams to guarantee correct data processing and as such significantly increase the reliability of the process.

The Hop developers actually eat their own dog food. Our library of integration tests has allowed us to identify and fix bugs that have been in the code base for over a decade. Similarly, a number of regressions have been detected and fixed during the aggressive development phases that led to Hop 1.0. These unit and integration tests helped tremendously to make Hop 1.0 the stable and robust release you can download today.

Full life cycle management

The separation of code (projects) and configuration (environments), version control and testing, ideally combined in a CI/CD pipeline, gives Hop teams all the tools they need to manage a project through its entire life cycle.

Community

Incubating Apache projects have two main tasks: learn to build software “The Apache Way” and community building. Hop 1.0 is the fourth release as an incubating project. The Hop team knows how to build software “The Apache Way” by now.

The second task, community building, is maybe even more important. The Hop community started as a handful of developers in late 2019. Two years later, Hop has hundreds of followers of each of the social media accounts (Twitter, LinkedIn, YouTube). Well over 200 people participate in the discussions on the Mattermost chat channels. Local user groups started popping up all over the world. Active groups have started events in at least Brazil, Spain, Italy and Japan.

Great communities build great software. The Hop community is a vibrant group of enthusiasts who share experiences, discuss problems, register bug and feature request tickets. All of these are very welcome contributions that help to take Hop forward every single day.

What’s next?

Work has already started on the next release (1.1.0). The Hop PMC (Podling Management Committee) intends to release frequently, dropping new features as they are developed.

After just over a year and four releases in the incubator, Hop is ready to start working towards graduating as an Apache Top Level project. With this graduation, the Apache Software Foundation will take ownership of all Hop code, documentation and other source materials.

Hop 1.0 and the upcoming graduation as a Top Level Project will mark the end of the beginning for Hop. The future looks bright, the entire community is eager to make Hop the go-to platform for data orchestration and integration, and hopes to join the ranks of Spark, Flink, Beam, Kafka, Airflow and others as successful Apache projects.

Apache Hop (Incubating) useful links

--

--

Bart Maertens
Bart Maertens

Written by Bart Maertens

Data architect, data engineer with 20+ years of experience. Co-founder at Apache Hop and Lean With Data, founder at know.bi.

No responses yet