Step-by-step guide to Pipeline Logging
NOTE: this is a repost of a post on the know.bi blog by Adalennis Buchillón Soris earlier.
Pipeline Log
Apache Hop allows data engineers and data developers to visually design workflows and data pipelines to build powerful solutions.
After your project has gone through the initial development and testing, knowing what is going on in runtime becomes important.
Apache Hop’s Pipeline Log feature enables logging pipeline activity within another pipeline. A Pipeline Log streams logging information from a running pipeline to another pipeline. The Pipeline Log will be created in JSON format.
Hop will pass the logging information for each pipeline you run to the pipeline(s) you specify as pipeline log metadata objects. In this post, we’ll look at an example of how to configure and use the pipeline log metadata to write pipeline logging information to a relational database.
The examples here use variables to separate code and configuration according to best practices in your Apache Hop projects.
Step 1: Create a Pipeline Log metadata object
To create a Pipeline Log click on the New -> Pipeline Log option or click on the Metadata -> Pipeline Log option.
The system displays the New Pipeline Log view with the following fields to be configured.
The Pipeline Log can be configured as in the following example:
- Name: The name of the metadata object (pipelines-log).
- Enabled: Logging is active by default (enabled).
- Logging parent pipelines only: Ensures that only parent pipelines are logged, avoiding redundancy in nested executions (enabled).
- Pipeline executed to capture logging: Select or create the pipeline to be used for logging the activity (${PROJECT_HOME}/code/logging/pipeline-log-database.hpl). Remember that ${PROJECT_HOME} represents the root directory of your project. You can either select an existing pipeline, specify a custom path where you plan to create your logging pipeline, or create a new one directly. In the next step, we’ll customize the pipeline. The only requirement is that the pipeline must begin with a Pipeline Logging Transform as the first step to ensure proper log capture and processing.
- Execute at the start of a pipeline: Logs are captured at the beginning of pipeline execution (enabled).
- Execute at the end of a pipeline: Logs are captured at the completion of the pipeline (enabled).
- Execute periodically during execution: This option allows the pipeline to log progress at regular intervals while the pipeline is running. In this case, periodic logging is turned off (disabled).
- Interval in Seconds: If periodic execution is enabled, logs would be captured every 30 seconds during the pipeline’s execution, providing real-time insights into its progress (30).
Periodic logging is useful for monitoring long-running pipelines in real time. In this example, it’s disabled as periodic logs are not needed. If enabled, set an appropriate interval to balance real-time insights with system performance.
Finally, save the Pipeline Log metadata.
Pipeline logging will apply to any pipeline you run in the current project. That may not be necessary or even not desired. If you want to only work with logging information for a selected number of pipelines, you can add a selection of pipelines to the table below the configuration options (Capture output of the following pipelines).
The screenshot below shows the single clean-transform.hpl pipeline that logging will be captured for in the default Apache Hop samples project.
Step 2: Create a new pipeline with the Pipeline Logging transform
To create the pipeline click on the New button in the New Pipeline Log dialog. Then, choose a folder and a name for the pipeline.
Using the ‘New’ button in the Pipeline Log dialog ensures that the required Pipeline Logging transform is included automatically. When creating a pipeline manually, you must add the Pipeline Logging transform as the first step to capture logs properly.
From the New Pipeline Log dialog, a new pipeline is automatically created with a Pipeline Logging transform connected to a Dummy transform (Save logging here).
Now it’s time to configure the Pipeline Logging transform. This configuration is very simple, open the transform and set your values as in the following example:
- Transform name: Choose a name for your transform, just remember that the name of the transform should be unique in your pipeline (pipeline-log).
- Also log transform: Selected by default but we uncheck this option for the example (unchecked). When enabled, this option logs detailed information about each transform in the pipeline, which may produce a large amount of data. In this example, it’s unchecked to focus on high-level pipeline execution logs.
Step 3: Add and configure a Table output transform
The Table Output transform allows you to load data into a database table. Table Output is equivalent to the DML operator INSERT. This transform provides configuration options for the target table and a lot of housekeeping and/or performance-related options such as Commit Size and Use batch update for inserts.
TIP: In this example, we are going to use a relational database connection to log but you can also use output files. In case you decide to use a database connection, check the installation and availability as a pre-requirement.
Add a Table Output transform by clicking anywhere in the pipeline canvas, then Search ‘table output’ -> Table Output.
Now it’s time to configure the Table Output transform. Open the transform and set your values as in the following example:
Transform name: choose a name for your transform, just remember that the name of the transform should be unique in your pipeline (pipelines logging).
- Connection: The database connection to which data will be written (logging-connection). The connection was configured by using the logging-connection.json environment file that contains the variables:
Ensure that the target database is accessible, the user has appropriate permissions, and the database schema is prepared to handle logs.
- Target table: The name of the table to which data will be written (pipelines-logging).
- Click on the SQL option to generate the SQL to create the output table automatically:
- Execute the SQL statements:
- Open the created table in your favorite database explorer (e.g DBeaver) to see all the logging fields:
The logging table includes fields such as: logging_date (timestamp of the log entry), pipelinefilenamename (executed file), pipelinestart (timestamp of the start), and pipelinend (timestamp of the end). For more information, refer to the Apache Hop documentation.
- Close and save the pipeline.
Step 4: Run a pipeline and check the logs
Finally, run a pipeline by clicking on the Run -> Launch option. In this case, we use a pipeline (clean-transform.hpl) from my-hop-project.
The data of the pipeline execution will be recorded in the pipelines-logging table.
Check the data in the pipelines-logging table.
Remarks
- Default logging behavior → Step 1: By default, all pipelines are logged. However, users can customize which pipelines they want to log by enabling or disabling logging for specific pipelines in the pipeline log settings.
- Pipeline Logging transform → Step 2: When the “New” option is used from the Pipeline Log dialog, a pipeline with the Pipeline Logging transform is automatically created. If the pipeline is created from scratch, the user must manually add the Pipeline Logging transform as the input transform to ensure logging functionality.
- Log storage options → Step 3: While relational database connections are commonly used for logging, users can also choose alternative storage options, such as output files, to write logs, providing flexibility based on project requirements.
Don’t miss the YouTube video for a step-by-step walkthrough!
Next steps
You now know how to use the pipeline log metadata type to work with everything Apache Hop has to offer to process your pipeline logging information.
Feel free to reach out if you’d like to find out more or to discuss how we can help with pipeline logging or any other aspect of your data engineering projects with Apache Hop.