Lesson 12 — Deployment#

So far, we have seen how to create, execute and debug a project in Flowman. But this still leaves the question open how a development workflow could look like. Of course, you could simply install Flowman on some server and then copy all project files to the server before using flowexec to execute some jobs.

But Flowman also offers a more streamlined process using Apache Maven as a build system. This workflow will easily integrate itself into an existing CI/CD infrastructure.

1. What to Expect#

Objectives#

  • You learn a robust development workflow including creating and deploying artifacts

  • You know how to use the Flowman Maven Plugin

You can find the full source code of this lesson on GitHub

Description#

Since this chapter focuses on the workflow and not core features, we will reuse the project from chapter 5.

We will restructure the project to be processed by the Flowman Maven plugin. The result will support the following workflow:

  1. Development on your local machine.

  2. Build a deployable artifact. This can be done on your local machine, but also on some CI/CD server like Jenkins.

  3. Deploy artifact to some remote location.

Prerequisites#

This lesson is not executed within the Docker container. It should be executed directly on your local machine. You need Java 11 and Maven installed on your machine.

2. Project Setup#

In order to use Maven with the Flowman plugin, we need to slightly restructure the project: We move all project related files into a subdirectory weather (the name of the project). We will also add a directory conf containing the default-namespace.yml configuration file. Eventually, we add the files pom.xml for Maven and deployment.xml for the Flowman Maven plugin.

2.1 Project Structure#

This final directory structure looks as follows

├── conf
│   └── default-namespace.yml
├── weather
│   ├── config
│   │   ├── aws.yml
│   │   ...
│   ├── job
│   │   └── main.yml
│   ├── mapping
│   │   ├── measurements.yml
│   │   ...
│   ├── model
│   │   ├── measurements-raw.yml
│   │   ...
│   ├── project.yml
│   ├── schema
│   │   ├── measurements.json
│   │   ...
│   ├── target
│       ├── aggregates.yml
│   │   ...
├── deployment.xml
├── pom.xml
└── README.md

2.2 Maven Build Process#

The pom.xml generated by the archetype will look as follows:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.dimajix.flowman.tutorial</groupId>
  <artifactId>flowman-tutorial-weather</artifactId>
  <version>1.0.0-SNAPSHOT</version>
  <packaging>pom</packaging>

  <name>Flowman Weather Data</name>
  <description>Small demo project for Flowman using publicly available weather data</description>

  <properties>
    <!-- Encoding related settings -->
    <encoding>UTF-8</encoding>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
    <flowman.version>1.1.0</flowman.version>
  </properties>

  <build>
    <plugins>
      <plugin>
        <groupId>com.dimajix.flowman.maven</groupId>
        <artifactId>flowman-maven-plugin</artifactId>
        <version>0.4.0</version>
        <extensions>true</extensions>
        <configuration>
          <deploymentDescriptor>deployment.yml</deploymentDescriptor>
        </configuration>
      </plugin>
    </plugins>
  </build>
</project>

As you can see, the Maven project looks almost trivial, but the flowman-maven-plugin will take care of lots of functionality.

2.3 Deployment Descriptor#

In addition to the Maven pom.xml you will also find a deployment.yml file which contains the packaging details for the Flowman Maven plugin. Its contents look as follows:

flowman:
  version: ${flowman.version}
  plugins:
    - flowman-avro
    - flowman-aws

# List of subdirectories containing Flowman projects
projects:
  - weather

# List of packages to be built
packages:
  # The first package is called "dist"
  dist:
    kind: dist

  # The second package is called "jar"
  jar:
    # The package is a "fatjar" package, i.e. a single jar file containing both Flowman and your project
    kind: fatjar

execution:
  javaOptions:
    - -Dhttp.proxyHost=${http.proxyHost}
    - -Dhttp.proxyPort=${http.proxyPort}
    - -Dhttps.proxyHost=${https.proxyHost}
    - -Dhttps.proxyPort=${https.proxyPort}

This deployment descriptor will create two packages, using the Maven coordinates (groupId, artifactId and version) of the pom.xml file. Each package is created as a separate classifier:

  • The jar package will create a Maven artifact with coordinates com.dimajix.flowman.tutorial:flowman-tutorial-weather:1.0-SNAPSHOT:jar:jar, i.e.

Property Value
groupId com.dimajix.flowman.tutorial
artifactId flowman-tutorial-weather
version 1.0-SNAPSHOT
classifier jar
packaging jar

The jar file is a so-called “fat jar” and contains both all Flowman code and your project files. This self-contained file can be directly with spark-submit.

  • The dist package will create a Maven artifact with coordinates com.dimajix.flowman.tutorial:flowman-tutorial-weather:1.0-SNAPSHOT:tar.gz:dist, i.e.

Property Value
groupId com.dimajix.flowman.tutorial
artifactId flowman-tutorial-weather
version 1.0-SNAPSHOT
classifier dist
packaging tar.gz

The dist package will create a tar.gz file, which contains all Flowman libraries, executables and plugins along with your project. For running Flowman from this package, you first need to unpack the tar.gz file, and then use the Flowman binaries like flowexec.

We will later use these Maven coordinates in the deployment step to retrieve the desired artifact from the artifact repository (like Nexus).

3. Building#

Once you are happy with your results, you can build a self-contained redistributable package with Maven via

mvn clean install

This will run all tests and create (possibly multiple) packages contained inside the target directory. The type and details of the package are defined in the deployment.yml file. The example above will create the following two artifacts:

  • The jar package will create a Maven artifact with coordinates com.dimajix.flowman.tutorial:flowman-tutorial-weather:1.0-SNAPSHOT:jar:jar, i.e.

Property Value
groupId com.dimajix.flowman.tutorial
artifactId flowman-tutorial-weather
version 1.0-SNAPSHOT
classifier jar
packaging jar
  • The dist package will create a Maven artifact with coordinates com.dimajix.flowman.tutorial:flowman-tutorial-weather:1.0-SNAPSHOT:tar.gz:dist, i.e.

Property Value
groupId com.dimajix.flowman.tutorial
artifactId flowman-tutorial-weather
version 1.0-SNAPSHOT
classifier dist
packaging tar.gz

What type of package is preferable (dist or fatjar) depends on your infrastructure and deployment pipelines. People with a dedicated Hadoop cluster (Cloudera, AWS EMR) will probably be happy with a dist package, while folks with a serverless infrastructure (Azure Synapse, AWS EMR serverless) will probably prefer a completely self-contained fatjar package.

Note for Windows users: Maven will also execute all tests in your Flowman project. The Hadoop dependency will require the so-called Winutils to be installed on your machine.

4. Publishing#

This step possibly should be performed via a CI/CD pipeline (for example, Jenkins). Of course, the details heavily depend on your infrastructure, but basically the following command will do the job:

mvn deploy

This will deploy the packaged self-contained redistributable archive to a remote repository manager like Nexus. Of course, you will need to configure appropriate credentials in your Maven settings.xml (this is a user-specific settings file, and not part of the project).

5. Deploying to Production#

This is the most difficult part and completely depends on your build and deployment infrastructure and on your target environment (Kubernetes, Cloudera, EMR, …). But generally, the following steps need to be performed:

5.1 Fetch redistributable package from remote repository#

You can use Maven again to retrieve the correct package via

mvn dependency:get -Dartifact=<groupId>:<artifactId>:<version>:<packaging>:<classifier> -Ddest=<your-dest-directory>

For example, for downloading the tar.gz package of our example into the /tmp directory, you would need to perform the following command:

mvn dependency:get -Dartifact=com.dimajix.flowman.tutorial:flowman-tutorial-weather:1.0-SNAPSHOT:tar.gz:dist -Ddest=/tmp

Similarly, for fetching the fat jar, you need to run the following Maven command:

mvn dependency:get -Dartifact=com.dimajix.flowman.tutorial:flowman-tutorial-weather:1.0-SNAPSHOT:jar:jar -Ddest=/tmp

5.2 Unpack redistributable package at appropriate location#

If you pulled a tar.gz file containing a full Flowman “dist” package, then you will need to install it. You can easily unpack the package, which will provide a complete Flowman installation (minus Spark and Hadoop):

tar xvzf <artifactId>-<version>-dist-bin.tar.gz

5.3 Run on your infrastructure#

Within the installation directory, you can easily run Flowman via

bin/flowexec -f flow test run

Or you can, of course, also start the Flowman Shell via

bin/flowshell -f flow

6. Next Lesson#

In the next lesson, we will learn what kind of execution metrics are collected by Flowman, how to define new data dependent metrics, and how to publish them to Prometheus.