Lesson 6 — Flowshell#

Implementing complex data transformations using Flowmans declarative approach can reduce development efforts significantly. But using the flowexec command for testing the syntax and logic during development turns out to be quite inefficient. Therefore, Flowman comes with an interactive shell, with is a valuable tool for development and debugging.

1. What to Expect#

Objectives#

  • You will learn how to use Flowman Shell to support development

  • You will know the most important command of Flowman Shell

  • You will understand job contexts and why they are required

We will use the source code of the last lesson, which is available on GitHub

Description#

Since this chapter only describes an additional development tool for Flowman, we will simply reuse the project of the last chapter. The project includes one job with multiple targets and relations.

2. Using the Flowman interactive shell#

We can simply start the Flowman shell via the following command:

cd /home/flowman
flowshell -f lessons/05-multiple-targets

As you can see, we only specify the project to be loaded, but we leave out any specific command to be executed. You will be presented with a Flowman logo and a prompt

[INFO] Using Flowman default system settings
[INFO] Reading namespace file /opt/flowman/conf/default-namespace.yml
[INFO] Reading plugin descriptor /opt/flowman/plugins/flowman-avro/plugin.yml
[INFO] Loading Plugin flowman-avro
[INFO] Reading plugin descriptor /opt/flowman/plugins/flowman-aws/plugin.yml
[INFO] Loading Plugin flowman-aws
[INFO] Reading plugin descriptor /opt/flowman/plugins/flowman-mssqlserver/plugin.yml
[INFO] Loading Plugin flowman-mssqlserver
[INFO] Reading project from file:/home/flowman/lessons/05-multiple-targets
[INFO] Reading all module files in directory file:/home/flowman/lessons/05-multiple-targets/mapping
[INFO] Reading all module files in directory file:/home/flowman/lessons/05-multiple-targets/job
[INFO] Reading all module files in directory file:/home/flowman/lessons/05-multiple-targets/model
[INFO] Reading all module files in directory file:/home/flowman/lessons/05-multiple-targets/config
[INFO] Reading all module files in directory file:/home/flowman/lessons/05-multiple-targets/target
[INFO] Reading module from file:/home/flowman/lessons/05-multiple-targets/target/aggregates.yml
[INFO] Reading module from file:/home/flowman/lessons/05-multiple-targets/config/aws.yml
[INFO] Reading module from file:/home/flowman/lessons/05-multiple-targets/mapping/aggregates.yml
[INFO] Reading module from file:/home/flowman/lessons/05-multiple-targets/model/aggregates.yml
[INFO] Reading module from file:/home/flowman/lessons/05-multiple-targets/model/measurements-raw.yml
[INFO] Reading module from file:/home/flowman/lessons/05-multiple-targets/model/measurements.yml
[INFO] Reading module from file:/home/flowman/lessons/05-multiple-targets/job/main.yml
[INFO] Reading module from file:/home/flowman/lessons/05-multiple-targets/config/environment.yml
[INFO] Reading module from file:/home/flowman/lessons/05-multiple-targets/model/stations-raw.yml
[INFO] Reading module from file:/home/flowman/lessons/05-multiple-targets/model/stations.yml
[INFO] Reading module from file:/home/flowman/lessons/05-multiple-targets/mapping/facts.yml
[INFO] Reading module from file:/home/flowman/lessons/05-multiple-targets/target/measurements.yml
[INFO] Reading module from file:/home/flowman/lessons/05-multiple-targets/target/stations.yml
[INFO] Reading module from file:/home/flowman/lessons/05-multiple-targets/mapping/measurements.yml
[INFO] Reading module from file:/home/flowman/lessons/05-multiple-targets/mapping/stations.yml
[INFO] Loaded project 'weather' version 1.0

Welcome to
______  _
|  ___|| |
| |_   | |  ___ __      __ _ __ ___    __ _  _ __
|  _|  | | / _ \\ \ /\ / /| '_ ` _ \  / _` || '_ \
| |    | || (_) |\ V  V / | | | | | || (_| || | | |
\_|    |_| \___/  \_/\_/  |_| |_| |_| \__,_||_| |_|    1.0.0

Using Spark 3.3.2 and Hadoop 3.3.2 and Scala 2.12.15 (Java 11.0.15)

Type in 'help' for getting help
flowman:weather> 

The prompt understands all commands, which are also available in flowexec, plus some additional commands only available in the Flowman interactive shell.

2.1 Inspecting the project#

The shell provides some simple commands for inspecting the project.

Listing entities#

For example, you can view a list of all jobs in the project with the following command:

flowman:weather> job list 
main

You can request a list of all targets, relations, mappings etc. with similar commands

flowman:weather> target list 
aggregates
measurements
stations
flowman:weather> relation list
aggregates
measurements
measurements_raw
stations
stations_raw
flowman:weather> mapping list 
aggregates
facts
measurements
measurements_extracted
measurements_joined
measurements_raw
stations
stations_raw

Inspecting entities#

The shell also provides a couple of inspect commands, which show more detailed information for a single entity. For example, the following command will show more detailed information for the relation stations:

flowman:weather> relation inspect stations
 Relation:
    name: stations
    kind: file
  Requires - CREATE:
    file:file:/home/flowman/lessons/05-multiple-targets/schema/stations.avsc
  Provides - CREATE:
    file:file:/tmp/weather/stations
  Requires - READ:
    file:file:/tmp/weather/stations
  Provides - READ:
  Requires - WRITE:
  Provides - WRITE:
    file:file:/tmp/weather/stations

Similar commands also exist for other entity types, for example for inspecting a mapping, you would type the following:

flowman:weather> mapping inspect stations
Mapping:
    name: stations
    kind: relation
    inputs: 
    outputs: main
    cache: StorageLevel(1 replicas)
    broadcast: false
    checkpoint: false
  Requires:
    file:file:/tmp/weather/stations

Inspect schema#

Diving into more details, you can also retrieve the data schema of a relation or mapping as follows:

flowman:weather> relation describe stations
root
 |-- usaf: string (nullable = false)
 |-- wban: string (nullable = false)
 |-- name: string (nullable = true)
 |-- country: string (nullable = true)
 |-- state: string (nullable = true)
 |-- icao: string (nullable = true)
 |-- latitude: float (nullable = true)
 |-- longitude: float (nullable = true)
 |-- elevation: float (nullable = true)
 |-- date_begin: date (nullable = true)
 |-- date_end: date (nullable = true)

Showing records#

The Flowman shell can also peek into some entities, like mappings and relations and show some records:

flowman:weather> mapping show stations_raw
[INFO] Instantiating mapping 'weather/stations_raw' with outputs 'main' (broadcast=false, cache='None')
[INFO] Reading from relation 'stations_raw' with partitions () and filter ''
[INFO] Reading file relation 'weather/stations_raw' at 's3a://dimajix-training/data/weather/isd-history'  for partitions ()
[INFO] Loading schema from file file:/home/flowman/lessons/05-multiple-targets/schema/stations.avsc
+------+-----+----------+-------+-----+----+--------+---------+---------+----------+----------+
|  usaf| wban|      name|country|state|icao|latitude|longitude|elevation|date_begin|  date_end|
+------+-----+----------+-------+-----+----+--------+---------+---------+----------+----------+
|007018|99999|WXPOD 7018|   null| null|null|     0.0|      0.0|   7018.0|2011-03-09|2013-07-30|
|007026|99999|WXPOD 7026|     AF| null|null|     0.0|      0.0|   7026.0|2012-07-13|2017-08-22|
|007070|99999|WXPOD 7070|     AF| null|null|     0.0|      0.0|   7070.0|2014-09-23|2015-09-26|
|008260|99999| WXPOD8270|   null| null|null|     0.0|      0.0|      0.0|2005-01-01|2010-09-20|
|008268|99999| WXPOD8278|     AF| null|null|   32.95|   65.567|   1156.7|2010-05-19|2012-03-23|
|008307|99999|WXPOD 8318|     AF| null|null|     0.0|      0.0|   8318.0|2010-04-21|2010-04-21|
|008411|99999|      XM20|   null| null|null|    null|     null|     null|2016-02-17|2016-02-17|
|008414|99999|      XM18|   null| null|null|    null|     null|     null|2016-02-16|2016-02-17|
|008415|99999|      XM21|   null| null|null|    null|     null|     null|2016-02-17|2016-02-17|
|008418|99999|      XM24|   null| null|null|    null|     null|     null|2016-02-17|2016-02-17|
+------+-----+----------+-------+-----+----+--------+---------+---------+----------+----------+
only showing top 10 rows

2.2 Job context#

When playing around with the shell, you might notice that the commands above do not work with all entities of the project. In particular, they fail for entities, which access the job parameter year. This is quite natural, because this variable is not defined globally, but only inside the main job. The same problem is true for all additional environment variables defined within a job - they are not globally available.

In order to work with all entities, even if they rely on some parameters or variables defined within a job, you can enter a job context, which will apply any job specific setting to the execution environment:

flowman:weather> job enter main year=2019
[INFO] Job argument year=2019

Now you also access entities, which (possibly indirectly) access the job parameter year. For example, the relation measuerements can now also be inspected:

flowman:weather/main> relation inspect measurements
Relation:
    name: measurements
    kind: file
  Requires - CREATE:
    file:s3a://dimajix-training/data/weather/2019
  Provides - CREATE:
    file:file:/tmp/weather/measurements
  Requires - READ:
    file:file:/tmp/weather/measurements/year=2019
  Provides - READ:
  Requires - WRITE:
  Provides - WRITE:
    file:file:/tmp/weather/measurements/year=2019

Finally, you can also leave the context again:

flowman:weather/main> job leave

2.3 SQL#

The Flowman shell also provides a simple way to execute arbitrary SQL commands, where tables refer to mappings. For example, in order to count the number of records in the stations_raw mapping, you can simply execute the following command:

flowman:weather> sql "SELECT COUNT(*) FROM stations_raw"
+--------+
|count(1)|
+--------+
|   29744|
+--------+

2.4 Executions#

Of course, you can also execute jobs or individual targets inside the shell:

flowman:weather> job build main year=2019
[INFO] 
[INFO] -------------------------------------------------------------------------------------------------------------
[INFO] Executing phases 'VALIDATE','CREATE','BUILD' for job 'main' in project 'weather' (version 1.0) with isolation
[INFO] Job argument year=2019
[INFO] 
[INFO] -------------------------------------------------------------------------------------------------------------
...

3. Next Lesson#

In the next lesson, we will talk about the lifecycle model of Flowman.