Relations#

Physical data sources (like Hive tables, relational SQL databases, files etc.) are specified using so-called relations. Data flows can read and write from and to relations via appropriate mappings (Read Mapping) or build targets (Relation Target).

A relation always contains all required information to connect to the data source (for example the table name and a JDBC Connection in case of a JDBC relation).

Some relations (like files, Hive tables and views) can also be directly created from Flowman, thereby providing a single tool for schema management and data ingestion into tables.

Schema Conversion#

Most relations support implicit schema conversions, which means that

when reading from a relation with a user specified schema, the actual schema of the physical data source is converted to the desired user defined schema
when writing to a relation, the schema of the data written is automatically converted to the actual schema of the data sink

The details of these conversions can be controlled via some config variables

flowman.default.relation.input.columnMismatchPolicy (default is IGNORE)
flowman.default.relation.input.typeMismatchPolicy (default is IGNORE)
flowman.default.relation.input.charVarcharPolicy (default is PAD_AND_TRUUNCATE)
flowman.default.relation.output.columnMismatchPolicy (default is ADD_REMOVE_COLUMNS)
flowman.default.relation.output.typeMismatchPolicy (default is CAST_ALWAYS)
flowman.default.relation.output.charVarcharPolicy (default is PAD_AND_TRUUNCATE)

The schema conversion is implemented using two aspects. The first is a mismatch between column (names). This can be configured using the columnMismatchPolicy as follows. Basically the idea is that

IGNORE will simply pass through the input columns unchanged
ERROR will fail the build once a mismatch between actual and requested schema is detected
ADD_COLUMNS_OR_IGNORE will add (NULL) columns from the requested schema to the input schema, and will keep columns in the input schema which are not present in the requested schema
ADD_COLUMNS_OR_ERROR will add (NULL) columns from the requested schema to the input schema, but will fail the build if the input schema contains columns not present in the requested schema
REMOVE_COLUMNS_OR_IGNORE will remove columns from the input schema which are not present in the requested schema
REMOVE_COLUMNS_OR_ERROR will remove columns from the input schema which are not present in the requested schema and will fail if the input schema is missing requested columns
ADD_REMOVE_COLUMNS will essentially pass through the requested schema as is

Column Mismatch Policy	Input Column	Requested Column	Result
`IGNORE`	present	missing	present
`IGNORE`	missing	present	missing
`ERROR`	present	missing	build error
`ERROR`	missing	present	build error
`ADD_COLUMNS_OR_IGNORE`	present	missing	present
`ADD_COLUMNS_OR_IGNORE`	missing	present	present
`ADD_COLUMNS_OR_ERROR`	present	missing	build error
`ADD_COLUMNS_OR_ERROR`	missing	present	missing
`REMOVE_COLUMNS_OR_IGNORE`	present	missing	missing
`REMOVE_COLUMNS_OR_IGNORE`	missing	present	missing
`REMOVE_COLUMNS_OR_ERROR`	present	missing	missing
`REMOVE_COLUMNS_OR_ERROR`	missing	present	build error
`ADD_REMOVE_COLUMNS`	present	missing	missing
`ADD_REMOVE_COLUMNS`	missing	present	present

The second aspect is the conversion of data types. Again this can be configured using the typeMismatchPolicy as follows:

Type Mismatch Policy	Input vs Requested Data Type	Result
`IGNORE`	Source can be cast to destination	Input Data Type
`IGNORE`	Source cannot be safely cast to destination	Input Data Type
`ERROR`	Source can be cast to destination	build error
`ERROR`	Source cannot be safely cast to destination	build error
`CAST_COMPATIBLE_OR_ERROR`	Source can be cast to destination	Target Data Type
`CAST_COMPATIBLE_OR_ERROR`	Source cannot be safely cast to destination	build error
`CAST_COMPATIBLE_OR_IGNORE`	Source can be safely cast to destination	Target Data Type
`CAST_COMPATIBLE_OR_IGNORE`	Source cannot be safely cast to destination	Input Data Type
`CAST_ALWAYS`	Source can be safely cast to destination	Target Data Type
`CAST_ALWAYS`	Source cannot be safely cast to destination	Target Data Type

The two options flowman.default.relation.input.charVarcharPolicy and flowman.default.relation.output.charVarcharPolicy control how Flowman will treat VARCHAR(n) and CHAR(n) data types. The possible values are

IGNORE - Do not apply any length restrictions
PAD_AND_TRUNCATE - Truncate VARCHAR(n)/CHAR(n) strings which are too long and pad CHAR(n) strings which are too short
PAD - Pad CHAR(n) strings which are too short
TRUNCATE - Truncate VARCHAR(n)/CHAR(n) strings which are too long

Relation Types#

Flowman directly provides support for the most important data sources, which are also supported directly by Spark. Additional data sources can be implemented as plugins if required.