Hive Union Table#
The hiveUnionTable
is a compound target for storing data in Hive that also provides extended schema migration
capabilities. In addition to schema changes which are supported out of the box via Hive, this target also supports
more changes like dropping columns, changing data types. This is accomplished by creating a UNION view on top of
possibly multiple Hive tables (each of them having a different incompatible schema).
Example#
relations:
some_table:
kind: hiveUnionTable
# Specify the Hive database, where the UNION view will be created
viewDatabase: "crm"
# Specify the name of the Hive UNION view
view: "my_table"
# Specify the Hive database where the underlying tables are to be created
tableDatabase: "crm"
# Specify the prefix of all Hive tables. Flowman will add numbers like 1,2,3,... to the prefix for
# different schema versions
tablePrefix: "zz_my_table"
# Specify the location prefix of all Hive tables. Flowman will add numbers like 1,2,3,... to the prefix for
# different schema versions
locationPrefix: "/hive/crm/zz_my_table"
external: true
# Select file format
format: parquet
# Add partition column
partitions:
- name: landing_date
type: string
description: "The date on which the contract event was generated"
# Explicitly specify the schema, which is mandatory for this relation type. In this case the schema is inferred
# from a mapping called `some_mapping`
schema:
kind: mapping
mapping: some_mapping
Fields#
Fields#
kind
(mandatory) (string):hiveUnionTable
viewDatabase
(optional) (string) (default: empty):view
(mandatory) (string): Name of the view to be created and managed by Flowman.viewDatabase
(optional) (string) (default: empty): Name of the Hive database where the view should be created intablePrefix
(mandatory) (string): Prefix of all tables which will be created and managed by Flowman. A version number will be appended to the prefix to form the full table name.tableDatabase
(optional) (string) (default: empty): Name of the Hive database where the tables should be created inlocationPrefix
(optional) (string) (default: empty): Specifies the location prefix of the files stored in this Hive table. This setting is only used when Flowman is used to create the Hive table and is ignored otherwise. This corresponds to theLOCATION
in aCREATE TABLE
statement.format
(optional) (string) (default: empty): Specifies the format of the files stored in this Hive table. This setting is only used when Flowman is used to create the Hive table and is ignored otherwise. This corresponds to theFORMAT
in aCREATE TABLE
statement.options
(optional) (map:string) (default: empty): All key-value pairs specified in options are directly passed to Apache spark for reading and/or writing to this relation.rowFormat
(optional) (string) (default: empty): Specifies the row format of the files stored in this Hive table. This setting is only used when Flowman is used to create the Hive table and is ignored otherwise. This corresponds to theROW FORMAT
in aCREATE TABLE
statement.inputFormat
(optional) (string) (default: empty): Specifies the input format of the files stored in this Hive table. This setting is only used when Flowman is used to create the Hive table and is ignored otherwise. This corresponds to theINPUT FORMAT
in aCREATE TABLE
statement.outputFormat
(optional) (string) (default: empty): Specifies the input format of the files stored in this Hive table. This setting is only used when Flowman is used to create the Hive table and is ignored otherwise. This corresponds to theOUTPUT FORMAT
in aCREATE TABLE
statement.partitions
(optional) (list:partition) (default: empty): Specifies all partition columns. This is used both for creating Hive tables, but also for writing and reading to and from them. Therefore, if you are working with partitioned Hive tables you have to specify partition columns, even if Flowman is not used for creating the table.properties
(optional) (map:string) (default: empty): Specifies additional properties of the Hive table. This setting is only used when Flowman is used to create the Hive table and is ignored otherwise. This corresponds to theTBLPROPERTIES
in aCREATE TABLE
statement.
Automatic Migrations#
The core idea of the hiveUnionTable
relation type is to support a broad range of non-destructive migrations. The
following changes to a data schema are supported
New columns can be added. This will be performed without created a new underlying table.
Existing columns can be dropped, although the previously existing columns will be still included in the Hive view on top of all physical tables.
Column nullability can be changed. If changing from nullable to non-nullable, the currently underlying table will be reused (since the new type is stricter than before), otherwise a new table will be created.
The data type of on existing column can be changed. Depending on the change (i.e. more or less restrictive data type), either the existing physical table will be reused, or a new underlying table will be created, and the view will be adjusted.
Output Modes#
The hiveUnionTable
relation supports the following output modes in a relation
target:
Output Mode | Supported | Comments |
---|---|---|
errorIfExists |
yes | Throw an error if the Hive table already exists |
ignoreIfExists |
yes | Do nothing if the Hive table already exists |
overwrite |
yes | Overwrite the whole table or the specified partitions |
append |
yes | Append new records to the existing table |
update |
no | - |
Remarks#
When using Hive union tables as data sinks in a relation
target, then Flowman will manage the
whole lifecycle for you. This means that
Hive tables will be created and migrated during
create
phaseHive tables will be populated with records and partitions will be added during
build
phaseHive tables will be truncated, or individual partitions will be dropped during
clean
phaseHive tables will be removed during
destroy
phase
Supported Data Types#
Please note that depending on the used tools accessing the data and the used Spark version, not all data types are supported:
VARCHAR(n)
andCHAR(n)
require Spark 3.1+ to be supported. Older version will fall back to writingSTRING
columns insteadDATE
types might not be supported in older Impala version (if you plan to query Hive tables using Impala)