Load Strategies #
The mutation pipeline supports three load strategies that control how data is stored on S3 and how Hive external tables are managed. Choose a strategy based on your data lifecycle needs.
full_load #
Replaces all existing data on every write. The pipeline deletes the previous file, uploads a fresh snapshot, and recreates the Hive table.
File layout #
1
2
3
<basePath>/
latest.parquet ← single file, overwritten on each write
Hive tables #
| Table name | Points to |
|---|---|
<tableName> | <basePath>/latest.parquet |
When to use #
- Dimension tables — small lookup data that changes infrequently (countries, categories, status codes)
- Reference data — configuration or mapping tables where you always want the full current state
- Small datasets — tables where the complete dataset fits comfortably in a single Parquet file
Pipeline steps #
-
Delete existing data at
<basePath>/latest.parquet -
Upload new Parquet file to
<basePath>/latest.parquet - DROP TABLE IF EXISTS + CREATE TABLE pointing to latest
Example #
1
2
3
4
5
6
7
8
9
{
"mutation": {
"loadStrategy": "full_load",
"type": "s3",
"bucket": "my-datalake",
"basePath": "warehouse/config/status_codes"
}
}
full_load_append #
Maintains both a latest snapshot and a historical log. Each write updates the current state and appends a timestamped copy to the history partition.
File layout #
1
2
3
4
5
6
7
<basePath>/
latest.parquet ← current snapshot (overwritten)
all.parquet/
year=2024/month=06/day=15/<uuid>.parquet ← historical partition
year=2024/month=06/day=16/<uuid>.parquet
year=2024/month=07/day=01/<uuid>.parquet
The partition structure above reflects the default partitioningFormat of "year/month/day". When partitioning is disabled (partitioning: false), historical files are written flat under all.parquet/<uuid>.parquet. When using field-based partitioning, files are grouped by the specified field's date value instead of write timestamp.
Hive tables #
| Table name | Points to |
|---|---|
<tableName>_latest | <basePath>/latest.parquet |
<tableName>_all | <basePath>/all.parquet/ |
When to use #
- Datasets needing both current state and history — product catalogs, pricing tables, inventory snapshots
- Slowly changing dimensions — where you want to query "what was the state on date X?"
- Compliance scenarios — where you need to prove what data looked like at a given point in time
Pipeline steps #
-
Delete existing data at
<basePath>/latest.parquet -
Upload new Parquet file to
<basePath>/latest.parquet -
Upload same Parquet file to
<basePath>/all.parquet/year=YYYY/month=MM/day=DD/<uuid>.parquet -
DROP + CREATE both
_latestand_alltables (with rollback on partial failure)
Example #
1
2
3
4
5
6
7
8
9
{
"mutation": {
"loadStrategy": "full_load_append",
"type": "s3",
"bucket": "my-datalake",
"basePath": "warehouse/products/catalog"
}
}
_all table is partitioned by date, so you can query historical snapshots
efficiently using WHERE year = '2024' AND month = '06'.append #
Only adds data. Never deletes or overwrites existing files. Each write creates a new partition file.
File layout #
1
2
3
4
5
6
<basePath>/
all.parquet/
year=2024/month=06/day=15/<uuid-1>.parquet
year=2024/month=06/day=15/<uuid-2>.parquet ← multiple writes per day
year=2024/month=06/day=16/<uuid-3>.parquet
The partition structure depends on the partitioningFormat setting. When partitioning is disabled (partitioning: false), files are written flat under all.parquet/<uuid>.parquet. When using field-based partitioning, files are grouped by the specified field's date value.
Hive tables #
| Table name | Points to |
|---|---|
<tableName> | <basePath>/all.parquet/ |
When to use #
- Event streams — clickstream data, page views, user interactions
- Time-series data — sensor readings, metrics, measurements
- Audit logs — system events, change logs, access records
- Any write-once data — where historical records should never be modified
Pipeline steps #
-
Upload Parquet file to
<basePath>/all.parquet/year=YYYY/month=MM/day=DD/<uuid>.parquet -
Recreate single Hive table pointing to
<basePath>/all.parquet/
Example #
1
2
3
4
5
6
7
8
9
{
"mutation": {
"loadStrategy": "append",
"type": "s3",
"bucket": "my-datalake",
"basePath": "warehouse/raw/click_events"
}
}
Choosing a Strategy #
| Question | Recommendation |
|---|---|
| Do I need historical versions of the data? | full_load_append |
| Is the data write-once (events, logs)? | append |
| Is this a small lookup/config table? | full_load |
| Do I need both "current state" and "history"? | full_load_append |
| Is the dataset large and ever-growing? | append |
| Will I always send the full dataset? | full_load |
Default Behavior #
When no loadStrategy is specified in the mutation configuration, the pipeline defaults to full_load. This is the safest default for most use cases — it ensures the table always reflects the latest complete dataset.
Partition Path Format #
For strategies that use partitioned storage (full_load_append and append), the partition path format is configurable via the partitioningFormat field:
partitioningFormat | Path structure |
|---|---|
"year/month/day" | year=2024/month=06/day=15/<uuid>.parquet |
"year/month" | year=2024/month=06/<uuid>.parquet |
"year" | year=2024/<uuid>.parquet |
The default is "year/month/day". The UUID filename prevents collisions when multiple writes occur within the same partition. The date-based partitioning enables efficient pruning when querying historical data.
Partitioning modes #
The partitioning field controls which date value is used for partition paths:
-
true(default) — Partitions by the write timestamp. Aload_timestampfield (DateTime, readOnly) is added to the endpoint definition and populated automatically at runtime. Additionally,load_timestamp_year(Integer, readOnly) andload_timestamp_month(Integer, readOnly) are injected as materialized partition columns for direct Parquet filtering without Hive metastore awareness. -
false— Disables partitioning entirely. Files are written flat underall.parquet/<uuid>.parquetwith no date directories. -
"field_name"— Partitions by the value of the named date/datetime field in the record data. Noload_timestampfield is added. -
Custom format string
(e.g.
"customer_id/event_date:year/event_date:month") — Partitions by a combination of fields and date components. See the Mutations guide for syntax details.
full_load strategy ignores partitioning entirely — it always writes a
single latest.parquet file regardless of the partitioning or
partitioningFormat settings.