LakeQL - Guides - Load Strategies

Load Strategies #

The mutation pipeline supports three load strategies that control how data is stored on S3 and how Hive external tables are managed. Choose a strategy based on your data lifecycle needs.

full_load #

Replaces all existing data on every write. The pipeline deletes the previous file, uploads a fresh snapshot, and recreates the Hive table.

File layout #

1
2
3
<basePath>/
  latest.parquet        ← single file, overwritten on each write

Hive tables #

Table name	Points to
`<tableName>`	`<basePath>/latest.parquet`

When to use #

Dimension tables — small lookup data that changes infrequently (countries, categories, status codes)
Reference data — configuration or mapping tables where you always want the full current state
Small datasets — tables where the complete dataset fits comfortably in a single Parquet file

Pipeline steps #

Delete existing data at <basePath>/latest.parquet
Upload new Parquet file to <basePath>/latest.parquet
DROP TABLE IF EXISTS + CREATE TABLE pointing to latest

Example #

1
2
3
4
5
6
7
8
9
{
  "mutation": {
    "loadStrategy": "full_load",
    "type": "s3",
    "bucket": "my-datalake",
    "basePath": "warehouse/config/status_codes"
  }
}

full_load_append #

Maintains both a latest snapshot and a historical log. Each write updates the current state and appends a timestamped copy to the history partition.

File layout #

1
2
3
4
5
6
7
<basePath>/
  latest.parquet                              ← current snapshot (overwritten)
  all.parquet/
    year=2024/month=06/day=15/<uuid>.parquet   ← historical partition
    year=2024/month=06/day=16/<uuid>.parquet
    year=2024/month=07/day=01/<uuid>.parquet

The partition structure above reflects the default partitioningFormat of "year/month/day". When partitioning is disabled (partitioning: false), historical files are written flat under all.parquet/<uuid>.parquet. When using field-based partitioning, files are grouped by the specified field's date value instead of write timestamp.

Hive tables #

Table name	Points to
`<tableName>_latest`	`<basePath>/latest.parquet`
`<tableName>_all`	`<basePath>/all.parquet/`

When to use #

Datasets needing both current state and history — product catalogs, pricing tables, inventory snapshots
Slowly changing dimensions — where you want to query "what was the state on date X?"
Compliance scenarios — where you need to prove what data looked like at a given point in time

Pipeline steps #

Delete existing data at <basePath>/latest.parquet
Upload new Parquet file to <basePath>/latest.parquet
Upload same Parquet file to <basePath>/all.parquet/year=YYYY/month=MM/day=DD/<uuid>.parquet
DROP + CREATE both _latest and _all tables (with rollback on partial failure)

Example #

1
2
3
4
5
6
7
8
9
{
  "mutation": {
    "loadStrategy": "full_load_append",
    "type": "s3",
    "bucket": "my-datalake",
    "basePath": "warehouse/products/catalog"
  }
}

The _all table is partitioned by date, so you can query historical snapshots efficiently using WHERE year = '2024' AND month = '06'.

append #

Only adds data. Never deletes or overwrites existing files. Each write creates a new partition file.

File layout #

1
2
3
4
5
6
<basePath>/
  all.parquet/
    year=2024/month=06/day=15/<uuid-1>.parquet
    year=2024/month=06/day=15/<uuid-2>.parquet   ← multiple writes per day
    year=2024/month=06/day=16/<uuid-3>.parquet

The partition structure depends on the partitioningFormat setting. When partitioning is disabled (partitioning: false), files are written flat under all.parquet/<uuid>.parquet. When using field-based partitioning, files are grouped by the specified field's date value.

Hive tables #

Table name	Points to
`<tableName>`	`<basePath>/all.parquet/`

When to use #

Event streams — clickstream data, page views, user interactions
Time-series data — sensor readings, metrics, measurements
Audit logs — system events, change logs, access records
Any write-once data — where historical records should never be modified

Pipeline steps #

Upload Parquet file to <basePath>/all.parquet/year=YYYY/month=MM/day=DD/<uuid>.parquet
Recreate single Hive table pointing to <basePath>/all.parquet/

Example #

1
2
3
4
5
6
7
8
9
{
  "mutation": {
    "loadStrategy": "append",
    "type": "s3",
    "bucket": "my-datalake",
    "basePath": "warehouse/raw/click_events"
  }
}

Choosing a Strategy #

Question	Recommendation
Do I need historical versions of the data?	`full_load_append`
Is the data write-once (events, logs)?	`append`
Is this a small lookup/config table?	`full_load`
Do I need both "current state" and "history"?	`full_load_append`
Is the dataset large and ever-growing?	`append`
Will I always send the full dataset?	`full_load`

Default Behavior #

When no loadStrategy is specified in the mutation configuration, the pipeline defaults to full_load. This is the safest default for most use cases — it ensures the table always reflects the latest complete dataset.

Partition Path Format #

For strategies that use partitioned storage (full_load_append and append), the partition path format is configurable via the partitioningFormat field:

`partitioningFormat`	Path structure
`"year/month/day"`	`year=2024/month=06/day=15/<uuid>.parquet`
`"year/month"`	`year=2024/month=06/<uuid>.parquet`
`"year"`	`year=2024/<uuid>.parquet`

The default is "year/month/day". The UUID filename prevents collisions when multiple writes occur within the same partition. The date-based partitioning enables efficient pruning when querying historical data.

Partitioning modes #

The partitioning field controls which date value is used for partition paths:

true (default) — Partitions by the write timestamp. A load_timestamp field (DateTime, readOnly) is added to the endpoint definition and populated automatically at runtime. Additionally, load_timestamp_year (Integer, readOnly) and load_timestamp_month (Integer, readOnly) are injected as materialized partition columns for direct Parquet filtering without Hive metastore awareness.
false — Disables partitioning entirely. Files are written flat under all.parquet/<uuid>.parquet with no date directories.
"field_name" — Partitions by the value of the named date/datetime field in the record data. No load_timestamp field is added.
Custom format string (e.g. "customer_id/event_date:year/event_date:month" ) — Partitions by a combination of fields and date components. See the Mutations guide for syntax details.

The full_load strategy ignores partitioning entirely — it always writes a single latest.parquet file regardless of the partitioning or partitioningFormat settings.

Load Strategies #

The mutation pipeline supports three load strategies that control how data is stored on S3 and how Hive external tables are managed. Choose a strategy based on your data lifecycle needs.

full_load #

Replaces all existing data on every write. The pipeline deletes the previous file, uploads a fresh snapshot, and recreates the Hive table.

File layout #

1
2
3
<basePath>/
  latest.parquet        ← single file, overwritten on each write

Hive tables #

Table name	Points to
`<tableName>`	`<basePath>/latest.parquet`

When to use #

Dimension tables — small lookup data that changes infrequently (countries, categories, status codes)
Reference data — configuration or mapping tables where you always want the full current state
Small datasets — tables where the complete dataset fits comfortably in a single Parquet file

Pipeline steps #

Delete existing data at <basePath>/latest.parquet
Upload new Parquet file to <basePath>/latest.parquet
DROP TABLE IF EXISTS + CREATE TABLE pointing to latest

Example #

1
2
3
4
5
6
7
8
9
{
  "mutation": {
    "loadStrategy": "full_load",
    "type": "s3",
    "bucket": "my-datalake",
    "basePath": "warehouse/config/status_codes"
  }
}

full_load_append #

Maintains both a latest snapshot and a historical log. Each write updates the current state and appends a timestamped copy to the history partition.

File layout #

1
2
3
4
5
6
7
<basePath>/
  latest.parquet                              ← current snapshot (overwritten)
  all.parquet/
    year=2024/month=06/day=15/<uuid>.parquet   ← historical partition
    year=2024/month=06/day=16/<uuid>.parquet
    year=2024/month=07/day=01/<uuid>.parquet

Hive tables #

Table name	Points to
`<tableName>_latest`	`<basePath>/latest.parquet`
`<tableName>_all`	`<basePath>/all.parquet/`

When to use #

Datasets needing both current state and history — product catalogs, pricing tables, inventory snapshots
Slowly changing dimensions — where you want to query "what was the state on date X?"
Compliance scenarios — where you need to prove what data looked like at a given point in time

Pipeline steps #

Delete existing data at <basePath>/latest.parquet
Upload new Parquet file to <basePath>/latest.parquet
Upload same Parquet file to <basePath>/all.parquet/year=YYYY/month=MM/day=DD/<uuid>.parquet
DROP + CREATE both _latest and _all tables (with rollback on partial failure)

Example #

1
2
3
4
5
6
7
8
9
{
  "mutation": {
    "loadStrategy": "full_load_append",
    "type": "s3",
    "bucket": "my-datalake",
    "basePath": "warehouse/products/catalog"
  }
}

The _all table is partitioned by date, so you can query historical snapshots efficiently using WHERE year = '2024' AND month = '06'.

append #

Only adds data. Never deletes or overwrites existing files. Each write creates a new partition file.

File layout #

1
2
3
4
5
6
<basePath>/
  all.parquet/
    year=2024/month=06/day=15/<uuid-1>.parquet
    year=2024/month=06/day=15/<uuid-2>.parquet   ← multiple writes per day
    year=2024/month=06/day=16/<uuid-3>.parquet

Hive tables #

Table name	Points to
`<tableName>`	`<basePath>/all.parquet/`

When to use #

Event streams — clickstream data, page views, user interactions
Time-series data — sensor readings, metrics, measurements
Audit logs — system events, change logs, access records
Any write-once data — where historical records should never be modified

Pipeline steps #

Upload Parquet file to <basePath>/all.parquet/year=YYYY/month=MM/day=DD/<uuid>.parquet
Recreate single Hive table pointing to <basePath>/all.parquet/

Example #

1
2
3
4
5
6
7
8
9
{
  "mutation": {
    "loadStrategy": "append",
    "type": "s3",
    "bucket": "my-datalake",
    "basePath": "warehouse/raw/click_events"
  }
}

Choosing a Strategy #

Question	Recommendation
Do I need historical versions of the data?	`full_load_append`
Is the data write-once (events, logs)?	`append`
Is this a small lookup/config table?	`full_load`
Do I need both "current state" and "history"?	`full_load_append`
Is the dataset large and ever-growing?	`append`
Will I always send the full dataset?	`full_load`

Default Behavior #

Partition Path Format #

For strategies that use partitioned storage (full_load_append and append), the partition path format is configurable via the partitioningFormat field:

`partitioningFormat`	Path structure
`"year/month/day"`	`year=2024/month=06/day=15/<uuid>.parquet`
`"year/month"`	`year=2024/month=06/<uuid>.parquet`
`"year"`	`year=2024/<uuid>.parquet`

Partitioning modes #

The partitioning field controls which date value is used for partition paths:

true (default) — Partitions by the write timestamp. A load_timestamp field (DateTime, readOnly) is added to the endpoint definition and populated automatically at runtime. Additionally, load_timestamp_year (Integer, readOnly) and load_timestamp_month (Integer, readOnly) are injected as materialized partition columns for direct Parquet filtering without Hive metastore awareness.
false — Disables partitioning entirely. Files are written flat under all.parquet/<uuid>.parquet with no date directories.
"field_name" — Partitions by the value of the named date/datetime field in the record data. No load_timestamp field is added.
Custom format string (e.g. "customer_id/event_date:year/event_date:month" ) — Partitions by a combination of fields and date components. See the Mutations guide for syntax details.

The full_load strategy ignores partitioning entirely — it always writes a single latest.parquet file regardless of the partitioning or partitioningFormat settings.