LakeQL
Introduction
  • Overview
  • Key Concepts
  • Package Map
Getting Started
  • Prerequisites
  • Quickstart
  • Environment Configuration
  • First Run
Architecture
  • System Overview
  • Data Flow
  • Request Lifecycle
Configuration
  • Environment Variables
  • Authentication
  • Trino Connection
create-app
  • Usage
  • Template Structure
  • Post Creation
Contributing
  • Local Development
  • Contribution Guide
Guides
  • Custom Resolvers
  • Extending Schema
  • Deploying
  • Mutations
  • Load Strategies
GitHub
LakeQL
  1. LakeQL
  2. Guides
  3. Load Strategies

On this page

  1. Load Strategies
  2. full_load
    1. File layout
    2. Hive tables
    3. When to use
    4. Pipeline steps
    5. Example
  3. full_load_append
    1. File layout
    2. Hive tables
    3. When to use
    4. Pipeline steps
    5. Example
  4. append
    1. File layout
    2. Hive tables
    3. When to use
    4. Pipeline steps
    5. Example
  5. Choosing a Strategy
  6. Default Behavior
  7. Partition Path Format
    1. Partitioning modes

Load Strategies

Understand the three load strategies (full_load, full_load_append, append) and when to use each for your mutation pipeline.

Load Strategies #

The mutation pipeline supports three load strategies that control how data is stored on S3 and how Hive external tables are managed. Choose a strategy based on your data lifecycle needs.

full_load #

Replaces all existing data on every write. The pipeline deletes the previous file, uploads a fresh snapshot, and recreates the Hive table.

File layout #

1
2
3
<basePath>/
  latest.parquet        ← single file, overwritten on each write

Hive tables #

Table namePoints to
<tableName><basePath>/latest.parquet

When to use #

  • Dimension tables — small lookup data that changes infrequently (countries, categories, status codes)
  • Reference data — configuration or mapping tables where you always want the full current state
  • Small datasets — tables where the complete dataset fits comfortably in a single Parquet file

Pipeline steps #

  1. Delete existing data at <basePath>/latest.parquet
  2. Upload new Parquet file to <basePath>/latest.parquet
  3. DROP TABLE IF EXISTS + CREATE TABLE pointing to latest

Example #

1
2
3
4
5
6
7
8
9
{
  "mutation": {
    "loadStrategy": "full_load",
    "type": "s3",
    "bucket": "my-datalake",
    "basePath": "warehouse/config/status_codes"
  }
}

full_load_append #

Maintains both a latest snapshot and a historical log. Each write updates the current state and appends a timestamped copy to the history partition.

File layout #

1
2
3
4
5
6
7
<basePath>/
  latest.parquet                              ← current snapshot (overwritten)
  all.parquet/
    year=2024/month=06/day=15/<uuid>.parquet   ← historical partition
    year=2024/month=06/day=16/<uuid>.parquet
    year=2024/month=07/day=01/<uuid>.parquet

The partition structure above reflects the default partitioningFormat of "year/month/day". When partitioning is disabled (partitioning: false), historical files are written flat under all.parquet/<uuid>.parquet. When using field-based partitioning, files are grouped by the specified field's date value instead of write timestamp.

Hive tables #

Table namePoints to
<tableName>_latest<basePath>/latest.parquet
<tableName>_all<basePath>/all.parquet/

When to use #

  • Datasets needing both current state and history — product catalogs, pricing tables, inventory snapshots
  • Slowly changing dimensions — where you want to query "what was the state on date X?"
  • Compliance scenarios — where you need to prove what data looked like at a given point in time

Pipeline steps #

  1. Delete existing data at <basePath>/latest.parquet
  2. Upload new Parquet file to <basePath>/latest.parquet
  3. Upload same Parquet file to <basePath>/all.parquet/year=YYYY/month=MM/day=DD/<uuid>.parquet
  4. DROP + CREATE both _latest and _all tables (with rollback on partial failure)

Example #

1
2
3
4
5
6
7
8
9
{
  "mutation": {
    "loadStrategy": "full_load_append",
    "type": "s3",
    "bucket": "my-datalake",
    "basePath": "warehouse/products/catalog"
  }
}
The _all table is partitioned by date, so you can query historical snapshots efficiently using WHERE year = '2024' AND month = '06'.

append #

Only adds data. Never deletes or overwrites existing files. Each write creates a new partition file.

File layout #

1
2
3
4
5
6
<basePath>/
  all.parquet/
    year=2024/month=06/day=15/<uuid-1>.parquet
    year=2024/month=06/day=15/<uuid-2>.parquet   ← multiple writes per day
    year=2024/month=06/day=16/<uuid-3>.parquet

The partition structure depends on the partitioningFormat setting. When partitioning is disabled (partitioning: false), files are written flat under all.parquet/<uuid>.parquet. When using field-based partitioning, files are grouped by the specified field's date value.

Hive tables #

Table namePoints to
<tableName><basePath>/all.parquet/

When to use #

  • Event streams — clickstream data, page views, user interactions
  • Time-series data — sensor readings, metrics, measurements
  • Audit logs — system events, change logs, access records
  • Any write-once data — where historical records should never be modified

Pipeline steps #

  1. Upload Parquet file to <basePath>/all.parquet/year=YYYY/month=MM/day=DD/<uuid>.parquet
  2. Recreate single Hive table pointing to <basePath>/all.parquet/

Example #

1
2
3
4
5
6
7
8
9
{
  "mutation": {
    "loadStrategy": "append",
    "type": "s3",
    "bucket": "my-datalake",
    "basePath": "warehouse/raw/click_events"
  }
}

Choosing a Strategy #

QuestionRecommendation
Do I need historical versions of the data?full_load_append
Is the data write-once (events, logs)?append
Is this a small lookup/config table?full_load
Do I need both "current state" and "history"?full_load_append
Is the dataset large and ever-growing?append
Will I always send the full dataset?full_load

Default Behavior #

When no loadStrategy is specified in the mutation configuration, the pipeline defaults to full_load. This is the safest default for most use cases — it ensures the table always reflects the latest complete dataset.

Partition Path Format #

For strategies that use partitioned storage (full_load_append and append), the partition path format is configurable via the partitioningFormat field:

partitioningFormatPath structure
"year/month/day"year=2024/month=06/day=15/<uuid>.parquet
"year/month"year=2024/month=06/<uuid>.parquet
"year"year=2024/<uuid>.parquet

The default is "year/month/day". The UUID filename prevents collisions when multiple writes occur within the same partition. The date-based partitioning enables efficient pruning when querying historical data.

Partitioning modes #

The partitioning field controls which date value is used for partition paths:

  • true (default) — Partitions by the write timestamp. A load_timestamp field (DateTime, readOnly) is added to the endpoint definition and populated automatically at runtime. Additionally, load_timestamp_year (Integer, readOnly) and load_timestamp_month (Integer, readOnly) are injected as materialized partition columns for direct Parquet filtering without Hive metastore awareness.
  • false — Disables partitioning entirely. Files are written flat under all.parquet/<uuid>.parquet with no date directories.
  • "field_name" — Partitions by the value of the named date/datetime field in the record data. No load_timestamp field is added.
  • Custom format string (e.g. "customer_id/event_date:year/event_date:month" ) — Partitions by a combination of fields and date components. See the Mutations guide for syntax details.
The full_load strategy ignores partitioning entirely — it always writes a single latest.parquet file regardless of the partitioning or partitioningFormat settings.

Previous page

Mutations