[DE Design Pattern]02-2. Incremental Load

1273 words

6 minutes

[DE Design Pattern]02-2. Incremental Load

2025-01-31

2026-01-31

Data Engineering

design_pattern

/

incremental_load

02. Incremental Load Pattern#

Incremental Load는 마지막 실행 이후 변경/추가된 데이터만 가져오는 패턴.

Delta Column 기반 구현#

1
from pyspark.sql import SparkSession, functions as F
2

3
def incremental_load_delta_column(
4
    spark: SparkSession,
5
    input_path: str,
6
    output_path: str,
7
    date_from: str,   # e.g. "2024-01-01 10:00:00"
8
    date_to: str       # e.g. "2024-01-01 11:00:00"
9
):
10
    """
11
    Delta Column 기반 Incremental Loader.
12
    """
13
    raw_data = spark.read.text(input_path)
14

15
    parsed = raw_data.select(
16
        F.from_json(F.col("value"), "id STRING, ingestion_time TIMESTAMP, payload STRING")
17
        .alias("data")
18
    ).select("data.*")
19

20
    # 핵심: 시간 범위(Ingestion Window)로 필터링
21
    incremental = parsed.filter(
22
        f'ingestion_time BETWEEN "{date_from}" AND "{date_to}"'
23
    )
24

25
    incremental.write.mode("append").parquet(output_path)

핵심은 BETWEEN 절로 Ingestion Window를 제한하는 것
ingestion_time > last_run_time으로 할 경우, 백필 시 전체 데이터를 한 번에 가져오게 되어 Full Load와 다를 바 없어짐

Partition 기반 구현#

소스 데이터가 이미 시간 기반 파티션으로 물리적으로 분리되어 있을 때 사용

1
#  Partition 기반 Incremental Loader
2
from airflow import DAG
3
from airflow.sensors.filesystem import FileSensor
4
from airflow.providers.cncf.kubernetes.operators.spark_kubernetes import SparkKubernetesOperator
5

6
with DAG("incremental_partition_loader", schedule_interval="@hourly") as dag:
7

8
    # 1) 파티션 존재 여부 확인 (Readiness)
9
    wait_for_partition = FileSensor(
10
        task_id="wait_for_partition",
11
        filepath="/data/input/date={{ ds }}/hour={{ execution_date.hour }}",
12
        mode="reschedule",  # 워커 슬롯을 점유하지 않음
13
    )
14

15
    # 2) 해당 파티션만 적재
16
    load_partition = SparkKubernetesOperator(
17
        task_id="load_partition",
18
        application_file="load_job_spec.yaml",
19
        # arguments에 {{ ds }} 같은 immutable 매크로 사용
20
    )
21

22
    wait_for_partition >> load_partition

두 방식의 결정적 차이를 정리하면:

구분	Delta Column	Partition 기반
상태 관리	마지막 ingestion_time 저장 필요	불필요 (실행 시간에서 추론)
백필	Window 제한 없으면 Full Load화	파티션 단위로 자연스럽게 분리
소스 요구사항	델타 컬럼 존재	시간 기반 파티션 구조
동시 백필	Window 제한 시 가능	파티션별 독립 실행 가능

3. Hard Delete 문제#

근본적 한계는 물리적 삭제를 감지할 수 없다는 것

1
# 소스 DB: row_id=5가 DELETE됨 → 해당 행이 물리적으로 사라짐
2
# 타겟 DB: row_id=5가 여전히 존재 → 소스와 불일치
3

4
# 해결 1: Soft Delete — 프로듀서가 DELETE 대신 UPDATE 사용
5
# UPDATE devices SET is_deleted = true, updated_at = NOW() WHERE id = 5
6

7
def incremental_load_with_soft_delete(spark, input_path, output_path, date_from, date_to):
8
    """Soft Delete가 적용된 소스에서 증분 로드"""
9
    incremental = (
10
        spark.read.parquet(input_path)
11
        .filter(f'updated_at BETWEEN "{date_from}" AND "{date_to}"')
12
    )
13

14
    # is_deleted=true인 행도 함께 가져옴 → 타겟에서 삭제 반영 가능
15
    incremental.write.mode("append").parquet(output_path)
16

17

18
# 해결 2: Insert-Only (Append-Only) 테이블
19
# 모든 변경을 INSERT로 기록하고, 소비자가 최신 상태를 재구성
20
# | id | action  | updated_at |
21
# | 5  | created | 10:00      |
22
# | 5  | updated | 10:30      |
23
# | 5  | deleted | 11:00      |  ← DELETE도 INSERT로 기록

4. Backfilling과 Ingestion Window#

백필 시 Delta Column 방식이 특히 위험

1
def safe_backfill_with_window(
2
    spark: SparkSession,
3
    input_path: str,
4
    output_path: str,
5
    date_from: str,
6
    date_to: str,
7
    window_hours: int = 1
8
):
9
    """
10
    Ingestion Window를 제한하여 백필 시에도 안정적인 데이터 볼륨 유지.
11

12
    핵심: delta_column BETWEEN ingestion_time AND ingestion_time + INTERVAL '1 HOUR'
13
    → 백필이든 정상 실행이든 항상 동일한 크기의 데이터만 처리
14
    → 여러 백필 잡을 동시에 실행
15
    """
16
    data = spark.read.parquet(input_path)
17

18
    windowed = data.filter(
19
        f'ingestion_time BETWEEN "{date_from}" AND "{date_to}"'
20
    )
21

22
    windowed.write.mode("append").parquet(output_path)
23

24
# 백필 실행 예시: 각 시간대별로 병렬 실행 가능
25
# safe_backfill_with_window(spark, path, out, "2024-01-01 00:00", "2024-01-01 01:00")
26
# safe_backfill_with_window(spark, path, out, "2024-01-01 01:00", "2024-01-01 02:00")
27
# safe_backfill_with_window(spark, path, out, "2024-01-01 02:00", "2024-01-01 03:00")

Event Time을 Delta Column으로 쓸 때의 주의점#

ingestion_time 대신 event_time을 델타 컬럼으로 쓰면 Late Data 문제가 발생함.
이벤트가 실제 발생 시간보다 늦게 도착하면, 이미 처리 완료된 시간 범위에 속하므로 로더에서 누락시킴
가능하면 ingestion_time(시스템이 데이터를 받은 시간)을 델타 컬럼으로 사용하는 것이 안전

Concept

Incremental Load : 마지막 실행 이후 변경/추가된 데이터만 적재하는 패턴. 대규모, 지속 증가 데이터셋에 적합
Delta Column : 행의 변경 시점을 나타내는 컬럼(ingestion_time, updated_at). 증분 로드의 필터 조건으로 사용
Ingestion Window : 한 번의 실행에서 처리할 시간 범위를 명시적으로 제한하는 기법. 백필 시 데이터 볼륨 폭발 방지 및 병렬 실행 가능
Partition 기반 로드 : 시간 파티션 구조를 활용해 실행 시간에서 처리 대상을 암묵적으로 결정하는 방식. 상태 관리 불필요
Hard Delete : 물리적 행 삭제. Incremental Loader가 감지할 수 없는 근본적 한계
Soft Delete : 물리 삭제 대신 is_deleted 플래그로 삭제를 표현하는 방식. DELETE → UPDATE로 전환
Insert-Only (Append-Only) Table : 모든 변경(생성/수정/삭제)을 INSERT로 기록하는 테이블. 소비자가 최신 상태를 재구성해야 함
Backfilling : 과거 데이터를 재처리하는 작업. Window 제한 없으면 Incremental이 Full Load화되는 위험
Late Data : 이벤트 발생 시간보다 늦게 도착하는 데이터. event_time을 델타 컬럼으로 쓰면 누락 위험
Immutable Execution Time : Airflow의 {{ ds }} 같은 실행 시간 매크로. 백필 시에도 변하지 않아 재현성 보장

추가검토

Airflow의 execution_date vs logical_date : Airflow 2.x에서 실행 시간 개념의 변화와 백필에 미치는 영향
Watermark (워터마크) : Spark Structured Streaming에서 Late Data를 처리하는 메커니즘