Batch ETL Integration Guide
Learn how to process large volumes of AL3 files using Integra in batch mode.
Use Case
You have thousands of historical AL3 files (e.g., daily dumps from an agency management system) that need to be converted to a modern format for warehousing.
Strategy 1: Sequential Scripting (Simple)
For moderate volumes (< 10GB), a simple shell script or Python script running locally or on a single VM is sufficient. Integra processes typical AL3 files in milliseconds.
#!/bin/bash
# Bulk process all .AL3 files in a directory
INPUT_DIR="./raw_al3"
OUTPUT_DIR="./processed_json"
INTEGRA_URL="http://localhost:8080/v1/parse"
mkdir -p "$OUTPUT_DIR"
for file in "$INPUT_DIR"/*.AL3; do
filename=$(basename "$file")
echo "Processing $filename..."
curl -s -X POST \
--data-binary "@$file" \
"$INTEGRA_URL" > "$OUTPUT_DIR/${filename}.json"
# Check parse success by inspecting file or exit code
if [ $? -eq 0 ]; then
echo "✓ Success"
else
echo "✗ Failed"
fi
done
Strategy 2: Parallel Processing (High Volume)
For high volumes, leverage parallel workers. Integra handles concurrent requests efficiently.
Using GNU Parallel
find ./raw_al3 -name "*.AL3" | parallel -j 8 \
"curl -s --data-binary @{} http://localhost:8080/v1/parse > ./processed/{/.}.json"
-j 8: Runs 8 concurrent uploads.
Using Python Multiprocessing
from multiprocessing import Pool
import requests
import os
files = [f for f in os.listdir('./raw') if f.endswith('.AL3')]
def process_file(filename):
with open(f"./raw/{filename}", 'rb') as f:
resp = requests.post('http://localhost:8080/v1/parse', data=f.read())
if resp.ok:
with open(f"./processed/{filename}.json", 'w') as out:
out.write(resp.text)
return True
return False
if __name__ == '__main__':
with Pool(processes=16) as pool:
results = pool.map(process_file, files)
print(f"Processed {sum(results)} files.")
Strategy 3: Pipeline Orchestration (Airflow/Prefect)
Integrate Integra as a task in your ETL DAG.
Airflow Example: 1. Sensor: Watch S3 bucket for new zip archives. 2. Unzip: Extract AL3 files to a temporary location. 3. Map Task: Send each AL3 file to Integra (HTTP Operator). 4. Load: Upload resulting JSONs to Snowflake/Redshift.
Recommendations
- Resource Sizing: Integra is CPU-bound. If running locally, allocate 1 CPU core per concurrent worker.
- Memory: Memory usage is low per request, but JSON parsing response can be large. Use NDJSON output (
Accept: application/x-ndjson) to minimize client-side memory overhead during ETL. - Aggregation: If loading into a Data Warehouse, consider concatenating multiple small JSON outputs into a single large NDJSON/JSONL file before
COPYcommand for better DB import performance.