Features Overview

Hubio Sync is a powerful data synchronization tool that enables seamless data movement between databases, APIs, data warehouses, and cloud storage. This guide covers all major features and capabilities.


Core Features

1. Multi-Source Data Synchronization

Sync data from multiple source types to multiple destinations.

Supported Sources:

  • Relational Databases: MySQL, PostgreSQL, SQLite, SQL Server, Oracle
  • REST APIs: Any HTTP/HTTPS API with authentication support
  • GraphQL APIs: Query-based data extraction
  • NoSQL Databases: MongoDB, Cassandra, DynamoDB
  • Cloud Storage: S3, GCS, Azure Blob (read files as data sources)
  • SaaS Platforms: Salesforce, HubSpot, Stripe (via connectors)

Supported Destinations:

  • Data Warehouses: Snowflake, BigQuery, Redshift, Databricks
  • Cloud Storage: S3, GCS, Azure Blob Storage
  • Relational Databases: PostgreSQL, MySQL, SQL Server
  • Data Lakes: S3 Parquet/Delta Lake, GCS, ADLS
  • Local Filesystem: JSON, CSV, Parquet, Avro files

Example:

# Sync from MySQL to Snowflake
[source]
type = "mysql"
host = "prod-db.example.com"
database = "production"

[destination]
type = "snowflake"
account = "xy12345.us-east-1"
warehouse = "COMPUTE_WH"
database = "ANALYTICS"

2. Incremental Sync

Only sync new or updated records to minimize data transfer and processing time.

How It Works:

  • Tracks last sync timestamp or ID
  • Queries only changed records using incremental column
  • Maintains state between syncs
  • Supports timestamp, integer, and date-based incremental columns

Example:

[sync]
mode = "incremental"
incremental_column = "updated_at"
incremental_type = "timestamp"

Benefits:

  • 🚀 Faster syncs - Process only changed data
  • 💰 Lower costs - Reduced data transfer and compute
  • ⏱️ Real-time ready - Enable frequent syncs (every 5-15 minutes)
  • 🔄 Idempotent - Safe to re-run without duplicates

Use Cases:

  • Continuous database replication
  • Near real-time analytics pipelines
  • Event streaming to data warehouses
  • Audit log archival

3. Data Transformations

Transform data during sync without separate ETL tools.

Column-Level Transformations

Anonymization:

[[transformations]]
table = "users"
transform = "anonymize"
columns = ["email", "phone", "ssn"]
method = "hash"  # Options: hash, mask, null

Type Casting:

[[transformations]]
table = "orders"
transform = "cast"
columns = { "total" = "decimal", "quantity" = "integer", "created_at" = "timestamp" }

Column Renaming:

[[transformations]]
table = "customers"
transform = "rename"
columns = { "cust_id" = "customer_id", "cust_email" = "email" }

Column Derivation:

[[transformations]]
table = "orders"
transform = "derive"
new_column = "order_month"
expression = "DATE_FORMAT(created_at, '%Y-%m')"

Row-Level Transformations

Filtering:

[[transformations]]
table = "orders"
transform = "filter"
condition = "status = 'completed' AND total > 100"

Deduplication:

[[transformations]]
table = "events"
transform = "deduplicate"
key_columns = ["user_id", "event_type", "timestamp"]
strategy = "keep_latest"  # Options: keep_first, keep_latest, keep_all

Custom SQL Transformations

[[transformations]]
table = "orders"
transform = "sql"
query = """
  SELECT
    id,
    customer_id,
    total,
    CASE
      WHEN total >= 1000 THEN 'enterprise'
      WHEN total >= 100 THEN 'business'
      ELSE 'individual'
    END as customer_segment,
    DATE(created_at) as order_date
  FROM orders
  WHERE status = 'completed'
"""

4. Scheduled Syncs

Automate data syncs with flexible scheduling.

Cron Expressions:

[sync]
# Every 15 minutes
schedule = "*/15 * * * *"

# Every hour at :30
schedule = "30 * * * *"

# Daily at 2 AM
schedule = "0 2 * * *"

# Weekdays at 9 AM
schedule = "0 9 * * 1-5"

# Weekly on Sunday at midnight
schedule = "0 0 * * 0"

# Monthly on the 1st at midnight
schedule = "0 0 1 * *"

# Timezone support
timezone = "America/New_York"

Scheduler Management:

# Start scheduler daemon
hubio-sync scheduler start

# List scheduled jobs
hubio-sync scheduler list

# Pause all jobs
hubio-sync scheduler pause

# Resume jobs
hubio-sync scheduler resume

# Stop scheduler
hubio-sync scheduler stop

# View job history
hubio-sync scheduler history

Advanced Scheduling:

[sync]
schedule = "0 2 * * *"

# Retry on failure
max_retries = 3
retry_delay = 300  # seconds between retries

# Timeout
timeout = 7200  # 2 hours max runtime

# Skip if previous run still active
skip_if_running = true

5. Performance Optimization

Parallel Processing

[performance]
# Sync multiple tables concurrently
max_parallel_tables = 4

# Parallel batches within a table
max_parallel_batches = 8

# Worker threads per batch
workers_per_batch = 4

Batch Size Tuning

[sync]
# Global batch size
batch_size = 10000

# Table-specific overrides
[[sync.table_config]]
name = "large_table"
batch_size = 50000  # Larger batches for big tables

[[sync.table_config]]
name = "small_table"
batch_size = 1000   # Smaller batches for frequent updates

Memory Management

[performance]
# Memory limits
max_memory = "4GB"
buffer_size = "500MB"

# Spill to disk when memory full
enable_disk_spillover = true
temp_directory = "/tmp/hubio-sync"

Compression

[destination]
type = "s3"
format = "parquet"
compression = "snappy"  # Options: none, snappy, gzip, zstd, lz4
compression_level = 6   # 1-9 (zstd/gzip only)

6. Data Quality & Validation

Schema Validation

[validation]
# Validate schema before sync
validate_schema = true

# Fail on schema drift
strict_schema = true

# Allow new columns
allow_new_columns = true

# Warn on missing columns
warn_missing_columns = true

Data Quality Checks

[[validation.checks]]
table = "users"
check = "not_null"
columns = ["id", "email", "created_at"]

[[validation.checks]]
table = "orders"
check = "unique"
columns = ["order_id"]

[[validation.checks]]
table = "products"
check = "range"
column = "price"
min = 0
max = 1000000

[[validation.checks]]
table = "users"
check = "regex"
column = "email"
pattern = "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"

Row Count Validation

[validation]
# Compare source vs destination row counts
validate_row_counts = true

# Acceptable difference threshold
row_count_tolerance = 0.01  # 1% tolerance

# Action on mismatch
on_row_count_mismatch = "warn"  # Options: warn, fail, ignore

7. Monitoring & Observability

Metrics Export

[metrics]
enabled = true
port = 9090
format = "prometheus"

# Metrics to track
track_row_counts = true
track_sync_duration = true
track_error_rates = true
track_bytes_transferred = true

Prometheus Metrics:

  • hubio_sync_rows_synced_total - Total rows synced
  • hubio_sync_duration_seconds - Sync duration
  • hubio_sync_errors_total - Error count
  • hubio_sync_bytes_transferred - Data volume

Logging

[logging]
level = "info"  # debug, info, warn, error
format = "json" # json, text, structured

# Outputs
outputs = ["stdout", "file"]
log_file = "/var/log/hubio-sync/sync.log"

# Rotation
max_file_size = "100MB"
max_backups = 10
compress = true

# Structured fields
include_timestamp = true
include_caller = true
include_stacktrace = true  # On errors

Alerts & Notifications

[alerts]
enabled = true

# Microsoft Teams notifications
teams_webhook_url = "${TEAMS_WEBHOOK_URL}"
notify_on = ["failure", "warning", "slow_sync"]

# Email notifications
smtp_host = "smtp.gmail.com"
smtp_port = 587
smtp_username = "${SMTP_USER}"
smtp_password = "${SMTP_PASS}"
email_to = ["team@example.com"]

# Alert thresholds
slow_sync_threshold = 3600  # seconds
error_rate_threshold = 0.05  # 5%

Health Checks

# HTTP health endpoint
curl http://localhost:8080/health

# Response
{
  "status": "healthy",
  "last_sync": "2025-11-25T14:30:00Z",
  "uptime_seconds": 86400,
  "active_syncs": 0
}

8. Security Features

Encryption at Rest

[security]
# Encrypt local cache and temp files
encrypt_cache = true
encryption_algorithm = "AES-256-GCM"
encryption_key_file = "/secure/path/to/key.pem"

Encryption in Transit

[security]
# TLS for all connections
require_tls = true
tls_min_version = "1.2"

# Certificate verification
verify_certificates = true
ca_bundle = "/path/to/ca-bundle.crt"

# Custom certificates
client_cert = "/path/to/client.crt"
client_key = "/path/to/client.key"

Credential Management

[security]
# Use secrets manager
secrets_provider = "aws_secrets_manager"  # aws, gcp, azure, vault
secrets_prefix = "hubio-sync/"

[source]
type = "mysql"
host = "db.example.com"
username = "sync_user"
# Reference secret instead of plaintext
password = "${secret:mysql_password}"

Audit Logging

[security.audit]
enabled = true
log_file = "/var/log/hubio-sync/audit.log"

# Events to audit
log_config_changes = true
log_authentication = true
log_data_access = true
log_schema_changes = true

9. High Availability & Reliability

Automatic Retries

[reliability]
# Retry failed syncs
max_retries = 3
retry_delay = 300  # seconds
retry_backoff = "exponential"  # linear, exponential

# Retry specific errors
retry_on_errors = ["connection_timeout", "deadlock", "rate_limit"]

State Persistence

[state]
# Store sync state for recovery
backend = "dynamodb"  # dynamodb, postgres, mysql, redis, filesystem
table_name = "hubio_sync_state"

# Automatic state backups
backup_enabled = true
backup_interval = 3600  # seconds

Disaster Recovery

[disaster_recovery]
# Automatic snapshots before sync
snapshot_before_sync = true
snapshot_retention_days = 7

# Recovery point objective
max_data_loss_minutes = 15

10. Advanced Features

Change Data Capture (CDC)

[source]
type = "mysql"
enable_cdc = true
cdc_method = "binlog"  # binlog, triggers, polling

# Capture operations
capture_inserts = true
capture_updates = true
capture_deletes = true

# Real-time streaming
stream_to_destination = true

Data Partitioning

[destination]
type = "s3"
format = "parquet"

# Time-based partitioning
partition_by = ["year", "month", "day"]
partition_column = "created_at"

# Custom partitioning
# partition_by = ["region", "product_category"]

Data Deduplication

[sync]
enable_deduplication = true
dedup_key = ["user_id", "timestamp"]
dedup_strategy = "keep_latest"  # keep_first, keep_latest, keep_all
dedup_window = "7d"  # Look-back window

Schema Evolution

[schema]
# Automatic schema evolution
auto_evolve = true

# Evolution strategies
allow_new_columns = true
allow_column_type_changes = false
allow_column_drops = false

# Schema versioning
track_schema_versions = true
version_storage = "s3://bucket/schemas/"

Use Case Examples

Real-Time Analytics Pipeline

# MySQL → Snowflake every 5 minutes
[source]
type = "mysql"
host = "prod-db.example.com"

[destination]
type = "snowflake"
account = "xy12345"

[sync]
mode = "incremental"
schedule = "*/5 * * * *"
incremental_column = "updated_at"

Data Lake Archival

# PostgreSQL → S3 Parquet daily
[source]
type = "postgres"
host = "archive-db.example.com"

[destination]
type = "s3"
format = "parquet"
partition_by = ["year", "month"]

[sync]
schedule = "0 2 * * *"
compression = "snappy"

Multi-Cloud Replication

# MySQL (AWS) → BigQuery (GCP)
[source]
type = "mysql"
host = "aws-db.example.com"

[destination]
type = "bigquery"
project_id = "my-gcp-project"
dataset = "analytics"

[sync]
mode = "incremental"
schedule = "0 */4 * * *"

Performance Benchmarks

Typical throughput (with default settings):

  • MySQL → S3 Parquet: ~50,000 rows/second
  • PostgreSQL → Snowflake: ~30,000 rows/second
  • REST API → BigQuery: ~10,000 rows/second

Factors affecting performance:

  • Network bandwidth
  • Source database query performance
  • Destination write throughput
  • Data transformation complexity
  • Record size and schema complexity

Learn More


Support

For technical support, contact support@hubio.com.