Features Overview
Hubio Sync is a powerful data synchronization tool that enables seamless data movement between databases, APIs, data warehouses, and cloud storage. This guide covers all major features and capabilities.
Core Features
1. Multi-Source Data Synchronization
Sync data from multiple source types to multiple destinations.
Supported Sources:
- Relational Databases: MySQL, PostgreSQL, SQLite, SQL Server, Oracle
- REST APIs: Any HTTP/HTTPS API with authentication support
- GraphQL APIs: Query-based data extraction
- NoSQL Databases: MongoDB, Cassandra, DynamoDB
- Cloud Storage: S3, GCS, Azure Blob (read files as data sources)
- SaaS Platforms: Salesforce, HubSpot, Stripe (via connectors)
Supported Destinations:
- Data Warehouses: Snowflake, BigQuery, Redshift, Databricks
- Cloud Storage: S3, GCS, Azure Blob Storage
- Relational Databases: PostgreSQL, MySQL, SQL Server
- Data Lakes: S3 Parquet/Delta Lake, GCS, ADLS
- Local Filesystem: JSON, CSV, Parquet, Avro files
Example:
# Sync from MySQL to Snowflake
[source]
type = "mysql"
host = "prod-db.example.com"
database = "production"
[destination]
type = "snowflake"
account = "xy12345.us-east-1"
warehouse = "COMPUTE_WH"
database = "ANALYTICS"
2. Incremental Sync
Only sync new or updated records to minimize data transfer and processing time.
How It Works:
- Tracks last sync timestamp or ID
- Queries only changed records using incremental column
- Maintains state between syncs
- Supports timestamp, integer, and date-based incremental columns
Example:
[sync]
mode = "incremental"
incremental_column = "updated_at"
incremental_type = "timestamp"
Benefits:
- 🚀 Faster syncs - Process only changed data
- 💰 Lower costs - Reduced data transfer and compute
- ⏱️ Real-time ready - Enable frequent syncs (every 5-15 minutes)
- 🔄 Idempotent - Safe to re-run without duplicates
Use Cases:
- Continuous database replication
- Near real-time analytics pipelines
- Event streaming to data warehouses
- Audit log archival
3. Data Transformations
Transform data during sync without separate ETL tools.
Column-Level Transformations
Anonymization:
[[transformations]]
table = "users"
transform = "anonymize"
columns = ["email", "phone", "ssn"]
method = "hash" # Options: hash, mask, null
Type Casting:
[[transformations]]
table = "orders"
transform = "cast"
columns = { "total" = "decimal", "quantity" = "integer", "created_at" = "timestamp" }
Column Renaming:
[[transformations]]
table = "customers"
transform = "rename"
columns = { "cust_id" = "customer_id", "cust_email" = "email" }
Column Derivation:
[[transformations]]
table = "orders"
transform = "derive"
new_column = "order_month"
expression = "DATE_FORMAT(created_at, '%Y-%m')"
Row-Level Transformations
Filtering:
[[transformations]]
table = "orders"
transform = "filter"
condition = "status = 'completed' AND total > 100"
Deduplication:
[[transformations]]
table = "events"
transform = "deduplicate"
key_columns = ["user_id", "event_type", "timestamp"]
strategy = "keep_latest" # Options: keep_first, keep_latest, keep_all
Custom SQL Transformations
[[transformations]]
table = "orders"
transform = "sql"
query = """
SELECT
id,
customer_id,
total,
CASE
WHEN total >= 1000 THEN 'enterprise'
WHEN total >= 100 THEN 'business'
ELSE 'individual'
END as customer_segment,
DATE(created_at) as order_date
FROM orders
WHERE status = 'completed'
"""
4. Scheduled Syncs
Automate data syncs with flexible scheduling.
Cron Expressions:
[sync]
# Every 15 minutes
schedule = "*/15 * * * *"
# Every hour at :30
schedule = "30 * * * *"
# Daily at 2 AM
schedule = "0 2 * * *"
# Weekdays at 9 AM
schedule = "0 9 * * 1-5"
# Weekly on Sunday at midnight
schedule = "0 0 * * 0"
# Monthly on the 1st at midnight
schedule = "0 0 1 * *"
# Timezone support
timezone = "America/New_York"
Scheduler Management:
# Start scheduler daemon
hubio-sync scheduler start
# List scheduled jobs
hubio-sync scheduler list
# Pause all jobs
hubio-sync scheduler pause
# Resume jobs
hubio-sync scheduler resume
# Stop scheduler
hubio-sync scheduler stop
# View job history
hubio-sync scheduler history
Advanced Scheduling:
[sync]
schedule = "0 2 * * *"
# Retry on failure
max_retries = 3
retry_delay = 300 # seconds between retries
# Timeout
timeout = 7200 # 2 hours max runtime
# Skip if previous run still active
skip_if_running = true
5. Performance Optimization
Parallel Processing
[performance]
# Sync multiple tables concurrently
max_parallel_tables = 4
# Parallel batches within a table
max_parallel_batches = 8
# Worker threads per batch
workers_per_batch = 4
Batch Size Tuning
[sync]
# Global batch size
batch_size = 10000
# Table-specific overrides
[[sync.table_config]]
name = "large_table"
batch_size = 50000 # Larger batches for big tables
[[sync.table_config]]
name = "small_table"
batch_size = 1000 # Smaller batches for frequent updates
Memory Management
[performance]
# Memory limits
max_memory = "4GB"
buffer_size = "500MB"
# Spill to disk when memory full
enable_disk_spillover = true
temp_directory = "/tmp/hubio-sync"
Compression
[destination]
type = "s3"
format = "parquet"
compression = "snappy" # Options: none, snappy, gzip, zstd, lz4
compression_level = 6 # 1-9 (zstd/gzip only)
6. Data Quality & Validation
Schema Validation
[validation]
# Validate schema before sync
validate_schema = true
# Fail on schema drift
strict_schema = true
# Allow new columns
allow_new_columns = true
# Warn on missing columns
warn_missing_columns = true
Data Quality Checks
[[validation.checks]]
table = "users"
check = "not_null"
columns = ["id", "email", "created_at"]
[[validation.checks]]
table = "orders"
check = "unique"
columns = ["order_id"]
[[validation.checks]]
table = "products"
check = "range"
column = "price"
min = 0
max = 1000000
[[validation.checks]]
table = "users"
check = "regex"
column = "email"
pattern = "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
Row Count Validation
[validation]
# Compare source vs destination row counts
validate_row_counts = true
# Acceptable difference threshold
row_count_tolerance = 0.01 # 1% tolerance
# Action on mismatch
on_row_count_mismatch = "warn" # Options: warn, fail, ignore
7. Monitoring & Observability
Metrics Export
[metrics]
enabled = true
port = 9090
format = "prometheus"
# Metrics to track
track_row_counts = true
track_sync_duration = true
track_error_rates = true
track_bytes_transferred = true
Prometheus Metrics:
hubio_sync_rows_synced_total- Total rows syncedhubio_sync_duration_seconds- Sync durationhubio_sync_errors_total- Error counthubio_sync_bytes_transferred- Data volume
Logging
[logging]
level = "info" # debug, info, warn, error
format = "json" # json, text, structured
# Outputs
outputs = ["stdout", "file"]
log_file = "/var/log/hubio-sync/sync.log"
# Rotation
max_file_size = "100MB"
max_backups = 10
compress = true
# Structured fields
include_timestamp = true
include_caller = true
include_stacktrace = true # On errors
Alerts & Notifications
[alerts]
enabled = true
# Microsoft Teams notifications
teams_webhook_url = "${TEAMS_WEBHOOK_URL}"
notify_on = ["failure", "warning", "slow_sync"]
# Email notifications
smtp_host = "smtp.gmail.com"
smtp_port = 587
smtp_username = "${SMTP_USER}"
smtp_password = "${SMTP_PASS}"
email_to = ["team@example.com"]
# Alert thresholds
slow_sync_threshold = 3600 # seconds
error_rate_threshold = 0.05 # 5%
Health Checks
# HTTP health endpoint
curl http://localhost:8080/health
# Response
{
"status": "healthy",
"last_sync": "2025-11-25T14:30:00Z",
"uptime_seconds": 86400,
"active_syncs": 0
}
8. Security Features
Encryption at Rest
[security]
# Encrypt local cache and temp files
encrypt_cache = true
encryption_algorithm = "AES-256-GCM"
encryption_key_file = "/secure/path/to/key.pem"
Encryption in Transit
[security]
# TLS for all connections
require_tls = true
tls_min_version = "1.2"
# Certificate verification
verify_certificates = true
ca_bundle = "/path/to/ca-bundle.crt"
# Custom certificates
client_cert = "/path/to/client.crt"
client_key = "/path/to/client.key"
Credential Management
[security]
# Use secrets manager
secrets_provider = "aws_secrets_manager" # aws, gcp, azure, vault
secrets_prefix = "hubio-sync/"
[source]
type = "mysql"
host = "db.example.com"
username = "sync_user"
# Reference secret instead of plaintext
password = "${secret:mysql_password}"
Audit Logging
[security.audit]
enabled = true
log_file = "/var/log/hubio-sync/audit.log"
# Events to audit
log_config_changes = true
log_authentication = true
log_data_access = true
log_schema_changes = true
9. High Availability & Reliability
Automatic Retries
[reliability]
# Retry failed syncs
max_retries = 3
retry_delay = 300 # seconds
retry_backoff = "exponential" # linear, exponential
# Retry specific errors
retry_on_errors = ["connection_timeout", "deadlock", "rate_limit"]
State Persistence
[state]
# Store sync state for recovery
backend = "dynamodb" # dynamodb, postgres, mysql, redis, filesystem
table_name = "hubio_sync_state"
# Automatic state backups
backup_enabled = true
backup_interval = 3600 # seconds
Disaster Recovery
[disaster_recovery]
# Automatic snapshots before sync
snapshot_before_sync = true
snapshot_retention_days = 7
# Recovery point objective
max_data_loss_minutes = 15
10. Advanced Features
Change Data Capture (CDC)
[source]
type = "mysql"
enable_cdc = true
cdc_method = "binlog" # binlog, triggers, polling
# Capture operations
capture_inserts = true
capture_updates = true
capture_deletes = true
# Real-time streaming
stream_to_destination = true
Data Partitioning
[destination]
type = "s3"
format = "parquet"
# Time-based partitioning
partition_by = ["year", "month", "day"]
partition_column = "created_at"
# Custom partitioning
# partition_by = ["region", "product_category"]
Data Deduplication
[sync]
enable_deduplication = true
dedup_key = ["user_id", "timestamp"]
dedup_strategy = "keep_latest" # keep_first, keep_latest, keep_all
dedup_window = "7d" # Look-back window
Schema Evolution
[schema]
# Automatic schema evolution
auto_evolve = true
# Evolution strategies
allow_new_columns = true
allow_column_type_changes = false
allow_column_drops = false
# Schema versioning
track_schema_versions = true
version_storage = "s3://bucket/schemas/"
Use Case Examples
Real-Time Analytics Pipeline
# MySQL → Snowflake every 5 minutes
[source]
type = "mysql"
host = "prod-db.example.com"
[destination]
type = "snowflake"
account = "xy12345"
[sync]
mode = "incremental"
schedule = "*/5 * * * *"
incremental_column = "updated_at"
Data Lake Archival
# PostgreSQL → S3 Parquet daily
[source]
type = "postgres"
host = "archive-db.example.com"
[destination]
type = "s3"
format = "parquet"
partition_by = ["year", "month"]
[sync]
schedule = "0 2 * * *"
compression = "snappy"
Multi-Cloud Replication
# MySQL (AWS) → BigQuery (GCP)
[source]
type = "mysql"
host = "aws-db.example.com"
[destination]
type = "bigquery"
project_id = "my-gcp-project"
dataset = "analytics"
[sync]
mode = "incremental"
schedule = "0 */4 * * *"
Performance Benchmarks
Typical throughput (with default settings):
- MySQL → S3 Parquet: ~50,000 rows/second
- PostgreSQL → Snowflake: ~30,000 rows/second
- REST API → BigQuery: ~10,000 rows/second
Factors affecting performance:
- Network bandwidth
- Source database query performance
- Destination write throughput
- Data transformation complexity
- Record size and schema complexity
Learn More
- Configuration Guide - Detailed configuration options
- Performance Tuning - Optimization techniques
- Security Best Practices - Secure your data pipelines
- CLI Reference - All commands and options
Support
For technical support, contact support@hubio.com.