Introduction: From 1h50m to Under 10 Minutes—How Careful Engineering Cut Data Load Time by an Order of Magnitude
Data import pipelines for large, distributed operations often evolve incrementally. Over time, they accumulate assumptions, one-off adapters, manual safety checks, and platform constraints that make performance tuning difficult. When a pipeline grows to handle multiple file sources, formats, and schedules, the latency added by sequential processing and non-selective file transfers becomes significant. In the case we describe here, a dairy enterprise in the ANZ region needed predictable, fast data refreshes across several farm software sources without sacrificing correctness or maintainability. The incumbent process took approximately one hour and fifty minutes to complete a full refresh. The goal was to reduce this to minutes while preserving the existing data contracts and operational safety.
This article explains how we approached the work: understanding the current system, deciding what to retain, and upgrading only what would measurably improve performance. The core adjustments included switching file synchronization from SFTP to Azure Blob Storage APIs, using parallel threads for selective downloads, and adopting a configuration-driven ingestion layer with JSON-based rules. We also implemented a multi–file-type loader that supports CSV, free-form text, special formats, and grain feed files used by proprietary farming software. Lastly, we added daily notifications to convey success or failure states to stakeholders. The result was a reduction in end-to-end loading time to under 10 minutes, with the Azure download phase completing in under 10 seconds. Throughout this write-up, we avoid speculation and focus on concrete design choices, measurable effects, and practical considerations.
Background and Discovery: Mapping the Constraints Before Changing Anything
The engagement spanned 20 days from discovery to production handover. Most of the early effort went into understanding the existing system and the constraints around it. Rather than replacing everything at once, we prioritized learning what had to remain stable. This was important both for risk control and for ensuring that downstream consumers—reports, dashboards, or other analytical processes—would continue to function without rework.
Initial Observations
The legacy approach retrieved files sequentially via SFTP and then loaded them into a data store, where a series of stored procedures transformed and reconciled the content. The SFTP transfer pattern synchronized directories without discrimination and performed sequential downloads. Afterward, ingestion was run as a single-threaded process, creating a coupled flow where the slowest step determined overall latency. In aggregate, this resulted in about one hour and fifty minutes of elapsed time on typical data volumes. The aim was to improve performance without altering established data semantics or downstream expectations.
Our first step was to separate concerns into distinct phases—acquisition, ingestion, and post-load processing—and to identify measurable bottlenecks in each phase. This decomposition made it clear that file acquisition (time spent pulling data from remote storage) and linear ingestion were the two biggest opportunities for improvement. While stored procedure execution also consumed time, the majority of the delay stemmed from the way inputs were acquired and staged for transformation.
Constraints and Non-Goals
Several design constraints influenced our plan:
- Maintain existing data contracts, including the expected schemas and naming conventions used downstream.
- Increase throughput without introducing nondeterministic behavior that would make reconciliation difficult.
- Support multiple file types and farm software sources (Jantec, DairyPlan, EasyDairy, DelPro) without hardcoding parser logic where possible.
- Implement changes that could be validated and observed quickly, given the 20-day timeline.
Non-goals included redesigning core business logic or re-implementing stored procedures unless a clear performance issue was discovered. This guardrail kept the scope focused on bottlenecks that offered the largest return with minimal disruption.
Architecture and Data Flow: From SFTP to Azure Blob, From Sequential to Parallel
The redesigned pipeline is organized around three core stages: selective acquisition from Azure Blob Storage, dynamic ingestion using JSON-based configuration, and post-load transformation through stored procedures. A notification subsystem ties into the pipeline to report daily outcomes to stakeholders. In what follows, we describe each component and the reasoning behind it.
1) Selective Acquisition via Azure Blob Storage APIs
The most impactful improvement came from replacing SFTP-based synchronization with Azure Blob Storage API calls. Instead of mirroring entire directories, the new approach enumerates objects in the target container and filters them based on configuration and run-time criteria (for example, last-modified timestamps or explicit include/exclude patterns, depending on the agreed rules). This ensures that only the files needed for a given cycle are downloaded. We avoided making assumptions about file naming conventions beyond what was present in the configuration and previously documented patterns.
Two techniques drove the observed speedup:
- Server-side listing with narrow filters: We rely on Azure’s object listing to minimize client-side work, retrieving only metadata necessary to decide whether a file should be downloaded.
- Parallelized downloads: Downloads are executed in parallel threads. Correctness is preserved by treating each file as an independent unit of work and using atomic handoff to the ingestion stage (for example, staging files under a temporary name and finalizing once complete).
With parallelism and selectivity, the download portion decreased to under 10 seconds for the observed workloads. The exact number of threads used is tuned per environment and balanced against available network and storage bandwidth, but the main point is that concurrency is applied where it is safe: at the level of independent files.
2) Dynamically Configurable Ingestion
The ingestion system is governed by JSON configuration files that define how to handle each file type and source. Rather than hardcoding logic for individual farm systems, we externalized behavior wherever possible. The configuration includes elements like:
- Source identification (e.g., Jantec, DairyPlan, EasyDairy, DelPro) and associated parsing rules.
- File type definitions (CSV, text, special format files, grain feed files for proprietary farming software) and per-type parsing options (delimiter, quoting, encoding, header row expectations, record terminators, and field-level coercion rules).
- Column mapping and type coercion directives to normalize incoming data into the target schema.
- Validation checks (required fields, acceptable value ranges, basic referential checks where applicable) with explicit handling strategies for failures (quarantine, reject, or default).
- Batching parameters (maximum rows per batch, memory thresholds) to safely stream large files without exhausting resources.
This approach makes it straightforward to onboard a new file format or tweak parsing behavior without deploying new code. It also improves transparency; configuration is version-controlled and reviewable, which helps when auditing changes or diagnosing data anomalies.
3) Post-Load Stored Procedures
After ingestion, we invoke existing stored procedures to perform transformations, merges, and any necessary reconciliation. Preserving these procedures minimized risk and aligned with the mandate to avoid changing domain semantics unless a measurable issue was identified. Because file acquisition and ingestion are faster and more parallelized, the stored procedure phase starts sooner. Additionally, by ensuring that data is normalized at ingestion-time according to the configuration, downstream work remains stable and predictable.
4) Notifications and Operational Feedback
The pipeline sends daily notifications summarizing import outcomes—success or failure, and where available, counts and basic diagnostics. The goal is straightforward: make status visible to stakeholders without requiring them to inspect logs. On failures, the notification indicates the relevant stage so that investigation can begin promptly. For recurring operations, this lightweight feedback loop encourages steady-state monitoring with minimal overhead.
Performance Engineering: Why the Load Time Dropped So Sharply
Cutting the end-to-end time from around one hour and fifty minutes to under ten minutes required attention to both data movement and compute scheduling. No single optimization accounted for the entire win; the aggregate effect came from removing sequential bottlenecks and stopping unnecessary work.
Eliminating Unnecessary Transfers
With SFTP synchronization, the pipeline previously considered broad directories and often handled more files than necessary for a given run. Even when checks existed to avoid reprocessing unchanged data, the cost of negotiating connections and traversing directory trees accumulated. By shifting to Azure Blob Storage APIs and enumerating objects with filters, the system now avoids downloading unneeded files entirely. This removes latency and reduces I/O contention on the ingestion machine.
Parallelism Where It’s Safe
We introduced parallelism in two places: file downloading and per-file ingestion. Concurrency was applied conservatively—only to operations that are independent and have clearly defined inputs and outputs. This ensures reproducible results while taking advantage of available CPU and network resources. The parallel download layer relies on thread-safe operations and temporary staging to avoid partially written files being read by ingestion. Once a file is fully downloaded and verified (size check or checksum where defined), it is handed off.
For ingestion, depending on file size and configured rules, the pipeline can process multiple files concurrently. The dynamic configuration determines whether certain file types need sequential handling (for example, if ordering guarantees are required) or whether they can be safely parallelized. This strikes a balance between speed and correctness.
Streaming and Batching
The ingestion layer streams data into staging tables and applies batching to control memory usage. By keeping an eye on batch sizes and using backpressure signals (for example, limiting the number of active batches per file), the pipeline avoids pauses due to resource exhaustion. The exact thresholds are configuration-driven, allowing tuning without code changes. Combined with column-level coercion and early validation, this reduces rework and improves throughput.
Preserving the Downstream Contract
A common source of performance regressions is unintentional schema drift or changes to naming conventions that affect downstream tooling. We avoided that risk by keeping the target schemas stable and ensuring that any normalization occurs before data lands in the staging area used by stored procedures. This allowed us to shorten the critical path without ripple effects.
Dynamic Configuration: JSON as the Control Plane for Ingestion
Treating configuration as data allows the ingestion system to adapt quickly to new file structures and sources. The JSON files act as a control plane for how the pipeline reads, validates, and maps content. They do not contain executable code; rather, they define a declarative specification that the ingestion engine interprets.
Key Elements of the JSON Configuration
- Source descriptors: Identify providers (Jantec, DairyPlan, EasyDairy, DelPro) and associate them with default parsing profiles.
- Format definitions: For CSV and text formats, the configuration can specify delimiters, quote characters, escape handling, encoding (e.g., UTF-8), header presence, and expected column counts. For special formats and grain feed files, the configuration can define record structures, field separators, and line-level patterns used to distinguish record types.
- Field mapping and coercion: Map source fields to target columns with type coercion rules (integer, float, date parsing with specified formats). Invalid conversions can trigger configurable actions: skip row, substitute defaults, or quarantine the row with diagnostics.
- Validation blocks: Declare required fields, enumerations, and value ranges. Each validation has a corresponding disposition to ensure the pipeline behaves predictably under imperfect inputs.
- Batch and memory guidance: Define batch sizes, maximum concurrency, and read-ahead buffers, enabling environment-specific tuning without code changes.
- Ordering and dependencies: Indicate whether certain file types must be ingested in sequence (for example, reference data before transactional data) to preserve referential integrity across stored procedure runs.
Benefits of Declarative Control
The main advantages are resilience and adaptability. When a provider adds a column or changes a delimiter, configuration updates can be reviewed and deployed quickly. Versioning the configuration also provides traceability—helpful when investigating anomalies or comparing behavior across days.
Because the configuration is explicit, it is easier to reason about the pipeline’s behavior. Operationally, this reduces the risk of accidental coupling between sources and makes onboarding new file types or providers more predictable.
Multi–File-Type Ingestion: Handling CSV, Text, Special Formats, and Grain Feed Files
A consistent ingestion engine can parse multiple data structures as long as its parsing strategies are carefully isolated and guided by configuration. The system we implemented supports the following categories:
- CSV: Handles configurable delimiter, quoting, and header rows. It supports large files by streaming rows and using fixed-size batches.
- Plain text: For row-based formats where records are delimited by line breaks but fields may require custom splitting logic, the configuration describes parsing rules and expected patterns.
- Special formats: For files that deviate from standard CSV or line-based structures, the configuration defines specific record schemas, field boundaries, and any pre-processing steps needed to interpret the data.
- Grain feed files (proprietary farming software): These may include domain-specific structures. The engine uses configuration to map fields into the target schema while preserving integrity checks.
The ingestion pipeline does not assume more than what is codified in configuration. Where ambiguity exists, conservative defaults are used (for example, rejecting a record that cannot be parsed according to the declared format rather than inferring a new rule). This avoids silent corruption and supports accurate downstream processing.
Integration with Farm Software Sources
The system supports files originating from Jantec, DairyPlan, EasyDairy, and Delpro by associating each source with the appropriate configuration profile. In practice, this means a file placed in a designated location or tagged with a particular path can be routed through the correct parsing and validation steps. When a provider changes an output layout, only the corresponding configuration needs updating, reducing the lead time to accommodate changes.
Reliability and Observability: Daily Notifications and Safe Failure Modes
Performance without observability is fragile. In parallel with the speedups, we added operational feedback mechanisms to ensure stakeholders could quickly assess data freshness and address issues.
Daily Notifications
The system emits daily notifications indicating whether the data import succeeded or failed. The message includes sufficient context to guide investigation (for example, which stage failed during acquisition, ingestion, or post-load processing). This daily heartbeat helps teams validate that the pipeline is functioning without requiring manual inspection of logs.
Safe Failure Modes
The ingestion engine quarantines problematic files or records rather than allowing partial, silent ingestion. For example, if a file does not match the declared schema, the system can hold it aside and note the reason in the logs, while continuing to process other files. This isolation prevents a single malformed input from blocking the entire pipeline.
Practical Tips and Considerations: What Mattered Most in the 20-Day Window
Several practical decisions helped us achieve measurable improvements within the limited timeframe. While every environment differs, the following considerations were central to the observed gains:
- Measure first, then optimize: Establish baseline timings for acquisition, ingestion, and post-load processing. Target the slowest stage with changes that carry the least risk.
- Prioritize selectivity: Avoid moving bytes you do not need. Switching from directory sync to filtered object listing and selective downloads on Azure saved minutes immediately.
- Parallelize independent work: If files are independent, they are candidates for parallel handling. Constrain concurrency using configuration so it can be tuned per environment.
- Stream data: Favor streaming reads and writes with bounded memory usage. Configure batch sizes and backpressure to keep throughput stable.
- Keep contracts stable: Preserve schemas and naming conventions unless there is a clear, agreed reason to change them. Stability reduces coordination costs.
- Externalize behavior: Use JSON configuration to define parsing and validation. This makes changes auditable and reduces deploy cycles.
- Automate feedback: Notifications minimize guesswork about pipeline health and encourage quick intervention on failure.
Implementations that emphasize selectivity and parallelism usually deliver immediate wins. However, it is equally important to define safe defaults and failure behavior to avoid trading performance for fragility. In our case, careful scoping and incremental validation allowed performance improvements without compromising correctness.
Results and Validation: What Changed and How We Verified It
The measurable outcomes were:
- Download phase: Reduced to under 10 seconds via selective Azure Blob Storage API calls and parallel threads, compared to prior sequential SFTP transfers.
- End-to-end load time: Decreased from approximately 1 hour 50 minutes to under 10 minutes for the observed data volumes and formats.
Validation steps included:
- Comparative runs: Running the legacy and new pipelines against the same data to confirm that downstream outputs remained consistent.
- Incremental rollout: Enabling the new acquisition step first, then the dynamic ingestion layer, verifying results at each stage before introducing higher concurrency.
- Operational monitoring: Observing daily notifications and logs to confirm stability across successive runs.
We avoided changing stored procedures unless necessary. This decision reduced risk and limited the surface area for differences in downstream behavior. Instead, we focused on accelerating the path to reach those procedures with clean, normalized inputs.
Security and Governance Considerations
Any change to data movement patterns should be reviewed with security and governance in mind. While the details of environment-specific security configurations are out of scope here, the following principles informed our approach:
- Least privilege: Access to Azure Blob containers, queues, or tables should be limited to what the pipeline requires. Rotate credentials according to organizational policy.
- Separation of duties: Configuration changes can be peer-reviewed and version-controlled, allowing audit trails and approvals.
- Data minimization: By selectively downloading only required files, the pipeline reduces exposure surface and potential data sprawl.
- Robust logging: Logs should record access events and data handling decisions (e.g., why a file was skipped) to support investigations and compliance checks.
These practices are straightforward to layer onto the technical design and help sustain performance improvements without compromising organizational requirements.
Extensibility: Preparing for New Sources and Formats
Because the ingestion logic is configuration-driven, adding a new data source or modifying an existing one involves updating the relevant JSON and validating the change against test data. This structure scales better than embedding provider-specific logic throughout code paths. For organizations that routinely onboard new partners or farms, the approach keeps lead times short and changes auditable.
The key is to keep tight control over the boundary between configuration and code. Configuration should describe what to parse and how to interpret it; code should remain responsible for execution semantics, validation enforcement, and error handling. When these responsibilities are well-defined, complexity stays manageable even as the number of supported formats grows.
Lessons Learned: What We Would Do the Same Again
Several decisions proved especially effective:
- Start with careful discovery: Understanding existing assumptions avoided breaking downstream consumers and kept scope realistic within 20 days.
- Focus on selective acquisition first: Removing unnecessary file transfers is often the fastest route to significant gains.
- Use parallelism judiciously: Concurrency helped, but only where independence was clear. Guardrails prevented race conditions and partial reads.
- Externalize parsing rules: JSON configurations simplified change management and increased transparency.
- Preserve core business logic: Leaving stored procedures intact reduced risk while still producing a substantial performance improvement.
If we were to extend this work further, natural areas would include richer observability (for example, per-source dashboards), tighter SLA monitoring, and automated configuration validation to catch mismatches earlier in the lifecycle. These possibilities are compatible with the current design and can be layered in incrementally.
Conclusion and How to Get in Touch
This case study illustrates how measurable performance gains can come from improving data acquisition strategies, applying parallelism at the right boundaries, and treating configuration as a first-class asset. By moving from SFTP to Azure Blob Storage API calls with selective, parallel downloads, implementing a configuration-driven ingestion layer, and preserving existing stored procedures, the end-to-end data load time dropped from about one hour and fifty minutes to under ten minutes. Daily notifications now provide operational assurances without manual effort, and the system remains adaptable to multiple file types and providers, including Jantec, DairyPlan, EasyDairy, and Delpro.
If you are planning a similar modernization—whether you need to cut load times, onboard new file formats, or add observability—we can help. Our approach emphasizes careful discovery, conservative changes where correctness matters, and targeted optimizations where they have the greatest effect. We would be happy to discuss your environment, constraints, and goals, and outline a practical plan tailored to your needs.
Ready to explore what a configuration-driven, parallelized ingestion pipeline could do for your data operations? Get in touch with us to start a conversation about your project. We’ll help you assess your current setup, identify low-risk, high-impact improvements, and implement a solution that prioritizes reliability and measurable outcomes.