Data Export: Streaming Large Datasets
Welcome to TopperBlog! 👋
I'm a tech content creator passionate about helping developers level up their careers and master cutting-edge technologies.
🎯 What I Write About:
• AI/ML Engineering & LLMs
• Web3 & Blockchain Development
• System Design & Architecture
• Interview Preparation (FAANG)
• Freelancing & Remote Work
• Modern Tech Stacks (Next.js, React, Rust, TypeScript)
• Performance Optimization & Best Practices
💼 Mission: Sharing practical, actionable insights that accelerate your tech career and maximize your earning potential.
📚 15+ In-Depth Guides covering everything from earning $10k/month as a freelancer to cracking FAANG interviews.
🌐 Let's connect and grow together in this amazing tech journey!
#TechBlogger #SoftwareEngineering #CareerGrowth #WebDevelopment #AIEngineering
Streaming Large Datasets Efficiently: Export at Scale
When your analytics platform needs to export 500 million rows to a customer's data warehouse, or your SaaS application must deliver a complete historical dataset spanning terabytes, traditional data export approaches collapse under their own weight. Loading entire result sets into memory, buffering complete responses before transmission, and synchronous processing patterns that worked for megabyte-scale exports in 2020 now cause out-of-memory crashes, timeout failures, and cascading system degradation when streaming large datasets efficiently becomes a production requirement rather than an optimization.
The consequences are immediate and measurable. A fintech company recently reported that their batch export jobs consumed 96GB of memory per worker, requiring expensive vertical scaling and still failing on datasets exceeding 2 billion rows. An e-commerce platform's customer data export feature timed out after 30 seconds for any merchant with more than 100,000 orders, forcing manual intervention and damaging customer trust. These failures stem from architectural assumptions that no longer hold: data volumes have grown exponentially, compliance requirements demand complete audit trails, AI training pipelines require full historical datasets, and customers expect sub-minute response times regardless of data size.
Why Traditional Export Approaches Fail at Scale
The conventional pattern of querying a database, collecting all results into an array, serializing to JSON or CSV, and returning a complete response worked adequately when datasets measured in thousands of rows. Modern applications routinely handle billions of events, petabytes of logs, and continuous data generation from IoT devices, user interactions, and automated systems.
Traditional approaches fail across multiple dimensions. Memory consumption scales linearly with dataset size—a 10GB result set requires at least 10GB of RAM, often 3-5x more after deserialization and object overhead. Database connections remain open for the entire export duration, exhausting connection pools and blocking other operations. Network timeouts kill long-running requests before completion. Client applications cannot begin processing until the entire response arrives, adding unnecessary latency.
Regulatory requirements in 2025 have intensified these challenges. GDPR, CCPA, and sector-specific regulations require complete data portability within strict timeframes. Healthcare systems must export patient records spanning decades. Financial institutions face audit requirements demanding transaction histories with millisecond timestamps across distributed systems. These aren't edge cases—they're standard operational requirements that traditional batch exports cannot satisfy reliably.
Modern Streaming Architecture for Data Export
Streaming large datasets efficiently requires fundamentally different architectural patterns. Instead of materializing complete results, streaming architectures process data in continuous flows, maintaining constant memory usage regardless of total dataset size. The core principle: never hold more than a small working set in memory at any time.
The architecture consists of four key components: a cursor-based data source that yields records incrementally, a transformation pipeline that processes records in chunks, a backpressure-aware transport layer that adapts to consumer speed, and a client-side handler that processes data as it arrives.
Here's a production-grade implementation using Node.js streams and TypeScript:
import { Readable, Transform } from 'stream';
import { pipeline } from 'stream/promises';
import { createGzip } from 'zlib';
interface ExportConfig {
batchSize: number;
maxConcurrency: number;
compressionEnabled: boolean;
}
class DatabaseCursorStream extends Readable {
private offset = 0;
private exhausted = false;
constructor(
private query: string,
private batchSize: number,
private dbClient: any
) {
super({ objectMode: true, highWaterMark: 2 });
}
async _read() {
if (this.exhausted) {
this.push(null);
return;
}
try {
const results = await this.dbClient.query(
`${this.query} LIMIT ${this.batchSize} OFFSET ${this.offset}`
);
if (results.rows.length === 0) {
this.exhausted = true;
this.push(null);
return;
}
for (const row of results.rows) {
if (!this.push(row)) {
// Backpressure detected, pause reading
break;
}
}
this.offset += results.rows.length;
} catch (error) {
this.destroy(error as Error);
}
}
}
class JSONLinesTransform extends Transform {
private isFirstChunk = true;
constructor() {
super({ objectMode: true });
}
_transform(chunk: any, encoding: string, callback: Function) {
try {
const line = JSON.stringify(chunk) + '\n';
callback(null, line);
} catch (error) {
callback(error);
}
}
}
class ChunkedExportStream {
constructor(private config: ExportConfig) {}
async export(query: string, dbClient: any, outputStream: NodeJS.WritableStream) {
const cursorStream = new DatabaseCursorStream(
query,
this.config.batchSize,
dbClient
);
const transformStream = new JSONLinesTransform();
const streams: NodeJS.ReadWriteStream[] = [
cursorStream,
transformStream
];
if (this.config.compressionEnabled) {
streams.push(createGzip({ level: 6 }));
}
streams.push(outputStream);
await pipeline(streams);
}
}
// Usage in an HTTP endpoint
async function handleExportRequest(req: any, res: any) {
res.setHeader('Content-Type', 'application/x-ndjson');
res.setHeader('Content-Encoding', 'gzip');
res.setHeader('Transfer-Encoding', 'chunked');
const exporter = new ChunkedExportStream({
batchSize: 1000,
maxConcurrency: 4,
compressionEnabled: true
});
try {
await exporter.export(
'SELECT * FROM events WHERE created_at > $1',
dbClient,
res
);
} catch (error) {
if (!res.headersSent) {
res.status(500).json({ error: 'Export failed' });
} else {
// Headers already sent, log error and close connection
console.error('Export stream error:', error);
res.end();
}
}
}
This implementation maintains constant memory usage through several mechanisms. The highWaterMark: 2 setting limits buffering to two batches. Object mode streams process individual records rather than string chunks. The cursor-based query pattern fetches data incrementally using LIMIT/OFFSET, though production systems should use keyset pagination for better performance on large offsets.
Handling Backpressure and Flow Control
Backpressure occurs when data producers generate data faster than consumers can process it. Without proper handling, buffers overflow, memory usage spikes, and systems crash. Streaming large datasets efficiently requires explicit backpressure management at every layer.
Node.js streams implement backpressure through return values: when push() returns false, the producer should stop generating data until the _read() method is called again. This signal propagates through the entire pipeline, creating a pull-based system where consumers control the flow rate.
For HTTP-based exports, TCP flow control provides automatic backpressure. When the client's receive buffer fills, the TCP stack stops acknowledging packets, causing the server's send buffer to fill, which eventually blocks the write operation. This natural backpressure mechanism works well for direct client connections but breaks down with intermediate proxies, load balancers, or CDNs that buffer responses.
Production systems need explicit timeout and cancellation handling:
class ResilientExportStream {
private abortController = new AbortController();
private heartbeatInterval?: NodeJS.Timeout;
async exportWithTimeout(
query: string,
dbClient: any,
outputStream: NodeJS.WritableStream,
timeoutMs: number
) {
const timeoutId = setTimeout(() => {
this.abortController.abort();
}, timeoutMs);
// Send periodic heartbeat comments to prevent proxy timeouts
this.heartbeatInterval = setInterval(() => {
if (!outputStream.destroyed) {
outputStream.write('# heartbeat\n');
}
}, 15000);
try {
await this.performExport(query, dbClient, outputStream);
} finally {
clearTimeout(timeoutId);
clearInterval(this.heartbeatInterval);
}
}
private async performExport(
query: string,
dbClient: any,
outputStream: NodeJS.WritableStream
) {
const cursor = dbClient.query(query, {
signal: this.abortController.signal
});
for await (const row of cursor) {
if (this.abortController.signal.aborted) {
throw new Error('Export aborted due to timeout');
}
const written = outputStream.write(JSON.stringify(row) + '\n');
if (!written) {
// Wait for drain event before continuing
await new Promise(resolve => outputStream.once('drain', resolve));
}
}
}
}
Optimizing Database Query Patterns
The database query pattern significantly impacts streaming performance. Sequential LIMIT/OFFSET queries degrade as offset increases—fetching rows 1,000,000 to 1,001,000 requires scanning and discarding the first million rows on every batch.
Keyset pagination using indexed columns provides consistent performance:
class KeysetCursorStream extends Readable {
private lastId: string | null = null;
private exhausted = false;
async _read() {
if (this.exhausted) {
this.push(null);
return;
}
const whereClause = this.lastId
? `WHERE id > $1 ORDER BY id LIMIT $2`
: `ORDER BY id LIMIT $1`;
const params = this.lastId
? [this.lastId, this.batchSize]
: [this.batchSize];
const results = await this.dbClient.query(
`SELECT * FROM events ${whereClause}`,
params
);
if (results.rows.length === 0) {
this.exhausted = true;
this.push(null);
return;
}
this.lastId = results.rows[results.rows.length - 1].id;
for (const row of results.rows) {
if (!this.push(row)) break;
}
}
}
For PostgreSQL, server-side cursors provide even better performance by maintaining query state on the database:
async function* postgresServerCursor(
query: string,
client: any,
batchSize: number
) {
const cursorName = `export_cursor_${Date.now()}`;
await client.query('BEGIN');
await client.query(`DECLARE ${cursorName} CURSOR FOR ${query}`);
try {
while (true) {
const result = await client.query(
`FETCH ${batchSize} FROM ${cursorName}`
);
if (result.rows.length === 0) break;
for (const row of result.rows) {
yield row;
}
}
} finally {
await client.query(`CLOSE ${cursorName}`);
await client.query('COMMIT');
}
}
Distributed Export for Multi-Terabyte Datasets
Single-node streaming works well up to hundreds of gigabytes, but multi-terabyte exports require distributed processing. The pattern involves partitioning the dataset, processing partitions in parallel across multiple workers, and merging results in a coordinated stream.
Time-based partitioning works well for event data:
interface ExportPartition {
startTime: Date;
endTime: Date;
workerId: string;
}
class DistributedExportCoordinator {
async createPartitions(
startDate: Date,
endDate: Date,
partitionCount: number
): Promise<ExportPartition[]> {
const totalMs = endDate.getTime() - startDate.getTime();
const partitionMs = totalMs / partitionCount;
return Array.from({ length: partitionCount }, (_, i) => ({
startTime: new Date(startDate.getTime() + i * partitionMs),
endTime: new Date(startDate.getTime() + (i + 1) * partitionMs),
workerId: `worker-${i}`
}));
}
async exportPartition(
partition: ExportPartition,
outputStream: NodeJS.WritableStream
) {
const query = `
SELECT * FROM events
WHERE created_at >= $1 AND created_at < $2
ORDER BY created_at, id
`;
const cursor = new KeysetCursorStream(
query,
[partition.startTime, partition.endTime],
1000,
dbClient
);
await pipeline(cursor, new JSONLinesTransform(), outputStream);
}
}
For cloud environments, object storage provides an efficient intermediate layer. Workers export partitions to S3/GCS, then a final merge step streams from multiple objects:
import { S3Client, GetObjectCommand } from '@aws-sdk/client-s3';
async function* mergeS3Partitions(
s3Client: S3Client,
bucket: string,
partitionKeys: string[]
) {
for (const key of partitionKeys) {
const command = new GetObjectCommand({ Bucket: bucket, Key: key });
const response = await s3Client.send(command);
const stream = response.Body as NodeJS.ReadableStream;
for await (const chunk of stream) {
yield chunk;
}
}
}
Common Pitfalls and Failure Modes
Memory leaks occur when event listeners aren't properly cleaned up. Always remove listeners in error handlers and use once() instead of on() for single-use events. Monitor memory usage with process.memoryUsage() and implement circuit breakers that abort exports exceeding memory thresholds.
Connection pool exhaustion happens when long-running exports hold database connections. Use dedicated connection pools for exports with lower max connection limits. Implement connection timeouts and automatic retry logic with exponential backoff.
Partial export failures leave clients with incomplete data. Include metadata headers indicating total expected records and implement checksums for verification:
class VerifiableExportStream extends Transform {
private recordCount = 0;
private checksum = 0;
_transform(chunk: any, encoding: string, callback: Function) {
this.recordCount++;
this.checksum = (this.checksum + this.hashRecord(chunk)) % 2147483647;
callback(null, JSON.stringify(chunk) + '\n');
}
_flush(callback: Function) {
const footer = JSON.stringify({
_metadata: {
totalRecords: this.recordCount,
checksum: this.checksum
}
}) + '\n';
callback(null, footer);
}
private hashRecord(record: any): number {
const str = JSON.stringify(record);
let hash = 0;
for (let i = 0; i < str.length; i++) {
hash = ((hash << 5) - hash) + str.charCodeAt(i);
hash = hash & hash;
}
return Math.abs(hash);
}
}
Timezone and encoding issues corrupt data during export. Always export timestamps in ISO 8601 format with explicit timezone information. Use UTF-8 encoding consistently and validate character encoding at boundaries.
Best Practices for Production Streaming Exports
Implement comprehensive observability from the start. Emit metrics for records processed per second, current memory usage, active connection count, and error rates. Use distributed tracing to track exports across multiple services:
import { trace, context } from '@opentelemetry/api';
class InstrumentedExportStream extends Transform {
private tracer = trace.getTracer('export-service');
private recordsProcessed = 0;
_transform(chunk: any, encoding: string, callback: Function) {
const span = this.tracer.startSpan('process-record', {
attributes: {
'export.record_number': this.recordsProcessed++
}
});
try {
const result = this.processRecord(chunk);
span.setStatus({ code: 1 }); // OK
callback(null, result);
} catch (error) {
span.setStatus({ code: 2, message: (error as Error).message });
callback(error);
} finally {
span.end();
}
}
private processRecord(chunk: any): string {
return JSON.stringify(chunk) + '\n';
}
}
Configure appropriate batch sizes based on record size and network latency. For small records (< 1KB), use batches of 1000-5000. For large records (> 100KB), reduce to 100-500. Monitor and adjust based on actual throughput metrics.
Implement graceful degradation for overload scenarios. When system load exceeds thresholds, reduce batch sizes, increase delays between batches, or queue export requests for later processing:
class AdaptiveExportStream {
private currentBatchSize: number;
private readonly minBatchSize = 100;
private readonly maxBatchSize = 5000;
constructor(initialBatchSize: number) {
this.currentBatchSize = initialBatchSize;
}
async adaptBatchSize(metrics: { cpuUsage: number; memoryUsage: number }) {
if (metrics.cpuUsage > 80 || metrics.memoryUsage > 85) {
this.currentBatchSize = Math.max(
this.minBatchSize,
Math.floor(this.currentBatchSize * 0.8)
);
} else if (metrics.cpuUsage < 50 && metrics.memoryUsage < 60) {
this.currentBatchSize = Math.min(
this.maxBatchSize,
Math.floor(this.currentBatchSize * 1.2)
);
}
}
}
Use compression for network-bound exports. Gzip compression typically reduces JSON payload sizes by 70-90% with minimal CPU overhead. For extremely large exports, consider columnar formats like Parquet that provide better compression ratios and faster client-side processing.
Implement resume capability for long-running exports. Include a resume token in responses that clients can use to continue from the last successfully received record:
interface ResumeToken {
lastRecordId: string;
timestamp: number;
checksum: string;
}
function generateResumeToken(lastRecord: any): string {
const token: ResumeToken = {
lastRecordId: lastRecord.id,
timestamp: Date.now(),
checksum: hashRecord(lastRecord)
};
return Buffer.from(JSON.stringify(token)).toString('base64');
}
function parseResumeToken(token: string): ResumeToken {
return JSON.parse(Buffer.from(token, 'base64').toString('utf-8'));
}
Frequently Asked Questions
What is the most memory-efficient way to export billions of database rows?
Use cursor-based streaming with keyset pagination and a small batch size (1000-5000 records). Process records through a streaming pipeline without materializing the complete result set. This maintains constant memory usage regardless of total dataset size, typically under 100MB for exports of any size.
How does backpressure handling work in streaming data exports?
Backpress