Content Role: pillar

Streaming Response APIs: Real-Time Data Delivery

Server-sent events and chunked transfer for progressive content loading

Traditional request-response APIs force clients to wait for complete data processing before receiving any output. When generating AI responses, processing large datasets, or performing complex computations, this creates poor user experiences with blank screens and timeout errors. Users abandon applications that feel unresponsive, even when backend processing is working correctly.

The fundamental problem is buffering. Standard HTTP responses accumulate the entire payload in memory before transmission. For a 30-second LLM inference or a 5-minute data export, clients receive nothing until completion—or worse, hit gateway timeouts at 30-60 seconds. This architectural constraint breaks down for modern applications requiring immediate feedback and progressive disclosure.

Streaming response APIs solve this by transmitting data incrementally as it becomes available. Instead of waiting for complete processing, clients receive chunks immediately, enabling real-time UI updates, early error detection, and better resource utilization. This pattern has become essential for AI applications, live analytics dashboards, and any system where perceived performance matters as much as actual throughput.

Why Traditional Approaches Fail

Standard REST APIs use a simple contract: client sends request, server processes completely, client receives response. This works perfectly for fast operations under 1-2 seconds. Beyond that threshold, the model breaks down in several ways.

Timeout cascades occur when API gateways, load balancers, and proxies each enforce their own timeout policies. Your application server might allow 5-minute processing, but if Cloudflare terminates connections at 100 seconds, clients never see results. Increasing timeouts across every infrastructure layer is fragile and doesn't scale.

Memory pressure builds when buffering large responses. Generating a 50MB CSV export requires holding the entire file in memory before transmission. With 100 concurrent requests, you're managing 5GB of buffered data—causing garbage collection pauses, OOM errors, and degraded performance for all users.

User experience degradation is the most visible failure. Users see loading spinners with no progress indication, no way to know if processing is working or stalled. They refresh pages, retry requests, and create duplicate work. The lack of incremental feedback makes applications feel broken even when functioning correctly.

Error detection delays mean failures discovered late in processing waste all previous computation. If a streaming API fails at 10% progress, you know immediately and can retry or adjust. With buffered responses, you discover errors only after investing full processing time.

Modern Streaming Patterns

Two primary approaches dominate streaming response APIs in 2025: Server-Sent Events (SSE) for unidirectional server-to-client streaming, and chunked transfer encoding for HTTP/1.1 compatibility with progressive responses.

Server-Sent Events Implementation

SSE provides a standardized protocol for server-to-client event streams over HTTP. It's built into browsers via the EventSource API and offers automatic reconnection, event typing, and clean error handling.

// Node.js streaming endpoint with SSE
import { FastifyInstance, FastifyRequest, FastifyReply } from 'fastify';

interface StreamRequest {
  query: string;
  model: string;
}

export async function registerStreamingRoutes(app: FastifyInstance) {
  app.get<{ Querystring: StreamRequest }>(
    '/api/stream',
    async (request: FastifyRequest<{ Querystring: StreamRequest }>, reply: FastifyReply) => {
      const { query, model } = request.query;

      // Set SSE headers
      reply.raw.setHeader('Content-Type', 'text/event-stream');
      reply.raw.setHeader('Cache-Control', 'no-cache');
      reply.raw.setHeader('Connection', 'keep-alive');
      reply.raw.setHeader('X-Accel-Buffering', 'no'); // Disable nginx buffering

      // Send initial connection event
      reply.raw.write('event: connected\n');
      reply.raw.write(`data: ${JSON.stringify({ timestamp: Date.now() })}\n\n`);

      try {
        // Simulate streaming data source (replace with actual LLM, database, etc.)
        for await (const chunk of processStreamingQuery(query, model)) {
          const eventData = {
            content: chunk.text,
            metadata: chunk.metadata,
            progress: chunk.progress
          };

          reply.raw.write('event: message\n');
          reply.raw.write(`data: ${JSON.stringify(eventData)}\n\n`);

          // Flush immediately - critical for real-time delivery
          if (reply.raw.flush) {
            reply.raw.flush();
          }
        }

        // Send completion event
        reply.raw.write('event: complete\n');
        reply.raw.write(`data: ${JSON.stringify({ status: 'success' })}\n\n`);
      } catch (error) {
        // Stream error to client
        reply.raw.write('event: error\n');
        reply.raw.write(`data: ${JSON.stringify({ 
          message: error.message,
          code: error.code 
        })}\n\n`);
      } finally {
        reply.raw.end();
      }
    }
  );
}

async function* processStreamingQuery(query: string, model: string) {
  // Example: streaming from an LLM or database cursor
  const chunks = await getDataChunks(query, model);

  for (let i = 0; i < chunks.length; i++) {
    yield {
      text: chunks[i],
      metadata: { chunkIndex: i, model },
      progress: ((i + 1) / chunks.length) * 100
    };

    // Simulate processing delay
    await new Promise(resolve => setTimeout(resolve, 100));
  }
}

The client-side implementation uses the native EventSource API:

// Client-side SSE consumer
class StreamingAPIClient {
  private eventSource: EventSource | null = null;

  async streamQuery(
    query: string,
    model: string,
    callbacks: {
      onMessage: (data: any) => void;
      onError: (error: any) => void;
      onComplete: () => void;
    }
  ): Promise<void> {
    const url = `/api/stream?query=${encodeURIComponent(query)}&model=${model}`;

    this.eventSource = new EventSource(url);

    this.eventSource.addEventListener('connected', (event) => {
      console.log('Stream connected:', JSON.parse(event.data));
    });

    this.eventSource.addEventListener('message', (event) => {
      const data = JSON.parse(event.data);
      callbacks.onMessage(data);
    });

    this.eventSource.addEventListener('error', (event) => {
      const errorData = JSON.parse((event as MessageEvent).data);
      callbacks.onError(errorData);
      this.close();
    });

    this.eventSource.addEventListener('complete', () => {
      callbacks.onComplete();
      this.close();
    });

    this.eventSource.onerror = () => {
      callbacks.onError({ message: 'Connection lost' });
      this.close();
    };
  }

  close(): void {
    if (this.eventSource) {
      this.eventSource.close();
      this.eventSource = null;
    }
  }
}

Chunked Transfer Encoding for Binary Streams

For binary data or scenarios requiring more control, chunked transfer encoding provides lower-level streaming without SSE's text-based protocol overhead.

// Streaming binary data with chunked encoding
import { Readable } from 'stream';

export async function streamLargeFile(
  reply: FastifyReply,
  fileGenerator: AsyncGenerator<Buffer>
) {
  reply.raw.setHeader('Content-Type', 'application/octet-stream');
  reply.raw.setHeader('Transfer-Encoding', 'chunked');
  reply.raw.setHeader('X-Content-Type-Options', 'nosniff');

  const stream = Readable.from(fileGenerator);

  stream.on('error', (error) => {
    console.error('Stream error:', error);
    if (!reply.raw.headersSent) {
      reply.code(500).send({ error: 'Stream failed' });
    } else {
      reply.raw.end();
    }
  });

  reply.raw.on('close', () => {
    stream.destroy();
  });

  stream.pipe(reply.raw);
}

// Usage: streaming CSV export
async function* generateCSVRows(query: string): AsyncGenerator<Buffer> {
  yield Buffer.from('id,name,value,timestamp\n');

  const cursor = await database.query(query).cursor();

  for await (const row of cursor) {
    const csvLine = `${row.id},${row.name},${row.value},${row.timestamp}\n`;
    yield Buffer.from(csvLine);
  }
}

Common Pitfalls and Solutions

Buffering at infrastructure layers is the most frequent issue. Reverse proxies like nginx buffer responses by default. Set X-Accel-Buffering: no header and configure proxy_buffering off in nginx. Cloudflare and AWS ALB have similar settings requiring explicit streaming enablement.

Missing backpressure handling causes memory leaks when producers generate data faster than consumers process it. Use Node.js streams with proper highWaterMark configuration and respect write() return values:

async function streamWithBackpressure(reply: FastifyReply, generator: AsyncGenerator<string>) {
  for await (const chunk of generator) {
    const canContinue = reply.raw.write(chunk);

    if (!canContinue) {
      // Wait for drain event before continuing
      await new Promise(resolve => reply.raw.once('drain', resolve));
    }
  }
  reply.raw.end();
}

Connection lifecycle mismanagement leads to resource leaks. Always clean up resources in finally blocks and handle client disconnections:

reply.raw.on('close', () => {
  // Client disconnected - clean up resources
  generator.return?.();
  database.releaseConnection();
});

Error handling mid-stream requires careful design. Once headers are sent, you cannot change status codes. Send errors as data events:

try {
  // streaming logic
} catch (error) {
  if (!reply.raw.headersSent) {
    reply.code(500).send({ error: error.message });
  } else {
    // Send error as SSE event
    reply.raw.write(`event: error\ndata: ${JSON.stringify({ error: error.message })}\n\n`);
    reply.raw.end();
  }
}

Best Practices Checklist

Set explicit headers: Content-Type, Cache-Control: no-cache, X-Accel-Buffering: no
Implement heartbeats: Send periodic keep-alive events every 15-30 seconds to prevent timeout
Handle reconnection: Use event IDs with Last-Event-ID header for resumable streams
Monitor memory: Track active connections and implement connection limits
Test infrastructure: Verify streaming works through all proxies and CDNs in your stack
Implement timeouts: Set maximum stream duration to prevent infinite connections
Log stream metrics: Track chunk count, duration, errors, and client disconnections
Use compression carefully: gzip can buffer chunks; prefer identity encoding for real-time streams
Validate early: Check authentication and input validation before starting streams
Document event schema: Clearly specify event types and data structures for consumers

Frequently Asked Questions

When should I use SSE versus WebSockets? Use SSE for unidirectional server-to-client streaming like AI responses, notifications, or live updates. SSE works over standard HTTP, passes through corporate firewalls easily, and reconnects automatically. Choose WebSockets only when you need bidirectional communication with client-to-server messages during the stream.

How do I handle authentication with streaming APIs? Validate authentication before starting the stream. For SSE, pass tokens as query parameters or use cookie-based auth since EventSource doesn't support custom headers. For fetch-based streaming, use standard Authorization headers. Never send auth tokens in SSE data events.

What's the maximum practical stream duration? Most infrastructure supports streams up to 5-10 minutes reliably. Beyond that, implement resumable streams with event IDs. For very long operations (hours), consider job queues with polling or webhooks instead of continuous streaming.

How do I test streaming endpoints? Use curl --no-buffer for SSE testing. Write integration tests that consume streams and verify event ordering, error handling, and completion. Load test with tools like k6 that support SSE and measure concurrent connection limits.

Can I use streaming APIs with serverless functions? Traditional serverless platforms (AWS Lambda, Google Cloud Functions) don't support streaming responses due to their request-response model. Use platforms with streaming support like AWS Lambda with function URLs and response streaming enabled, or deploy to container-based services like Cloud Run, ECS, or Kubernetes.

How do I implement rate limiting for streaming endpoints? Apply rate limits before starting the stream based on user identity. Track active streams per user and enforce concurrent connection limits. For data volume limits, track bytes sent and terminate streams exceeding quotas.

What's the performance overhead of streaming versus buffered responses? Streaming reduces memory usage significantly but adds slight CPU overhead for chunking and flushing. For large responses, streaming improves overall throughput by allowing parallel processing and transmission. The perceived performance improvement for users far outweighs any minor technical overhead.

Streaming Response APIs: Real-Time Data Delivery

Streaming Response APIs: Real-Time Data Delivery

Server-sent events and chunked transfer for progressive content loading

Why Traditional Approaches Fail

Modern Streaming Patterns

Server-Sent Events Implementation

Chunked Transfer Encoding for Binary Streams

Common Pitfalls and Solutions

Best Practices Checklist

Frequently Asked Questions

Comments

More from this blog

Embedding-First Architecture for Real-World LLM Apps

AI/ML Modern Patterns

Containers/K8s Modern Patterns

Containers/K8s Modern Patterns

Containers/K8s Modern Patterns

Command Palette

Streaming Response APIs: Real-Time Data Delivery

Server-sent events and chunked transfer for progressive content loading

Why Traditional Approaches Fail

Modern Streaming Patterns

Server-Sent Events Implementation

Chunked Transfer Encoding for Binary Streams

Common Pitfalls and Solutions

Best Practices Checklist

Frequently Asked Questions

Comments

More from this blog