Content Role: pillar

Federated Learning: Privacy-First ML Training

Decentralized model training without centralizing sensitive data

Traditional machine learning requires aggregating training data in a central location. This approach creates privacy risks, regulatory compliance challenges, and data transfer bottlenecks. Healthcare providers cannot share patient records. Financial institutions face strict data residency requirements. Mobile applications generate sensitive user data that should never leave devices.

Federated learning architecture solves this problem by training models where data lives. Instead of moving data to the model, the model moves to the data. Clients train locally on their data, then share only model updates with a central server. The server aggregates these updates without ever accessing raw data.

This article explains how federated learning architecture works, demonstrates implementation patterns with TypeScript, and provides practical guidance for building production systems.

The Core Problem with Centralized Training

Centralized machine learning creates three fundamental problems:

Privacy exposure: Collecting raw data in one location creates a single point of failure. Data breaches expose all training data simultaneously. Even with encryption at rest, the central server must decrypt data for training.

Regulatory barriers: GDPR, HIPAA, and other regulations restrict data movement across jurisdictions. Healthcare data cannot leave hospital networks. Financial data must remain in specific geographic regions. These constraints make centralized training legally impossible for many use cases.

Infrastructure costs: Transferring gigabytes or terabytes of training data consumes bandwidth and storage. Edge devices with limited connectivity struggle to upload large datasets. Network costs scale linearly with data volume.

Federated learning architecture addresses all three issues by keeping data distributed and training models collaboratively.

How Federated Learning Architecture Works

The federated learning architecture consists of four key components:

Central server: Maintains the global model and orchestrates training rounds. The server never accesses raw training data—only aggregated model updates.

Clients: Individual devices, hospitals, or organizations that hold private data. Each client trains a local model copy on their data.

Communication protocol: Defines how clients receive model weights, compute updates, and transmit gradients back to the server.

Aggregation algorithm: Combines client updates into a single global model update. The most common approach is Federated Averaging (FedAvg).

The Training Process

A typical federated learning round follows this sequence:

Server broadcasts current global model weights to selected clients
Each client downloads the model and trains on local data for several epochs
Clients compute model updates (weight deltas or gradients)
Clients send updates back to the server
Server aggregates updates using weighted averaging
Server updates global model and repeats

This process continues until the model converges or reaches a performance threshold.

Implementing Federated Learning Architecture

Here's a practical implementation of a federated learning server in TypeScript:

interface ModelWeights {
  layers: number[][];
  version: number;
}

interface ClientUpdate {
  clientId: string;
  weights: number[][];
  sampleCount: number;
  loss: number;
}

class FederatedLearningServer {
  private globalModel: ModelWeights;
  private clients: Set<string>;
  private minClientsPerRound: number;

  constructor(initialWeights: number[][], minClients: number = 3) {
    this.globalModel = {
      layers: initialWeights,
      version: 0
    };
    this.clients = new Set();
    this.minClientsPerRound = minClients;
  }

  registerClient(clientId: string): void {
    this.clients.add(clientId);
  }

  getGlobalModel(): ModelWeights {
    return {
      layers: this.globalModel.layers.map(layer => [...layer]),
      version: this.globalModel.version
    };
  }

  async aggregateUpdates(updates: ClientUpdate[]): Promise<ModelWeights> {
    if (updates.length < this.minClientsPerRound) {
      throw new Error(`Insufficient clients: ${updates.length} < ${this.minClientsPerRound}`);
    }

    const totalSamples = updates.reduce((sum, u) => sum + u.sampleCount, 0);

    // Federated Averaging: weighted average by sample count
    const aggregatedWeights = this.globalModel.layers.map((layer, layerIdx) => {
      return layer.map((_, weightIdx) => {
        const weightedSum = updates.reduce((sum, update) => {
          const weight = update.weights[layerIdx][weightIdx];
          const clientWeight = update.sampleCount / totalSamples;
          return sum + (weight * clientWeight);
        }, 0);
        return weightedSum;
      });
    });

    this.globalModel = {
      layers: aggregatedWeights,
      version: this.globalModel.version + 1
    };

    return this.getGlobalModel();
  }

  selectClientsForRound(fraction: number = 0.3): string[] {
    const clientArray = Array.from(this.clients);
    const numClients = Math.max(
      this.minClientsPerRound,
      Math.floor(clientArray.length * fraction)
    );

    // Random selection without replacement
    const shuffled = clientArray.sort(() => Math.random() - 0.5);
    return shuffled.slice(0, numClients);
  }
}

Client-side training implementation:

class FederatedClient {
  private clientId: string;
  private localData: { features: number[][], labels: number[] };

  constructor(clientId: string, data: { features: number[][], labels: number[] }) {
    this.clientId = clientId;
    this.localData = data;
  }

  async trainLocal(
    globalWeights: number[][],
    epochs: number = 5,
    learningRate: number = 0.01
  ): Promise<ClientUpdate> {
    // Clone weights to avoid modifying global model
    let localWeights = globalWeights.map(layer => [...layer]);
    let totalLoss = 0;

    // Simple gradient descent training loop
    for (let epoch = 0; epoch < epochs; epoch++) {
      for (let i = 0; i < this.localData.features.length; i++) {
        const features = this.localData.features[i];
        const label = this.localData.labels[i];

        // Forward pass (simplified)
        const prediction = this.forward(features, localWeights);
        const loss = Math.pow(prediction - label, 2);
        totalLoss += loss;

        // Backward pass and weight update (simplified)
        const gradients = this.computeGradients(features, prediction, label);
        localWeights = this.updateWeights(localWeights, gradients, learningRate);
      }
    }

    return {
      clientId: this.clientId,
      weights: localWeights,
      sampleCount: this.localData.features.length,
      loss: totalLoss / (epochs * this.localData.features.length)
    };
  }

  private forward(features: number[], weights: number[][]): number {
    // Simplified forward pass
    return features.reduce((sum, f, i) => sum + f * weights[0][i], 0);
  }

  private computeGradients(
    features: number[],
    prediction: number,
    label: number
  ): number[][] {
    const error = prediction - label;
    return [features.map(f => 2 * error * f)];
  }

  private updateWeights(
    weights: number[][],
    gradients: number[][],
    learningRate: number
  ): number[][] {
    return weights.map((layer, i) =>
      layer.map((w, j) => w - learningRate * gradients[i][j])
    );
  }
}

Orchestrating a complete training round:

async function runFederatedTrainingRound(
  server: FederatedLearningServer,
  clients: Map<string, FederatedClient>
): Promise<void> {
  // Select subset of clients
  const selectedClientIds = server.selectClientsForRound(0.3);

  // Broadcast global model
  const globalModel = server.getGlobalModel();

  // Parallel local training
  const updatePromises = selectedClientIds.map(async (clientId) => {
    const client = clients.get(clientId);
    if (!client) throw new Error(`Client ${clientId} not found`);

    return await client.trainLocal(globalModel.layers, 5, 0.01);
  });

  const updates = await Promise.all(updatePromises);

  // Aggregate updates
  const newModel = await server.aggregateUpdates(updates);

  console.log(`Round complete. Model version: ${newModel.version}`);
}

Common Pitfalls

Non-IID data distribution: Clients often have non-identically distributed data. A hospital specializing in cardiology has different patient demographics than a general practice. This skew causes model divergence. Solution: Use client sampling strategies and adaptive learning rates.

Communication bottlenecks: Sending full model weights every round consumes bandwidth. For large models, this becomes prohibitive. Solution: Implement gradient compression, quantization, or sparse updates.

Stragglers: Slow clients delay training rounds. Waiting for all clients creates bottlenecks. Solution: Set timeouts and aggregate updates from available clients only.

Model poisoning: Malicious clients can send corrupted updates to degrade model performance. Solution: Implement Byzantine-robust aggregation algorithms and anomaly detection.

Privacy leakage through gradients: Model updates can leak information about training data through gradient analysis. Solution: Add differential privacy noise to updates before transmission.

Client selection bias: Always selecting the same clients creates sampling bias. Solution: Rotate client selection and track participation statistics.

Best Practices Checklist

[ ] Implement secure aggregation to prevent server from seeing individual updates
[ ] Add differential privacy noise with calibrated epsilon values
[ ] Use gradient clipping to bound update magnitudes
[ ] Implement client authentication and authorization
[ ] Monitor convergence metrics across heterogeneous clients
[ ] Set appropriate timeouts for client responses
[ ] Validate update sizes and ranges before aggregation
[ ] Log participation rates and detect anomalous clients
[ ] Use compression for model weight transmission
[ ] Implement checkpointing for fault tolerance
[ ] Test with realistic non-IID data distributions
[ ] Profile bandwidth and compute requirements per client

FAQ

What's the difference between federated learning and distributed training?

Distributed training splits a single dataset across multiple machines for parallel computation. All machines access the same data. Federated learning trains on separate, private datasets that never leave their original locations. The key distinction is data ownership and privacy.

How many clients do I need for effective federated learning?

Minimum 10-20 clients for meaningful aggregation. Production systems typically use hundreds to millions of clients. More clients improve model generalization but increase coordination complexity. Start with 20-50 clients for prototyping.

Can federated learning work with deep neural networks?

Yes. Federated learning works with any model architecture that supports gradient-based optimization. CNNs, RNNs, and transformers all work. Large models require gradient compression or partial updates to manage communication costs.

How do I handle clients with different computational capabilities?

Implement tiered model architectures where powerful clients train larger models and resource-constrained clients train smaller variants. Alternatively, use adaptive computation where clients train for different numbers of local epochs based on their capabilities.

What happens if clients drop out mid-training?

The server should set timeouts and aggregate updates from available clients only. Track client reliability and adjust selection probabilities. Implement checkpointing so clients can resume from their last state.

How do I debug federated learning systems?

Start with centralized training as a baseline. Simulate federated learning locally with multiple processes. Log convergence metrics per client. Visualize weight distributions across clients. Test with synthetic data before production deployment.

Is federated learning slower than centralized training?

Yes, typically 2-10x slower due to communication overhead and client heterogeneity. The tradeoff is privacy preservation and regulatory compliance. Optimize by reducing communication frequency, compressing updates, and selecting faster clients.

Federated Learning: Privacy-First ML Training

Federated Learning: Privacy-First ML Training

Decentralized model training without centralizing sensitive data

The Core Problem with Centralized Training

How Federated Learning Architecture Works

The Training Process

Implementing Federated Learning Architecture

Common Pitfalls

Best Practices Checklist

FAQ

Comments

More from this blog

Embedding-First Architecture for Real-World LLM Apps

AI/ML Modern Patterns

Containers/K8s Modern Patterns

Containers/K8s Modern Patterns

Containers/K8s Modern Patterns

Command Palette

Federated Learning: Privacy-First ML Training

Decentralized model training without centralizing sensitive data

The Core Problem with Centralized Training

How Federated Learning Architecture Works

The Training Process

Implementing Federated Learning Architecture

Common Pitfalls

Best Practices Checklist

FAQ

Comments

More from this blog