Skip to main content
graphwiz.ai
← Back to Posts

Scaling Self-Hosted Cloud Applications: From 1K to 100K+ Users

DevOpsInfrastructure
DevOpsscalingKubernetesNextcloudOpenDeskinfrastructure

Table of Contents

Introduction

Self-hosting cloud applications gives you control over data sovereignty—but that control comes with scaling responsibility. Unlike SaaS platforms that abstract infrastructure away, self-hosted solutions like Nextcloud, OpenDesk, or Matrix require deliberate architecture decisions as user counts grow.

This guide maps four scaling tiers, using Nextcloud and OpenDesk as practical case studies:

Tier Users Architecture
Tier 1 1,000 Single server, optimized
Tier 2 10,000 Multi-service, load balanced
Tier 3 50,000 Clustered, distributed
Tier 4 100,000+ Multi-region, enterprise

Scaling Tiers Overview


Tier 1: 1,000 Users — Single Server, Optimized

At 1,000 users with ~10-15% concurrent usage, a well-tuned single server suffices. The focus is on optimization, not distribution.

Tier 1: Single Server Architecture

Hardware Baseline

CPU:     8 vCPU
RAM:     32 GB
Storage: 500 GB NVMe SSD (or S3-compatible object storage)
Network: 1 Gbps

Database Configuration

PostgreSQL (recommended for performance):

# postgresql.conf
shared_buffers = 8GB
effective_cache_size = 24GB
max_connections = 200
work_mem = 64MB
maintenance_work_mem = 512MB
checkpoint_completion_target = 0.9
wal_buffers = 64MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200

MySQL/MariaDB alternative:

[mysqld]
innodb_buffer_pool_size = 8G
innodb_log_file_size = 512M
innodb_flush_log_at_trx_commit = 2
innodb_flush_method = O_DIRECT
max_connections = 200
query_cache_type = 1
query_cache_size = 64M
transaction_isolation = READ-COMMITTED

Caching Stack

Single-server caching uses APCu for local cache and Redis for locking:

// Nextcloud config.php
'memcache.local' => '\OC\Memcache\APCu',
'memcache.locking' => '\OC\Memcache\Redis',
'redis' => [
    'host' => 'localhost',
    'port' => 6379,
],

PHP-FPM Tuning

[www]
pm = dynamic
pm.max_children = 50
pm.start_servers = 5
pm.min_spare_servers = 5
pm.max_spare_servers = 10
pm.max_requests = 500

Memory calculation: Each PHP-FPM worker consumes ~50-100MB. With 32GB RAM and 8GB for database, you can safely run 150-200 workers.

OPcache Configuration

opcache.enable = 1
opcache.memory_consumption = 256
opcache.interned_strings_buffer = 16
opcache.max_accelerated_files = 10000
opcache.revalidate_freq = 60
opcache.fast_shutdown = 1

Storage Strategy

Option A: Local NVMe — Fastest for small deployments

Storage: /var/lib/nextcloud/data → 500GB NVMe

Option B: Object Storage — Better for growth, simpler backup

'objectstore' => [
    'class' => '\\OC\\Files\\ObjectStore\\S3',
    'arguments' => [
        'bucket' => 'nextcloud-primary',
        'hostname' => 'minio.internal.example.com',
        'key' => 'access-key',
        'secret' => 'secret-key',
        'use_path_style' => true,
    ],
],

Background Jobs

Use systemd timers instead of web-based AJAX:

# /etc/systemd/system/nextcloudcron.timer
[Unit]
Description = Run Nextcloud cron every 5 minutes

[Timer]
OnBootSec = 5min
OnUnitActiveSec = 5min

[Install]
WantedBy = timers.target

Enable: systemctl enable --now nextcloudcron.timer


Tier 2: 10,000 Users — Multi-Service, Load Balanced

At 10,000 users, single-server bottlenecks emerge. You need load balancing and read replicas.

Tier 2: Load Balanced Architecture

Hardware Sizing

Component Spec Count
Web Nodes 8 vCPU, 16GB RAM 3
Database Primary 8 vCPU, 32GB RAM 1
Database Replicas 4 vCPU, 16GB RAM 2
Redis 4 vCPU, 16GB RAM 3 (cluster)
Object Storage MinIO cluster 4+ nodes

Load Balancer Configuration (HAProxy)

frontend nextcloud_https
    bind *:443 ssl crt /etc/ssl/nextcloud.pem
    acl url_discovery path /.well-known/caldav /.well-known/carddav
    http-request redirect location /remote.php/dav/ code 301 if url_discovery
    default_backend nextcloud_servers

backend nextcloud_servers
    balance leastconn
    option httpchk HEAD /status.php HTTP/1.1\r\nHost:\ nextcloud.example.com
    http-check expect status 200
    server web1 10.0.1.1:9000 check inter 5s fall 3 rise 2
    server web2 10.0.1.2:9000 check inter 5s fall 3 rise 2
    server web3 10.0.1.3:9000 check inter 5s fall 3 rise 2

Database Read Replicas

Nextcloud supports native read/write splitting (since v29):

// config.php
'dbreplica' => [
    ['user' => 'nc_user', 'password' => 'pass1', 'host' => 'db-replica-1', 'dbname' => 'nextcloud'],
    ['user' => 'nc_user', 'password' => 'pass2', 'host' => 'db-replica-2', 'dbname' => 'nextcloud'],
],

Read queries automatically route to replicas; writes go to primary.

Redis Cluster

Distributed caching and file locking:

'memcache.local' => '\OC\Memcache\APCu',
'memcache.distributed' => '\OC\Memcache\Redis',
'memcache.locking' => '\OC\Memcache\Redis',
'redis.cluster' => [
    'seeds' => [
        'redis-1:7000',
        'redis-2:7000',
        'redis-3:7000',
    ],
],

Session Storage

Web nodes are stateless; sessions go to Redis:

// php.ini
session.save_handler = redis
session.save_path = "tcp://redis-1:6379?weight=1,tcp://redis-2:6379?weight=1"

Critical: Shared Configuration

All web nodes must share:

  • Same Redis cluster (distributed cache + locking)
  • Same database (primary + replicas)
  • Same object storage (not local disk)
  • Same config.php (synced via rsync or shared volume)

Tier 3: 50,000 Users — Kubernetes, Clustered

At 50,000 users, Kubernetes becomes essential for orchestration, auto-scaling, and resilience.

Tier 3: Kubernetes Clustered Architecture

Architecture Overview

flowchart TD
    subgraph K8s["Kubernetes Cluster"]
        subgraph Ingress["Ingress Layer"]
            ING["NGINX Ingress + cert-manager"]
        end
        subgraph App["Application Layer"]
            P1["Nextcloud Pod 1"]
            P2["Nextcloud Pod 2"]
            P3["Nextcloud Pod 3"]
            PN["Nextcloud Pod N"]
            HPA["HPA: min=5, max=20 (CPU > 70%)"]
        end
        subgraph Data["Data Layer"]
            PG["PostgreSQL Cluster<br/>(Patroni)"]
            RD["Redis Cluster<br/>(6 nodes)"]
            MN["MinIO Cluster<br/>(4+ nodes)"]
        end
        ING --> P1
        ING --> P2
        ING --> P3
        ING --> PN
        P1 --> PG
        P1 --> RD
        P1 --> MN
    end

Kubernetes Deployment (OpenDesk Pattern)

OpenDesk demonstrates production Kubernetes architecture:

# Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: nextcloud-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nextcloud
  minReplicas: 5
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Database Clustering (Patroni/PostgreSQL)

# Patroni cluster for high availability
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgresql
spec:
  serviceName: postgresql-headless
  replicas: 3
  selector:
    matchLabels:
      app: postgresql
  template:
    spec:
      containers:
      - name: postgresql
        resources:
          requests:
            cpu: "4"
            memory: "16Gi"
          limits:
            cpu: "8"
            memory: "32Gi"

Multi-Bucket Object Storage

For 50K+ users, distribute files across multiple S3 buckets:

'objectstore' => [
    'class' => '\\OC\\Files\\ObjectStore\\S3',
    'arguments' => [
        'multibucket' => true,
        'num_buckets' => 64,
        'bucket' => 'nextcloud-',
        'hostname' => 'minio.internal.example.com',
        'key' => 'access-key',
        'secret' => 'secret-key',
        'use_path_style' => true,
    ],
],

Component Resource Guidelines

Based on OpenDesk scaling documentation:

Component Per X Users CPU RAM Notes
Nextcloud 500 concurrent 2 vCPU 4 GB Scale horizontally
Collabora 15 active users 1 vCPU 50 MB Stateful - sticky sessions
Jitsi (JVB) 200 concurrent 4 vCPU 8 GB Video transcoding
Matrix/Element 10K total 15 vCPU 12 GB Federation doubles load
PostgreSQL Cluster-wide 16 vCPU 64 GB Primary + 2 replicas

Monitoring Stack

# kube-prometheus-stack
prometheus:
  prometheusSpec:
    retention: 30d
    resources:
      requests:
        cpu: 500m
        memory: 2Gi

alertmanager:
  config:
    route:
      receiver: 'slack-notifications'
      routes:
      - match:
          severity: critical
        receiver: 'pagerduty'

grafana:
  additionalDataSources:
  - name: Loki
    type: loki
    url: http://loki:3100

Key Metrics to Monitor

  • Pod scaling events — HPA triggers indicate capacity pressure
  • Database connection pool saturation — Approaching max_connections
  • Redis memory usage — Cache eviction rates
  • Object storage latency — S3/MinIO response times
  • PHP-FPM queue length — Requests waiting for workers

Tier 4: 100,000+ Users — Multi-Region, Enterprise

At 100K+ users, single-region deployments hit limits. You need multi-region architecture, global load balancing, and sophisticated failure handling.

Tier 4: Multi-Region Enterprise Architecture

Architecture Overview

flowchart TD
    GSLB["Global Load Balancer (GSLB)<br/>Route53 / CloudFlare / PowerDNS"]

    subgraph EU["Region EU"]
        EU_K8s["K8s Cluster (20+ nodes)"]
        EU_PG["PostgreSQL Primary"]
        EU_MN["MinIO Cluster (sync)"]
    end

    subgraph US["Region US"]
        US_K8s["K8s Cluster (20+ nodes)"]
        US_PG["PostgreSQL Primary"]
        US_MN["MinIO Cluster (sync)"]
    end

    subgraph AP["Region AP"]
        AP_K8s["K8s Cluster (20+ nodes)"]
        AP_PG["PostgreSQL Primary"]
        AP_MN["MinIO Cluster (sync)"]
    end

    GSLB --> EU
    GSLB --> US
    GSLB --> AP
    EU_MN <--> US_MN
    US_MN <--> AP_MN

Tenant Isolation Strategy

Option A: Namespace per Tenant (Kubernetes)

# Each organization gets isolated namespace
apiVersion: v1
kind: Namespace
metadata:
  name: tenant-acme-corp
  labels:
    tenant: acme-corp
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: tenant-isolation
  namespace: tenant-acme-corp
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          tenant: acme-corp

Option B: Database per Tenant

-- Tenant isolation at database level
CREATE DATABASE nextcloud_acme;
CREATE DATABASE nextcloud_globex;

-- Row-level security for shared database
ALTER TABLE files ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON files USING (tenant_id = current_tenant());

Global Database Strategy

Synchronous replication within region, async between regions:

flowchart TD
    subgraph RegionEU["Region EU"]
        EU_P["Primary"]
        EU_R1["Replica 1"]
        EU_R2["Replica 2"]
        EU_P --> EU_R1 --> EU_R2
    end

    subgraph RegionUS["Region US"]
        US_S["Standby (Promotable)"]
        US_R["Replica"]
        US_S --> US_R
    end

    EU_P -.->|"Async Stream"| US_S

CDN and Edge Caching

# CloudFlare / Fastly CDN rules
rules:
  - match:
      path: "/remote.php/dav/files/*"
    caching:
      enabled: false  # DAV is dynamic
  - match:
      path: "/apps/files/*"
    caching:
      enabled: true
      ttl: 3600
  - match:
      path: "/css/*"
      path: "/js/*"
    caching:
      enabled: true
      ttl: 86400

Capacity Planning at Scale

Per-region sizing for 100K users:

Component Instances Each Spec
Web/API Pods 50-100 4 vCPU, 8 GB
Database Primary 1 32 vCPU, 128 GB
Database Replicas 4-6 16 vCPU, 64 GB
Redis Cluster 9 (3x3) 8 vCPU, 32 GB
MinIO Nodes 12+ 16 vCPU, 64 GB, NVMe
Load Balancers 3 8 vCPU, 16 GB

Failure Modes and Mitigations

Failure Impact Mitigation
Single pod crash None K8s recreates automatically
Node failure Minimal Pods reschedule, PDB ensures availability
AZ failure Degraded Cross-AZ deployment, multi-AZ storage
Region failure Failover GSLB routes to healthy region
Database primary failure Brief outage Patroni failover to replica (<30s)
Object storage failure Severe Multi-region replication

OpenDesk vs Vanilla Nextcloud: Scaling Comparison

OpenDesk (German government's open-source workspace) provides a reference architecture for scaled deployments:

Aspect Vanilla Nextcloud OpenDesk
Deployment VM or containers Kubernetes-only
Architecture Monolithic PHP Modular microservices
Database MySQL/MariaDB/PostgreSQL PostgreSQL with clustering
Auth Built-in or external Keycloak + OpenLDAP (decoupled)
Scaling Manual configuration Helm charts with autoscaling
Office Suite Optional app Collabora/OnlyOffice integrated
Video External (BigBlueButton) Jitsi Meet (integrated)

OpenDesk Component Scaling

From official documentation:

Collabora (Document Editing):
  Per 15 active users: 1 vCPU, 50 MB RAM

Jitsi (Video Conferencing):
  Per JVB (200 concurrent): 4 vCPU, 8 GB RAM
  Scale JVBs horizontally, use Octo for load balancing

Matrix/Element (Chat):
  Per 10K users (federation ON): 15 vCPU, 12 GB RAM
  Per 10K users (federation OFF): 10 vCPU, 8 GB RAM
  Federation adds 2-5x resource overhead

Scaling Decision Matrix

Factor Scale Vertically Scale Horizontally
Application type Stateful, monolithic Stateless, microservices
Traffic pattern Steady, predictable Variable, bursty
Availability requirement Best effort High availability (99.9%+)
Data consistency Strong, immediate Eventual acceptable
Operational complexity Keep simple Accept complexity
Budget Limited Flexible
Geographic distribution Single region Multi-region

General Principles

  1. Start vertical, then horizontal — Optimize single instance before adding complexity
  2. Stateless first — 12-factor app principles enable horizontal scaling
  3. Shared nothing — Each process independent, state in backing services
  4. Cache aggressively — Multi-layer: CDN → App → DB
  5. Monitor everything — You can't scale what you can't measure
  6. Automate scaling — HPA, VPA, Cluster Autoscaler
  7. Design for failure — Components will fail; plan for it
  8. Right-size continuously — Over-provisioning wastes money

Quick Reference: Configuration by Tier

Tier 1 (1K Users)

// Minimal config
'memcache.local' => '\OC\Memcache\APCu',
'memcache.locking' => '\OC\Memcache\Redis',

Tier 2 (10K Users)

// Load balanced with replicas
'dbreplica' => [
    ['host' => 'db-replica-1'],
    ['host' => 'db-replica-2'],
],
'memcache.distributed' => '\OC\Memcache\Redis',
'redis.cluster' => ['seeds' => ['redis-1:7000', 'redis-2:7000', 'redis-3:7000']],

Tier 3 (50K Users)

// Kubernetes with object storage
'objectstore' => [
    'class' => '\\OC\\Files\\ObjectStore\\S3',
    'arguments' => [
        'multibucket' => true,
        'num_buckets' => 64,
        'bucket' => 'nextcloud-',
    ],
],

Tier 4 (100K+ Users)

// Multi-region with global load balancing
'objectstore' => [
    'class' => '\\OC\\Files\\ObjectStore\\S3',
    'arguments' => [
        'multibucket' => true,
        'num_buckets' => 128,
        'bucket' => 'nextcloud-',
        'region' => 'eu-west-1',
    ],
],
'trusted_proxies' => ['10.0.0.0/8', '172.16.0.0/12'],
'overwriteprotocol' => 'https',

Conclusion

Scaling self-hosted cloud applications follows predictable patterns:

  • 1K users: Tune the stack, optimize single server
  • 10K users: Add load balancing and read replicas
  • 50K users: Kubernetes orchestration, clustered databases
  • 100K+ users: Multi-region, tenant isolation, global load balancing

The key insight: scaling is architecture, not just hardware. Decisions made at 1K users—storage backend, caching strategy, state management—determine how smoothly you reach 100K.

Start with 12-factor principles: stateless processes, attached resources, horizontal scaling. Then layer on product-specific optimizations (Nextcloud's Redis locking, OpenDesk's Kubernetes-native components).

Invest in observability early. You cannot scale what you cannot measure.