Scaling Self-Hosted Cloud Applications: From 1K to 100K+ Users
Table of Contents
- Introduction — Overview of scaling challenges
- Tier 1: 1,000 Users — Single server optimization
- Tier 2: 10,000 Users — Load balancing setup
- Tier 3: 50,000 Users — Kubernetes clustering
- Tier 4: 100,000+ Users — Enterprise architecture
- OpenDesk vs Vanilla Nextcloud — Platform comparison
- Scaling Decision Matrix — Decision framework
- Quick Reference — Configuration templates
- Conclusion — Key takeaways
Introduction
Self-hosting cloud applications gives you control over data sovereignty—but that control comes with scaling responsibility. Unlike SaaS platforms that abstract infrastructure away, self-hosted solutions like Nextcloud, OpenDesk, or Matrix require deliberate architecture decisions as user counts grow.
This guide maps four scaling tiers, using Nextcloud and OpenDesk as practical case studies:
| Tier | Users | Architecture |
|---|---|---|
| Tier 1 | 1,000 | Single server, optimized |
| Tier 2 | 10,000 | Multi-service, load balanced |
| Tier 3 | 50,000 | Clustered, distributed |
| Tier 4 | 100,000+ | Multi-region, enterprise |
Tier 1: 1,000 Users — Single Server, Optimized
At 1,000 users with ~10-15% concurrent usage, a well-tuned single server suffices. The focus is on optimization, not distribution.
Hardware Baseline
CPU: 8 vCPU
RAM: 32 GB
Storage: 500 GB NVMe SSD (or S3-compatible object storage)
Network: 1 Gbps
Database Configuration
PostgreSQL (recommended for performance):
# postgresql.conf
shared_buffers = 8GB
effective_cache_size = 24GB
max_connections = 200
work_mem = 64MB
maintenance_work_mem = 512MB
checkpoint_completion_target = 0.9
wal_buffers = 64MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
MySQL/MariaDB alternative:
[mysqld]
innodb_buffer_pool_size = 8G
innodb_log_file_size = 512M
innodb_flush_log_at_trx_commit = 2
innodb_flush_method = O_DIRECT
max_connections = 200
query_cache_type = 1
query_cache_size = 64M
transaction_isolation = READ-COMMITTED
Caching Stack
Single-server caching uses APCu for local cache and Redis for locking:
// Nextcloud config.php
'memcache.local' => '\OC\Memcache\APCu',
'memcache.locking' => '\OC\Memcache\Redis',
'redis' => [
'host' => 'localhost',
'port' => 6379,
],
PHP-FPM Tuning
[www]
pm = dynamic
pm.max_children = 50
pm.start_servers = 5
pm.min_spare_servers = 5
pm.max_spare_servers = 10
pm.max_requests = 500
Memory calculation: Each PHP-FPM worker consumes ~50-100MB. With 32GB RAM and 8GB for database, you can safely run 150-200 workers.
OPcache Configuration
opcache.enable = 1
opcache.memory_consumption = 256
opcache.interned_strings_buffer = 16
opcache.max_accelerated_files = 10000
opcache.revalidate_freq = 60
opcache.fast_shutdown = 1
Storage Strategy
Option A: Local NVMe — Fastest for small deployments
Storage: /var/lib/nextcloud/data → 500GB NVMe
Option B: Object Storage — Better for growth, simpler backup
'objectstore' => [
'class' => '\\OC\\Files\\ObjectStore\\S3',
'arguments' => [
'bucket' => 'nextcloud-primary',
'hostname' => 'minio.internal.example.com',
'key' => 'access-key',
'secret' => 'secret-key',
'use_path_style' => true,
],
],
Background Jobs
Use systemd timers instead of web-based AJAX:
# /etc/systemd/system/nextcloudcron.timer
[Unit]
Description = Run Nextcloud cron every 5 minutes
[Timer]
OnBootSec = 5min
OnUnitActiveSec = 5min
[Install]
WantedBy = timers.target
Enable: systemctl enable --now nextcloudcron.timer
Tier 2: 10,000 Users — Multi-Service, Load Balanced
At 10,000 users, single-server bottlenecks emerge. You need load balancing and read replicas.
Hardware Sizing
| Component | Spec | Count |
|---|---|---|
| Web Nodes | 8 vCPU, 16GB RAM | 3 |
| Database Primary | 8 vCPU, 32GB RAM | 1 |
| Database Replicas | 4 vCPU, 16GB RAM | 2 |
| Redis | 4 vCPU, 16GB RAM | 3 (cluster) |
| Object Storage | MinIO cluster | 4+ nodes |
Load Balancer Configuration (HAProxy)
frontend nextcloud_https
bind *:443 ssl crt /etc/ssl/nextcloud.pem
acl url_discovery path /.well-known/caldav /.well-known/carddav
http-request redirect location /remote.php/dav/ code 301 if url_discovery
default_backend nextcloud_servers
backend nextcloud_servers
balance leastconn
option httpchk HEAD /status.php HTTP/1.1\r\nHost:\ nextcloud.example.com
http-check expect status 200
server web1 10.0.1.1:9000 check inter 5s fall 3 rise 2
server web2 10.0.1.2:9000 check inter 5s fall 3 rise 2
server web3 10.0.1.3:9000 check inter 5s fall 3 rise 2
Database Read Replicas
Nextcloud supports native read/write splitting (since v29):
// config.php
'dbreplica' => [
['user' => 'nc_user', 'password' => 'pass1', 'host' => 'db-replica-1', 'dbname' => 'nextcloud'],
['user' => 'nc_user', 'password' => 'pass2', 'host' => 'db-replica-2', 'dbname' => 'nextcloud'],
],
Read queries automatically route to replicas; writes go to primary.
Redis Cluster
Distributed caching and file locking:
'memcache.local' => '\OC\Memcache\APCu',
'memcache.distributed' => '\OC\Memcache\Redis',
'memcache.locking' => '\OC\Memcache\Redis',
'redis.cluster' => [
'seeds' => [
'redis-1:7000',
'redis-2:7000',
'redis-3:7000',
],
],
Session Storage
Web nodes are stateless; sessions go to Redis:
// php.ini
session.save_handler = redis
session.save_path = "tcp://redis-1:6379?weight=1,tcp://redis-2:6379?weight=1"
Critical: Shared Configuration
All web nodes must share:
- Same Redis cluster (distributed cache + locking)
- Same database (primary + replicas)
- Same object storage (not local disk)
- Same config.php (synced via rsync or shared volume)
Tier 3: 50,000 Users — Kubernetes, Clustered
At 50,000 users, Kubernetes becomes essential for orchestration, auto-scaling, and resilience.
Architecture Overview
flowchart TD
subgraph K8s["Kubernetes Cluster"]
subgraph Ingress["Ingress Layer"]
ING["NGINX Ingress + cert-manager"]
end
subgraph App["Application Layer"]
P1["Nextcloud Pod 1"]
P2["Nextcloud Pod 2"]
P3["Nextcloud Pod 3"]
PN["Nextcloud Pod N"]
HPA["HPA: min=5, max=20 (CPU > 70%)"]
end
subgraph Data["Data Layer"]
PG["PostgreSQL Cluster<br/>(Patroni)"]
RD["Redis Cluster<br/>(6 nodes)"]
MN["MinIO Cluster<br/>(4+ nodes)"]
end
ING --> P1
ING --> P2
ING --> P3
ING --> PN
P1 --> PG
P1 --> RD
P1 --> MN
end
Kubernetes Deployment (OpenDesk Pattern)
OpenDesk demonstrates production Kubernetes architecture:
# Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: nextcloud-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: nextcloud
minReplicas: 5
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Database Clustering (Patroni/PostgreSQL)
# Patroni cluster for high availability
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgresql
spec:
serviceName: postgresql-headless
replicas: 3
selector:
matchLabels:
app: postgresql
template:
spec:
containers:
- name: postgresql
resources:
requests:
cpu: "4"
memory: "16Gi"
limits:
cpu: "8"
memory: "32Gi"
Multi-Bucket Object Storage
For 50K+ users, distribute files across multiple S3 buckets:
'objectstore' => [
'class' => '\\OC\\Files\\ObjectStore\\S3',
'arguments' => [
'multibucket' => true,
'num_buckets' => 64,
'bucket' => 'nextcloud-',
'hostname' => 'minio.internal.example.com',
'key' => 'access-key',
'secret' => 'secret-key',
'use_path_style' => true,
],
],
Component Resource Guidelines
Based on OpenDesk scaling documentation:
| Component | Per X Users | CPU | RAM | Notes |
|---|---|---|---|---|
| Nextcloud | 500 concurrent | 2 vCPU | 4 GB | Scale horizontally |
| Collabora | 15 active users | 1 vCPU | 50 MB | Stateful - sticky sessions |
| Jitsi (JVB) | 200 concurrent | 4 vCPU | 8 GB | Video transcoding |
| Matrix/Element | 10K total | 15 vCPU | 12 GB | Federation doubles load |
| PostgreSQL | Cluster-wide | 16 vCPU | 64 GB | Primary + 2 replicas |
Monitoring Stack
# kube-prometheus-stack
prometheus:
prometheusSpec:
retention: 30d
resources:
requests:
cpu: 500m
memory: 2Gi
alertmanager:
config:
route:
receiver: 'slack-notifications'
routes:
- match:
severity: critical
receiver: 'pagerduty'
grafana:
additionalDataSources:
- name: Loki
type: loki
url: http://loki:3100
Key Metrics to Monitor
- Pod scaling events — HPA triggers indicate capacity pressure
- Database connection pool saturation — Approaching max_connections
- Redis memory usage — Cache eviction rates
- Object storage latency — S3/MinIO response times
- PHP-FPM queue length — Requests waiting for workers
Tier 4: 100,000+ Users — Multi-Region, Enterprise
At 100K+ users, single-region deployments hit limits. You need multi-region architecture, global load balancing, and sophisticated failure handling.
Architecture Overview
flowchart TD
GSLB["Global Load Balancer (GSLB)<br/>Route53 / CloudFlare / PowerDNS"]
subgraph EU["Region EU"]
EU_K8s["K8s Cluster (20+ nodes)"]
EU_PG["PostgreSQL Primary"]
EU_MN["MinIO Cluster (sync)"]
end
subgraph US["Region US"]
US_K8s["K8s Cluster (20+ nodes)"]
US_PG["PostgreSQL Primary"]
US_MN["MinIO Cluster (sync)"]
end
subgraph AP["Region AP"]
AP_K8s["K8s Cluster (20+ nodes)"]
AP_PG["PostgreSQL Primary"]
AP_MN["MinIO Cluster (sync)"]
end
GSLB --> EU
GSLB --> US
GSLB --> AP
EU_MN <--> US_MN
US_MN <--> AP_MN
Tenant Isolation Strategy
Option A: Namespace per Tenant (Kubernetes)
# Each organization gets isolated namespace
apiVersion: v1
kind: Namespace
metadata:
name: tenant-acme-corp
labels:
tenant: acme-corp
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: tenant-isolation
namespace: tenant-acme-corp
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
tenant: acme-corp
Option B: Database per Tenant
-- Tenant isolation at database level
CREATE DATABASE nextcloud_acme;
CREATE DATABASE nextcloud_globex;
-- Row-level security for shared database
ALTER TABLE files ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON files USING (tenant_id = current_tenant());
Global Database Strategy
Synchronous replication within region, async between regions:
flowchart TD
subgraph RegionEU["Region EU"]
EU_P["Primary"]
EU_R1["Replica 1"]
EU_R2["Replica 2"]
EU_P --> EU_R1 --> EU_R2
end
subgraph RegionUS["Region US"]
US_S["Standby (Promotable)"]
US_R["Replica"]
US_S --> US_R
end
EU_P -.->|"Async Stream"| US_S
CDN and Edge Caching
# CloudFlare / Fastly CDN rules
rules:
- match:
path: "/remote.php/dav/files/*"
caching:
enabled: false # DAV is dynamic
- match:
path: "/apps/files/*"
caching:
enabled: true
ttl: 3600
- match:
path: "/css/*"
path: "/js/*"
caching:
enabled: true
ttl: 86400
Capacity Planning at Scale
Per-region sizing for 100K users:
| Component | Instances | Each Spec |
|---|---|---|
| Web/API Pods | 50-100 | 4 vCPU, 8 GB |
| Database Primary | 1 | 32 vCPU, 128 GB |
| Database Replicas | 4-6 | 16 vCPU, 64 GB |
| Redis Cluster | 9 (3x3) | 8 vCPU, 32 GB |
| MinIO Nodes | 12+ | 16 vCPU, 64 GB, NVMe |
| Load Balancers | 3 | 8 vCPU, 16 GB |
Failure Modes and Mitigations
| Failure | Impact | Mitigation |
|---|---|---|
| Single pod crash | None | K8s recreates automatically |
| Node failure | Minimal | Pods reschedule, PDB ensures availability |
| AZ failure | Degraded | Cross-AZ deployment, multi-AZ storage |
| Region failure | Failover | GSLB routes to healthy region |
| Database primary failure | Brief outage | Patroni failover to replica (<30s) |
| Object storage failure | Severe | Multi-region replication |
OpenDesk vs Vanilla Nextcloud: Scaling Comparison
OpenDesk (German government's open-source workspace) provides a reference architecture for scaled deployments:
| Aspect | Vanilla Nextcloud | OpenDesk |
|---|---|---|
| Deployment | VM or containers | Kubernetes-only |
| Architecture | Monolithic PHP | Modular microservices |
| Database | MySQL/MariaDB/PostgreSQL | PostgreSQL with clustering |
| Auth | Built-in or external | Keycloak + OpenLDAP (decoupled) |
| Scaling | Manual configuration | Helm charts with autoscaling |
| Office Suite | Optional app | Collabora/OnlyOffice integrated |
| Video | External (BigBlueButton) | Jitsi Meet (integrated) |
OpenDesk Component Scaling
From official documentation:
Collabora (Document Editing):
Per 15 active users: 1 vCPU, 50 MB RAM
Jitsi (Video Conferencing):
Per JVB (200 concurrent): 4 vCPU, 8 GB RAM
Scale JVBs horizontally, use Octo for load balancing
Matrix/Element (Chat):
Per 10K users (federation ON): 15 vCPU, 12 GB RAM
Per 10K users (federation OFF): 10 vCPU, 8 GB RAM
Federation adds 2-5x resource overhead
Scaling Decision Matrix
| Factor | Scale Vertically | Scale Horizontally |
|---|---|---|
| Application type | Stateful, monolithic | Stateless, microservices |
| Traffic pattern | Steady, predictable | Variable, bursty |
| Availability requirement | Best effort | High availability (99.9%+) |
| Data consistency | Strong, immediate | Eventual acceptable |
| Operational complexity | Keep simple | Accept complexity |
| Budget | Limited | Flexible |
| Geographic distribution | Single region | Multi-region |
General Principles
- Start vertical, then horizontal — Optimize single instance before adding complexity
- Stateless first — 12-factor app principles enable horizontal scaling
- Shared nothing — Each process independent, state in backing services
- Cache aggressively — Multi-layer: CDN → App → DB
- Monitor everything — You can't scale what you can't measure
- Automate scaling — HPA, VPA, Cluster Autoscaler
- Design for failure — Components will fail; plan for it
- Right-size continuously — Over-provisioning wastes money
Quick Reference: Configuration by Tier
Tier 1 (1K Users)
// Minimal config
'memcache.local' => '\OC\Memcache\APCu',
'memcache.locking' => '\OC\Memcache\Redis',
Tier 2 (10K Users)
// Load balanced with replicas
'dbreplica' => [
['host' => 'db-replica-1'],
['host' => 'db-replica-2'],
],
'memcache.distributed' => '\OC\Memcache\Redis',
'redis.cluster' => ['seeds' => ['redis-1:7000', 'redis-2:7000', 'redis-3:7000']],
Tier 3 (50K Users)
// Kubernetes with object storage
'objectstore' => [
'class' => '\\OC\\Files\\ObjectStore\\S3',
'arguments' => [
'multibucket' => true,
'num_buckets' => 64,
'bucket' => 'nextcloud-',
],
],
Tier 4 (100K+ Users)
// Multi-region with global load balancing
'objectstore' => [
'class' => '\\OC\\Files\\ObjectStore\\S3',
'arguments' => [
'multibucket' => true,
'num_buckets' => 128,
'bucket' => 'nextcloud-',
'region' => 'eu-west-1',
],
],
'trusted_proxies' => ['10.0.0.0/8', '172.16.0.0/12'],
'overwriteprotocol' => 'https',
Conclusion
Scaling self-hosted cloud applications follows predictable patterns:
- 1K users: Tune the stack, optimize single server
- 10K users: Add load balancing and read replicas
- 50K users: Kubernetes orchestration, clustered databases
- 100K+ users: Multi-region, tenant isolation, global load balancing
The key insight: scaling is architecture, not just hardware. Decisions made at 1K users—storage backend, caching strategy, state management—determine how smoothly you reach 100K.
Start with 12-factor principles: stateless processes, attached resources, horizontal scaling. Then layer on product-specific optimizations (Nextcloud's Redis locking, OpenDesk's Kubernetes-native components).
Invest in observability early. You cannot scale what you cannot measure.