VPN Service Scaling

The Task
The Client, a foreign VPN service provider operating via a Telegram bot, approached us with performance and scalability challenges. The existing service was built on simple synchronous code using SQLite as its database, leading to frequent failures as the user base grew.

Project Goals:

  • Conduct an audit and refactoring to enhance performance.
  • Implement a scalable architecture supporting 10,000+ concurrent users.
  • Demonstrate scaling across 5 countries (Russia, Germany, Netherlands, Finland, USA) with load balancers.
  • Integrate with modern development and infrastructure tools.
System Audit
The audit revealed several critical bottlenecks in the original system:
  • Database: SQLite — a file-based database not designed for concurrent access. Simultaneous user requests caused locking issues, leading to frequent timeouts.
  • Bot Architecture: Synchronous code (lacking async/await) blocked during I/O operations (API requests, payment processing), limiting throughput to just 10-20 requests/second.
  • VPN Management: The bot directly interacted with VPN servers (creating keys, configurations) without a centralized backend, complicating monitoring and scalability.
  • Security: Absence of strict IP controls, API vulnerabilities, and lack of rate-limiting mechanisms.
  • Scalability: The system supported neither database replication nor horizontal scaling of VPN nodes.
  • Expected Growth Issues: Projected service downtime, payment transaction losses, and account blocks due to IP changes.
Proposed Solution
We proposed a complete system overhaul with a transition to a microservices architecture:
  • Backend:
    FastAPI (asynchronous Python framework) for business logic, API, and integrations.
  • Bot:
    Aiogram (async Telegram Bot API) for user interface, with Redis-based FSM (Finite State Machine) for state management.
  • Communication Protocol (MCP):
    REST API between the bot and backend secured with API keys and TLS. Request structure follows JSON format with method/params/request_id fields.
  • Database:
    PostgreSQL with primary/replica replication for high availability.
  • VPN Management:
    Marzban for user and configuration management, V2IpLimit for IP change detection
  • Background Tasks:
    Celery + Beat for automation (subscription checks, notifications, renewals).
  • Payments:
    Integration with payment systems and Telegram Stars.
  • Monitoring:
    Prometheus + Grafana for metrics collection, with alerts routed through a dedicated bot.
Advantages: The asynchronous architecture delivers a 5-10x performance improvement, PostgreSQL guarantees ACID compliance, and the MCP protocol simplifies maintenance while facilitating future integration of AI agents for both development and user support operations.
This enables seamless node expansion: new countries are integrated via the Marzban API, with load balancers configured automatically.
  • Backend (FastAPI)
    • Models & CRUD: SQLAlchemy
    • Admin Panel: Bootstrap/Jinja2 for user management and bulk messaging
    • Services: Marzban integration and payment processing
    • Security: JWT authentication, rate limiting, HMAC signing of MCP requests
    01
  • Telegram-Bot (Aiogram)
    • Handlers: For basic commands, subscriptions, payments, and settings (including broadcasts)
    • Keyboards: Inline menus (get config, renew subscription, unsubscribe)
    • MCP Integration: POST requests to backend with API key authentication
    • Broadcasts: Receives user_id batches from backend, sends messages with "Support"/"Unsubscribe" buttons
    • FSM: RedisStorage for state management
    • Localization: Multi-language support
    02
  • Deployment
    Docker Compose for backend and database, Docker, Nginx for reverse proxy and SSL.
    03
Scaling

To demonstrate scalability, five countries were added with load balancers (Nginx Round Robin). Each country has 2-3 nodes for traffic distribution. The backend (Marzban) centrally manages all nodes while implementing geo-based user allocation.

  • Distribution:
    Automated node assignment during subscription creation (based on user preferences).
  • Load Balancing:
    Nginx on the load balancer proxies traffic to backend nodes while performing health monitoring via Prometheus.
  • Replication:
    A read-only PostgreSQL replica handles read queries (statistics, broadcast operations).
Results and Benefits

Following implementation:
  • Performance: API response time reduced from 500ms to under 100ms (asynchronous architecture + PostgreSQL). Bot throughput reached 200+ requests/second.
  • Scalability: Support for 10,000+ active users with zero downtime achieved through replication and Celery (automatic subscription renewals, broadcast capacity of 5,000 messages/minute).
  • Security: Strict IP control (automatic blocking upon IP change), 99.9% uptime, DDoS protection (rate-limiting).
  • Business Metrics: 40% increase in user retention (renewal notifications), new payment integration via Telegram Stars (+20% conversion), 100% data migration without loss.
  • Monitoring: Grafana dashboards for country-level traffic analysis, alerts for IP changes/system errors.
The project was completed in 4 phases (infrastructure, backend, bot, migration/launch) and is currently operating successfully.
We'll find the right solution for your business
Phone - +7 905 715-55-55

Email - agat55555@mail.ru

«SECURITY HOUSE»