top of page

Payment Orchestration Engine Architecture: Advanced Implementation Strategies

  • Writer: David Pop
    David Pop
  • 2 days ago
  • 19 min read

ree

In our last article, Payment Orchestration Engine Architecture Guide, we explored the fundamentals of payment orchestration engines, what they are, why businesses need them, and how they solve critical problems like multi-provider complexity, high transaction costs, and fragmented data. We covered the eight core architectural components (from API Gateway to Security & Compliance Modules), walked through a complete end-to-end payment flow, and analyzed the build vs. buy decision with detailed cost breakdowns and ROI calculations.


Now, in Part 2, we dive deeper into the advanced implementation strategies that separate basic orchestration platforms from production-grade, enterprise-scale systems. We'll explore microservices architecture patterns that enable independent scaling and fault isolation, intelligent routing strategies from simple rules to ML-powered optimization, performance optimization techniques for sub-500ms latency, and real-world deployment specifications for handling 10M+ transactions per month.


Whether you're building a custom solution with specialized payment engineering expertise or evaluating enterprise providers like Crafting Software, this guide provides the technical depth and proven patterns needed to architect a resilient, high-performance payment orchestration engine that scales with your business while maintaining security, compliance, and reliability.


Let's start with the architectural foundation that makes enterprise-scale orchestration possible: microservices.


Microservices Architecture for Payment Orchestration

Modern payment orchestration engines are built using microservices to achieve scalability, reliability, and independent deployment. Unlike monolithic architectures where a single codebase handles all payment operations, microservices split functionality into independent services that can be developed, deployed, and scaled separately.



Why Microservices?

Independent Scaling: Scale PSP connectors independently based on transaction volume per provider. If Stripe processes 60% of your transactions while Adyen handles 30%, you can run 6 Stripe connector instances and 3 Adyen instances, optimizing resource allocation and costs.


Fault Isolation: If one PSP connector crashes due to a bug or memory leak, other PSPs continue working normally. This isolation prevents a single component failure from taking down your entire payment system.


Technology Flexibility: Use the right tool for each job—Elixir/Erlang for fault-tolerant core routing, Python for machine learning models, Go for high-throughput connectors that need maximum performance. Each service can use the language and framework best suited to its requirements.


Continuous Deployment: Update your Stripe connector to support a new payment method without redeploying your Adyen connector, fraud detection service, or orchestration core. Deploy changes to production multiple times per day with zero downtime using rolling deployments.


Typical Microservices Breakdown

Service

Responsibility

Technology

Scaling

API Gateway

Authentication, rate limiting, validation

Elixir/Phoenix, Nginx

Horizontal (10+ instances)

Orchestration Core

Routing decisions, fallback logic

Elixir/Erlang

Horizontal (5+ instances)

Stripe Connector

Stripe API integration

Elixir/Go

Horizontal (3+ instances)

Adyen Connector

Adyen API integration

Elixir/Go

Horizontal (3+ instances)

Fraud Detection

Real-time fraud scoring

Python/FastAPI

Horizontal (2+ instances)

3DS Service

3D Secure authentication

Elixir

Horizontal (2+ instances)

Reporting Service

Analytics, dashboards

Elixir/PostgreSQL

Vertical (database)

Event Publisher

Outbox pattern, Kafka publishing

Elixir

Horizontal (2+ instances)

Inter-Service Communication

Synchronous (HTTP/gRPC): For real-time operations requiring immediate response

  • API Gateway → Orchestration Core

  • Orchestration Core → PSP Connectors

  • Orchestration Core → Fraud Detection


Asynchronous (Kafka): For non-blocking operations and event streaming

  • Payment events (initiated, succeeded, failed)

  • Analytics data ingestion

  • Webhook notifications


Example: gRPC Service Definition


Intelligent Routing: Rules, Algorithms, and Machine Learning

The routing engine is the brain of your orchestration platform, it's what transforms a simple multi-PSP integration into an intelligent system that actively optimizes every transaction. Without smart routing, you're just randomly distributing payments across providers. With it, you're making data-driven decisions that directly impact your bottom line: lower fees, higher acceptance rates, and better customer experience.


Routing strategies exist on a spectrum from simple (rule-based) to complex (ML-powered). Most businesses start with rules, graduate to algorithmic optimization, and eventually layer in machine learning as transaction volume justifies the investment. Let's explore each approach, when to use it, and the measurable impact on your business.


1. Rule-Based Routing (Simplest)

Rule-based routing uses explicit, hardcoded logic defined by your payment team. Think of it as a decision tree: "If transaction is in EUR and under €100, route to Adyen. If customer is in the US, route to Stripe. Otherwise, use Checkout.com as fallback."


Merchants define explicit rules:

Pros: Simple, predictable, easy to debug 

Cons: Static, doesn't adapt to changing conditions


2. Cost-Optimized Routing

Cost-optimized routing automatically calculates the total cost of processing each transaction through every available PSP, then routes to the cheapest option. This accounts for not just the advertised rates (2.9% + $0.30) but also hidden fees like foreign exchange markups, cross-border fees, and card scheme assessments.


Automatically selects PSP with lowest fees:

Real-World Impact: Merchants can reduce payment costs by 20-40% by dynamically selecting cheapest PSP per transaction.


3. Success-Rate-Optimized Routing

Success-rate-optimized routing analyzes historical transaction data to identify which PSP has the highest approval rate for transactions matching specific characteristics (card type, country, amount range, time of day). It then routes new transactions to the PSP most likely to approve them.


Selects PSP with highest historical acceptance rate:

Example: If Stripe has 96% success rate for US Visa cards but Adyen has 98%, automatically route to Adyen.


4. Machine Learning Routing (Most Advanced)

ML routing trains predictive models on historical transaction data to forecast the probability that each PSP will successfully approve a specific transaction. Rather than relying on simple aggregated success rates, ML models learn complex patterns like "French Visa cards on Tuesday afternoons with amounts €50-€100 have 97% success rate with Adyen but only 92% with Stripe."


Train ML models on historical data to predict transaction success:

Features Used:

  • Transaction amount and currency

  • Customer country and IP address

  • Card type (Visa, Mastercard, Amex)

  • Time of day and day of week

  • Customer's past transaction history

  • PSP's recent performance metrics


Model Training (Python):

Real-World Results: ML routing can increase overall acceptance rates by 2-5% compared to static rules.


5. Hybrid Routing (Recommended)

Hybrid routing combines multiple strategies in a prioritized decision tree. It applies business rules first (hard constraints), then uses ML predictions (soft optimization), then applies cost constraints (economic efficiency), balancing multiple objectives simultaneously.


Combine multiple strategies with priority:

1. Check Circuit Breaker: If PSP is failing, exclude it

2. Apply Business Rules: If merchant has exclusion list, filter out those PSPs

3. Run ML Model: Predict success probability for each remaining PSP

4. Apply Cost Constraint: If multiple PSPs have >95% predicted success, choose cheapest

5. Select Final PSP: Return PSP with best score

6. Record Decision: Log why this PSP was selected for future analysis


Ensuring Low Latency and High Performance

Payment orchestration adds a layer between merchants and PSPs, so minimizing latency is critical. Every millisecond counts, our research shows that each 100ms of latency reduces conversion rates by approximately 1.1%. When your orchestration engine sits in the critical path between "customer clicks pay" and "payment confirmed," you can't afford to add significant overhead.


The challenge is balancing the intelligence of orchestration (smart routing, fraud checks, fallback logic) with the speed customers expect. A well-architected orchestration engine should add no more than 50-100ms to the total payment flow, barely perceptible to users while delivering significant value through optimized routing and automatic failover.


Below, we establish target latencies for each component and explore five battle-tested optimization techniques that keep your orchestration engine fast even under heavy load.


Target Latencies

Understanding where time is spent in your payment flow is essential for optimization. The table below shows realistic targets and typical observed latencies for each component in a production orchestration engine:

Component

Target Latency

Typical Latency

API Gateway

< 10ms

5ms

Routing Engine

< 20ms

12ms

PSP Connector

< 500ms

200-400ms (depends on PSP)

Database Write

< 10ms

5ms

Total (P95)

< 600ms

250-500ms

A payment flow without orchestration (direct PSP integration) typically takes 200-450ms (just the PSP call + minimal application overhead). A well-optimized orchestration engine adds 50-100ms to this, bringing total time to 250-550ms—a small price for 20-40% cost savings and 2-5% higher acceptance rates.


Performance Optimization Techniques


1. Connection Pooling

Connection pooling reuses established HTTP connections to PSPs instead of creating new TCP connections for each request. Without pooling, every request incurs a full TCP handshake (1 round trip) plus TLS negotiation (2-3 round trips)—adding 100-200ms of pure overhead before any data is transmitted.

Impact: Reduces latency by 50-100ms per request by eliminating TLS handshake overhead.


2. Caching

Caching stores frequently accessed, slow-to-compute data in memory (Redis) so subsequent requests avoid hitting the database or recomputing expensive operations. In orchestration engines, prime caching candidates include PSP configurations, routing rules, and recent fraud scores.

Impact: Reduces database queries by 80-90%, cutting latency by 10-20ms per request.


Track cache hit rate (target: >95%). If hit rate drops below 90%, either increase TTL or increase cache memory allocation. Use Redis with eviction policy allkeys-lru (least recently used) to automatically evict old entries when memory fills.


3. Asynchronous Processing

Asynchronous processing moves non-critical operations out of the critical path—the sequence of steps that must complete before returning a response to the customer. Operations like analytics logging, webhook notifications, and dashboard updates don't need to block payment confirmation.

Impact: Reduces P95 latency by 50-100ms by not waiting for analytics and webhooks.


Async tasks can fail silently. Use the Outbox Pattern (covered earlier) for mission-critical events like webhooks, write events to database first, then background workers process them with retries and dead-letter queues for permanent failures.


4. Database Optimization

Database optimization in orchestration engines focuses on separating read and write workloads. Transactional writes (creating payment records) must go to the primary database for consistency. Read-heavy operations (analytics queries, transaction history lookups) can use replicas with eventual consistency (typically 100-500ms lag).

Impact: Offloads 70-80% of queries to replicas, reducing primary database load.


Use PgBouncer with 50-100 connections to primary and 50-100 connections per replica. This prevents connection exhaustion when traffic spikes, PostgreSQL handles 200-500 concurrent connections gracefully with proper pooling.


5. Circuit Breaker Pattern

The circuit breaker pattern prevents cascading failures by detecting when a PSP is failing and immediately returning errors instead of waiting for timeouts. Without circuit breakers, if Stripe's API goes down, every request to Stripe waits 5 seconds for timeout before failing, during which your orchestrator's connection pool fills up and new requests queue behind the failing ones.

Impact: Prevents cascading failures and reduces latency during PSP outages from 5s (timeout) to 10ms (circuit open check).


Track circuit state changes with alerts. circuit_opened events should trigger PagerDuty immediately, this indicates a PSP outage affecting your payment processing. Track time-in-open-state to measure PSP reliability and justify multi-PSP strategy to leadership.


Security & Compliance: Building Trust into the Architecture

Payment orchestration engines handle sensitive financial data, making security and compliance non-negotiable.


PCI DSS Compliance

Payment Card Industry Data Security Standard (PCI DSS) defines requirements for storing, processing, and transmitting cardholder data.


Key Requirements:

Requirement 3: Protect Stored Cardholder Data

  • Never store full card numbers in plaintext

  • Use tokenization to replace card data with tokens

  • Encrypt all sensitive data at rest (AES-256)


Implementation:

Requirement 4: Encrypt Transmission of Cardholder Data

  • All API communication must use TLS 1.2 or higher

  • Use strong cipher suites (TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384)

Requirement 8: Identify and Authenticate Access

  • Multi-factor authentication for admin access

  • Role-based access control (RBAC)

Requirement 10: Track and Monitor All Access

  • Log every payment operation (create, update, refund)

  • Retain logs for at least 1 year


PSD2 Strong Customer Authentication (SCA)

The Payment Services Directive 2 (PSD2) is a European regulation that strengthens the security of online payments. It requires Strong Customer Authentication (SCA) to verify a payer’s identity using at least two factors, something they know (like a password or PIN), something they have (like a phone or hardware token), or something they are (like a fingerprint or facial recognition). This ensures that even if one security factor is compromised, unauthorized transactions are still prevented.


Two-Factor Authentication Requirements:

  • Something you know (password, PIN)

  • Something you have (phone, token device)

  • Something you are (fingerprint, face scan)


Implementation:


Fraud Detection Integration

Fraud detection systems analyze transaction patterns and user behavior to identify potentially fraudulent activity before payments are processed. Instead of relying only on static rules, modern systems evaluate factors like transaction frequency, device fingerprints, and location mismatches in real time. This helps businesses block or review suspicious transactions without disrupting legitimate ones.


Building vs. Buying: Cost and Complexity Analysis

A custom-built Payment Orchestration Platform (POP) gives you complete control over your payment strategy, designed around your specific operations and long-term goals. It lets you create a centralized system to handle everything, from provider integrations and transaction routing to reporting, fraud management, and compliance processes.


Unlike off-the-shelf platforms, you’re not restricted by another company’s roadmap or limitations. However, this freedom also comes with full responsibility for the platform’s architecture, codebase, maintenance, compliance, and scalability.


Build (Custom Development)

Total Cost of Ownership (TCO) - Year 1:

Component

Cost

Notes

Development Team (6 months)

$500,000

4 engineers × $125k avg. salary

Infrastructure (AWS/GCP)

$50,000

Kubernetes, databases, Kafka

PSP Integration Fees

$20,000

Setup fees for 5-7 PSPs

Security & Compliance

$75,000

PCI DSS audit, penetration testing

Maintenance (6 months)

$150,000

Ongoing development and bug fixes

Total Year 1

$795,000


Ongoing Annual Costs:

  • Maintenance & Feature Development: $300,000/year

  • Infrastructure: $100,000/year

  • Compliance Audits: $50,000/year

  • Total Ongoing: $450,000/year


Pros:

  • Full control over features and roadmap

  • Custom routing logic optimized for your use case

  • No per-transaction fees to third-party orchestrator

  • Can integrate with proprietary internal systems


Cons:

  • High upfront investment ($800k+)

  • 6-12 month time-to-market

  • Requires specialized payment engineering expertise, eg. Crafting Software.

  • Ongoing maintenance burden

  • Compliance responsibility falls entirely on you


Buy (Third-Party Solution)

Popular Orchestration Platforms:


Pricing Model (Typical):

  • Monthly Platform Fee: $2,000-$10,000 depending on transaction volume

  • Per-Transaction Fee: $0.05-$0.15 per transaction

  • Setup Fee: $10,000-$50,000 (one-time)


Example TCO - Year 1: Assume 100,000 transactions/month = 1.2M transactions/year

Component

Cost

Calculation

Setup Fee

$25,000

One-time

Monthly Platform Fee

$60,000

$5,000 × 12 months

Per-Transaction Fees

$120,000

$0.10 × 1.2M transactions

Total Year 1

$205,000


Ongoing Annual Costs (Year 2+):

  • Monthly Platform Fee: $60,000

  • Per-Transaction Fees: $120,000 (scales with volume)

  • Total Ongoing: $180,000/year


Pros:

  • Fast time-to-market (weeks, not months)

  • Lower upfront investment ($200k vs. $800k)

  • Compliance handled by vendor (PCI DSS Level 1 certified)

  • Pre-built integrations with 50+ PSPs

  • Automatic updates and new features


Cons:

  • Ongoing per-transaction fees (can get expensive at scale)

  • Less control over routing logic and features

  • Vendor dependency (lock-in risk)

  • May not support niche PSPs or custom integrations


Decision Framework

Choose BUILD if:

  • You process >10M transactions/year (cost of buy becomes prohibitive)

  • You have specialized routing requirements not supported by existing platforms

  • You have in-house payment engineering expertise

  • You need deep integration with proprietary internal systems


Choose BUY if:

  • You process <5M transactions/year (cost of build is too high)

  • You need to go live quickly (3-6 months faster)

  • You lack in-house payment expertise

  • You want to offload compliance burden


Hybrid Approach:

  • Start with a third-party platform to validate product-market fit

  • Build custom orchestration later once you reach scale (>10M txns/year)

  • Many companies follow this path (e.g., Uber started with Braintree, later built internal payment systems)


Migration Strategy: From Legacy to Orchestration

Migrating from a legacy payment system to an orchestration engine requires careful planning to avoid downtime and lost revenue.


Phase 1: Assessment & Planning (Weeks 1-4)

Map Current State:

  • Document all PSP integrations (APIs, SDKs, credentials)

  • Identify payment flows (checkout, recurring billing, refunds)

  • Catalog stored payment methods and customer data

  • Review compliance requirements (PCI DSS, PSD2)


Define Target State:

  • Select orchestration platform (build vs. buy decision)

  • Choose initial PSPs to integrate (usually 2-3 primary + 2 backup)

  • Design routing strategy (rules-based or cost-optimized)


Phase 2: Parallel Integration (Weeks 5-12)

Shadow Mode:

  • Build orchestration platform in parallel with existing system

  • Send a copy of each transaction to both old system and new orchestrator

  • Compare results to verify consistency

  • Do NOT charge customers twice (shadow mode is read-only)


Example Implementation:

Duration: Run shadow mode for 4-8 weeks to validate accuracy.


Phase 3: Gradual Rollout (Weeks 13-20)

Traffic Split: Route percentage of live traffic to orchestrator:

  • Week 13-14: 5% of transactions

  • Week 15-16: 25% of transactions

  • Week 17-18: 50% of transactions

  • Week 19-20: 100% of transactions


Implementation Using Feature Flags:


Monitoring: Track key metrics for orchestrator vs. legacy:

  • Success rate (should be equal or higher)

  • Average latency (should be within 10% of legacy)

  • Error rates (should be equal or lower)


Phase 4: Decommission Legacy (Weeks 21-24)

Final Cutover:

  • Route 100% of traffic to orchestrator for 2 weeks

  • Monitor for any issues

  • Keep legacy system running in read-only mode for reference


Data Migration:

  • Migrate stored payment methods (tokenized cards) to orchestrator's vault

  • Migrate transaction history for reporting and reconciliation

  • Update customer records to point to new tokens


Decommission:

  • Shut down legacy system

  • Archive data for compliance (retain for 7 years)

  • Remove old PSP integrations


Real-World Architecture: High-Volume Payment Platform

Let's design a complete orchestration engine for a high-volume e-commerce platform processing 10 million transactions/month.


Requirements

  • Volume: 10M transactions/month = 333k/day = 3.8 transactions/second average, 15 TPS peak

  • Availability: 99.95% uptime (< 22 minutes downtime/month)

  • Latency: P95 < 500ms, P99 < 1s

  • PSPs: Integrate with 5 primary PSPs (Stripe, Adyen, Checkout.com, Braintree, PayPal)

  • Geographic Coverage: North America, Europe, Asia-Pacific

  • Compliance: PCI DSS Level 1, PSD2 SCA, GDPR


Architecture Diagram

Architecture Diagram for High-Volume Payment Platform

Technology Stack

Every technology choice below is justified by production requirements. We prioritize proven, battle-tested technologies over trendy frameworks—payment systems demand reliability over novelty.

Component

Technology

Justification

API Gateway

Elixir/Phoenix, Plug

Low-latency, high concurrency (2M connections per node)

Orchestration Core

Elixir/OTP, GenServer

Fault tolerance, supervision trees, hot code reloading

PSP Connectors

Elixir with Tesla HTTP client

Connection pooling, circuit breakers

Database

PostgreSQL 15

ACID compliance, JSON support, mature replication

Cache

Redis

Sub-millisecond latency, built-in expiration

Event Streaming

Apache Kafka

High throughput, durable, ordered event log

Monitoring

Prometheus + Grafana

Real-time metrics, alerting

Logging

ELK Stack (Elasticsearch, Logstash, Kibana)

Centralized logging, full-text search

Infrastructure

Kubernetes (EKS/GKE)

Auto-scaling, rolling deployments, self-healing

Deployment Architecture

Below is the exact hardware configuration for handling 10M transactions/month with headroom for 3-5x traffic spikes during peak events like Black Friday:

Environment

Nodes

Resources per Node

Total Capacity

API Gateway

3

4 vCPU, 8GB RAM

12 vCPU, 24GB RAM

Orchestration Core

5

8 vCPU, 16GB RAM

40 vCPU, 80GB RAM

PSP Connectors (total)

9 (3 per major PSP)

2 vCPU, 4GB RAM

18 vCPU, 36GB RAM

Kafka Brokers

3

4 vCPU, 16GB RAM

12 vCPU, 48GB RAM

PostgreSQL Primary

1

16 vCPU, 64GB RAM

16 vCPU, 64GB RAM

PostgreSQL Replicas

2

8 vCPU, 32GB RAM

16 vCPU, 64GB RAM

Redis

2 (primary + replica)

4 vCPU, 16GB RAM

8 vCPU, 32GB RAM

Total Infrastructure:

  • vCPUs: 122

  • RAM: 348GB

  • Estimated Monthly Cost (AWS): $8,000-$12,000


Scaling Strategy

This architecture supports both horizontal scaling (add more nodes) and vertical scaling (use bigger nodes) depending on the bottleneck. Most scaling is horizontal because it's easier to automate via Kubernetes HPA.


Horizontal Scaling (Add More Nodes):

  • API Gateway: Auto-scale from 3 to 10 nodes during peak hours

  • Orchestration Core: Scale from 5 to 15 nodes for Black Friday / Cyber Monday

  • PSP Connectors: Add nodes dynamically based on per-PSP traffic


Vertical Scaling (Bigger Nodes):

  • PostgreSQL: Upgrade to 32 vCPU, 128GB RAM if query performance degrades

  • Redis: Upgrade to 8 vCPU, 32GB RAM if cache hit rate drops below 95%


Kubernetes HPA (Horizontal Pod Autoscaler) Configuration:


Key Performance Indicators (KPIs) for Payment Orchestration

Tracking the right metrics is essential to measure the success of your orchestration engine. Without proper measurement, you can't validate whether your intelligent routing is actually reducing costs, whether your fallback logic is recovering failed transactions, or whether your infrastructure investments are delivering the expected performance improvements.


The KPIs below are organized into four categories: transaction health (measuring payment success), system performance (measuring speed and reliability), cost efficiency (measuring financial impact), and business outcomes (measuring customer and operational impact). Establish baselines before implementing orchestration, then track these metrics continuously to demonstrate ROI and identify optimization opportunities.


Transaction KPIs for Payment Orchestration

Transaction KPIs measure the health and success of your payment processing. Acceptance rate is your primary indicator, every percentage point improvement directly translates to revenue recovered from previously declined transactions.

Metric

Definition

Target

Calculation

Acceptance Rate

% of transactions successfully authorized

>95%

(Successful / Total) × 100

Decline Rate

% of legitimate transactions declined

<5%

(Declined / Total) × 100

Failure Rate

% of transactions that failed due to errors

<1%

(Failed / Total) × 100

Retry Success Rate

% of initially failed transactions that succeeded on retry

>40%

(Retry Success / Initial Failures) × 100

Performance KPIs for Payment Orchestration

Performance KPIs ensure your orchestration layer isn't adding unacceptable latency to the checkout experience. Every 100ms of delay can reduce conversion rates by 1%, making speed critical to revenue.

Metric

Definition

Target

P50 Latency

50th percentile response time

<250ms

P95 Latency

95th percentile response time

<500ms

P99 Latency

99th percentile response time

<1000ms

Uptime

% of time system is operational

>99.95%

Cost KPIs for Payment Orchestration

Cost KPIs validate the financial return on your orchestration investment. These metrics prove whether intelligent routing is actually saving money versus using a single PSP.

Metric

Definition

Target

Calculation

Average Cost per Transaction

Average fees paid to PSPs

Minimize

Total PSP Fees / Total Transactions

Cost Savings vs. Baseline

Savings from intelligent routing

>20%

(Baseline Cost - Actual Cost) / Baseline Cost × 100

Example:

  • Baseline (using only Stripe): $100,000/month in fees

  • With Orchestration: $75,000/month in fees

  • Savings: 25%


Business KPIs for Payment Orchestration

Business KPIs measure operational efficiency and customer impact, the outcomes that matter to executive leadership and directly affect company growth.

Metric

Definition

Target

Revenue Recovery

Additional revenue from retry logic

>$50k/month

Customer Satisfaction

CSAT score for checkout experience

>4.5/5.0

Time to Integrate New PSP

Days to add a new payment provider

<5 days


Top Payment Orchestration Engine Architecture Providers

Some of the top payment orchestration engine architecture providers include Adyen, Spreedly, Juspay, IXOPAY, and Crafting Software, offering solutions for enterprises, SaaS, and online businesses with flexibility, scalability, and compliance features. Each provider brings unique strengths for different business needs, from global acquiring to no-code automation and white-label options.


Enterprise-Focused Providers

  • Adyen: Enterprise-grade orchestration with global acquiring and direct connections to card networks.

  • IXOPAY: Scalable, PCI-certified platform with advanced routing and risk management.

  • Crafting Software: Customizable enterprise orchestration engine supporting multi-PSP integration, PCI compliance, and high-volume processing.

  • Cybersource: Customer-friendly orchestrator offering global payment and fraud management options.

  • Gr4vy: No-code platform enabling enterprises to automate and optimize payment strategies.


Flexible and Modern Solutions

  • Spreedly: Flexible platform for connecting multiple payment providers and modernizing payment stacks.

  • Juspay: Comprehensive suite with open-source options, global payouts, and local payment methods.

  • Primer.io: Unified infrastructure with no-code automation and multiple payment method support.

  • Crafting Software: Provides modular, flexible orchestration for businesses needing custom workflows and integrations.


Specialized Providers

  • Openpay: SaaS-focused platform with smart routing and AI-driven retention tools.

  • Akurateco: White-label, PCI-compliant solution with global and local connectors.

  • MYFUNDBOX: Subscription-focused orchestration with automated workflows.

  • Inai: Automated platform for managing payments across multiple vendors.

  • Crafting Software: Offers specialized features for enterprise and high-volume merchants with configurable workflows and risk management.


Conclusion

Payment orchestration has become an essential part of managing modern digital payments. As payment systems grow more complex, with multiple providers, payment methods, and compliance requirements, having a single orchestration layer helps businesses simplify operations and improve performance.


The architecture patterns discussed in this article show that effective orchestration depends on finding the right balance between performance, reliability, security, flexibility, and cost. Whether a company builds its own solution or adopts a third-party platform, the key components stay the same: smart routing, reliable failover, strong monitoring, and solid security.

For businesses handling large transaction volumes, the investment in orchestration, through either internal development or vendor tools, usually pays off in higher authorization rates, lower processing costs, and faster rollout of new payment options. Technologies like Elixir/OTP, PostgreSQL, Kafka, and microservices have proven reliable for high-scale systems that need both low latency and strong uptime.


In the future, orchestration systems will likely use machine learning for smarter routing, tighter fraud prevention, and support for newer payment types like crypto and central bank digital currencies. Companies that build these capabilities now will be better prepared to adapt to new payment technologies while maintaining efficiency and customer trust.

Choosing between building and buying depends on your transaction volume, in-house expertise, and business goals, but having a payment orchestration layer is becoming a standard requirement for operating efficiently at scale.


If you have specific questions about designing or validating a payment orchestration architecture, or want an experienced perspective from an enterprise-focused provider, feel free to reach out to Crafting Software. Our team has helped multiple companies build flexible, modern payment solutions tailored to their business needs.


Payment Orchestration Engine Architecture FAQ


1. How do you implement the Outbox Pattern to ensure atomic transaction writes and event publishing in a payment orchestration engine?

The Outbox Pattern solves the dual-write problem where you need to both save a transaction to the database and publish an event to Kafka atomically. Implementation uses Ecto.Multi in Elixir to wrap both operations in a single database transaction. First, insert the transaction record into the transactions table. Second, in the same transaction, insert an event record into the outbox_events table with fields including event_type (e.g., "payment.initiated"), aggregate_id (transaction ID), payload (JSON-encoded event data), and published (boolean, initially false). A separate background worker (GenServer) polls the outbox table every 1-2 seconds, retrieves unpublished events (using WHERE published = false LIMIT 100), publishes each to Kafka using a producer client, and marks them as published. This guarantees that events are never lost even if Kafka is temporarily unavailable, and ensures exactly-once semantics between database state and event streams. The worker should implement exponential backoff for failed publications and dead-letter queuing for permanently failing events.


2. What's the optimal approach for implementing circuit breakers in PSP connectors, and how do you tune the failure threshold and timeout parameters?

Circuit breakers prevent cascading failures by failing fast when a PSP is degraded. Implement using GenServer state machines with three states: :closed (healthy), :open (failing), and :half_open (testing recovery). Track consecutive failures per PSP—transition from :closed to :open after 5 consecutive failures within a 60-second window. In the :open state, immediately return {:error, :circuit_open} without calling the PSP API, which reduces P95 latency from 5000ms (timeout) to <10ms (state check). After a cooldown period (typically 30-60 seconds), transition to :half_open and allow one test request through. If it succeeds, transition back to :closed and reset failure count. If it fails, return to :open for another cooldown cycle. For tuning, monitor your PSP's typical error patterns—if you see intermittent timeouts, use a lower threshold (3 failures) and shorter cooldown (30s). For rate-limiting errors, use a higher threshold (10 failures) and longer cooldown (120s). Implement per-PSP configuration since failure characteristics vary between providers. Store circuit state in Redis for shared state across multiple orchestrator instances.


3. When implementing ML-based routing, what features and model architecture provide the best prediction accuracy for transaction success rates?

ML-based routing achieves 2-5% acceptance rate improvements by predicting which PSP will most likely approve each transaction. Use a Random Forest or Gradient Boosting classifier trained separately per PSP to predict binary outcomes (success/failure). Key features include: transaction amount (normalized by currency), card BIN (first 6 digits, one-hot encoded), customer country (ISO code, embedded), transaction currency, merchant category code (MCC), time features (hour of day, day of week as cyclical encodings), customer lifetime transaction count, customer average transaction amount, and recent PSP-specific success rates (calculated over trailing 7-day windows). For feature engineering, create interaction terms like amount_x_country since approval patterns differ geographically by amount. Use 90 days of historical transaction data for training, with 80/20 train-test split stratified by PSP. Target minimum 10,000 transactions per PSP for stable models. Retrain models weekly using airflow or similar orchestration to capture evolving approval patterns. During inference, run predictions for all available PSPs in parallel (<20ms per model), then route to the PSP with highest predicted probability. Implement A/B testing to measure lift versus rule-based routing—typically see 1-3% acceptance rate improvement with proper feature engineering and monthly retraining.


4. How do you handle PSD2 Strong Customer Authentication (SCA) exemptions in the orchestration layer while maintaining compliance?

PSD2 SCA requires two-factor authentication for EU transactions, but several exemptions can reduce friction. Implement exemption logic in a dedicated SCA module that evaluates each transaction before routing. Check exemptions in order: (1) Low-value exemption - transactions under €30 are exempt, but track cumulative value per card (max €100 or 5 transactions since last SCA). (2) Low-risk exemption - integrate with your fraud detection system; if fraud score <0.1 and transaction <€500, request Transaction Risk Analysis (TRA) exemption from acquirer. (3) Recurring payment exemption - if transaction has recurring: true flag and customer completed SCA on initial payment, store the initial authentication reference and pass it with subsequent charges. (4) Corporate card exemption - identify corporate cards via BIN ranges and exempt if both issuer and acquirer support it. For implementation, create an sca_exemptions table storing exemption type, cumulative counters, and last authentication timestamp per card token. When exemption is applied, include exemption indicator in authorization request using 3DS fields (threeDSRequestorChallengeIndicator: "04" for low-value, "02" for TRA). If issuer soft-declines and requests SCA (soft_decline_code: "sca_required"), automatically trigger 3DS flow via redirect, then retry authorization with authentication data. Monitor exemption approval rates per issuer—some issuers decline exemptions more aggressively. Maintain audit logs of all exemption decisions for regulatory review.


5. What database schema design and indexing strategy supports sub-10ms transaction writes and efficient analytics queries on 100M+ transaction records?

Design a hybrid schema optimizing for both transactional writes (OLTP) and analytical queries (OLAP). Core transactions table uses UUID primary key (generates using gen_random_uuid() in PostgreSQL for distributed write performance), with columns: merchant_id, customer_id, psp, psp_transaction_id, amount, currency, status, payment_method_type, created_at, updated_at, completed_at, error_code, error_message. Use JSONB column metadata for extensible key-value data to avoid schema migrations for new fields. Critical indexes include: (1) (merchant_id, status, created_at DESC) for merchant dashboard queries, (2) (psp, created_at DESC) for PSP performance analytics, (3) (customer_id, created_at DESC) for customer transaction history, and (4) GIN index on metadata JSONB column for flexible querying. Partition table by created_at using range partitioning with monthly partitions—improves query performance by 10-50x for time-range queries and enables efficient archival of old data. For writes, use connection pooling (50-100 connections via PgBouncer) and write primarily to master, achieving 5-10ms P95 write latency. For analytics, maintain 2 read replicas with streaming replication (typically 100-500ms lag). Complex analytical queries (success rates by PSP, revenue by currency) run against replicas to avoid impacting transactional performance. For real-time dashboards requiring <1s lag, implement materialized views refreshed every 30-60 seconds, or use Kafka + ksqlDB to maintain denormalized aggregation tables updated via event streams. At 100M+ records, consider pg_partman extension for automated partition management and pg_repack for zero-downtime table maintenance.



bottom of page