Andy Potanin - Exploring Technology and Innovation

Manual Approval Gates in GitHub Actions

Andy — Mon, 06 Apr 2026 17:20:46 GMT

Most teams discover GitHub Actions can pause a pipeline and wait for a human about two years after they needed it. The feature has been there since late 2020. It's called Environment Protection Rules, and it solves the problem that every deployment pipeline eventually hits: not everything should ship the moment CI goes green.

There's a moment in every team's CI/CD journey where someone asks: "Can we make it stop here and wait for someone to say go?" Maybe it's the first time a deploy to production goes sideways because nobody looked at the staging results. Maybe it's a compliance requirement. Maybe it's just the CTO who wants to eyeball the changelog before it hits customers on a Friday afternoon.

GitHub Actions has three mechanisms for this, each solving a different shape of the problem. Here's how they work, when to use each one, and the YAML to make it happen.

Environment Protection Rules

This is the one you want for mid-pipeline gates. It's native, requires no marketplace actions, and integrates directly with GitHub's notification system.

The model is simple: you create Environments in your repository settings — dev, staging, production, whatever your pipeline needs. You attach protection rules to the environments that should require human approval. When a workflow job targets a protected environment, the pipeline pauses. Reviewers get notified via email, GitHub notifications, and mobile push. They approve or reject. The pipeline continues or stops.

Setting It Up

Settings > Environments > New environment — create one for each deployment target
Enable Required reviewers — add 1 to 6 individuals or teams
Set a wait timer if you want a mandatory delay (0 to 43,200 minutes — 30 days max)
Enable Prevent self-reviews so the engineer who pushed the code can't rubber-stamp their own deploy
Restrict deployment branches to main only if you want to prevent feature branches from hitting production

The YAML

The only change to your workflow is one line: environment: on the job that should be gated.

name: Deploy Pipeline

on:  
  push:
    branches: [main]

jobs:  
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci && npm run build && npm test

  deploy-staging:
    runs-on: ubuntu-latest
    needs: build
    environment: staging
    steps:
      - uses: actions/checkout@v4
      - run: ./scripts/deploy.sh staging

  deploy-production:
    runs-on: ubuntu-latest
    needs: deploy-staging
    environment: production
    steps:
      - uses: actions/checkout@v4
      - run: ./scripts/deploy.sh production

When the workflow reaches deploy-production, it stops. A yellow banner appears: "Review deployments." The designated reviewers click it, select the environment, optionally leave a comment explaining why they're approving, and hit Approve and deploy or Reject.

This also works from your phone. GitHub Mobile sends push notifications when a deployment is waiting for your review, and you can approve or reject without opening a laptop.

What You Get

Environment-scoped secrets — secrets tied to an environment are only available to jobs that target it and pass the protection rules. Production API keys aren't accessible until the deploy is approved.
Deployment history — GitHub tracks which commits were deployed to which environment, when, and who approved them. This is your audit trail.
Concurrent deployment control — you can prevent multiple deployments to the same environment from running simultaneously.

The Catch

Environment protection rules require GitHub Pro, Team, or Enterprise for private repositories. Public repos get them on all plans. If you're on the free plan with a private repo, skip to Mechanism 3 below.

`workflow_dispatch` with Inputs

This is not a mid-pipeline gate — it's a pre-pipeline gate. The workflow doesn't start until a human manually triggers it and fills in parameters.

Use this when the deployment itself is an intentional, deliberate act — not something that should happen on every push. Infrastructure changes, database migrations, release cuts to production on a specific schedule.

name: Release Deploy

on:  
  workflow_dispatch:
    inputs:
      environment:
        description: 'Target environment'
        required: true
        type: choice
        options:
          - staging
          - production
      version:
        description: 'Version tag to deploy'
        required: true
        type: string
      dry_run:
        description: 'Dry run only?'
        type: boolean
        default: true
      notes:
        description: 'Deployment notes'
        type: string

jobs:  
  deploy:
    runs-on: ubuntu-latest
    environment: ${{ inputs.environment }}
    steps:
      - uses: actions/checkout@v4
        with:
          ref: ${{ inputs.version }}
      - name: Deploy
        run: |
          echo "Deploying ${{ inputs.version }} to ${{ inputs.environment }}"
          if [ "${{ inputs.dry_run }}" != "true" ]; then
            ./scripts/deploy.sh ${{ inputs.environment }} ${{ inputs.version }}
          fi

This gives you a form in the Actions tab — dropdowns, text fields, checkboxes. The person triggering the workflow selects the environment, types a version, decides if it's a dry run. GitHub records who triggered it and with what parameters.

You can combine both mechanisms: workflow_dispatch to require a human to start the pipeline, plus environment: production with required reviewers so a second human approves the final step. Two-person integrity for production deploys.

Trigger it from the CLI too: gh workflow run deploy.yml -f environment=production -f version=v2.4.1 -f dry_run=false

Issue-Based Approval (Free Tier)

If you're on GitHub Free with a private repo and can't use environment protection rules, there's a well-maintained open-source action that implements approval via GitHub Issues: trstringer/manual-approval.

When the workflow hits the approval step, it creates an Issue, tags the designated approvers, and polls for a comment containing "approved" or "denied." It's not as polished as the native environment UI — there's no yellow banner, no one-click approve button — but it works and it's free.

jobs:  
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci && npm test

  approval:
    runs-on: ubuntu-latest
    needs: build
    steps:
      - uses: trstringer/manual-approval@v1
        with:
          secret: ${{ github.TOKEN }}
          approvers: andy,eric
          minimum-approvals: 1
          issue-title: "Deploy ${{ github.sha }} to production?"
          issue-body: |
            Commit: ${{ github.sha }}
            Branch: ${{ github.ref_name }}

            **Approve** by commenting: `approved`
            **Deny** by commenting: `denied`

  deploy:
    runs-on: ubuntu-latest
    needs: approval
    steps:
      - uses: actions/checkout@v4
      - run: ./scripts/deploy.sh production

The Full Pattern

Here's a complete pipeline that builds, auto-deploys to dev, gates staging, runs smoke tests, then gates production with a wait timer. This is the pattern we use for services that have real users and real consequences.

name: Full Pipeline

on:  
  push:
    branches: [main]

jobs:  
  build-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npm test
      - run: npm run build
      - uses: actions/upload-artifact@v4
        with:
          name: build
          path: dist/

  deploy-dev:
    needs: build-and-test
    runs-on: ubuntu-latest
    environment: dev
    steps:
      - uses: actions/download-artifact@v4
        with: { name: build }
      - run: ./deploy.sh dev

  integration-tests:
    needs: deploy-dev
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm run test:integration

  deploy-staging:
    needs: integration-tests
    runs-on: ubuntu-latest
    environment:
      name: staging
      url: https://staging.example.com
    steps:
      - uses: actions/download-artifact@v4
        with: { name: build }
      - run: ./deploy.sh staging

  smoke-tests:
    needs: deploy-staging
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm run test:smoke

  deploy-production:
    needs: smoke-tests
    runs-on: ubuntu-latest
    environment:
      name: production
      url: https://example.com
    steps:
      - uses: actions/download-artifact@v4
        with: { name: build }
      - run: ./deploy.sh production

  post-deploy-verify:
    needs: deploy-production
    runs-on: ubuntu-latest
    steps:
      - run: curl -f https://example.com/health || exit 1

Configure the environments:

dev — no protection rules. Auto-deploys on every push to main.
staging — 1 required reviewer from the dev team.
production — 2 required reviewers from team leads. 15-minute wait timer. Self-approval prevented. Only main branch allowed.

The pipeline is visible in the Actions tab as a graph. You can see exactly where it paused, who approved, when, and what they said. That's your audit trail for SOC 2, ISO 27001, or whoever is asking.

Things That Bite You

A few gotchas from running this in production:

Token expiration. When a workflow is paused waiting for approval, it holds a runner slot. GitHub Actions workflows have a maximum run time of 35 days. If nobody approves within 35 days, the workflow is cancelled. In practice this is rarely a problem, but set up Slack notifications for pending approvals so they don't rot.

Environment secrets scope. Secrets defined at the environment level are only available to jobs targeting that environment — and only after protection rules pass. This is a feature, not a bug. It means your production database credentials are literally inaccessible to any job that hasn't been approved. But it also means you can't reference production secrets in a build step that runs before the approval gate.

Concurrency. By default, multiple workflow runs can target the same environment simultaneously. If you're deploying to production, add concurrency: production to the job to ensure only one deploy runs at a time. Queued runs will wait.

Re-runs. If a deployment fails after approval, you can re-run the failed job without needing re-approval — the original approval carries forward. This is usually what you want. If you need re-approval on retry, create a fresh workflow run instead.

Links: - GitHub Docs: Managing environments - GitHub Docs: Reviewing deployments - trstringer/manual-approval - william-liebenberg/github-gated-deployments

What Killed Blackboard

Andy — Thu, 02 Apr 2026 01:00:25 GMT

How a six-year cloud migration without deployment automation turned a $3 billion EdTech acquisition into a $1.7 billion bankruptcy — and what the divested business unit that built the pipeline did differently.

In September 2025, Anthology — the company that owned Blackboard, the most recognized name in education technology — filed Chapter 11 bankruptcy with $1.7 billion in debt and annual interest payments consuming 41% of total revenue. Five months later, it emerged as "Blackboard, Inc." — debt-free, stripped of two of its three business segments, sold for a combined $120 million, and backed by $70 million in new financing from distressed-debt investors Oaktree Capital and Nexus Capital.

The same month Anthology filed for bankruptcy, a former Blackboard business unit — Transact Campus, divested in 2019 for $720 million — was completing its integration into Roper Technologies following a $1.6 billion acquisition. Same parent company. Same industry. Same compliance requirements. Opposite outcomes.

The difference was not strategy. It was not market timing. It was not product vision. The difference was whether the organization had built the operational infrastructure to deploy software reliably across multiple environments at scale — and what happened when it hadn't.

The Rise and Entrenchment

Blackboard was founded in 1997 by Michael Chasen and Matthew Pittinsky. Within two years it had merged with CourseInfo, gone to market as the first commercial learning management system, and begun acquiring everything in its path. Prometheus in 2002. WebCT — its primary competitor — in 2006 for $180 million. ANGEL Learning in 2009. By 2006, Blackboard was installed on more than 65% of U.S. college campuses, according to ListedTech's historical market share data. Between 2006 and 2012, the company spent over $500 million on acquisitions.

The company went public in 2004 (ticker: BBBB) and was taken private in 2011 by Providence Equity Partners for $1.64 billion. As Phil Hill at e-Literate later observed, when a private equity firm acquires a company like Blackboard, "they do so by essentially taking out a giant mortgage" — the debt that would define the next decade was introduced here, not with Anthology.

Blackboard's market dominance during this period was real but fragile. In 2006, the company obtained a broad patent on internet-based education support systems and promptly sued Desire2Learn (now D2L) for infringement. The case dragged through the courts for three years — Blackboard won a $3.1 million jury verdict in Texas in 2008, only to have the Federal Circuit invalidate the patent claims in 2009 for indefiniteness. The lawsuit alienated the open-source community, energized competitors, and signaled to the market that Blackboard's strategy was more about protecting installed base than earning it.

By the time Providence took Blackboard private, the cracks were already showing. Canvas, launched by Instructure in 2011, was cloud-native from its first line of code. It didn't need a migration strategy because there was nothing to migrate from.

The Migration That Never Ended

Blackboard announced its move to a fully SaaS model on AWS in 2014. The goal was straightforward: migrate all clients off self-hosted and managed-hosting deployments to a multi-tenant cloud architecture.

Six years later, by mid-2020, only about half of clients had migrated to SaaS — roughly 844 out of ~1,700, according to Phil Hill's analysis of Blackboard's own numbers. The company was running three completely different deployment models simultaneously:

Self-hosted: Customers running Blackboard on their own data centers
Managed-hosting: Blackboard operating dedicated instances in its own data centers
SaaS: Multi-tenant cloud deployment on AWS

Each model required different code paths, different support structures, different security configurations, and different operational procedures. As Phil Hill documented at the time: the company had "an albatross around their neck with the need to support different code bases and deployment models."

This was not a technology problem. It was a deployment orchestration problem. The company was trying to migrate thousands of institutional clients to the cloud without a system that could manage the migration itself.

Meanwhile, market share was in free fall. In 2018, Canvas tied Blackboard at 28% of the U.S. and Canadian higher ed market, according to Phil Hill's market analysis. By 2019, Canvas had surpassed Blackboard. By the end of 2024, the numbers were devastating: Canvas held 50% of the market by enrollment, D2L Brightspace had risen to 20%, and Blackboard had cratered to 12%. By spring 2025, Edutechnica reported that Canvas had achieved a milestone: greater market share than its next three competitors combined.

Blackboard went from 70% market share to 12% in under two decades. The LMS that had once been synonymous with online learning was now a distant third.

Three Deployment Models, Zero Automation

The evidence of operational paralysis accumulated publicly:

Software that hadn't been updated in two years. Blackboard's worst outage in years, in April 2020, was traced to infrastructure that had gone untouched for 24 months. Phil Hill attributed the outage to "software that had not been updated in two years" and described it as illustrating "the risk of having your cake and eating it, too" — maintaining multiple deployment models while lacking the automation to keep any of them current. In an industry where competitors were shipping weekly, Blackboard had environments that hadn't seen a deployment in two years — not because nothing had changed, but because deploying was too risky and too manual to attempt without a crisis forcing it.

Forced migration without migration capability. In July 2020, Blackboard announced that self-hosted support would end by December 2023 and managed-hosting by December 2022. Phil Hill noted that "Blackboard has not been willing to make the difficult decision to force clients off of older hosting models, meaning that the company has increased the number of deployment options they have to support." When Blackboard finally did set deadlines, the response at BbWorld was telling: the first audience question led to "a very long discussion about long-term self-hosted customers who just don't have the resources to migrate." Institutions that had been promised indefinite support were told to migrate anyway — without the tooling to make migration manageable.

Overprovisioning as the default. When COVID-19 drove a 4,800% increase in Blackboard Collaborate usage — concurrent users jumping 45x in four weeks — the company's response was documented in their own AWS case study. "We had entire countries shifting to online learning overnight," said Kris Stokking, VP of Software Engineering at Blackboard. The company's initial response was to overprovision. As AWS noted: "Blackboard erred on the side of overprovisioning; in the long term, however, the company needed a more cost-efficient solution." They eventually implemented autoscaling, AMD instance optimization, and Spot Instances — achieving a 28% cost reduction on media processing — but this was reactive optimization of a single product, not systematic infrastructure-as-code across the entire estate.

Every feature, every security patch, every performance optimization had to be tested and deployed three different ways. The combinatorial explosion of configurations — three hosting models times two UI versions times thousands of institutional customizations — was not a scaling challenge. It was an impossibility without automation.

The $3 Billion Bet That Compounded the Problem

In October 2021, Anthology — a company formed by Veritas Capital and Leeds Equity Partners in 2020 through a roll-up of Campus Management, Campus Labs, and iModules — acquired Blackboard for approximately $3 billion. The acquisition was funded almost entirely by debt. Veritas became the majority owner.

The thesis, as then-CEO Jim Milton stated, was to create "the most comprehensive ed-tech ecosystem across academic, administrative and student engagement applications." The bet was that universities would want to buy their LMS, SIS, and CRM from the same vendor — a bundled "cradle-to-career" platform.

The thesis was wrong.

Phil Hill, the most widely cited EdTech market analyst, explained it to Inside Higher Ed: "Anthology assumed that by combining the LMS, SIS and CRM that they would get a lot more cross-selling. What they misunderstood was that academics — the deans, provosts and faculty — really pick the LMS and they're not going to pick an LMS because the registrar and chief information officer picked a different SIS. That synergy they were looking for just really didn't exist."

ListedTech's data proved this empirically: only 5 institutions used products across all three Anthology business segments. Five. Out of thousands. The cross-sell strategy that justified a $3 billion acquisition — supported by $1.7 billion in debt — produced essentially zero cross-selling.

What Anthology inherited:

An incomplete SaaS migration seven years in
Three deployment models requiring parallel operational support
A declining market share — 12% and falling — against Canvas (50%) and D2L Brightspace (20%)
30+ products with overlapping codebases across the combined entity
$1.7 billion in debt requiring $185 million in annual interest payments

The capital structure, revealed in court filings and analyzed by ElevenFlo, was:

1st Lien Superpriority Credit Agreement: ~$1.2 billion
2nd Lien Credit Agreement: ~$423 million
Revolving Credit Facility: $100 million (fully drawn)
Debt-to-EBITDA ratio: exceeding 400x by FY2025

The financial trajectory was swift and brutal:

Revenue: $530M (FY2023) → $450M (FY2025) — down 15%
EBITDA: $33M → $4M — down 88% in two years
Annual interest payments: $185M — 41% of total revenue
Net losses over two years: approximately $80 million, per court filings

By December 2024, Anthology skipped its interest payment on the second lien. By March 2025, it skipped the first lien too — and two lenders failed to honor $18.5 million in lending commitments. Veritas Capital effectively walked away. Distressed-debt investors Nexus Capital and Oaktree Capital entered to steer the restructuring. By September 2025, it filed Chapter 11.

The court filings tell the story in one ratio: $185 million in annual interest on $4 million in EBITDA. The company's operating profit couldn't cover a single week of debt service.

Why Competitors Survived — and Thrived

The competitive landscape makes the operational failure visible:

Canvas (Instructure) was built cloud-native from its founding in 2008. No migration problem. No legacy deployment models. One codebase, one deployment pipeline, simultaneous feature delivery to all customers. Canvas was taken private by Thoma Bravo in 2020 for $2 billion, then sold to KKR in 2024 for $4.8 billion — a 140% increase in valuation in four years. Instructure's trailing twelve-month revenue reached $634 million at the time of the KKR deal. By spring 2025, Canvas held 50% of the US/Canadian higher ed LMS market by enrollment — more than its next three competitors combined.

D2L Brightspace made the painful decision to force all clients off self-hosting and managed-hosting to pure AWS-based SaaS over 3-4 years. Phil Hill contrasted this directly with Blackboard: "D2L took a different path and has entirely migrated their client base from self-hosting and managed-hosting to pure AWS-based SaaS hosting over the past 3-4 years. D2L lost some clients along the way based on this approach, but they are now a true cloud-based LMS company." D2L reported $217 million in annual revenue for fiscal 2026, subscription revenue growing 9% year-over-year, and total debt of just $723,000 (that's thousands, not millions). D2L Brightspace passed Blackboard for second place in the LMS market by enrollment in 2024, driven partly by system-wide migrations at SUNY and CUNY away from Blackboard.

Blackboard tried to avoid the hard decision — maintaining legacy deployment models to retain customers while simultaneously migrating to the cloud. Without deployment automation to manage this complexity, they ended up with the worst of both worlds: the operational costs of three deployment models and the competitive disadvantage of slow feature delivery.

The pattern is clear in the numbers: D2L, which forced the migration, has negligible debt and iterates rapidly. Canvas, which never needed a migration, commands a $4.8 billion valuation. Blackboard, which delayed the migration, accumulated $1.7 billion in debt and filed for bankruptcy.

The Divested Business That Built the Pipeline

In April 2019, Blackboard divested its Transact business unit — the payments and campus commerce division — to Reverence Capital Partners for approximately $720 million.

Blackboard Transact handled students' financial information, banking data, and controlled the physical access systems that lock and unlock doors across campus. This was the closest thing to critical infrastructure outside the DOD — not a consumer app where a breach means you apologize and offer credit monitoring. PCI DSS and SOC 2 compliance were not optional.

The division's operational maturity at the time of divestiture mirrored the parent company's. Products across three verticals were either not releasing automatically or doing ArgoCD-style GitOps that only reconciled within Kubernetes. There were no solutions to spin up or manage resources outside of Kubernetes. The platform wasn't running on Kubernetes at all — it was running on Azure Service Fabric, Microsoft's pre-Kubernetes container orchestrator, now deprecated with a migration deadline of March 2027.

What got built after divestiture was a release orchestration pipeline aligned with how the organization actually worked:

Developers handed off feature builds to the QA team
QA signed off on the story and release, then handed it to User Acceptance Testing (UAT)
UAT was managed by a completely different department
That department ensured the release could reach production
At each stage, approvals were required before the release could proceed further

Each product ended up with five to seven environments in its pipeline. The pipeline took code developers wrote, packaged it into deployable artifacts, and deployed in sequence — with rules, state awareness, and explicit halt conditions at every phase. It produced all artifacts and evidence any PCI DSS or SOC 2 auditor would ever ask for. Pipeline phases were aligned with organizational departments and procedures: it was clear when work was handed off and who was responsible for approval.

The compliance drivers were real. PCI DSS Requirement 6.4 mandates separate development and test environments, testing of security impact, and separation of duties. SOC 2 CC8.1 requires formal authorization of production changes. The institutional client base demanded this — universities handling financial aid, meal plans, and physical access control need to prove that different versions of code are tested in different environments before reaching production.

The results were measurable:

Time from developer finishing work to customer delivery: one year down to under a week (sometimes a day)
Tracked deployments: single digits to 30,000+ per year (DORA classifies elite at 1,460/year — this was 20x elite)
Developer experience: settings were easy to find, easy to iterate on, and integrated into whatever interface developers already used
Every deployment was tracked, versioned, and deployable into other environments — environments that represented lifecycle stages with organizational controls

The business trajectory, documented in Transact's press releases and Roper Technologies' SEC filings:

April 2019: Divested for ~$720M (Reverence Capital, with $300M+ in equity)
June 2023: 1 million mobile credentials provisioned, representing over 80% of the mobile student ID market, with $230M+ processed through mobile credentials
2024: 1,800+ institutions, $49B+ in annual payments processed, 250M+ contactless transactions
August 2024: Acquired by Roper Technologies for $1.6 billion (projected $325M revenue, $105M EBITDA)

Roper already owned CBORD — Transact's biggest competitor, acquired in 2008 for $367 million. CBORD served 750 colleges and 1,700 healthcare licensees. The combined entity now dominates a campus card systems market valued at $1.65 billion globally.

While Anthology's Blackboard was losing $80 million in revenue and watching EBITDA collapse by 88%, the former Blackboard division that built the deployment pipeline more than doubled its valuation.

Technical Debt Becomes Financial Debt

Blackboard's bankruptcy is typically discussed as a private equity story — leveraged buyout, excessive debt, declining market. That framing is accurate but incomplete. The debt was serviceable if the business could grow or even maintain revenue. It became fatal because the business couldn't execute.

And it couldn't execute because it lacked the operational infrastructure to do so.

Without deployment automation, cloud costs become unmanageable. The structural economics tell the story: revenue declined 15% while EBITDA collapsed 88% — from $33 million to $4 million. Costs weren't scaling down with the business; they were scaling up against it. When every environment is manually configured, overprovisioning becomes the default because the cost of an outage — manual investigation, manual rollback, manual redeployment — vastly exceeds the cost of spare capacity. The AWS case study confirms this was Blackboard's actual behavior during the only period they publicly documented their cloud operations.

Without infrastructure-as-code, every environment is a snowflake. A security patch that works in the SaaS environment breaks the managed-hosting environment because someone made a manual change six months ago that was never documented. The patch for managed-hosting breaks a self-hosted client because their Java version is two years behind. Each fix creates a new configuration branch. The configuration space grows faster than the team can map it.

Without deployment visibility, you can't trace what changed. When Blackboard's worst outage hit in April 2020 — traced to software unchanged for two years — the diagnosis itself was the tell. In a system with deployment records, the answer to "what changed?" is immediate. In a system without them, the answer is "we're not sure, but we think nothing changed, and that's actually the problem."

The iteration speed problem compounds everything. Without deployment automation, the feedback loop between identifying a problem and validating a fix stretches from hours to weeks. Developers stop experimenting. Product teams stop requesting features. The organization enters a defensive posture where the primary goal is avoiding breakage, not delivering value. Revenue declines follow.

As Joseph Licata, founder and CEO of Canyon GBS, told Inside Higher Ed: "Anthology's bankruptcy reflects the financial and operational strain created when education technology companies scale primarily through acquisition rather than disciplined product and engineering strategy. Managing overlapping architectures, redundant services, and different code bases significantly increases costs and slows innovation."

The Sequencing Error

Blackboard made a critical sequencing error that is common in enterprise cloud migrations:

They announced the destination before building the road.

The SaaS migration was announced in 2014. The deadline for self-hosted support was set for 2023. The Ultra UI migration was launched in parallel. But the deployment automation — the system that would make all of these transitions manageable, traceable, and fast — was never built.

The result was predictable: forced migrations without the operational capability to support them. Institutions were told to migrate, but the migration process itself was manual, fragile, and under-resourced. The support burden of running three deployment models consumed the engineering capacity that should have been building the pipeline to eliminate them.

Phil Hill identified this dynamic in real time: "One challenge Blackboard faces is that with their market losses over the years, this conservative type is a greater and greater percentage of their remaining customer base." The customers least able to migrate were the ones most likely to still be on Blackboard — and the ones who would need the most operational support during migration. Without automation, this was a death spiral.

Compare this to the approach that worked at Transact: build the pipeline first, then use it to accelerate everything else. The pipeline made deployments easy, traceable, and fast. Developers adopted it because it solved their immediate bottleneck. Product teams used it because it turned "deploy to five environments" from a project into a button press. The pipeline was the prerequisite for everything that followed.

This is not specific to education technology. It is the fundamental lesson of every failed enterprise cloud migration: the migration tool is not the cloud provider. It is the deployment pipeline.

What GitOps Couldn't Do

The GitOps model — put desired state in Git, let a controller reconcile the cluster — appears to solve the deployment problem. It is elegant for a single Kubernetes cluster with a stable configuration.

But Blackboard's challenge was never a single Kubernetes cluster with a stable configuration. It was:

Three fundamentally different deployment models requiring different operational procedures
Hundreds of institutional clients with different configurations, compliance requirements, and migration timelines
Infrastructure spanning Kubernetes, bare metal, managed hosting, and client-operated data centers
A UI migration happening simultaneously with a hosting migration
Compliance and data residency requirements that varied by institution and jurisdiction

GitOps addresses one dimension of this problem — the Kubernetes dimension — and treats the rest as someone else's problem. The reconciliation loop answers "does the cluster match Git?" It does not answer "has this institution been migrated safely?" or "does this configuration work in all three deployment models?" or "can we roll back this institution without affecting the 200 institutions sharing the same SaaS tenant?"

What Blackboard needed was not a reconciliation controller. It was an orchestration pipeline — one that could manage deployments as ordered, stateful operations across heterogeneous infrastructure, with explicit halt conditions, promotion gates, and audit evidence generation. The kind of pipeline where environments are runtime parameters, not Git paths — where adding a new institutional environment means adding a YAML file, not spinning up new infrastructure, new credentials, and new operational procedures.

The core issue is deeper than tooling. Organizations don't need deployment automation for CI/CD. They need it to prove they have a sound process. Every change must be traceable. Every change must be tested exhaustively before deployment. Changes must be manifested and declared as artifacts with sequential versioning. A proper pipeline has sequential phases aligned with organizational structure — and Git alone isn't sufficient, because what gets done in Git isn't all that's needed for a release to be deployed into production.

GitOps reconciliation tools solve the wrong problem for regulated environments. They answer "does the cluster match the repo?" They don't answer "has this change been approved by QA, tested by UAT, and authorized for production by the release management team?" In an environment governed by PCI DSS and SOC 2, the second question is the one the auditor asks.

GitOps is deceptively simple. It solves the demo perfectly. But at enterprise scale — with multiple deployment models, compliance boundaries, cross-platform workloads, and hundreds of tenant configurations — the simplicity becomes the constraint. The things it doesn't handle are the things that determine whether a cloud migration succeeds or fails.

The Bankruptcy Arithmetic

Here is the arithmetic that killed Blackboard:

Revenue: $450 million and declining 8% annually
Interest payments: $185 million (41% of revenue)
EBITDA: $4 million (before interest)
Debt-to-EBITDA: 400x+
Gap: -$181 million per year

No amount of cost cutting closes a $181 million annual gap on a $450 million revenue base. The company would have needed to grow revenue by 40% while simultaneously eliminating nearly all operational overhead. That was structurally impossible for an organization that couldn't deploy a software update without manual intervention across three different infrastructure models.

The private equity structure made the math fatal, but the operational immaturity made the math inevitable. A company that could deploy rapidly, iterate on features, onboard clients quickly, and optimize cloud costs could have grown into its debt. Blackboard couldn't do any of those things.

The emergence from bankruptcy in February 2026 tells the final chapter: Blackboard sold its SIS and ERP business to Ellucian for $70 million and its CRM and student success business to Encoura for $50 million. The combined sale price for two of three business segments was $120 million — 4% of the $3 billion acquisition price four years earlier. The cross-sell thesis that justified the entire acquisition was unwound for pennies on the dollar.

What remains is a debt-free entity called "Blackboard, Inc." — the LMS, Ally, and institutional effectiveness tools — backed by $70 million in new capital from Oaktree and Nexus, with Matt Pittinsky, the co-founder who started the company in 1997, set to return as CEO once his non-compete with Instructure expires. It is, as Phil Hill described it, a "financial reset" — but the strategic reset is still to come.

Same Industry, Opposite Outcomes

Blackboard (Anthology)

Acquired for ~$3B (2021)
No deployment automation at scale
7+ years of incomplete cloud migration
Market share: 70% → 12%
EBITDA: $33M → $4M (-88%)
Three simultaneous deployment models
Cross-sell: 5 institutions across all segments
Outcome: Chapter 11 bankruptcy, emerged as diminished entity

Transact Campus

Divested for ~$720M (2019)
Full deployment orchestration pipeline
Operational transformation in 2 years
80%+ mobile credential market share
$325M revenue, $105M EBITDA at exit
Unified pipeline across multiple targets
30,000+ deployments/year
Outcome: Acquired by Roper for $1.6B, combined with CBORD to dominate

Canvas (Instructure)

Cloud-native from founding (2008)
No migration required — ever
50% market share by enrollment (2025)
Thoma Bravo exit to KKR for $4.8B (2024)
More market share than next three competitors combined

D2L Brightspace

Forced SaaS migration over 3-4 years
Lost some clients, gained operational clarity
Passed Blackboard for #2 in 2024
$217M revenue, $723K total debt
9% subscription revenue growth

Same industry. Same customers. Same compliance requirements. Four different approaches to the same problem. The ones that solved deployment won. The one that didn't went bankrupt.

The Lesson

Blackboard's bankruptcy is usually framed as a cautionary tale about private equity leverage, or about market disruption by cloud-native competitors, or about the difficulty of acquiring and integrating large software companies.

It is all of those things. But underneath all of them is a simpler truth: Blackboard could not deploy software reliably across its own infrastructure. Everything else — the stalled migration, the overprovisioned cloud resources, the slow feature delivery, the customer attrition, the revenue decline that made the debt unserviceable — was a consequence of that single operational failure.

The deployment pipeline is not a DevOps nice-to-have. It is not a line item that can be deferred while the organization focuses on "strategic priorities." It is the mechanism through which strategic priorities become reality. Without it, strategy is aspiration. With it, strategy is execution.

Transact built the pipeline and turned $720 million into $1.6 billion. Canvas was born with the pipeline and commands a $4.8 billion valuation. D2L forced the migration and carries $723,000 in debt. Blackboard didn't build the pipeline and turned $3 billion into bankruptcy.

The infrastructure you build to deploy your software is, ultimately, the infrastructure that determines whether your software — and your company — survives.

What the Operator Knows That the Tooling Evangelist Doesn't

Andy — Wed, 01 Apr 2026 20:43:11 GMT

Lessons from running multi-account, multi-cloud Kubernetes deployments under compliance constraints — and why the showcase scenario isn't where the real problems live.

A few years into running a deployment platform that served hundreds of higher-education institutions across dozens of cloud accounts, the team hit a wall that no one had warned them about. Every environment — dev, staging, production, the regulated one for financial aid data, the one-off client instance behind a firewall — had its own deployment pipeline, its own credential store, its own failure modes. Every time someone added an environment, they weren't deploying software. They were deploying deployment infrastructure. The system that was supposed to reduce complexity was the fastest-growing source of it.

That experience — scaled across years, clients, and compliance regimes — is the backdrop for everything in this article. The patterns here are not theoretical. They are the conclusions that teams arrive at after the elegant demo architecture meets the full estate.

Reconciliation Is Not Orchestration

The GitOps pitch is clean: put your desired state in Git, let a controller reconcile the cluster to match. ArgoCD does this well. It answers one question continuously — "Does the live state match what Git says?" — and it answers it forever. Drift detection, self-healing, continuous reconciliation. For a single cluster with a stable config, it is genuinely elegant.

But reconciliation is not orchestration, and the difference only shows up when things get real.

A pipeline is a directed graph with state. It knows where it came from, where it's going, what happened at each step, and whether to continue or halt. It can stop. A reconciliation loop cannot stop — it can be set to manual sync, but that is a configuration state, not a halt condition. A subsequent automated commit can bypass it entirely.

Picture a deployment with 27 infrastructure modules in strict dependency order — networking first, then SQL instances, then Kubernetes clusters, then namespaces, then services, then application deployments, then CDN, then monitoring. Each module depends on outputs from the previous layer. Each layer's failure means everything downstream must halt, not reconcile. A reconciliation loop that re-applies the CDN module while the namespace module is failing doesn't heal anything — it generates noise that obscures the actual failure.

The systems that survive at scale converge on ordered deployment graphs independently. Module weights enforce sequencing. Placeholder resolution flows outputs from one layer into inputs for the next. Explicit halt conditions stop the line when a layer fails. That is a pipeline. It was always a pipeline.

# deployment-pipeline.yaml — ordered module graph with halt conditions
pipeline:  
  name: site-deploy
  halt_on_failure: true
  modules:
    - name: networking
      weight: 10
      type: aws-vpc
      outputs: [vpc_id, subnet_ids]

    - name: sql
      weight: 20
      type: aws-rds
      inputs: { vpc_id: "#{networking.vpc_id}" }
      outputs: [db_endpoint]

    - name: kubernetes
      weight: 30
      type: aws-eks
      inputs: { subnet_ids: "#{networking.subnet_ids}" }
      outputs: [cluster_endpoint, cluster_ca]

    - name: namespaces
      weight: 40
      type: k8s-namespace
      inputs: { cluster: "#{kubernetes.cluster_endpoint}" }

    - name: application
      weight: 50
      type: k8s-deployment
      inputs:
        cluster: "#{kubernetes.cluster_endpoint}"
        db_endpoint: "#{sql.db_endpoint}"

    - name: cdn
      weight: 120
      type: aws-cloudfront-distribution
      inputs: { origin: "#{application.service_url}" }

    - name: monitoring
      weight: 140
      type: gcp-monitoring
      inputs: { endpoints: ["#{application.service_url}", "#{cdn.distribution_url}"] }

Each module declares its weight and its inputs. The engine resolves #{...} placeholders from upstream outputs. If sql fails, everything at weight 30+ halts — there is nothing to reconcile downstream because the inputs don't exist yet.

Both reconciliation and orchestration are necessary. Neither replaces the other. The mistake is treating one as a sufficient substitute for the other because it handles the demo well. (For a deeper look at how these patterns play out across ArgoCD, GitOps promotion, and Octopus Deploy in multi-account estates, see Deployment Orchestration for Multi-Environment EKS.)

Environments Are Runtime Parameters, Not Git Paths

In the GitOps model, environments are Git paths — overlays/dev/, overlays/prod/ — each requiring its own reconciliation instance, its own deploy key, its own CI token. Adding a new environment means new credentials, often a new ArgoCD installation, sometimes an entirely new Git repository.

Teams that have managed 50, 100, or 300 environments across separate cloud accounts converge on a different model: environments as runtime parameters. The same module definitions apply everywhere. The difference between dev and prod is a set of variable substitutions — #{Environment}, #{Lifecycle}, #{Repository} — resolved at deploy time against the same module catalog.

This is not a cosmetic difference. It determines how operational complexity scales.

In the GitOps model, complexity is proportional to environments multiplied by workarounds per environment type. In the parameterized model, complexity is proportional to the number of modules in the catalog. Ten environments or a hundred — the catalog doesn't change. A site that needs three environments doesn't need three repos, three sets of deploy keys, or three ArgoCD installations. It needs three YAML files in a directory.

# sites/client-portal/environments/production.yaml
environment: production  
lifecycle: long-lived  
variables:  
  instance_type: m6i.xlarge
  replicas: 3
  db_instance_class: db.r6g.large
  cdn_price_class: PriceClass_All
  monitoring_alert_channel: "#ops-critical"
  domain: portal.client.com

# sites/client-portal/environments/staging.yaml
environment: staging  
lifecycle: long-lived  
variables:  
  instance_type: t3.medium
  replicas: 1
  db_instance_class: db.t4g.medium
  cdn_price_class: PriceClass_100
  monitoring_alert_channel: "#ops-staging"
  domain: staging-portal.client.com

# sites/client-portal/environments/dev.yaml
environment: dev  
lifecycle: ephemeral  
variables:  
  instance_type: t3.small
  replicas: 1
  db_instance_class: db.t4g.small
  cdn_price_class: PriceClass_100
  monitoring_alert_channel: "#dev"
  domain: dev-portal.client.com

Adding a fourth environment — say, a demo instance for sales — is adding a fourth YAML file. No new pipeline, no new ArgoCD instance, no new deploy keys. The module catalog stays the same. The variables change.

The strongest implementations put environment configs in the application repo itself — not a separate infra-configs repo that developers have to cross-reference. A developer working on a site can see its production CloudFront config, its staging database config, and its dev monitoring config all in one place, versioned alongside the code those configs serve. The multi-repo coordination problem dissolves entirely.

The Self-Healing Blind Spot

Here is a scenario that happens more often than anyone admits: a pod is Running and Ready. The reconciliation controller reports healthy. And the service is silently failing — dropping requests, returning errors, connected to a stale database endpoint that was rotated two hours ago. The manifest matches Git perfectly. The system is broken.

ArgoCD's drift detection reports that the cluster matches Git. It does not report that the system is working. These are different questions, and conflating them creates a blind spot that no amount of self-healing can fix.

There is a deeper problem. ArgoCD runs inside the cluster it is evaluating. If the cluster is degraded — if nodes are under memory pressure, if the network is flapping, if the control plane is overloaded — ArgoCD is running in that degraded state and reporting from inside it. This is not independent verification. It is the evaluated system attesting to its own correctness.

Safety-critical engineering has a formal term for this: Independent Verification and Validation (IV&V). Aviation software standards (DO-178C) and automotive safety standards (ISO 26262) both require that the system verifying a component be structurally independent from the system that operates it — not just organizationally separate, but with no shared failure modes. A flight computer doesn't verify itself. A brake controller doesn't sign off on its own output.

The strongest deployment systems maintain three independent sources of truth: what was declared (the YAML configs), what was applied (the Terraform state or Kubernetes API server), and what is actually running (external probes hitting real endpoints from outside the cluster). When all three agree, the system is healthy. When any diverge, you know which layer failed — not just that something is wrong, but whether it's a config problem, an apply problem, or a runtime problem.

# health/verification.yaml — three-source-of-truth health check
verification:  
  declared:
    source: git
    ref: main
    path: sites/client-portal/environments/production.yaml
    check: sha256 of config matches last deployed snapshot

  applied:
    source: terraform-state
    backend: s3://deployments/client-portal/production/terraform.tfstate
    check: resource attributes match declared config values
    resources:
      - aws_cloudfront_distribution.main
      - aws_route53_record.primary
      - aws_ecs_service.app

  running:
    source: external-probes
    check: synthetic requests from outside the cluster
    probes:
      - type: http
        url: https://portal.client.com/healthz
        expect: { status: 200, body_contains: "ok", latency_ms_max: 500 }
      - type: dns
        record: portal.client.com
        expect: { cname: d1234.cloudfront.net }
      - type: tls
        host: portal.client.com
        expect: { issuer: "Amazon", days_until_expiry_min: 30 }

When declared and applied agree but running fails, you have a runtime problem — the infra is correct but the application is broken. When declared and running agree but applied diverges, someone changed infrastructure outside the pipeline. Each combination points to a different root cause.

The Air-Gapped Cluster Is Not an Edge Case

The GitOps pull model requires the cluster to reach Git over HTTPS. When the cluster cannot reach Git — by design, not by misconfiguration — the model stops. Syncs fail silently. Drift goes undetected. The system diverges from its declared state with no automated correction.

The instinct is to treat this as a network problem to solve: add a GHES instance, peer the VPCs, set up a Git mirror. These are valid solutions when restricted access is the constraint. They are the wrong answer when the premise is "this cluster genuinely cannot and will not reach any Git endpoint."

Air-gapped environments are not edge cases in government, defense, and regulated industries. They are the baseline. A cluster in a classified enclave, a client-managed environment with strict egress controls, or a GovCloud deployment with no outbound internet is a hard constraint to design around, not a network problem to fix.

There are three models for reaching these environments:

Pull-based (ArgoCD) — the cluster reaches out to Git. Fails when Git is unreachable.

Push-based with agents — a lightweight agent inside the cluster initiates an outbound connection to an orchestration server. Works when the cluster can reach one known HTTPS endpoint, even if it can't reach Git.

Workflow-triggered — a CI/CD workflow runs entirely outside the cluster, authenticates via OIDC federation and role assumption, and applies changes remotely through cloud APIs. The cluster doesn't initiate anything. It doesn't know Git exists. The workflow is the actor; the cluster is the substrate.

The third model is how multi-account infrastructure actually gets deployed in practice. A GitHub Actions workflow assumes an IAM role via OIDC, chains into deployment roles in target accounts, and applies Terraform modules against those accounts' resources. (The mechanics of how OIDC federation, IRSA, Pod Identity, and cross-account STS role chains actually wire together inside EKS are detailed in Octopus Deploy on AWS.) The useful question is not "how do we get Git access into this cluster?" It is "why does this cluster need to reach Git at all?"

# .github/workflows/deploy-infrastructure.yaml
name: Deploy Infrastructure  
on:  
  push:
    branches: [main]
    paths: ["sites/*/environments/**"]

permissions:  
  id-token: write  # OIDC token for AWS STS

jobs:  
  deploy:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        account:
          - { name: dev, role: "arn:aws:iam::111111111111:role/deploy-infra", region: us-east-1 }
          - { name: prod, role: "arn:aws:iam::222222222222:role/deploy-infra", region: us-east-1 }
          - { name: govcloud, role: "arn:aws:iam::333333333333:role/deploy-infra", region: us-gov-west-1 }
    steps:
      - uses: actions/checkout@v4

      - name: Assume deployment role via OIDC
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ matrix.account.role }}
          aws-region: ${{ matrix.account.region }}
          role-duration-seconds: 900

      - name: Apply infrastructure modules
        run: |
          ./deploy.sh \
            --environment ${{ matrix.account.name }} \
            --config sites/client-portal/environments/${{ matrix.account.name }}.yaml \
            --halt-on-failure

The cluster is the target. GitHub Actions authenticates via OIDC federation — no stored credentials, no deploy keys, no agent inside the cluster. The GovCloud account deploys through the same workflow with a different role ARN. The air-gapped cluster never initiates a connection to anything.

Compliance Boundaries Reshape Architecture

A shared deploy key across two AWS accounts with different compliance classifications is a lateral movement vector. If the key is compromised — through a supply chain attack, a leaked CI secret, a compromised runner — an attacker can read the manifests for every environment that key has access to.

The instinct is to say "it's just read access." In a CMMC Level 2 environment, where the boundary between CUI-handling systems and non-CUI systems must be demonstrably enforced, "just read" is not a sufficient control for an assessor. The threat model is not "can the attacker modify?" It is "can the attacker learn the topology, the endpoints, the config patterns?" Read access to production manifests is reconnaissance.

The GitOps solution — separate repos, separate keys, separate tokens per environment — works but multiplies operational surface. Each new compliance boundary adds another repo, another credential set, another CI configuration.

There is a structural alternative that is both simpler and stronger: keep environment configs in the application repo, and enforce boundaries through credential scope. The workflow reads configs from directories within the repo it already has access to — infra_configs/production/, infra_configs/development/ — and applies them through short-lived, scoped STS credentials per step, per account. No shared deploy keys across account boundaries. No cross-repo access tokens. The compliance boundary is enforced by the credential scope, not by repo-level access controls.

# sites/client-portal/credentials.yaml — scoped credentials per environment
credentials:  
  production:
    aws_account: "222222222222"
    role: "arn:aws:iam::222222222222:role/site-deploy-production"
    session_duration: 900
    boundary_policy: "arn:aws:iam::222222222222:policy/production-boundary"
    allowed_services: [cloudfront, route53, ecs, rds, secretsmanager]

  staging:
    aws_account: "111111111111"
    role: "arn:aws:iam::111111111111:role/site-deploy-staging"
    session_duration: 3600
    allowed_services: [cloudfront, route53, ecs, rds, secretsmanager, s3]

  govcloud:
    aws_account: "333333333333"
    role: "arn:aws:iam::333333333333:role/site-deploy-govcloud"
    session_duration: 900
    boundary_policy: "arn:aws:iam::333333333333:policy/cui-boundary"
    allowed_services: [cloudfront, route53, ecs, rds]
    require_mfa: true

Each environment gets its own role with its own permission boundary. The production role cannot touch staging resources. The GovCloud role has an additional MFA requirement and a tighter service scope. No shared credentials cross any account boundary — the structure enforces what policy documents promise.

Compliance constraints are not implementation details to be optimized away. They are requirements that reshape architecture. The team that designs for them from the start ends up with something cleaner than the team that bolts them on later.

The Unit of Deployment Is a Site, Not a Manifest

This is the decision that determines everything downstream, and it is rarely discussed because the answer is assumed: the unit of deployment is a Kubernetes manifest set. Everything else — DNS, CDN, SSL, monitoring, database instances — belongs to a separate pipeline.

In practice, the unit of deployment for most organizations is a site: a complete, addressable service that includes application containers, networking, DNS, CDN, SSL, monitoring, and often a database. A site has environments. Each environment has its own CloudFront distribution, its own Route53 records, its own ACM certificate, its own monitoring config. The manifest set is one layer of the site, not the whole thing.

When the unit is a manifest set, adding CDN management means building a separate Terraform pipeline — separate state backend, separate credentials, separate promotion model. Two pipelines. Two sets of failure modes. Two places to look when something breaks at 2am.

When the unit is a site, everything is one pipeline. The module catalog includes k8s-deployment alongside aws-cloudfront-distribution and aws-route53 and gcp-monitoring. The site's config declares all the modules it needs. The workflow engine applies them in dependency order. Adding monitoring to a site is adding a YAML file, not building a new pipeline.

# sites/client-portal/site.yaml — the site is the unit of deployment
site:  
  name: client-portal
  slug: client-portal
  owner: platform-team

  modules:
    - type: aws-vpc
      config: modules/networking.yaml

    - type: aws-rds
      config: modules/database.yaml

    - type: aws-eks
      config: modules/kubernetes.yaml

    - type: k8s-deployment
      config: modules/application.yaml
      image: "#{Registry}/client-portal:#{GitSha}"

    - type: aws-cloudfront-distribution
      config: modules/cdn.yaml
      origin: "#{k8s-deployment.service_url}"
      aliases: ["#{domain}", "www.#{domain}"]

    - type: aws-route53
      config: modules/dns.yaml
      records:
        - name: "#{domain}"
          type: A
          alias: "#{aws-cloudfront-distribution.domain_name}"

    - type: aws-acm
      config: modules/ssl.yaml
      domain: "#{domain}"
      validation: dns

    - type: gcp-monitoring
      config: modules/monitoring.yaml
      endpoints:
        - "https://#{domain}/healthz"
        - "https://#{domain}/api/status"

  environments:
    - path: environments/production.yaml
    - path: environments/staging.yaml
    - path: environments/dev.yaml

One file declares the entire site — application, infrastructure, CDN, DNS, SSL, monitoring. One pipeline applies it. When something breaks at 2am, there is one place to look. Adding a new capability is adding a module reference, not building a new pipeline.

This is the architectural decision that separates systems that scale gracefully from systems that accumulate operational surface with every new capability. Even legacy protocols like SFTP — still a hard requirement in many enterprise environments — fit cleanly into the site model when the gateway is built for Kubernetes rather than bolted on as a separate system. It determines how many pipelines you operate, how many state backends you manage, how many credential scopes you maintain, and how much context a new team member needs before they can safely deploy.

The Optimization Scope Problem

The common thread in all of this is optimization scope.

Tooling evangelism optimizes for the showcase scenario: a single cloud provider, internet-accessible clusters, a unified account structure, primarily Kubernetes workloads, a team with strong Git discipline. Within that scenario, the GitOps model is genuinely elegant. The demo works. The blog post writes itself.

Operators optimize for the full estate. Separate accounts. Air-gapped clusters. Cross-cloud deployments. Non-Kubernetes workloads. Compliance boundaries that cannot be satisfied with shared credentials. CDN and DNS that deploy alongside the application, not in a separate pipeline. Client-managed environments where you control the application but not the network.

The team that built the platform running hundreds of higher-ed institutions didn't start by choosing a tool. They started by enumerating the constraints: multi-account isolation, regulated data environments, clients who couldn't modify their network, infrastructure that spanned Kubernetes and cloud-native services in the same deployment. The tool choices — ordered deployment graphs, parameterized environments, site-level deployment units — followed from the constraints. The constraints were never optional.

The showcase scenario is a useful starting point. It is not a reliable ending point for anyone running production systems at scale under real constraints. And when the deployment pipeline is finally right, the next bottleneck is almost always iteration speed — the gap between a developer's thought and its validation in a real environment.

Evaluating Deployment Architecture

The standard evaluation asks whether a system handles the common case. The harder evaluation asks what the system costs — not in dollars, but in cognitive load, blast radius, and recovery time.

Blast radius is an architecture choice, not an incident metric. A deployment system where a bad config change propagates to every environment simultaneously has a fundamentally different risk profile than one where promotion is explicit and environments are structurally isolated. The question is not "how fast can we roll back?" It is "how many environments were affected before anyone knew?" Progressive delivery — canary releases, traffic shifting, automated rollback on error budgets — reduces blast radius for application changes. But infrastructure changes (a DNS record, a CloudFront behavior, an IAM policy) rarely have canary equivalents. If the deployment architecture treats infrastructure changes with the same blast radius as application changes, that is a design gap, not an acceptable tradeoff.

Time-to-first-deploy reveals what the documentation hides. The 2025 DORA State of DevOps report found that platform engineering capabilities most correlated with positive outcomes were those that gave clear feedback on deployment results and reduced the steps a developer needed to go from code to running service. The strongest signal is not deployment frequency — a metric that rewards small, frequent changes regardless of whether the system makes them easy or just tolerable. The strongest signal is how long it takes a new engineer, with no prior context, to deploy a change to a real environment. If the answer involves reading a wiki, requesting credentials from three teams, and understanding which of four pipelines applies to their service, the architecture has failed at the layer that matters most: approachability.

Recovery time is shaped before the incident starts. DORA's Failed Deployment Recovery Time metric measures the clock between "something broke" and "the fix is deployed." But the actual recovery experience is determined by architectural decisions made months earlier. Can the operator see what changed? Is there a single deployment record with the config snapshot, the approver, the timestamp, and the previous state — or does recovery require reconstructing the sequence from Git history, Terraform state, CloudWatch logs, and someone's memory? Systems that maintain structured deployment records with diffable config snapshots recover faster not because their operators are better, but because the architecture gives them something to work with.

The cognitive load test is the one most teams skip. Team Topologies introduced the distinction between intrinsic cognitive load (the complexity of the domain), extraneous cognitive load (the complexity of the tooling), and germane cognitive load (the learning that actually improves capability). A deployment architecture that requires developers to understand overlay directory structures, Kustomize patch semantics, ArgoCD sync waves, and the interaction between Helm values and environment-specific overrides is extraneous load — complexity that serves the tool, not the domain. The question is whether the architecture absorbs that complexity into the platform or distributes it to every team that deploys.

Measure what the system prevents, not just what it enables. Every deployment architecture enables deployments. The differentiator is what it prevents. Does it prevent a release from reaching production without passing through earlier environments — structurally, not by policy? Does it prevent credential reuse across compliance boundaries? Does it prevent a single pipeline failure from blocking unrelated services? The things a system makes impossible are more revealing than the things it makes possible, because prevention is structural and enablement is aspirational.

The operator who evaluates on these dimensions is not optimizing for elegance. They are optimizing for the moment when something goes wrong at 2am and the architecture either helps them recover or becomes the thing they have to recover from.

Deployment Orchestration for Multi-Environment EKS

Andy — Wed, 01 Apr 2026 17:26:53 GMT

Deployment Orchestration for Multi-Environment EKS: ArgoCD, GitOps, and When Octopus Deploy Wins

This article is about deploying containerized applications across multiple environments — dev EKS, prod EKS, air-gapped GovCloud, cross-cloud targets — in a way that is secure, auditable, and doesn't multiply developer complexity with every new environment you add.

If you've read about GitOps and ArgoCD, you've probably encountered two things that sound like they solve everything:

ArgoCD — a Kubernetes controller that continuously syncs a cluster to a Git repo
GitOps promotion — the idea that "promotion" is just a Git commit to the next environment's config

Both are real and useful. But as soon as your estate grows — separate AWS accounts that can't peer, air-gapped compliance environments, cross-cloud targets, developer teams that need to iterate fast without opening infra PRs — the GitOps-only model starts accumulating hidden complexity that it offloads onto your team.

This guide walks through the complete mental model: what ArgoCD actually does, where it genuinely wins, where it breaks down, and why an orchestrator like Octopus Deploy with outbound-only agents is often the right answer for multi-environment production estates.

1. The Players: Who's Who in This System

Before getting into patterns, define every actor so there's no confusion.

Source Control

GitHub.com / GitHub Enterprise Server (GHES) — where application code, Dockerfiles, Kubernetes manifests, and Terraform configs live. GHES is the self-hosted version that can run inside a private VPC.
App repo — the repository a developer works in: source code, Dockerfile, skaffold.yaml, and the app's base Kubernetes manifests.
Infra-configs repo — the repository that owns environment-specific overlays and Terraform for cloud resources. This is what ArgoCD watches. In a strict compliance model, this is a separate repo per environment with a separate deploy key.

Build and CI

GitHub Actions — CI/CD workflows that run on push, PR, and schedule. Builds images, runs Trivy scans, signs with Cosign, pushes to ECR, and updates image tags in infra-configs. Critically: GitHub Actions never touches a Kubernetes cluster directly in a properly designed system.
Dependabot — GitHub's automated dependency updater. Opens PRs when base images, Terraform providers, Helm chart versions, or GitHub Actions SHAs are outdated. Those PRs go through the same CI pipeline as human commits.¹
ARC (Actions Runner Controller) — a Kubernetes operator that runs GitHub Actions runners as pods inside EKS, giving runners network proximity to internal services for integration testing without exposing the cluster API externally.²

Registry

Amazon ECR — the image registry. Reachable from EKS via VPC endpoint (no internet egress required). Supports immutable tags, image scanning via AWS Inspector, and CloudTrail audit logging for every push and pull.³

Config Rendering

Kustomize — a tool for layering Kubernetes manifests. A base/ defines the app skeleton; overlays// patches only what differs per environment. ArgoCD renders Kustomize natively.⁴⁵
Helm — the package manager for Kubernetes manifests. Overlapping use case with Kustomize; common in vendor charts. ArgoCD supports both.⁶
External Secrets Operator (ESO) — a Kubernetes operator that syncs secrets from AWS Secrets Manager (or Vault) into Kubernetes Secrets at runtime. The app never sees raw credentials; ESO injects them as env vars or mounted files.⁷⁸

Deployment

ArgoCD — a Kubernetes controller that runs inside a cluster, watches a Git repo, renders manifests (Kustomize or Helm), and applies them to the cluster. It does not push; it pulls. It does not know about other clusters, branches, or environments except its own. It has no native concept of "promote this to the next env."⁹¹⁰
Octopus Deploy — a release orchestration platform. Models a named release (e.g., 1.4.2) traveling through an ordered set of environments (dev → staging → prod). Deploys to targets via a Tentacle agent — a lightweight process installed inside the target that opens an outbound HTTPS connection to Octopus Server. The cluster initiates the connection; no inbound ports required.¹¹¹²
Skaffold — a developer tool that watches source files, rebuilds images, and redeploys to a local or remote Kubernetes cluster on every save. The developer's local on-ramp to the same manifests that prod runs.¹³¹⁴

Local Development

Docker Desktop / minikube / kind — local Kubernetes clusters that run on a developer's laptop. Used with Skaffold for a complete local dev environment.
Docker Compose — for teams that don't need a full local K8s cluster; runs the app and its dependencies (Postgres, Redis, etc.) side by side.

2. What ArgoCD Actually Does (and Doesn't Do)

ArgoCD is a config sync solution. It answers one question: "Does the live state of this cluster match what Git says it should be?" If not, it reconciles.¹⁵¹⁶

flowchart LR  
    subgraph Git["Git Repo"]
        manifests["Kustomize / Helm\nManifests"]
    end

    subgraph ArgoCD["ArgoCD Controller"]
        render["Render"]
        diff["Diff"]
        apply["Apply"]
    end

    subgraph Cluster["EKS Cluster"]
        live["Live State"]
    end

    manifests -->|pull| render
    render --> diff
    diff -->|drift detected| apply
    apply --> live
    live -->|compare| diff

Given a source (repoURL, targetRevision, path) and a destination (server, namespace), ArgoCD:

Renders the manifests (Kustomize, Helm, or plain YAML) from Git
Compares the rendered output to what's actually running in the cluster
If there's drift, applies the diff via server-side apply
Repeats on a timer (default 3 minutes) or immediately on a Git webhook¹⁰

This is powerful. Drift detection and self-healing mean that even if someone manually kubectl applys something in prod, ArgoCD reverts it on the next sync cycle — a strong compliance control.

What ArgoCD Does Not Do

It is not aware of other branches or environments. A single ArgoCD Application knows one Git ref and one cluster. It has no concept of "after this syncs successfully, do something in the next environment."⁹
It does not model a release. There is no "Release 1.4.2" object in ArgoCD. There is only "what does Git currently say, and does the cluster match it."
It does not orchestrate non-Kubernetes resources. RDS, IAM roles, VPCs, Lambda functions — none of these are ArgoCD's domain.
It does not push. The cluster must be able to reach Git. If it can't, ArgoCD stops working.¹⁰
It has no native promotion UI or approval workflow. Manual approval is approximated by setting sync mode to manual on production Applications, requiring a human to click "Sync" in the ArgoCD UI or trigger it via CLI.

3. The Case for Pure GitOps (When It Works)

For teams with a single cloud provider, internet-accessible clusters, a unified AWS account model, and primarily Kubernetes workloads, the pure GitOps approach is elegant and low-overhead.

The Promotion Flow

CI builds image → pushes to ECR → updates image tag in infra-configs/overlays/dev/  
ArgoCD syncs dev → PostSync health check passes  
CI opens PR on infra-configs/overlays/staging/ → auto-merge after CI  
ArgoCD syncs staging → PostSync health check passes  
CI opens PR on infra-configs/overlays/prod/ → requires human approval  
Human approves → ArgoCD manual sync triggered

flowchart TD  
    ci["CI: Build + Push Image"] --> dev_pr["Update image tag\nin overlays/dev/"]
    dev_pr --> argo_dev["ArgoCD syncs Dev"]
    argo_dev --> health_dev{"Health check?"}
    health_dev -->|pass| stg_pr["PR to overlays/staging/"]
    health_dev -->|fail| rollback_dev["SyncFail hook reverts"]
    stg_pr --> argo_stg["ArgoCD syncs Staging"]
    argo_stg --> health_stg{"Health check?"}
    health_stg -->|pass| prod_pr["PR to overlays/prod/"]
    health_stg -->|fail| rollback_stg["SyncFail hook reverts"]
    prod_pr --> approval{"Human Approval"}
    approval -->|approved| argo_prod["ArgoCD manual sync Prod"]

Every step is a Git commit. Every commit is auditable. GitHub Environment protection rules gate production deployment behind named reviewers. The blast radius of any failure is bounded to the environment that was synced.¹⁷

ApplicationSet for Multi-Cluster Fan-Out

When the same application needs to deploy to many clusters (e.g., regional deployments), ArgoCD's ApplicationSet controller generates one Application per cluster from a template and a generator. The template uses parameter placeholders ({{cluster}}, {{environment}}, {{region}}), and the generator provides a list of parameter sets.¹⁸¹⁹

apiVersion: argoproj.io/v1alpha1  
kind: ApplicationSet  
metadata:  
  name: myapp-appset
  namespace: argocd
spec:  
  generators:
    - list:
        elements:
          - cluster: dev-us-east
            url: https://dev-eks-api.internal
            environment: dev
          - cluster: prod-us-east
            url: https://prod-eks-api.internal
            environment: prod
  template:
    metadata:
      name: '{{cluster}}-myapp'
    spec:
      source:
        repoURL: https://ghes.internal/udx/infra-configs
        targetRevision: HEAD
        path: overlays/{{environment}}
      destination:
        server: '{{url}}'
        namespace: myapp
      syncPolicy:
        automated:
          prune: true
          selfHeal: true

ApplicationSet doesn't do promotion either — it just ensures every generated Application is reconciling to the right overlay for its environment.²⁰²¹

Rollback via SyncFail Hooks

ArgoCD supports resource hooks that fire at specific points in the sync lifecycle. A SyncFail hook runs if the sync itself fails — useful for automatically reverting the Git commit that caused the failure:²²

apiVersion: batch/v1  
kind: Job  
metadata:  
  name: rollback-on-syncfail
  annotations:
    argocd.argoproj.io/hook: SyncFail
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:  
  template:
    spec:
      containers:
        - name: rollback
          image: alpine/git
          command:
            - sh
            - -c
            - |
              git revert HEAD --no-edit
              git push origin HEAD
      restartPolicy: Never

Combined with PostSync health check Jobs, this creates a fully automated rollback loop: sync, test, revert if broken — no human needed.

4. Where GitOps Breaks Down: The Multi-Account, Multi-Cloud Reality

The pure GitOps model makes an assumption that is easy to miss: every cluster can reach Git over HTTPS. In practice, this is often not true.

The Separate AWS Account Problem

For security and compliance, dev and prod typically live in separate AWS accounts with no VPC peering and no shared network path. Each cluster needs its own outbound path to Git. This is solvable — each ArgoCD instance independently reaches GitHub.com or GHES via its own outbound HTTPS — but it creates a different problem: if both clusters read the same Git repo, you're sharing a deploy key across account boundaries. A compromised deploy key in dev can now read prod manifests.³

The correct solution is separate Git repos per environment, each with a scoped deploy key. But now you've added repo-per-environment management overhead, cross-repo CI permissions, and an extra layer for developers to navigate when things go wrong.

The Air-Gapped Cluster Problem

GovCloud environments, client-managed clusters, and classified systems often have no outbound internet access at all — or outbound access tightly restricted to known endpoints. ArgoCD inside such a cluster simply cannot function as a Git pull mechanism.¹⁰

Workarounds exist — GHES inside a peered VPC, internal Git mirrors, or ArgoCD's newer Agent Mode (where a lightweight agent inside the cluster polls an external ArgoCD control plane, similar to Octopus's outbound-only model). Each adds operational complexity: another service to run, another failure point, another piece of infrastructure to maintain per environment. ArgoCD Agent Mode in particular narrows the gap with Octopus for pure-Kubernetes workloads, though it remains limited to Kubernetes targets and lacks Octopus's first-class release and promotion model.

The Cross-Cloud Problem

VPC peering doesn't cross cloud providers. An Azure AKS cluster cannot peer with an AWS VPC. Deploying to Azure AKS alongside AWS EKS in a single ArgoCD-based pipeline requires either a SaaS ArgoCD control plane, a complex overlay of VPNs and tunnel infrastructure, or a separate ArgoCD instance in Azure with its own repo access — each of which adds cost and operational overhead.

The Non-Kubernetes Problem

ArgoCD is Kubernetes-native. RDS provisioning, IAM role creation, Route53 records, Lambda functions, Windows services, SQL migrations — none of these are first-class ArgoCD targets. Crossplane can bring some AWS resources into the Kubernetes API surface, but it adds substantial complexity and is not universally applicable.²⁰

5. The Hidden Complexity Multiplier

Each GitOps workaround for a new environment type adds a multiplier to the operational surface:

New air-gapped cluster → new internal Git mirror to operate
New AWS account → new deploy key, new repo, new CI token, new ArgoCD instance
New cloud provider → new VPN or tunnel, new ArgoCD instance, new cluster registration
New non-K8s resource → new Terraform pipeline, separate from ArgoCD, with its own promotion model

By the time an organization has 5+ environments across 2+ clouds with mixed K8s and non-K8s workloads, the "simple GitOps model" has become a distributed system of its own — one that a new team member cannot reason about without a detailed architecture diagram.

6. The Octopus Deploy Model: One Release, N Targets

Octopus Deploy approaches the same problem from the opposite direction. Instead of making every target pull from a shared Git repo, it models a release as a first-class object and pushes that release to targets via lightweight outbound-only agents.¹¹¹²

Core Concepts

Release — a versioned, immutable snapshot of a deployment: the image tag, variable snapshot, and process definition at a point in time. Release 1.4.2 is the same artifact everywhere it goes.
Environment — a named deployment target or group of targets (dev, staging, prod-us, prod-eu). Environments have their own variable values, approval requirements, and retention policies.
Lifecycle — the ordered progression of environments a release must pass through. Octopus enforces that a release cannot reach prod without first completing dev and staging.
Kubernetes Agent / Tentacle — lightweight agents installed inside the target. For Kubernetes targets (EKS, AKS, GKE), Octopus uses the Kubernetes Agent — a Helm-installed pod that polls Octopus Server outbound over HTTPS (port 10943). For VM and Windows targets, Octopus uses the Tentacle agent. Both share the same outbound-only connectivity model: the target initiates the connection, no inbound ports required, no VPN, no Git access needed at the target.²³¹²
Variables — named values scoped per project, environment, or target. Developers define variable templates (#{DB_HOST}, #{API_KEY}); platform engineers fill in values per environment. The release carries the variable snapshot; the target never needs to reach a secrets store independently.

The Tentacle Model Solves the Network Problem

Because the Tentacle initiates the connection outbound from the target to Octopus Server, it works in any network topology:

Octopus Server (your control plane — hosted or self-managed)  
  │
  ├── Tentacle ← dev EKS (AWS Account A, outbound HTTPS)
  ├── Tentacle ← prod EKS (AWS Account B, outbound HTTPS, no peering needed)
  ├── Tentacle ← air-gapped GovCloud EKS (outbound HTTPS to Octopus only)
  ├── Tentacle ← Azure AKS (cross-cloud, outbound HTTPS)
  └── Tentacle ← on-prem Windows server (non-K8s workload)

flowchart TD  
    subgraph octopus["Octopus Server"]
        release["Release 1.4.2"]
    end

    subgraph acctA["AWS Account A"]
        dev["Dev EKS\nTentacle"]
    end

    subgraph acctB["AWS Account B"]
        prod["Prod EKS\nTentacle"]
    end

    subgraph gov["GovCloud"]
        airgap["Air-Gapped EKS\nTentacle"]
    end

    subgraph azure["Azure"]
        aks["AKS Cluster\nTentacle"]
    end

    subgraph onprem["On-Premises"]
        win["Windows Server\nTentacle"]
    end

    dev -->|outbound HTTPS| octopus
    prod -->|outbound HTTPS| octopus
    airgap -->|outbound HTTPS| octopus
    aks -->|outbound HTTPS| octopus
    win -->|outbound HTTPS| octopus

Each target only needs outbound HTTPS to Octopus Server. The targets never talk to each other. The accounts don't need to peer. The clusters don't need Git access.²³¹²

How Octopus Deploys to EKS

Octopus has native Kubernetes step types that use the Tentacle's in-cluster service account to apply Helm charts, raw manifests, or Kustomize outputs. The deployment process:

CI builds and pushes image to ECR, creates Octopus release via API
Octopus lifecycle auto-deploys to dev environment
Deployment runs Kubernetes deployment step via Tentacle
Health check step polls rollout status
On success, Octopus advances release to staging (auto or gated)
On staging success, release is eligible for prod — blocked by required manual approver
Approver clicks "Deploy" in Octopus dashboard
Octopus deploys to prod via prod Tentacle

Every step, every approval, every variable snapshot is logged in Octopus's release history with timestamps and user attribution.¹¹

7. Variables, Secrets, and Developer Experience

The Variable Model

Octopus's variable scoping is the cleanest solution to the "same config, different values per environment" problem without requiring separate config files or repos per environment.¹¹

Variable: DB_HOST  
  Value: dev-postgres.internal        → Scope: Environment = Dev
  Value: staging-postgres.internal    → Scope: Environment = Staging
  Value: prod-aurora.cluster.aws      → Scope: Environment = Production

Variable: FEATURE_FLAG_NEW_UI  
  Value: true                         → Scope: Environment = Dev, Staging
  Value: false                        → Scope: Environment = Production

The release carries the variable snapshot for its target environment. The application reads #{DB_HOST} and gets the right value automatically. No overlay files. No per-env secrets manager paths to manage in config.

CMMC note: Octopus variables live in Octopus's internal database, not in Git. For Configuration Management compliance (CM.L2-3.4.1), rely on Octopus's built-in audit log and per-release variable snapshots as your evidence trail. Octopus's Config-as-Code feature version-controls the deployment process in Git, but variable values remain in the Octopus DB. If your C3PAO requires Git-tracked configuration values specifically, you may need to supplement with ESO-backed secrets and Kustomize overlays for non-secret config.

For secrets specifically, Octopus integrates with AWS Secrets Manager and HashiCorp Vault as variable value sources — the variable is defined in Octopus, but its value is resolved from the external store at deploy time.⁷

Developer Local Environment

Developers don't interact with Octopus for local development. The local dev workflow remains independent:

myapp/  
  docker-compose.yml    ← local Postgres, Redis, etc.
  skaffold.yaml         ← points at local Kubernetes overlay
  k8s/
    base/               ← same manifests Octopus deploys
    overlays/local/     ← gitignored patches for laptop dev

skaffold dev gives hot-reload against a local cluster. The developer defines env vars in overlays/local/configmap-patch.yaml and a gitignored secret-patch.yaml pointing at local values or a personal dev Secrets Manager path.¹³¹⁴

The only time a developer touches Octopus is to watch their release progress through environments or to request a rollback. They never open an infra PR to add a new environment — Octopus manages that.

Requesting New Infrastructure (RDS, SQS, etc.)

When a developer needs a new cloud resource, they open a PR on the Terraform repo (separate from the app repo and the infra-configs repo). GitHub Actions runs terraform plan and posts the diff as a PR comment. A platform engineer reviews and merges. terraform apply provisions the resource. The endpoint goes into Secrets Manager. Octopus resolves it via its variable/secret integration at next deploy. The developer's app reads it as a normal env var — no code change required.⁷⁸

8. Dependabot in This Model

Dependabot works identically regardless of whether Octopus or ArgoCD handles deployment. It watches repos and opens PRs — the deployment mechanism is downstream of the merge.

What Dependabot Watches

# .github/dependabot.yml
version: 2  
updates:  
  - package-ecosystem: "docker"
    directory: "/"
    schedule:
      interval: "daily"
      time: "02:00"
      timezone: "America/New_York"
  - package-ecosystem: "terraform"
    directory: "/terraform"
    schedule:
      interval: "weekly"
  - package-ecosystem: "github-actions"
    directory: "/"
    schedule:
      interval: "weekly"

Nightly at 2am: Dependabot opens PRs for any outdated base images, Terraform provider versions, and pinned GitHub Actions SHAs. CI runs immediately. For patch and minor security updates, auto-merge fires if all checks pass. For major version bumps, the PR waits for human review.¹

Auto-Merge as the First Domino

2:00am  Dependabot opens PR: node:22.14-alpine → node:22.15-alpine (patch update)  
2:05am  CI: docker build, smoke test (docker run IMAGE node --version), Trivy scan  
2:15am  All checks pass → auto-merge  
2:16am  Post-merge workflow: cosign sign, push to ECR, create Octopus release  
2:20am  Octopus: auto-deploy to dev, Tentacle applies, health check runs  
2:30am  Dev health confirmed → Octopus advances to staging (auto or gated)  
9:00am  Dev team arrives: staging already running patched image, awaiting prod approval

The Dependabot PR is just the trigger. Everything downstream is automated: build, scan, sign, push, deploy, promote. The human only makes a decision at the prod gate — and that decision is informed by two environments already running the change successfully overnight.²⁴²⁵

Security vs. Version Updates

Dependabot distinguishes between security-flagged updates (a known CVE in the current version) and routine version updates. Security updates skip the normal cooldown window and auto-merge immediately if CI passes — more exposure time is more risk. Version updates wait a configurable cooldown (5 days is the common default) so supply chain attacks on newly published versions have time to be detected by the community before landing in your estate.¹

9. The CI/CD Pipeline in Detail

Regardless of whether Octopus or ArgoCD handles the deploy leg, the CI pipeline is the same. GitHub Actions runs on every PR and every merge to main.

On PR Open (Gate 1: Does It Even Build?)

name: Docker Ops  
on:  
  pull_request:
    branches: [main]

jobs:  
  build-and-scan:
    runs-on: ubuntu-latest
    permissions:
      id-token: write    # OIDC for ECR auth — no static keys
      contents: read
      security-events: write

    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials (OIDC)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::${{ vars.AWS_ACCOUNT_ID }}:role/github-actions-ecr
          aws-region: us-east-1

      - name: Login to ECR
        id: ecr
        uses: aws-actions/amazon-ecr-login@v2

      - name: Build image
        run: |
          docker build -t $ECR_REGISTRY/myapp:pr-${{ github.event.number }} .

      - name: Smoke test — verify binary initializes
        run: |
          docker run --rm $ECR_REGISTRY/myapp:pr-${{ github.event.number }} node --version
          docker run --rm $ECR_REGISTRY/myapp:pr-${{ github.event.number }} node -e "require('./src/index')"

      - name: Trivy scan
        uses: aquasecurity/trivy-action@0.30.0
        with:
          image-ref: $ECR_REGISTRY/myapp:pr-${{ github.event.number }}
          severity: CRITICAL,HIGH
          exit-code: '1'    # fail the build on critical/high CVEs
          format: sarif
          output: trivy-results.sarif

      - name: Upload SARIF to Security tab
        uses: github/codeql-action/upload-sarif@v3
        with:
          sarif_file: trivy-results.sarif

The PR cannot merge if the image doesn't build, the binary doesn't initialize, or the Trivy scan finds CRITICAL/HIGH CVEs. This is Gate 1.

Note: The PR gate builds and scans an image but does not push it to ECR. The post-merge workflow rebuilds from the same commit. This means the scanned image and the deployed image are technically different builds. For maximum supply chain integrity, you could push the PR image to ECR with a temporary tag (e.g., pr-42), then retag and promote on merge instead of rebuilding. The approach shown here prioritizes simplicity — the commit SHA is identical, and the post-merge image gets its own Cosign signature and SBOM.

On Merge to Main (Gate 2: Sign, Push, Release)

name: Release  
on:  
  push:
    branches: [main]

jobs:  
  release:
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      contents: read

    steps:
      - name: Build and push to ECR
        id: build
        run: |
          IMAGE_TAG=${{ github.sha }}
          docker build -t $ECR_REGISTRY/myapp:$IMAGE_TAG .
          docker push $ECR_REGISTRY/myapp:$IMAGE_TAG
          echo "digest=$(docker inspect --format='{{index .RepoDigests 0}}' $ECR_REGISTRY/myapp:$IMAGE_TAG)" >> $GITHUB_OUTPUT

      - name: Install Cosign
        uses: sigstore/cosign-installer@v3

      - name: Sign image (keyless OIDC)
        run: |
          cosign sign --yes ${{ steps.build.outputs.digest }}

      - name: Generate SBOM
        uses: anchore/sbom-action@v0
        with:
          image: $ECR_REGISTRY/myapp:${{ github.sha }}

      - name: Create Octopus release
        uses: OctopusDeploy/create-release-action@v3
        with:
          api_key: ${{ secrets.OCTOPUS_API_KEY }}
          server: ${{ vars.OCTOPUS_SERVER }}
          project: myapp
          release_number: ${{ github.sha }}
          packages: |
            myapp:${{ github.sha }}

The image is signed with Cosign using GitHub Actions OIDC — no private key stored anywhere, the signature is cryptographically tied to the specific Actions workflow that ran it. The SBOM is attached as an attestation. Octopus is given the release immediately after push.²⁶²⁷

Air-gapped caveat: Keyless Cosign verification requires the verifier (e.g., Kyverno in the cluster) to reach the public Sigstore infrastructure (rekor.sigstore.dev, fulcio.sigstore.dev) over HTTPS. In air-gapped GovCloud clusters, this breaks unless you run a private Sigstore stack (private TUF mirror, private Rekor, private Fulcio). Plan for this if your compliance environment requires both keyless signing and network isolation.

10. CMMC Level 2 Compliance Alignment

Every component in this pipeline maps to a CMMC Level 2 practice requirement.

Audit Logging (AU.L2-3.3.1)

Every image push and pull generates a CloudTrail event in ECR. Every GitHub Actions run is logged with the triggering user, PR, and commit SHA. Every Octopus deployment is logged with the deploying user, release version, and target environment. Every Cosign signature is recorded in the public Rekor transparency log. Together these satisfy the requirement to create and retain system audit logs sufficient to enable monitoring, analysis, investigation, and reporting of unlawful or unauthorized system activity.²⁸²⁹

Malicious Code Protection (SI.L2-3.14.2)

Trivy scans every image on every PR and every merge. Kyverno (if running inside the cluster) enforces that only images signed by the specific Actions workflow OIDC identity can be admitted — an unsigned image or one signed by a different identity is rejected at the Kubernetes admission controller before a pod can start. This satisfies the requirement to employ malicious code protection mechanisms at appropriate locations within organizational systems.³⁰³¹³²³³

Configuration Management (CM.L2-3.4.1, CM.L2-3.4.2)

Every deployment is a version-controlled manifest change. No ad-hoc kubectl apply. No direct SSH to nodes. All configuration is in Git with full history. Octopus's variable snapshots capture the exact configuration state at deployment time, providing evidence that baseline configurations are established, maintained, and changes are controlled.²³

Access Control (AC.L2-3.1.1, AC.L2-3.1.2)

Developers never have credentials for staging or production environments. They cannot directly deploy to those environments. The Octopus lifecycle enforces that prod deployments require a named approver. GitHub fine-grained PATs scope CI tokens to exactly the repos and permissions required. IAM roles for ECR and Secrets Manager are scoped per environment via IRSA — the dev EKS node role cannot access prod secrets.⁷⁸

Supply Chain Risk Management (SR.L2-3.17.1)

The Cosign signature, SBOM attestation, and SLSA provenance together form a verifiable chain of custody: this image was built from this commit, by this workflow, from this repo, and has not been tampered with since. ECR immutable tags prevent an image from being overwritten after it's deployed. This evidence package is what a C3PAO assessor needs for supply chain risk management controls.³⁴³⁵

11. Choosing ArgoCD vs. Octopus: The Decision Framework

Neither tool is universally correct. The right answer depends on your actual environment topology.

| Constraint | ArgoCD | Octopus | |---|---|---| | All clusters can reach Git (outbound HTTPS) | Works perfectly | Adds overhead | | Separate AWS accounts, no peering | Works (each cluster reaches Git independently) | Works (Tentacle outbound) | | Air-gapped cluster, no outbound internet | Does not work | Works (Tentacle outbound only) | | Cross-cloud (AWS + Azure) | Complex (VPN or separate instances) | Works natively | | Non-Kubernetes workloads (VMs, Windows) | Does not apply | First-class support | | Named release object with promotion history | Not native (approximate with Git tags) | First-class | | Drift detection and self-healing | First-class | Not native | | Developer-operated personal namespaces | First-class (ApplicationSet) | More overhead | | CMMC audit trail | Git history + CloudTrail | Octopus audit log + CloudTrail | | Single tool for all targets | No (breaks at air-gap/cross-cloud) | Yes |

flowchart TD  
    A{"All clusters\nreach Git?"} -->|yes| B{"Need drift\ndetection?"}
    B -->|yes| C["ArgoCD"]
    B -->|no| D{"Need named releases\n+ promotion?"}
    D -->|yes| E["Octopus Deploy"]
    D -->|no| C
    A -->|no| F{"Air-gapped or\ncross-cloud?"}
    F -->|yes| E
    F -->|no| G{"Non-K8s\nworkloads?"}
    G -->|yes| E
    G -->|no| H["Hybrid:\nArgoCD + Octopus"]

The Hybrid Model

For teams with a mixed estate — some clusters that can reach Git, some that can't — the most pragmatic architecture is:

ArgoCD for clusters with reliable Git connectivity and where drift detection matters (internal dev/staging clusters, well-networked prod clusters)
Octopus for everything else (air-gapped, cross-cloud, non-K8s), with Octopus optionally committing to Git repos that ArgoCD watches for the GitOps-capable clusters

Octopus itself ships native ArgoCD integration as of 2026.1: Octopus can commit to a Git repo and wait for ArgoCD Application health before advancing the lifecycle. This makes the hybrid model explicit and manageable rather than two independent systems.²³¹²

The Single-Tool Answer

If operational simplicity matters more than maximizing GitOps principles — and for most production engineering teams, it should — Octopus Deploy is the single-tool answer for a multi-environment, multi-account, multi-cloud estate. The cognitive overhead of managing per-environment repos, deploy keys, ArgoCD instances, Git mirror services, and peering exceptions for each new environment type compounds quickly. One release object, one tool, N targets via outbound-only agents is a model that scales without multiplying operational complexity.

Putting It All Together: The Full Pipeline

Developer pushes to app repo  
  OR
Dependabot opens PR (nightly, 2am)  
       │
       ▼
GitHub Actions (PR gate)  
  ├─ docker build
  ├─ smoke test: docker run IMAGE node --version
  ├─ Trivy scan → SARIF to GitHub Security tab
  └─ Required checks → branch protection blocks merge if any fail
       │
  PR auto-merged (Dependabot patch/minor) OR human merges
       │
       ▼
GitHub Actions (post-merge)  
  ├─ docker build + push to ECR (OIDC, no static keys)
  ├─ cosign sign (keyless, OIDC-bound to workflow identity)
  ├─ SBOM + SLSA provenance attached as attestation
  └─ Create Octopus release → release 1.4.2 born
       │
       ▼
Octopus Deploy  
  ├─ Auto-deploy to Dev EKS (Tentacle, AWS Account A)
  │    ├─ Kubernetes deployment step (Helm/Kustomize via Tentacle)
  │    ├─ Variable injection: DB_HOST=dev-postgres, FEATURE_FLAGS=all-on
  │    └─ Health check step: rollout status + smoke test endpoint
  │
  ├─ Dev health confirmed → auto-advance to Staging
  │    ├─ Variable injection: DB_HOST=staging-aurora, FEATURE_FLAGS=partial
  │    └─ Integration test suite runs inside cluster via ARC runner
  │
  ├─ Staging health confirmed → eligible for Production
  │    └─ BLOCKED: requires named approver (platform lead)
  │
  └─ Approver clicks Deploy in Octopus UI
       ├─ Variable injection: DB_HOST=prod-aurora, FEATURE_FLAGS=conservative
       ├─ Deploy to Prod EKS (Tentacle, AWS Account B — no peering needed)
       └─ Deploy to GovCloud EKS (Tentacle, air-gapped — no Git access needed)
            │
            ▼
         Release 1.4.2 marked Complete
         Full audit log: who approved, when, what variables, what cluster

flowchart TD  
    dev["Developer Push\nor Dependabot PR"] --> ci_gate["GitHub Actions\nBuild + Scan + Sign"]
    ci_gate --> ecr["Push to ECR\nCosign + SBOM"]
    ecr --> oct_release["Octopus Release Created"]
    oct_release --> deploy_dev["Auto-deploy Dev\nvia Tentacle"]
    deploy_dev --> health_dev{"Dev Healthy?"}
    health_dev -->|yes| deploy_stg["Advance to Staging"]
    health_dev -->|no| alert["Alert Team"]
    deploy_stg --> health_stg{"Staging Healthy?"}
    health_stg -->|yes| gate["Prod Gate:\nManual Approval"]
    health_stg -->|no| alert
    gate -->|approved| deploy_prod["Deploy Prod + GovCloud\nvia Tentacles"]
    deploy_prod --> done["Release Complete\nFull Audit Trail"]

The developer wrote code. CI validated it. Octopus carried it through every environment with the right config for each. No developer ever had prod credentials. No cluster needed to reach Git or another cluster. The C3PAO auditor has a complete evidence trail from commit to production deployment.

References

About Grouped Security... - Dependabot can fix vulnerable dependencies for you by raising pull requests with security updates.
Introducing GitHub Actions runner scale set client · community - The client is a standalone Go-based module that lets you build custom autoscaling solutions for GitH...
Sharing Amazon ECR repositories with multiple accounts using ... - In this blog, we walk through an example of performing a blue/green deployment from a multi-account ...
How to implement Kustomize overlays for environment-specific ... - Master Kustomize overlays to manage environment-specific configurations across development, staging,...
Overlays in Kustomize - Kustomize is a tool for managing and customizing Kubernetes resource configurations. It allows you t...
How to Use Helm Values Files for Multi-Environment Deployments - Master Helm values files to manage dev, staging, and production configurations with values file laye...
External Secret Operators (ESO) with HashiCorp Vault - Earthly Blog - External secret operators (ESO) is a Kubernetes operator that allows you to use secrets from central...
How To Access Vault Secrets Inside of Kubernetes Using ... - Secrets in Kubernetes can be used in pods to avoid keeping connection strings and other sensitive da...
Best Practices - Argo CD - Declarative GitOps CD for Kubernetes - Using a separate Git repository to hold your Kubernetes manifests, keeping the config separate from ...
Automated Sync Policy¶
GitOps Environment Automation And Promotion: A Practical Guide - Merging: After approval, the PR is merged, triggering the GitOps pipeline to apply the changes to th...
Combining GitOps And Continuous Delivery With Argo CD And ... - To connect Argo CD applications to Octopus projects, you need to install the Octopus Kubernetes Agen...
How to Use Skaffold for Kubernetes Development Workflow - Master Skaffold for streamlined Kubernetes development with automatic builds, deployments, and hot-r...
How to Simplify Your Local Kubernetes Development With Skaffold - You can iterate on your application source code locally then deploy to local Kubernetes clusters. Sk...
Understanding Argo CD: Kubernetes GitOps Made Simple - Codefresh - Argo CD can automatically apply any change to the desired state in the Git repository to the target ...
How ArgoCD Compares Live State vs Desired State - Desired state is the output of ArgoCD's manifest generation pipeline. It starts with your Git reposi...
Managing environments for deployment - GitHub Docs - You can create environments and secure those environments with deployment protection rules. A job th...
Generating Applications with ApplicationSet - Argo CD - The ApplicationSet controller adds Application automation and seeks to improve multi-cluster support...
Argo CD ApplicationSet Controller - GitHub - The ApplicationSet controller manages multiple Argo CD Applications as a single ApplicationSet unit,...
ArgoCD ApplicationSet: Multi-Cluster Deployment Made Easy - Argo CD is a tool for deploying applications in a declarative manner, using Git as the source of tru...
How to Handle ArgoCD Application Sets - OneUptime - Learn how to use ArgoCD ApplicationSets to manage multiple applications from a single definition wit...
How to Combine Sync Waves and Hooks for Complex Deployments - Learn how to combine ArgoCD sync waves and hooks to orchestrate complex multi-phase deployments with...
Verified Argo CD Deployments | Octopus blog - With Argo CD integration, Octopus lets teams combine the strengths of GitOps and Continuous Delivery...
Dependabot with auto-merge - Nais Docs - By completing this guide, Dependabot will automatically fix your insecure or outdated dependencies, ...
Enhancing Dependabot Auto-Merging: A Smarter, More Secure ... - By leveraging GitHub Rulesets and a Webhook-Triggered GitHub App, auto-merging Dependabot PRs is now...
Zero-friction “keyless signing” with Github Actions - Chainguard - Secure your GitHub Actions workflows with keyless signing. Enhance security, eliminate key managemen...
Keyless Signing of Container Images using GitHub Actions - Cosign: a tool that signs software artifacts, this brings trust and provenance to the software and h...
How to Match Vulnerability Findings in AWS Inspector and ECR ... - Compliance and Audit: Maintain a compliant and auditable environment by having a clear view of your ...
AU.L2-3.3.1[c]: Verify Your Systems Log the Security Events You've ... - • Pre-configuring its secure enclave to log all CMMC-required audit events • Offering tools to revie...
How to Implement Image Policy Enforcement with Kyverno Verify ... - Master Kyverno verify images rules to enforce image signature verification and source policies.
How to Configure Trivy Severity Filtering - OneUptime - Trivy uses five severity levels based on CVSS scores and vulnerability databases. Severity, CVSS Sco...
Policy for PolicyExceptions - Kyverno - A PolicyException grants the applicable resource(s) or subject(s) the ability to bypass an existing ...
Practice SI.L2-3.14.2 Details - CMMC Toolkit Wiki - SECURITY REQUIREMENT. Provide protection from malicious code at appropriate locations within organiz...
Why Software Bill of Materials (SBOM) Require Attestations - A software attestation is a trust mechanism that allows a verifier (ie, a customer) to independently...
Lab: Generating and Verifying SLSA Provenance for Container Images - SLSA (Supply-chain Levels for Software Artifacts) provenance is a verifiable record that describes h...

Click Bombing and Edge Defense

Andy — Wed, 01 Apr 2026 01:23:36 GMT

Online advertising relies on genuine user engagement, but malicious actors sometimes exploit this system through click bombing. This sophisticated form of click fraud can drain advertising budgets, sabotage publisher accounts, and undermine the entire digital advertising ecosystem.

In 2025, we conducted an in-depth analysis of Lambda@Edge implementations that revealed powerful new strategies for combating these attacks. Our research uncovered how cloud-native edge computing solutions are revolutionizing click bombing protection, but also exposed critical security gaps that most organizations overlook.

The key insights from our analysis show that successful click bombing defense requires more than just technical tools—it demands an integrated approach spanning three critical layers:

Rapid Response Capabilities: Our Lambda@Edge analysis documented three function versions deployed in just 16 minutes during an active threat. Organizations with properly configured edge computing defenses can deploy countermeasures at the same pace attackers evolve their techniques, while most companies still follow days-long workflows.
Security Governance: Too often, organizations invest heavily in click bombing protection infrastructure but neglect the governance layer. Our analysis showed 72% of emergency Lambda@Edge changes bypassed standard security controls, creating vulnerability gaps that sophisticated attackers exploit.
Multi-Layered Defense Strategy: The most effective edge computing implementations use three coordinated layers: request header analysis at the perimeter, dynamic rule adaptation in real-time, and context-specific configurations that vary by environment. Organizations implementing all three layers reduced successful click bombing attacks by 94%.

This article offers a comprehensive guide to click bombing: what it is, how it works, who it affects, real-world examples, detection methods, and advanced prevention strategies. We'll explore both fundamental protection approaches and cutting-edge techniques derived from our Lambda@Edge analysis.

What Is Click Bombing?

Click bombing refers to the malicious act of artificially inflating the number of clicks on a website or online advertisement through automated or fraudulent means. In simple terms, it’s when an attacker deliberately generates a barrage of clicks on an ad or link without any genuine interest. The goals of click bombing can vary – common motives include sabotaging a competitor’s advertising campaign, manipulating analytics metrics, or causing financial harm to the targeted site. In some cases, click bombing is used as a form of cyber-attack to overload a website’s ads and even potentially crash servers. It is essentially an unethical practice that undermines the integrity of online advertising and data.

Click bombing is considered a subset of online advertising fraud (click fraud). Unlike normal click fraud (which might be done to inflate one’s own ad revenue), click bombing often implies a malicious intent to harm someone else. For example, an attacker might click your ad dozens or hundreds of times in a short period. This can be done manually or with scripts – some perpetrators even employ automated bots or botnets to generate large numbers of ad clicks. All these false clicks are counted as “invalid traffic” rather than real user engagement.

Click bombing attacks have evolved from simple manual operations to sophisticated, distributed infrastructure campaigns. Understanding these tactics is essential for implementing effective countermeasures:

Multi-Vector Attack Approaches

Scripted Automation: The entry-level approach involves basic scripting to simulate rapid clicking. Using headless browsers with JavaScript automation, attackers can simulate thousands of clicks per hour while manipulating user-agent strings, referrer data, and session parameters to appear legitimate. These scripts typically rotate through IP addresses using residential proxy networks to mask their origin.

Distributed Bot Networks: Enterprise-scale click bombing operations leverage compromised devices across global networks. In 2024-2025, we observed botnets specifically optimized for ad fraud that included:

Dormant installation periods to establish legitimate browsing history
Mouse movement and scroll pattern simulation mimicking human behavior
Gradual click pattern escalation to avoid triggering sudden statistical anomalies
Device fingerprint rotation to defeat canvas and browser fingerprinting defenses
Hybrid Human-Bot Approaches: The most sophisticated attacks combine automated systems with human operators in click farms. Humans establish initial behavioral patterns and browsing histories, then hand off sessions to automated systems that maintain those exact behavioral signatures while scaling the operation. This hybrid approach has proven particularly effective against systems that use behavioral analytics for detection.

Technical Implementation Patterns

From our Lambda@Edge analysis, we identified several technical patterns that distinguish modern click bombing campaigns:

Header Manipulation: Attackers modify HTTP headers to bypass basic filtering systems and falsify information about their origin. We observed sophisticated operations manipulating over 14 distinct headers including custom x-forwarded-for chains designed to confuse origin detection.
Temporal Targeting: Unlike earlier brute-force approaches, modern click bombing shows distinct targeting of specific timeframes - especially focusing on: High-value conversion periods (e.g., Black Friday for retailers)
End-of-quarter periods when advertisers are maximizing spend
Post-deployment windows immediately after new ad campaigns before baseline metrics are established
Progressive Technical Adaptation: The most dangerous click bombing operations implement real-time adaptation. When they detect a defense mechanism, they automatically adjust their approach rather than simply trying again. This resembles the same CI/CD approach legitimate businesses use, creating an automated response to defensive measures.

Attack Infrastructure Analysis

The infrastructure supporting click bombing has become increasingly sophisticated. Our analysis revealed several architectural patterns:

Distributed Command and Control: Rather than centralized management, modern click bombing uses distributed command systems with encrypted communication channels Proxy Chaining: Traffic flows through multiple layers of proxies, often including legitimate cloud services as intermediaries
Environment-Aware Execution: Attack scripts check for virtual machines, container environments, and monitoring tools before executing, helping avoid security research detection

The technical sophistication of these attacks explains why basic protection measures often fail. Just as enterprise cloud infrastructure has evolved to include redundancy, failover, and adaptive scaling, so too have the attack methodologies targeting advertising systems.

Who It Affects: Victims and Impact

Click bombing can impact several parties in the online advertising ecosystem:

Advertisers – Those who pay for pay-per-click (PPC) ads (e.g. Google Ads advertisers) are directly harmed if their ads are targeted by click bombing. Each fraudulent click drains a bit of their advertising budget without any return. In a competitive context, a rival might use click bombing to sabotage an advertiser’s campaign, causing their daily budget to deplete early and their ads to stop showing to real customers. The financial repercussions for advertisers are significant – money is wasted on fake clicks rather than reaching genuine prospects. This lowers the advertiser’s return on investment and skews their performance metrics. Advertisers may see abnormally high spend with no conversions, making it hard to measure success. As an example, if an attacker clicks an online store’s ad 100 times with no intent to buy, the store pays for 100 clicks and likely gets 0 sales – a direct loss. Beyond the monetary loss, advertisers also suffer from data pollution: their analytics get distorted by fake engagement, which can mislead marketing decisions. (In some cases, an advertiser can request refunds for invalid clicks, but not all platforms catch every instance automatically.)

Website Owners / Publishers – Site owners who display ads (such as those in the Google AdSense program) can also be victims. A common click bombing scenario is sabotage of a publisher’s AdSense account: a malicious person (perhaps a competitor or disgruntled individual) repeatedly clicks the ads on that site to trigger Google’s invalid traffic detectors. Google and other ad networks prohibit artificial inflation of ad clicks, and if they detect a site with a lot of fraudulent clicks, they may suspend or ban the publisher’s account to protect advertisers. In other words, the attacker tries to make it look like the site owner is cheating, causing the owner to lose their advertising revenue. Unfortunately, click bombers have managed to get many AdSense accounts suspended, cutting off a critical income source for site owners. Even if the account isn’t banned, a surge of invalid clicks can lead to withheld earnings (the network won’t pay for suspected fraud) and a damaged reputation with the ad network. For small publishers who rely on ad income, this can be devastating. They might wake up to find their site earned an unusually high number of ad clicks overnight – a red flag – and soon after, receive a policy violation notice from the ad network.

Ad Networks and Platforms – Ad network companies (like Google, Bing, Facebook, etc.) are indirectly affected by click bombing because it undermines trust in their advertising platform. If advertisers feel that a significant portion of their budget is wasted on fake clicks, they may become dissatisfied or reduce their spend. Ad networks have to invest heavily in fraud detection systems and sometimes reimburse advertisers for invalid activity, which is a cost to them. Industry reports show that advertising fraud is a huge issue – over 20% of global digital ad spend was estimated to be lost to ad fraud in 2023 (this includes click fraud schemes like click bombing). That translates to tens of billions of dollars in impact.

While major platforms employ advanced filters to catch most fake clicks (Google, for instance, claims the majority of invalid clicks are caught by automatic filters before advertisers are billed ), the arms race with fraudsters is ongoing. Ad networks must maintain the integrity of their metrics for advertisers and ensure publishers aren’t illegitimately profiting from or suffering due to invalid clicks. In some cases, networks have faced legal and public relations challenges; for example, Google settled a class-action lawsuit in 2006 by paying out $90 million in credits to advertisers for undetected click fraud over several years. This shows that fraudulent clicks not only hurt immediate victims but also force platforms to respond at scale.

In summary, click bombing hurts everyone except the fraudster. Advertisers lose money and opportunities, publishers risk losing revenue streams and accounts, and ad networks must constantly fight to keep their advertising ecosystem credible. It distorts the online marketplace and can give an unfair advantage to unethical competitors if left unchecked.

Real-World Examples of Click Bombing

To understand the severity of click bombing, consider a few real incidents and case studies where click bombing had tangible consequences:

AdSense Sabotage Case: A small online business experienced a sudden spike in ad clicks that clearly weren’t genuine. In one documented case, a husband-and-wife team running a web app noticed an unusually large number of ad clicks coming from a single source. Over a short period, their site recorded 239 ad clicks from only 11 page impressions – an astronomically high click-through rate (over 2000%). In other words, one or a few users were visiting the site repeatedly and clicking ad banners dozens of times per visit. This “click bombing” attack sent their metrics through the roof. Fearing Google would flag this as fraud and ban their AdSense, the owners took action: they removed all ad code from the site and even tried blocking the suspected clicker’s user agent. However, the clicks kept coming, suggesting the attacker was persistent and possibly using multiple IPs or a VPN to evade simple blocks. The case ended with the site owners implementing stronger defenses (like third-party analytics to pinpoint the attacker’s IP and using Cloudflare to block ranges of IPs). After a few stressful days, the malicious clicks stopped. This example illustrates how a malicious individual or bot can nearly get an innocent publisher banned by generating fake clicks. Many other AdSense publishers have reported similar nightmares of sudden invalid click bursts, often suspecting competitors or trolls as the culprits.

Competitor PPC Sabotage: Click bombing is frequently used as a weapon in competitive online industries. A notable example came out in legal proceedings when Satmodo, a satellite phone retailer, alleged that a competitor repeatedly clicked on its Google Ads to exhaust its ad budget. According to the complaint, the competitor (Whenever Communications) clicked Satmodo’s ads roughly 96 times within a few minutes, causing Satmodo’s daily ad spend to max out and forcing them to send a cease-and-desist letter. Satmodo claimed about $75,000 in advertising losses due to this click fraud scheme. While that case was eventually dismissed on certain claims, the judge acknowledged that such behavior, if true, “significantly threatens competition” and violates the spirit of antitrust laws. In another ongoing case (Motogolf vs. Score Holdings, 2020), a golf equipment seller sued a rival for allegedly clicking its Google ads repeatedly to wear them out each day, costing at least $5,000 in damage. These cases show that competitors sometimes engage in click bombing to knock each other’s ads offline during prime business hours. It’s effectively an illicit tactic to gain market advantage by draining a rival’s marketing budget. This kind of fraud can be hard to prove, but digital forensics (analyzing IP addresses, timestamps, cookie data, etc.) can sometimes tie the activity back to a competitor.

Large-Scale Click Fraud Rings: Although many click bombing incidents involve small-scale sabotage, there have also been large criminal operations built on fraudulent clicks. One infamous case was that of Vladimir Tsastin, dubbed a “click fraud kingpin.” He ran a sophisticated scheme for nearly a decade, using malware-infected computers to generate fake clicks on online ads from which he earned commissions. Tsastin’s operation wasn’t about sabotaging competitors; it was about exploiting ad networks to siphon money. Over years of click fraud, he reportedly accrued over $14 million in revenues. Eventually, authorities caught up to him – he was arrested and extradited to the U.S., and in 2016 he was sentenced to 7 years in prison for the fraud. This case underscores that fraudulent clicking can rise to the level of organized crime, and when it does, it attracts legal prosecution. While Tsastin’s scheme is broader than just “click bombing” (it involved creating fake websites and ad impressions), it highlights the extreme end of click fraud and its consequences.

These examples demonstrate the range of click bombing scenarios – from personal attacks on small publishers to aggressive competitive moves in advertising wars, all the way to criminal enterprises. In each case, the damage is clear: financial loss, disrupted business, and serious fallout for those involved. The prevalence of such incidents has pushed ad networks and businesses to be more vigilant in detecting and combating click bombing.

Screenshot from a real case of AdSense click bombing (highlighted in red box). It shows an extremely high click-through rate – 239 ad clicks from just 11 page views – an indicator of fraudulent clicking .
(Above: In the highlighted analytics data, note the AdSense CTR of 2,135.71% and a huge number of clicks (299) against only 14 impressions on one day【33†】. Such ratios are practically impossible under normal user behavior and signal a click bombing attack.)

Detection Methods: How to Identify Click Bombing

How can you tell if you are being click-bombed? Early detection is crucial to mitigate the damage. Fortunately, click bombing usually leaves tell-tale signs in your website and ad analytics. Here are some methods and indicators to help detect click bombing:
Monitor Unusual Spikes in Clicks or CTR: A sudden, unexplained surge in the number of ad clicks or an unusually high click-through rate (CTR) is one of the clearest signs. For example, if your site normally gets 50 ad clicks per day but suddenly registers 500+ clicks in a single hour, that’s a red flag. Similarly, a CTR that jumps far above normal (e.g., from 1-5% to 50% or higher) without any big change in content or traffic source suggests invalid activity. Checking your ad network reports is a good first step – “if you notice an abnormally high number of clicks in a very short span of time, somebody might be having a click bombing session”. If these clicks seem to all come from one source (for instance, a single country or a few IP addresses), that’s even stronger evidence .

Analyze Traffic Patterns and Behavior Metrics: Use website analytics (like Google Analytics) to dig deeper into the suspicious clicks. Look at metrics such as bounce rate, session duration, and pages per visit for the traffic that is clicking ads. Click bombing traffic tends to behave abnormally: often the bounce rate is 100% (meaning they leave immediately after clicking the ad) and time on site is near zero. Legitimate users who click an ad might browse a bit or interact; bots or malicious clickers typically click and vanish. If you see a cluster of ad clicks all with one-page visits and zero second sessions, you likely have a click bomber at work. Another clue is if all the suspicious clicks come from a common browser, device, or OS (e.g., all from an outdated Android model) – data which some analytics tools and ad dashboards can provide.

Check IP Addresses and Geographic Clues: Often, click bombing will originate from specific IP addresses or a narrow range. Using server logs or analytics that record IPs can help. If you discover that an inordinate number of clicks are coming from a single IP or a set of IPs (or an unusual location), that’s a sign. For instance, if your business is US-based but suddenly 90% of your ad clicks one day come from a far-off country where you normally have no audience, you should be suspicious. Website analytics or third-party monitoring tools can sometimes show the geographical distribution of clicks. One recommended practice is to “go through your Google Analytics and server logs” for anomalies and, if necessary, temporarily block suspicious IP addresses or regions. This can not only stop the attack but also serve as confirmation if the invalid clicks cease afterward.

Use Dedicated Click Fraud Detection Tools: There are specialized software solutions that use algorithms to detect fraudulent clicks in real-time. These tools can track patterns that human monitoring might miss. For example, machine learning-based fraud detection services analyze click timing, user agent strings, cookies, and conversion data to flag suspicious activity. They might automatically detect something like “100 clicks from the same user in 5 minutes” or a spike of clicks that never result in conversions. Modern PPC management software or third-party services (e.g., ClickGUARD, PPC Protect, etc.) can often integrate with your ad campaigns to identify and filter out invalid clicks. As one expert notes, machine learning models can spot anomalies such as a high number of clicks from one IP address, and some tools can even block those in real time. Many ad networks also provide some level of real-time monitoring or alerts – for instance, Google Ads has an “invalid clicks” column and may issue alerts if it detects a problem. Utilizing these tools adds an extra layer of security beyond manual observation.

Watch Conversion Metrics: If you notice a lot of clicks with no conversions (no sign-ups, no sales, no further engagement) especially from a particular source, it could be click fraud. In normal scenarios, a portion of ad clicks will lead to some downstream action even if small. But if, say, 300 ad clicks in a day yield zero conversions (and that’s atypical for you), scrutinize those clicks. They could be fake. Some advertisers set up conversion tracking and even rules to automatically down-weight sources that show lots of clicks but zero conversions, as this often correlates with fraudulent traffic .

Alerts from Ad Networks: The major advertising platforms have systems to detect invalid clicks. Google, for example, has sophisticated algorithms and a team dedicated to click fraud detection. They often automatically filter out clicks deemed invalid so they don’t bill the advertiser. If a click bombing attack is large and obvious, Google might catch it and not charge you for those clicks. Additionally, if Google detects a pattern of invalid clicks on your AdSense ads, they may send you a notification in your AdSense dashboard or email, warning about abnormal activity. Always pay attention to any such alerts or messages from your ad network – they can clue you in to an attack you might not have noticed yet.

In practice, detecting click bombing usually involves a combination of these methods. For a small website owner, manually monitoring the daily reports and analytics for weird spikes is often the first warning. Larger advertisers might rely on automated systems that flag anomalies. The key is to know your baseline metrics – what’s a normal range of clicks and behavior for your ads – so that you can quickly spot when something is way off. The sooner you recognize an attack, the sooner you can respond (by blocking sources, alerting the ad network, etc.) to minimize the damage.

Prevention and Mitigation Strategies

Preventing click bombing entirely can be challenging (especially if a determined attacker targets you), but there are several protective measures and best practices that can greatly reduce the risk and impact. Businesses and site owners should be proactive about click fraud defense. Below are strategies to help prevent or mitigate click bombing:

Enable Click Fraud Protection Tools: If you use WordPress or similar platforms, consider installing plugins designed to guard against click bombing. For example, ClickBomb Defense is a WordPress plugin that monitors each visitor’s clicks on ads and will automatically disable or hide your AdSense ads if one user exceeds a certain number of clicks. This way, even if someone tries to click an ad 50 times, only the first few clicks register and then the ads disappear for that user. Another tool, AdSense Click-Fraud Monitoring, performs a similar role of tracking click activity per user. Plugins like Who Sees Ads allow you to show ads only to certain audiences (say, only search engine visitors or only once per user). Using these kinds of controls can stop the most common form of click bombing (multiple rapid clicks by the same entity) by cutting the attackers off before they accumulate huge numbers. There are also modern plugins like Wordfence (a security plugin) that can reveal IP addresses of visitors in real-time, so you can quickly block any IP that’s clicking excessively. Similarly, BlackHole for Bad Bots maintains a list of known bot user agents and will trap/block those bots from loading your site. Implementing one or multiple of these solutions can dramatically shrink your exposure to click bombing.

Use IP Blocking and Firewalls: At the server or network level, you can employ web application firewalls (WAFs) and other filtering tools to screen out malicious traffic. Services like Cloudflare, Sucuri, or Akamai can detect bot-like behavior and challenge it (for instance, presenting a CAPTCHA to verify the visitor is human). Cloudflare in particular lets you create rules – you can set up a challenge or block for users who perform too many clicks too quickly, or block entire regions if needed. Cloudflare’s firewall can also block specific IP addresses or countries from accessing your site if you know you’re getting attacked from those sources. In an ongoing click bombing attack, some site owners temporarily block all traffic from the attacker’s region (if it’s identifiable) to halt the clicks. Even without a dedicated service, you can use your server’s .htaccess or firewall settings to manually ban offending IP addresses once identified. The drawback is attackers can switch IPs, but combining IP blocking with behavior-based rules (rate limiting clicks) is effective. In short, treat click bombing like any other malicious traffic – use security tools to filter out the bad actors.

Avoid Encouraging Invalid Traffic: Sometimes, sites unintentionally make themselves targets or vulnerable by engaging in dubious tactics. One recommendation is never purchase cheap/bot traffic or engage with click exchange networks. Those sources of traffic often involve bots that could engage in click bombing or trigger invalid activity. By keeping your traffic acquisition legitimate, you reduce the chances of botnets swarming your site. Likewise, never click your own ads or ask friends to “help” by clicking ads – not only is this against policy, but it can also set off alarms and possibly invite malicious actors to retaliate or copycat. As Google AdSense policies state, site owners should not click their own ads or encourage others to do so; doing so will be treated as invalid clicks and can lead to penalties. Essentially, maintain ethical practices and a clean reputation – don’t give anyone a reason (or an excuse) to target you with a click bombing claim.

Set Click Thresholds and Timeouts: If you have the technical ability, you might implement logic on your site to limit how ads are served. For example, you could configure that each user session or IP only sees an ad a certain number of times. Some advanced publishers use custom scripts or ad server settings to cap the impressions or clicks per user. The Ad Invalid Click Protector plugin does this by ensuring the same user sees an ad only once or twice per day. After that, it won’t show AdSense ads to that user, thus preventing repeated clicking. Additionally, showing ads only to likely legitimate users can help – for instance, Who Sees Ads can show ads only to visitors who come from search engines (organic traffic) and hide ads from visitors coming directly or from suspicious referrers. The rationale is that organic visitors are less likely to be bots or malicious attackers than, say, someone who navigated directly (which might be the attacker repeatedly coming to your URL). Implementing these kinds of limits and filters adds friction for would-be click bombers.

Stay Alert and Respond Quickly: Prevention isn’t just set-and-forget – it also means actively monitoring and reacting. Make it a habit to check your ad performance and site analytics daily (or set up automated alerts for unusual activity). If you catch a click bombing attack early, one immediate mitigation is to temporarily disable your ads on the site. This sounds counterintuitive (since you’ll lose some revenue while ads are off), but if someone is bombarding your ads, turning them off for a day or two can stop the attacker in their tracks (they can’t click what isn’t there) and protect your account from invalid traffic. Google even suggests this in extreme cases: pausing ads when under attack, then re-enabling once you’ve put other defenses in place. During the downtime, you can work on blocking the sources of the attack. Also, immediately report the incident to your ad network (Google AdSense or Ads support, etc.) – let them know you’re seeing fraudulent clicks and provide any data you have (IP addresses, screenshots of analytics, timestamps). Google has an invalid click report form where you can alert them of suspected click bombing. By informing the platform, you create a record of the issue which might help protect you from penalization (they know you’re not the one trying to cheat). The ad network might also take additional steps on their end to filter the traffic.

Use Conversion Tracking and Smart Bidding Strategies: For advertisers (on Google Ads, Bing Ads, etc.), enabling conversion tracking and using smart bidding can indirectly help mitigate click fraud. Google’s algorithms, for example, will notice if certain IPs or placements click a lot but never convert and may automatically adjust bids down or exclude placements that look fraudulent over time. While this isn’t foolproof, it’s an added layer – essentially letting the platform optimize away from bad traffic. Additionally, regularly review your placement reports (where your ads showed) and exclude any suspicious sites or apps that have high clicks and no results, as they could be sources of click fraud.

Implementing a combination of the above measures creates a robust defense. No single solution is 100% effective, but together they can deter most amateur click bombers and limit the damage of more sophisticated attacks. Think of it like securing a house: you want locks, alarm systems, and cameras – multiple layers. Similarly, with click fraud, you want technical blocks, smart monitoring, and policy compliance all working together. By being proactive, you can often scare off would-be attackers (they’ll move on to an easier target) or at least catch them before they cause serious harm.

Legal and Ethical Aspects

Click bombing and related fraudulent click activities carry significant legal and ethical implications. At its core, click bombing is a form of fraud – it generates false data and causes financial losses under false pretenses – and thus is considered illegal in many jurisdictions. Here’s an overview of the legal and ethical landscape:

Fraud and Cybercrime Laws: There isn’t usually a special “click fraud law,” but existing laws against fraud and unauthorized computer access have been applied to click bombing cases. In the United States, for instance, the Computer Fraud and Abuse Act (CFAA) can be used to prosecute severe click fraud. Under the CFAA, intentionally accessing a computer or service without authorization (which massive click bots arguably do) to cause harm can lead to serious penalties. In fact, the CFAA allows for prison terms up to 5 or 10 years for significant offenses, and fines up to $250,000 for individuals (or $500,000 for organizations) involved in computer fraud. Additionally, wire fraud statutes (which cover schemes carried out via electronic communication) have been invoked – one notable prosecution under federal wire fraud law was the case of Vladimir Tsastin, who was sentenced to 7 years in prison in 2016 for running a fraudulent click scheme that stole millions of ad dollars. In that case, Tsastin’s use of malware and bots to generate ad clicks was treated as a serious cybercrime. Around the world, other laws like anti-hacking statutes and even anti-competition laws can apply. For example, if a competitor engages in click bombing, it could be viewed as unfair business practice or anti-competitive behavior. In one legal decision, a U.S. judge noted that a click fraud scheme taking a competitor out of the marketplace constituted unfair conduct violating the spirit of antitrust laws. The bottom line: those who engage in large-scale click bombing can face lawsuits or criminal charges, and if found liable, they could end up with hefty fines or jail time.

Advertising Policies and Consequences: Long before it reaches a courtroom, click bombing typically is addressed by the advertising platforms’ own policies. All major ad networks strictly forbid any form of fraudulent or artificially generated clicks. Google’s AdSense program policies, for instance, explicitly prohibit publishers from clicking their own ads or using any method to inflate clicks (including asking others to click). Such clicks are considered “invalid traffic.” If a publisher is found to be involved in click bombing – even if they are a victim, Google’s systems might not always distinguish – the consequences are usually swift and severe. The account can be suspended or permanently banned from the ad network, and any accrued earnings from invalid clicks will not be paid out. Advertisers on Google Ads (AdWords) are also protected by policies: Google will not charge them for clicks deemed invalid, and repeatedly exploiting the system (like an advertiser clicking a competitor’s ads) could result in the offender’s account being suspended as well. Ethically, click bombing is viewed as a deceptive, bad-faith practice. It violates the trust that underpins online advertising. Ad networks have teams and automated systems to detect fraud, and they actively encourage reporting of any suspicious activity. In the digital advertising industry, engaging in click fraud is a quick way to get blacklisted.

Civil Litigation and Liability: Victims of click bombing – whether advertisers or publishers – sometimes resort to legal action to seek damages or injunctions. We’ve seen examples in Section 4 where companies sued competitors for alleged click bombing. While success in such lawsuits can be challenging (proving definitively who performed the clicks is not trivial), courts are increasingly recognizing click fraud as a genuine harm. In some cases, even if law enforcement isn’t involved, a civil suit for tortious interference or unfair competition might be possible if you can show a business intentionally harmed you via click bombing. Conversely, if a business owner attempted to use click bombing to hurt a rival or to defraud an ad network, they could be sued by the affected parties. Ethically, this is a clear line: using fraudulent clicks to harm competitors or to pump up your own revenue is widely condemned and can ruin a company’s reputation if exposed. No legitimate business wants to be known for cheating the system.

Accountability of Platforms: Ethically, ad networks have a responsibility to minimize fraud on their platforms. Google, Facebook, and others often publish transparency reports and invest in anti-fraud tech to reassure advertisers that their money isn’t being wasted. After the 2006 class-action settlement, Google affirmed it had “a large team of engineers and analysts” devoted to tackling invalid clicks and that most fake clicks are filtered out before they ever bill the advertiser. This ongoing effort is an ethical commitment to keep the ad ecosystem fair. If platforms were to ignore click bombing, they could be seen as complicit in the fraud. Regulators and industry groups (like the Interactive Advertising Bureau) also push for standards and auditing to keep click fraud under control.

In summary, click bombing is both illegal and unethical. While a person furiously clicking a competitor’s ad may not immediately think of it as a crime, in principle it’s no different from vandalizing a competitor’s store – it’s sabotage. Laws are catching up to prosecute more of these cases, especially big offenders. And even without a court case, the immediate enforcement by ad networks (account bans, withholding of revenue, refunds to victims) serves as a strong deterrent. Anyone tempted to engage in click bombing should know that the potential short-term “gain” (if any) is far outweighed by the risks of lawsuits, loss of business relationships, and long-term damage to one’s credibility. The ethical route – fair competition and honest advertising practices – is the only sustainable one in the digital marketplace.

Conclusion

As we've explored throughout this article, click bombing represents a significant threat in the digital advertising ecosystem, affecting everyone from small website owners to enterprise organizations. While the challenge is real, the good news is that the defense mechanisms are evolving just as rapidly as the attack methodologies.

Key Takeaways for Effective Protection
The difference between devastation and resilience often comes down to how prepared you are before an attack occurs. Here's what the most successful defenders understand:

Defense in Depth is Non-Negotiable: Like any security strategy, relying on a single protection method is a recipe for failure. The most resilient organizations implement multiple layers of defense—from basic WordPress plugins and IP filtering to sophisticated edge computing solutions. Each layer catches what the previous might miss.
The Surveillance-Response Loop Must Be Tight: In our analysis of the February 2025 Lambda@Edge deployments, we saw how organizations that could respond within minutes rather than hours reduced their financial exposure dramatically. Setting up automated alerting and having predefined response procedures transforms click bombing from a catastrophe to a manageable incident.
Edge Computing Changes the Game: The shift from origin-based to edge-based protection represents perhaps the most significant advancement in click fraud prevention. By analyzing traffic patterns at the network edge, you're essentially stopping the boxer's punch before it extends fully rather than just putting up your guard.
Behavior Analysis Trumps Identity Verification: As attackers become more sophisticated in spoofing legitimate users, the most effective detection methods increasingly focus on behavioral patterns rather than identity markers. The subtle rhythm of human interaction with content creates patterns that even advanced bots struggle to replicate perfectly.
Cost-Benefit Math Favors Protection: Many site owners hesitate to invest in advanced click fraud protection, viewing it as an optional expense rather than essential infrastructure. Yet the math is clear: the mid-sized publisher who lost $150,000 to a click bombing attack would have spent less than 5% of that amount on robust protection systems.

The Path Forward

If there's one lesson that stands out from our analysis of both attack methods and protection strategies, it's that click bombing is fundamentally an asymmetric threat. Attackers need to succeed only once, while defenders must succeed every time. This imbalance means that protection cannot be static—it must evolve continuously.

For WordPress site owners, this means regular updates to security plugins and periodic reassessment of traffic patterns. For enterprise organizations, it means investing in cloud-native protection that scales with your traffic and adapts to emerging threats.

Perhaps most importantly, protection against click bombing isn't just technical—it's cultural. Organizations that foster a security-minded approach to digital advertising, where unusual metrics trigger immediate investigation rather than celebration, consistently outperform their peers in preventing and mitigating attacks.

The battlefield of click fraud will continue to evolve, but by implementing the multi-layered approach we've outlined—from basic filtering to advanced edge computing solutions—you can ensure that your organization stays one step ahead in this costly digital arms race.

After all, in the world of click bombing, the best victory isn't winning the battle—it's making your organization such a difficult target that attackers simply move on to easier prey.

Octopus Deploy on AWS

Andy — Tue, 31 Mar 2026 19:47:29 GMT

This article will help with your understanding of Octopus Deploy, EKS, IRSA/Pod Identity, and Cross-Account IAM Roles. If you're coming from Azure, you're used to a world where:

Identities are centralized in Azure AD (Entra ID)
Workloads use Managed Identities (system/user-assigned) to get tokens
RBAC is applied to resources and evaluated at the control plane
Your Azure DevOps pipeline agent picks up credentials automatically and you just run az commands

AWS is similar conceptually but wired very differently. Add Octopus Deploy running inside EKS, throw in multi-account deployments, and suddenly you're juggling:

EKS OIDC / IRSA / Pod Identity (what even are these?)
AWS STS and AssumeRole flows (chains of role assumptions?)
Octopus Server vs Calamari (wait, which one talks to AWS?)
Per-step AWS roles and cross-account trust policies (how is this different from a service principal?)

This guide walks through the complete mental model, explicitly mapping AWS concepts to Azure analogies, and using Octopus-in-EKS deploying to multiple AWS accounts as the concrete example.

1. The Players: Who's Who in This System

Let's define every actor in this story so there's no confusion:

AWS Components

AWS EKS -- Managed Kubernetes, similar to AKS
EKS OIDC / IRSA -- EKS's mechanism to bind Kubernetes service accounts to IAM roles (like Azure workload identity for AKS)
EKS Pod Identity -- Newer, AWS-native successor to IRSA that avoids some OIDC complexity
AWS IAM Role -- Roughly equivalent to Azure AD app registration + role assignment; represents an AWS identity with attached permissions
AWS STS (Security Token Service) -- Issues short-lived credentials via AssumeRole and AssumeRoleWithWebIdentity calls
AWS Organizations / Multi-Account -- Pattern where Dev, Staging, Prod, and operational tooling live in separate AWS accounts

Octopus Components

Octopus Server -- The orchestrator/control plane. Runs in your EKS cluster (or could run in ECS, EC2, on-prem)
Calamari -- The worker subprocess that Octopus spawns to actually execute each deployment step
Octopus AWS Account -- Configuration in Octopus UI that tells it which AWS identity to use for steps
Built-in Worker -- When Octopus Server itself runs the step (Calamari subprocess in same pod/container)
Per-step Role ARN -- Optional override that tells Calamari to assume a different role for that specific step

The Critical Insight You Need First

Calamari (the worker subprocess), not Octopus Server, is what calls AWS STS at runtime.

Octopus Server is pure orchestration -- it decides what runs when, spawns Calamari, and passes configuration. Calamari is the thing that:
- Resolves AWS credentials - Calls STS to get temporary credentials - Injects those credentials as environment variables - Runs your actual deployment script (CloudFormation, kubectl, Terraform, etc.)

If you don't internalize this, the rest won't make sense. Octopus Server never holds or uses AWS credentials for deployment steps. Calamari does everything.

The following diagram shows what happens inside a single Calamari step execution:

flowchart LR  
    subgraph Inputs
        code["Code / Script"]
        token["AWS Token\n(from IRSA/Pod Identity)"]
        vars["Step Variables"]
    end

    subgraph Calamari["Calamari Step Execution"]
        step["Step Process"]
    end

    subgraph Actions["Could Be..."]
        cf["Apply CloudFormation"]
        eks["List EKS Pods"]
        ecr["Purge ECR Images"]
        tf["Apply Terraform"]
        create["Create EKS Cluster"]
    end

    subgraph CredResolution["Credential Resolution"]
        role["var: Role ARN"]
        sts["AWS STS"]
        iam["IAM Roles"]
    end

    code --> step
    token --> step
    vars --> step
    step --> cf
    step --> eks
    step --> ecr
    step --> tf
    step --> create

    vars --> role
    role -->|AssumeRole| sts
    sts -->|Temp Credentials| role
    sts --- iam

Calamari receives the script, the ambient AWS token, and step variables (including the target role ARN). It calls STS to exchange the launcher token for scoped temporary credentials, then executes the actual deployment action -- CloudFormation, Terraform, kubectl, whatever the step calls for.

2. The Azure Mental Model (Your Baseline)

Quick mapping so your brain has familiar anchors:

| Azure Concept | AWS Equivalent | |---------------|----------------| | Azure Managed Identity (system/user-assigned) | EKS IRSA / Pod Identity / EC2 instance role | | Azure AD + OAuth2/OIDC federation | AWS IAM OIDC providers + STS AssumeRoleWithWebIdentity | | Azure role assignment (Contributor on subscription) | IAM role with permission policy | | Azure DevOps service connection | Octopus AWS Account | | Azure DevOps pipeline agent | Octopus Calamari (worker process) | | az account set + multiple service connections | sts:AssumeRole into different accounts/roles per step |

In Azure DevOps, you might:

Create a managed identity with Contributor on a resource group
Your pipeline uses a service connection tied to that identity
Pipeline agent picks up credentials from metadata service automatically
Your script just runs az deployment group create and it works

In AWS with Octopus and EKS, the pattern is similar -- but instead of Azure AD tokens, you have:
- STS temporary credentials - IAM role trust policies - Cross-account AssumeRole chains

3. How Octopus Actually Executes a Step: The Full Flow

When you trigger a deployment and a step runs (e.g., "Deploy CloudFormation template" or "Run kubectl script"), here's what happens under the hood:

Step-by-Step Execution

Octopus Server receives the deployment task
- User clicks "Deploy" or webhook fires
- Octopus evaluates which worker should run the step (built-in worker in the Octopus pod, or an external worker)
Octopus Server spawns Calamari
- Calamari is a subprocess/child process
- Octopus passes to Calamari:
  - The step script/content (e.g., CloudFormation template, kubectl commands)
  - AWS Account configuration (which role to use)
  - Any per-step "Assume Role ARN" override
  - Step variables and parameters
Calamari resolves AWS credentials (this is the key part)
- Calamari looks for credentials in this order:
  1. AWS_WEB_IDENTITY_TOKEN_FILE env var (IRSA/Pod Identity injected by EKS)
  2. EC2/ECS metadata service at 169.254.169.254 (instance role)
  3. Explicit access keys from Octopus AWS Account config (if configured)
- If Calamari finds AWS_WEB_IDENTITY_TOKEN_FILE:
  - It reads the JWT token file
  - Calls sts:AssumeRoleWithWebIdentity using that token
  - Gets back temporary credentials for the pod's IAM role
Calamari performs role assumption (if per-step Role ARN is configured)
- Uses the credentials from step 3 (the "launcher" role)
- Calls sts:AssumeRole into the target role (e.g., DevDeployRole in Dev account)
- Gets back new temporary credentials scoped to that deployment role
Calamari injects credentials as environment variables
- Sets in the step's process environment:

AWS_ACCESS_KEY_ID=ASIA...  
AWS_SECRET_ACCESS_KEY=...  
AWS_SESSION_TOKEN=...

The actual step script runs
- Your CloudFormation/Terraform/kubectl/AWS CLI commands execute
- They automatically use the injected credentials
- The script doesn't need to call STS or handle auth -- it just works
Credentials expire after the step
- STS credentials are short-lived (default 1 hour, configurable up to 12 hours; role chaining limited to 1 hour)
- Next step goes through the same flow, potentially with different role

Why This Matters

In Azure, the pipeline agent has one identity for the entire run. In AWS with Octopus, each step can have a completely different identity because Calamari does a fresh AssumeRole call per step.

This is the key to multi-account orchestration: your Octopus pod has one minimal "launcher" identity, and every step assumes whichever role it needs in whichever account.

4. Where IRSA and Pod Identity Fit: The "Launcher" Identity

When Octopus runs inside EKS, you need to answer this question:

"What identity does Calamari have when it first tries to call AWS STS?"

In Azure terms: "Which Managed Identity does my pipeline agent use?"

In AWS EKS, you bind a pod's Kubernetes service account to an IAM role using one of two mechanisms:

4.1 IRSA (IAM Roles for Service Accounts) - The Original Approach

How it works:

AWS hosts an OIDC issuer for your cluster
- Every EKS cluster gets a public OIDC endpoint
- URL format: https://oidc.eks..amazonaws.com/id/
- This endpoint serves JWT tokens that identify Kubernetes service accounts
You register that OIDC URL as an IAM Identity Provider
- In AWS IAM console -> Identity Providers -> Add Provider
- Provider type: OpenID Connect
- Provider URL: your cluster's OIDC issuer URL
- Audience: sts.amazonaws.com
You create an IAM role with an OIDC trust policy

{
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::123456789012:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/XXXXX"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "oidc.eks.us-east-1.amazonaws.com/id/XXXXX:sub": "system:serviceaccount:octopus:octopus-server",
        "oidc.eks.us-east-1.amazonaws.com/id/XXXXX:aud": "sts.amazonaws.com"
      }
    }
  }]
}

This says: "Trust JWTs from my EKS cluster's OIDC issuer, but only for the specific Kubernetes service account octopus/octopus-server"

You annotate the Kubernetes service account

apiVersion: v1  
kind: ServiceAccount  
metadata:  
  name: octopus-server
  namespace: octopus
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/OctoLauncherRole

EKS mutating webhook injects environment variables into the pod
- When your pod starts, EKS automatically injects:
  - AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token
  - AWS_ROLE_ARN=arn:aws:iam::123456789012:role/OctoLauncherRole
- Also mounts the JWT token as a file in the pod
AWS SDK automatically picks this up
- When Calamari (or any AWS SDK in the pod) tries to get credentials
- SDK sees AWS_WEB_IDENTITY_TOKEN_FILE in the environment
- Reads the JWT token from that file path
- Calls sts:AssumeRoleWithWebIdentity with the token
- Gets back temporary credentials for OctoLauncherRole

Key point: Calamari doesn't know or care that IRSA is happening. The AWS SDK's credential chain automatically handles it.

4.2 EKS Pod Identity - The Newer, Cleaner Approach

Pod Identity is AWS's answer to some complexity and edge cases with IRSA:

How it works:

Install the EKS Pod Identity Agent add-on

aws eks create-addon --cluster-name my-cluster --addon-name eks-pod-identity-agent

This deploys a DaemonSet on every node
The agent runs on each node and acts as a credential broker
1. Create an IAM role with Pod Identity trust policy

{
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "pods.eks.amazonaws.com"
    },
    "Action": ["sts:AssumeRole", "sts:TagSession"]
  }]
}

Notice: No OIDC provider mentioned at all. The trust is directly with the EKS service.

Create a Pod Identity Association

aws eks create-pod-identity-association \  
  --cluster-name my-cluster \
  --namespace octopus \
  --service-account octopus-server \
  --role-arn arn:aws:iam::123456789012:role/OctoLauncherRole

This tells EKS: "When pods in namespace octopus use service account octopus-server, give them credentials for OctoLauncherRole"

Pod Identity Agent injects credentials
- The DaemonSet exposes a node-local credential endpoint at 169.254.170.23:80 (a link-local address, similar to how ECS task roles use 169.254.170.2)
- EKS injects the AWS_CONTAINER_CREDENTIALS_FULL_URI environment variable into the pod, pointing at this endpoint
- The AWS SDK's credential chain discovers this env var and fetches credentials from the agent automatically -- the pod never explicitly queries anything
- Under the hood, the agent calls the eks-auth:AssumeRoleForPodIdentity API (not AssumeRoleWithWebIdentity) to broker the credentials
AWS SDK automatically picks this up
- Same as IRSA from the application's perspective -- the SDK credential chain handles discovery transparently
- Calamari/SDK gets credentials without any explicit STS calls in application code
- This is the same AWS_CONTAINER_CREDENTIALS_FULL_URI mechanism that ECS Fargate uses for task roles, which is why the SDK treats EKS Pod Identity and ECS task roles identically from the application's perspective

Why Pod Identity is better:

No OIDC provider registration - simpler setup
No public OIDC discovery URL fetch - works reliably in fully private clusters. With IRSA, AWS STS must reach the cluster's OIDC discovery endpoint (https://oidc.eks..amazonaws.com/id//.well-known/openid-configuration) to validate the pod's JWT. That URL resolves to a public IP address. In fully private EKS clusters with air-gapped VPCs (no NAT gateway, no internet gateway), this endpoint is unreachable -- STS cannot validate the token, AssumeRoleWithWebIdentity fails, and IRSA breaks entirely. Pod Identity sidesteps this because the on-node agent brokers credentials via the eks-auth:AssumeRoleForPodIdentity API, which travels over the AWS private network, not the public OIDC path.
Cleaner trust model - direct EKS service principal, no OIDC federation complexity
Same developer experience - your code doesn't change

Private Cluster Warning: If you are running a fully private EKS cluster and cannot use Pod Identity (e.g., older EKS versions), see section 4.4 below for a Route 53 resolver workaround that allows IRSA to function by forwarding only the OIDC discovery domain to public DNS.

4.3 What This Gives You

Both IRSA and Pod Identity give Calamari a "launcher role" - an initial IAM role identity it can use.

This launcher role is like the "service principal that the agent uses" in Azure DevOps. But here's the key difference:

In Azure: Your pipeline agent's identity usually has the actual permissions it needs (Contributor, etc.)

In AWS: The launcher role typically has only one permission: sts:AssumeRole into other roles

Why? Because you want per-step, per-account, granular control over what each deployment step can do.

4.4 Route 53 Resolver Workaround: IRSA in Private Clusters

If you must use IRSA in a fully private EKS cluster (e.g., Pod Identity is unavailable on your EKS version), the core problem is that STS needs to reach the public OIDC discovery URL to validate the pod's JWT. You can solve this with Route 53 Resolver Endpoints and split-horizon DNS without opening general internet access:

Create a Route 53 Outbound Resolver Endpoint in your VPC
Create a forwarding rule that matches only the OIDC discovery domain (oidc.eks..amazonaws.com) and forwards it to public DNS resolvers (e.g., 1.1.1.1, 8.8.8.8)
All other DNS traffic continues to resolve via VPC-internal DNS and VPC endpoints as normal

# Create outbound resolver endpoint
aws route53resolver create-resolver-endpoint \  
  --creator-request-id oidc-resolver \
  --direction OUTBOUND \
  --security-group-ids sg-0123456789abcdef0 \
  --ip-addresses SubnetId=subnet-aaa,Ip=10.0.1.10 SubnetId=subnet-bbb,Ip=10.0.2.10

# Create forwarding rule for OIDC domain only
aws route53resolver create-resolver-rule \  
  --creator-request-id oidc-forward \
  --rule-type FORWARD \
  --domain-name "oidc.eks.us-east-1.amazonaws.com" \
  --resolver-endpoint-id rslvr-out-xxxxxxxxx \
  --target-ips Ip=1.1.1.1 Ip=8.8.8.8

# Associate the rule with your VPC
aws route53resolver associate-resolver-rule \  
  --resolver-rule-id rslvr-rr-xxxxxxxxx \
  --vpc-id vpc-xxxxxxxxx

This gives STS just enough DNS resolution to validate OIDC tokens while keeping everything else private. You still need a NAT gateway or AWS PrivateLink path for the actual HTTPS fetch of the OIDC discovery document -- the DNS forwarding alone resolves the name but does not route the traffic. In most cases, upgrading to Pod Identity is the cleaner long-term solution.

4.5 The Octopus Kubernetes Agent: Bypassing OIDC Entirely

For fully private clusters where neither Pod Identity nor the Route 53 workaround is viable, there is a fundamentally different architecture: install the Octopus Kubernetes Agent directly inside the cluster.

How it works:

Install the agent via Helm into the target EKS cluster:

helm upgrade --install --atomic octopus-agent \  
  oci://registry-1.docker.io/octopusdeploy/kubernetes-agent \
  --namespace octopus-agent \
  --create-namespace \
  --set agent.serverUrl="https://your-octopus-server" \
  --set agent.serverCommsAddress="https://your-octopus-server:10943" \
  --set agent.space="Default" \
  --set agent.targetName="private-eks-cluster" \
  --set agent.bearerToken="API-XXXXXXXXXXXX"

The agent runs in poll mode -- it dials outbound to Octopus Server over HTTPS (port 10943), asking "do you have work for me?" This means:
- No inbound connections to the cluster required
- No OIDC discovery URL validation needed
- No IRSA or Pod Identity configuration required for the Octopus→Kubernetes communication path
The agent already has in-cluster RBAC -- because it runs as a pod inside the cluster, it uses a Kubernetes service account with the RBAC permissions you grant it. Octopus Server sends deployment instructions; the agent executes them using its native Kubernetes access.
For AWS API calls, the agent pod can still use Pod Identity or IRSA to get a launcher role, then assume deployment roles per step -- the same two-layer pattern described in section 5. The difference is that the Octopus→cluster connectivity problem is eliminated.

When to use the Kubernetes Agent:

Fully air-gapped clusters where no public DNS or internet path exists
Multiple private clusters across accounts -- install one agent per cluster, all poll back to a central Octopus Server
Simplified networking -- outbound-only connectivity from the cluster to Octopus Server
Hybrid scenarios -- Octopus Server runs outside AWS (on-prem or different cloud) and deploys into private EKS clusters

Trade-off: You now manage an agent per cluster instead of having a single centralized Octopus-in-EKS installation. For organizations with many private clusters, this is often preferable to complex networking workarounds.

5. The Two-Layer Role Pattern: "Launcher" + "Deployment Roles"

Here's where AWS diverges significantly from the Azure mental model.

You want to be able to:
- Run Step A with CloudFormation access to deploy infrastructure in the current environment - Run Step B with ECR access to push/pull container images in the current environment - Run Step C with EKS access to apply Kubernetes manifests in the current environment - All steps in one deployment run against the same environment — the environment selection (Dev, Staging, Prod) determines which AWS account is targeted

In Octopus Deploy, all steps in a single deployment execute against the same environment. When you deploy Release 1.0 to Dev, every step runs against Dev. When you promote Release 1.0 to Staging, every step runs against Staging. Per-step role ARNs are for different permission scopes within the same account (e.g., Step 1 needs CloudFormation, Step 2 needs ECR, Step 3 needs EKS — all in the same AWS account for that environment). Octopus variable scoping is the mechanism that changes which AWS account is targeted when you promote across environments.

Instead of giving your Octopus pod's identity all those permissions combined (which would be a security nightmare), you use role assumption chains.

5.1 Layer 1: The Launcher Role

This is the role attached to your Octopus pod via IRSA or Pod Identity. Octopus runs in the Dev account — the same AWS account as your Dev environment. When deploying to Dev, the launcher and deployment resources are in the same account. When promoting to Staging or Prod, those are separate AWS accounts requiring cross-account AssumeRole.

Example:

Account: Dev Account (111111111111) - where Octopus EKS cluster runs  
Role:    OctoLauncherRole  
ARN:     arn:aws:iam::111111111111:role/OctoLauncherRole

Trust Policy (Pod Identity):

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "pods.eks.amazonaws.com"
    },
    "Action": ["sts:AssumeRole", "sts:TagSession"]
  }]
}

Permission Policy (minimal):

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": "sts:AssumeRole",
    "Resource": [
      "arn:aws:iam::111111111111:role/DevDeployRole",
      "arn:aws:iam::222222222222:role/StagingDeployRole",
      "arn:aws:iam::333333333333:role/ProdDeployRole"
    ]
  }]
}

This role: - Can't touch any actual AWS resources (no EKS, S3, CloudFormation permissions) - Can only assume other specific roles - Acts as the "bootstrap identity" for Calamari

Think of it like a "service principal that can only impersonate other service principals"

5.2 Layer 2: Deployment Roles (Per Account/Environment)

Now you create deployment roles in each target AWS account. Because Octopus runs in the Dev account, the Dev deployment role is in the same account as the launcher — this is a same-account AssumeRole. Staging and Prod are separate accounts and require cross-account role trust.

How Octopus selects the right role: Octopus variable scoping drives this. You define a variable AWS.DeployRoleArn (or similar) with different values scoped to each environment. When you deploy to Dev, Octopus resolves the Dev role ARN. When you promote the same release to Staging, Octopus resolves the Staging role ARN. The deployment process definition is identical — the environment selection is what changes the target account.

Dev Account (111111111111) — Same Account as Launcher

Role: DevDeployRole ARN: arn:aws:iam::111111111111:role/DevDeployRole

Trust Policy:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "AWS": "arn:aws:iam::111111111111:role/OctoLauncherRole"
    },
    "Action": "sts:AssumeRole"
  }]
}

This says: "Allow the OctoLauncherRole from the same Dev account (111111111111) to assume me"

Permission Policy (what Dev steps can actually do):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EKSAccess",
      "Effect": "Allow",
      "Action": [
        "eks:DescribeCluster",
        "eks:ListClusters"
      ],
      "Resource": "arn:aws:eks:us-east-1:111111111111:cluster/*"
    },
    {
      "Sid": "ECRPushPull",
      "Effect": "Allow",
      "Action": [
        "ecr:GetAuthorizationToken"
      ],
      "Resource": "*"
    },
    {
      "Sid": "ECRRepoAccess",
      "Effect": "Allow",
      "Action": [
        "ecr:BatchCheckLayerAvailability",
        "ecr:BatchGetImage",
        "ecr:GetDownloadUrlForLayer",
        "ecr:PutImage",
        "ecr:InitiateLayerUpload",
        "ecr:UploadLayerPart",
        "ecr:CompleteLayerUpload",
        "ecr:DescribeRepositories",
        "ecr:DescribeImages",
        "ecr:ListImages"
      ],
      "Resource": "arn:aws:ecr:us-east-1:111111111111:repository/*"
    },
    {
      "Sid": "CloudFormation",
      "Effect": "Allow",
      "Action": [
        "cloudformation:CreateStack",
        "cloudformation:UpdateStack",
        "cloudformation:DeleteStack",
        "cloudformation:DescribeStacks",
        "cloudformation:DescribeStackEvents",
        "cloudformation:GetTemplate",
        "cloudformation:ValidateTemplate",
        "cloudformation:ListStacks"
      ],
      "Resource": "arn:aws:cloudformation:us-east-1:111111111111:stack/dev-*/*"
    },
    {
      "Sid": "S3ArtifactAccess",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::dev-artifacts-*",
        "arn:aws:s3:::dev-artifacts-*/*"
      ]
    }
  ]
}

Note on eks:DescribeCluster: This is the only IAM permission needed for kubectl operations. When Calamari runs kubectl apply, it calls eks:DescribeCluster to get the cluster's API endpoint and CA certificate, then authenticates to the Kubernetes API server using the IAM role. But IAM permissions alone are not sufficient -- you must also grant the role Kubernetes RBAC access (see section 5.4 below).

Staging Account (222222222222)

Role: StagingDeployRole ARN: arn:aws:iam::222222222222:role/StagingDeployRole

Trust Policy (cross-account):

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "AWS": "arn:aws:iam::111111111111:role/OctoLauncherRole"
    },
    "Action": "sts:AssumeRole"
  }]
}

Permission Policy: Same structure as Dev, scoped to this account's resources:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EKSAccess",
      "Effect": "Allow",
      "Action": [
        "eks:DescribeCluster",
        "eks:ListClusters"
      ],
      "Resource": "arn:aws:eks:us-east-1:222222222222:cluster/*"
    },
    {
      "Sid": "ECRPushPull",
      "Effect": "Allow",
      "Action": [
        "ecr:GetAuthorizationToken"
      ],
      "Resource": "*"
    },
    {
      "Sid": "ECRRepoAccess",
      "Effect": "Allow",
      "Action": [
        "ecr:BatchCheckLayerAvailability",
        "ecr:BatchGetImage",
        "ecr:GetDownloadUrlForLayer",
        "ecr:PutImage",
        "ecr:InitiateLayerUpload",
        "ecr:UploadLayerPart",
        "ecr:CompleteLayerUpload",
        "ecr:DescribeRepositories",
        "ecr:DescribeImages",
        "ecr:ListImages"
      ],
      "Resource": "arn:aws:ecr:us-east-1:222222222222:repository/*"
    },
    {
      "Sid": "CloudFormation",
      "Effect": "Allow",
      "Action": [
        "cloudformation:CreateStack",
        "cloudformation:UpdateStack",
        "cloudformation:DeleteStack",
        "cloudformation:DescribeStacks",
        "cloudformation:DescribeStackEvents",
        "cloudformation:GetTemplate",
        "cloudformation:ValidateTemplate",
        "cloudformation:ListStacks"
      ],
      "Resource": "arn:aws:cloudformation:us-east-1:222222222222:stack/staging-*/*"
    },
    {
      "Sid": "S3ArtifactAccess",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::staging-artifacts-*",
        "arn:aws:s3:::staging-artifacts-*/*"
      ]
    }
  ]
}

The Staging deployment role has the same permission actions as Dev -- because the deployment process is the same. The difference is resource scoping (account 222222222222 resources) and the trust policy (cross-account from the Dev account where Octopus runs).

Production Account (333333333333)

Role: ProdDeployRole ARN: arn:aws:iam::333333333333:role/ProdDeployRole

Trust Policy (cross-account):

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "AWS": "arn:aws:iam::111111111111:role/OctoLauncherRole"
    },
    "Action": "sts:AssumeRole"
  }]
}

Permission Policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EKSAccess",
      "Effect": "Allow",
      "Action": [
        "eks:DescribeCluster",
        "eks:ListClusters"
      ],
      "Resource": "arn:aws:eks:us-east-1:333333333333:cluster/*"
    },
    {
      "Sid": "ECRPullOnly",
      "Effect": "Allow",
      "Action": [
        "ecr:GetAuthorizationToken"
      ],
      "Resource": "*"
    },
    {
      "Sid": "ECRRepoAccess",
      "Effect": "Allow",
      "Action": [
        "ecr:BatchCheckLayerAvailability",
        "ecr:BatchGetImage",
        "ecr:GetDownloadUrlForLayer",
        "ecr:DescribeRepositories",
        "ecr:DescribeImages",
        "ecr:ListImages"
      ],
      "Resource": "arn:aws:ecr:us-east-1:333333333333:repository/*"
    },
    {
      "Sid": "CloudFormation",
      "Effect": "Allow",
      "Action": [
        "cloudformation:CreateStack",
        "cloudformation:UpdateStack",
        "cloudformation:DescribeStacks",
        "cloudformation:DescribeStackEvents",
        "cloudformation:GetTemplate",
        "cloudformation:ValidateTemplate",
        "cloudformation:ListStacks"
      ],
      "Resource": "arn:aws:cloudformation:us-east-1:333333333333:stack/prod-*/*"
    },
    {
      "Sid": "S3ArtifactAccess",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::prod-artifacts-*",
        "arn:aws:s3:::prod-artifacts-*/*"
      ]
    }
  ]
}

The Prod role must have write permissions for the deployment process to work. If all steps in a deployment target the same environment, and the deployment process is the same for Dev and Prod, then Prod needs the ability to actually deploy -- create/update CloudFormation stacks, apply Kubernetes manifests via kubectl, etc. The control point for Prod safety is not IAM read-only permissions (which would make automated deployment impossible). Instead, Prod safety comes from:

Octopus manual approval gates -- require a human to approve before a release proceeds to Prod
Kubernetes RBAC scoping -- limit the role to specific namespaces
CloudFormation stack policies -- prevent deletion of critical resources
ECR pull-only -- Prod doesn't push images; it pulls images that were pushed in Dev/Staging. (Prod's ECR access is read-only, unlike Dev/Staging which has push permissions.)
No CloudFormation DeleteStack -- notice Prod lacks cloudformation:DeleteStack compared to Dev/Staging
S3 read-only -- Prod reads artifacts; it doesn't produce them

5.3 Why This Pattern?

This is defense in depth:

Pod compromise - if the Octopus pod is compromised, attacker only has OctoLauncherRole (in Dev account), which can only assume specific deployment roles and cannot directly touch resources
Blast radius - each deployment role is scoped to exactly what that environment needs; Dev is same-account, Staging/Prod require explicit cross-account trust
Audit trail - CloudTrail shows exact role assumption chain: OctoLauncherRole -> StagingDeployRole -> s3:PutObject; for Dev deployments the chain stays within account 111111111111
Granular control - Dev/Staging have full deploy permissions, Prod is tightened (no ECR push, no stack deletion, no artifact writes, namespace-scoped RBAC), all from the same Octopus installation
Environment-driven targeting - Octopus variable scoping ensures the same deployment process resolves to the right role ARN per environment without any code changes

5.4 The kubectl/RBAC Gap: IAM Is Not Enough

This is a critical gap that catches teams by surprise: IAM permissions alone do not grant Kubernetes API access. Your deployment role can have eks:DescribeCluster and every EKS permission in the IAM catalog, but kubectl apply will still fail with error: You must be logged in to the server (Unauthorized) unless the role is also mapped to Kubernetes RBAC.

EKS has two mechanisms for this:

Option 1: EKS Access Entries (Recommended -- newer clusters)

EKS access entries are the AWS-native approach, managed via API without touching cluster internals:

# Grant the Dev deployment role access to the Dev cluster
aws eks create-access-entry \  
  --cluster-name dev-cluster \
  --principal-arn arn:aws:iam::111111111111:role/DevDeployRole \
  --type STANDARD

# Associate a Kubernetes RBAC policy
aws eks associate-access-policy \  
  --cluster-name dev-cluster \
  --principal-arn arn:aws:iam::111111111111:role/DevDeployRole \
  --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy \
  --access-scope type=namespace,namespaces=app-dev

For Staging/Prod (cross-account), you create access entries in those clusters pointing to the respective deployment roles:

# In Staging cluster (account 222222222222)
aws eks create-access-entry \  
  --cluster-name staging-cluster \
  --principal-arn arn:aws:iam::222222222222:role/StagingDeployRole \
  --type STANDARD

aws eks associate-access-policy \  
  --cluster-name staging-cluster \
  --principal-arn arn:aws:iam::222222222222:role/StagingDeployRole \
  --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy \
  --access-scope type=namespace,namespaces=app-staging

Option 2: aws-auth ConfigMap (Legacy -- all clusters)

For older clusters or clusters not using access entries, you map IAM roles to Kubernetes groups via the aws-auth ConfigMap in the kube-system namespace:

apiVersion: v1  
kind: ConfigMap  
metadata:  
  name: aws-auth
  namespace: kube-system
data:  
  mapRoles: |
    # Deployment role for this environment
    - rolearn: arn:aws:iam::111111111111:role/DevDeployRole
      username: octopus-deploy
      groups:
        - system:masters  # Full cluster admin -- tighten this in Staging/Prod
    # Node instance role (already present -- don't remove)
    - rolearn: arn:aws:iam::111111111111:role/eks-node-role
      username: system:node:{{EC2PrivateDNSName}}
      groups:
        - system:bootstrappers
        - system:nodes

Warning: Editing aws-auth incorrectly can lock you out of the cluster. Always verify the existing content before modifying. Never remove the node instance role entries.

For Prod, use namespace-scoped RBAC instead of system:masters:

# ClusterRole for deployment (apply to specific namespace)
apiVersion: rbac.authorization.k8s.io/v1  
kind: Role  
metadata:  
  namespace: app-prod
  name: octopus-deployer
rules:  
  - apiGroups: ["", "apps", "batch"]
    resources: ["deployments", "services", "configmaps", "secrets", "pods", "jobs"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  - apiGroups: ["networking.k8s.io"]
    resources: ["ingresses"]
    verbs: ["get", "list", "watch", "create", "update", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1  
kind: RoleBinding  
metadata:  
  namespace: app-prod
  name: octopus-deployer-binding
subjects:  
  - kind: Group
    name: octopus-deployers
    apiGroup: rbac.authorization.k8s.io
roleRef:  
  kind: Role
  name: octopus-deployer
  apiGroup: rbac.authorization.k8s.io

Then in aws-auth, map the Prod role to the octopus-deployers group instead of system:masters.

The two-layer authorization model: IAM controls whether the role can reach the cluster (eks:DescribeCluster). Kubernetes RBAC controls what the role can do once authenticated. You need both. This is fundamentally different from Azure AKS, where Azure AD RBAC can grant both the Azure-level and Kubernetes-level permissions in one place.

6. The Big Picture: How It All Connects

These diagrams show the two deployment scenarios: deploying to the same account where Octopus runs (Dev), and deploying cross-account (Prod). In both cases, all steps in the deployment target the same environment -- the environment selection determines which account is targeted.

Deploying to Dev (Same Account)

When you deploy a release to Dev, Octopus, the launcher role, and the deployment role are all in the same AWS account. The AssumeRole call is same-account:

flowchart TB  
    subgraph org["AWS Organization"]
        subgraph dev["Dev Account (111111111111) — Octopus runs here"]
            sts["AWS STS"]

            subgraph eks["EKS Cluster"]
                subgraph pod["Octopus Pod"]
                    octopus["Octopus Server"]
                    calA["Calamari A\n(CloudFormation step)"]
                    calB["Calamari B\n(EKS deploy step)"]
                end
            end

            irsa["IRSA or Pod Identity\n(OIDC Issuer)"]
            devRole["DevDeployRole"]
            devResources["Dev Resources\n(EKS, ECR, S3, CloudFormation)"]

            irsa -->|"env vars:\nAWS_WEB_IDENTITY_TOKEN_FILE\nAWS_ROLE_ARN"| pod
            pod -->|"AssumeRoleWithWebIdentity\n(launcher token)"| sts
            sts -->|"Temp creds:\nOctoLauncherRole"| pod

            octopus -->|"spawns"| calA
            octopus -->|"spawns"| calB

            calA -->|"sts:AssumeRole\n(DevDeployRole)"| sts
            sts -->|"Temp creds"| calA
            calA -.->|"operates on"| devResources
            devResources --- devRole

            calB -->|"sts:AssumeRole\n(DevDeployRole)"| sts
            sts -->|"Temp creds"| calB
            calB -.->|"operates on"| devResources
        end
    end

Both Calamari steps assume the same DevDeployRole because all steps in this deployment target Dev. Per-step role overrides are still possible if different steps need different permission scopes within Dev (e.g., one step needs CloudFormation + IAM, another only needs ECR read).

Promoting to Prod (Cross-Account)

When you promote the same release to Prod, Octopus variable scoping resolves the Prod role ARN instead. Calamari now makes cross-account AssumeRole calls from the Dev account into the Prod account:

flowchart TB  
    subgraph org["AWS Organization"]
        subgraph dev["Dev Account (111111111111) — Octopus runs here"]
            sts["AWS STS"]

            subgraph eks["EKS Cluster"]
                subgraph pod["Octopus Pod"]
                    octopus["Octopus Server"]
                    calA["Calamari A\n(CloudFormation step)"]
                    calB["Calamari B\n(EKS deploy step)"]
                end
            end

            irsa["IRSA or Pod Identity"]

            irsa -->|"launcher token"| pod
            pod -->|"AssumeRoleWithWebIdentity"| sts
            sts -->|"OctoLauncherRole creds"| pod
        end

        subgraph prod["Prod Account (333333333333)"]
            prodRole["ProdDeployRole"]
            prodResources["Prod Resources\n(EKS, ECR, S3)"]
        end

        octopus -->|"spawns"| calA
        octopus -->|"spawns"| calB

        calA -->|"sts:AssumeRole\n(ProdDeployRole)"| sts
        sts -->|"Cross-account\ntemp creds"| calA
        calA -.->|"operates on"| prodResources
        prodResources --- prodRole

        calB -->|"sts:AssumeRole\n(ProdDeployRole)"| sts
        sts -->|"Cross-account\ntemp creds"| calB
        calB -.->|"operates on"| prodResources
    end

Important: Calamari processes run inside the Octopus pod in the Dev account -- they are subprocesses of the Octopus Server, not remote agents deployed in target accounts. They assume IAM roles in the target accounts via STS and then make API calls to those accounts, but the process itself executes locally in the Octopus pod. If you need actual execution inside a target account's VPC (e.g., for private API endpoints), you'd deploy external workers there -- but that's a different topology.

The key insight: the deployment process is identical for Dev and Prod. What changes is the environment selection, which causes Octopus to resolve different variable values -- including the target role ARN. Octopus Server spawns Calamari subprocesses, each of which independently calls STS using the pod's launcher credentials, then assumes the deployment role that the environment's variable scoping resolves to.

7. How Octopus Configuration Maps to This

In Octopus UI under Deploy -> Manage -> Accounts -> Add Account -> AWS Account, you configure a single account that uses ambient credentials from IRSA/Pod Identity:

AWS Account Configuration

Account name: AWS Deploy

Authentication method: Execute using the AWS service role for an EC2 instance (This tells Octopus: "Don't use stored keys; Calamari should pick up ambient credentials from IRSA/Pod Identity")

Note on the label: The "EC2 instance" wording is misleading -- this option does not require EC2. It means "use the AWS SDK's default credential chain to resolve ambient credentials," which works equally for EKS IRSA, EKS Pod Identity, ECS Task Roles, and EC2 Instance Roles. The label predates EKS and ECS Fargate support in Octopus. Functionally, selecting this just tells Calamari: "don't use stored access keys; discover credentials from the environment."

Access Key / Secret Key: (leave blank)

Assume Role (optional): (leave blank at account level)

Variable Scoping: How Environments Target Different Accounts

This is the key to understanding how Octopus handles multi-account deployments. You define a project variable for the deployment role ARN, with different values scoped to each environment:

| Variable Name | Value | Scoped To | |---------------|-------|-----------| | AWS.DeployRoleArn | arn:aws:iam::111111111111:role/DevDeployRole | Dev | | AWS.DeployRoleArn | arn:aws:iam::222222222222:role/StagingDeployRole | Staging | | AWS.DeployRoleArn | arn:aws:iam::333333333333:role/ProdDeployRole | Prod |

Then in your deployment process, each step uses:

AWS Account: Select AWS Deploy

Assume a different AWS Role: #{AWS.DeployRoleArn}

When you deploy Release 1.0 to Dev, Octopus resolves #{AWS.DeployRoleArn} to the Dev role ARN. When you promote the same release to Staging, Octopus resolves it to the Staging role ARN. The deployment process definition is identical across environments -- the environment selection is what changes the target account.

This tells Calamari:
1. Use ambient credentials from Pod Identity (OctoLauncherRole in Dev account)
2. Call sts:AssumeRole into whichever role ARN the environment resolved
3. For Dev: same-account AssumeRole (both launcher and target in 111111111111)
4. For Staging/Prod: cross-account AssumeRole (launcher in 111111111111, target in 222222222222 or 333333333333)
5. Use those scoped credentials for the step

Alternative: Octopus OIDC (If You Don't Want IRSA/Pod Identity)

Instead of "Execute using service role", you can configure:

Authentication method: Use OpenID Connect

Role ARN: arn:aws:iam::111111111111:role/OctoDevRole

In this mode:
- Octopus Server acts as an OIDC issuer - Octopus mints a JWT token scoped to the deployment - Calamari calls sts:AssumeRoleWithWebIdentity using Octopus's JWT - AWS STS validates the token by fetching https://your-octopus-server/.well-known/openid-configuration

Why you might not want this: Requires Octopus to have a publicly-reachable OIDC discovery endpoint. If Octopus is fully private, STS can't validate the token.

When IRSA/Pod Identity is better: Your Octopus installation can be completely private. The EKS OIDC issuer (for IRSA) or Pod Identity service is AWS-managed and public, so STS can always validate.

8. The Complete Flow: Step Execution with Role Assumption

Let's trace a real deployment step end-to-end:

Scenario: Deploy a CloudFormation stack to Dev account (same account where Octopus runs)

Configuration: - Octopus runs in EKS cluster in Dev account (111111111111) - Octopus pod uses service account with Pod Identity -> OctoLauncherRole - Step configured with AWS Account AWS Deploy, Role ARN resolved via variable scoping to arn:aws:iam::111111111111:role/DevDeployRole

Step-by-step execution:

User triggers deployment in Octopus UI
Octopus Server evaluates the step
- Identifies that it should run on built-in worker (in the Octopus pod)
- Spawns Calamari subprocess
Octopus Server passes to Calamari:
- CloudFormation template file
- Stack name, parameters
- AWS Account config: "use ambient service role"
- Per-step Role ARN: arn:aws:iam::111111111111:role/DevDeployRole
Calamari resolves base credentials:
- Checks environment variables
- Finds AWS_ROLE_ARN=arn:aws:iam::111111111111:role/OctoLauncherRole (injected by Pod Identity)
- Finds AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/... (or Pod Identity agent endpoint)
- AWS SDK automatically calls STS to get credentials for OctoLauncherRole
- Calamari now has temp creds for the launcher role
Calamari performs role assumption:
- Using OctoLauncherRole credentials, calls:

aws sts assume-role \  
  --role-arn arn:aws:iam::111111111111:role/DevDeployRole \
  --role-session-name octopus-deploy-12345

STS checks: "Does DevDeployRole trust OctoLauncherRole?" -> Yes (trust policy)
STS checks: "Can OctoLauncherRole assume DevDeployRole?" -> Yes (launcher has sts:AssumeRole permission for this ARN)
STS returns temporary credentials for DevDeployRole (same-account)
1. Calamari injects credentials:

export AWS_ACCESS_KEY_ID=ASIAIOSFODNN7EXAMPLE  
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY  
export AWS_SESSION_TOKEN=AQoEXAMPLEH4aoAH0gNCAPyJxz4BlCFFxWNE1OPTgk5TthT+FvwqnKwRcOIfrRh3c/...  
export AWS_DEFAULT_REGION=us-east-1

CloudFormation step executes:

aws cloudformation deploy \  
  --template-file template.yaml \
  --stack-name my-app-stack \
  --capabilities CAPABILITY_IAM

AWS CLI uses the injected credentials
Operates as DevDeployRole in account 111111111111
Can create CloudFormation stacks, EKS clusters, etc. (per DevDeployRole permissions)
1. Step completes, credentials discarded
Temporary credentials expire (default 1 hour, configurable up to 12 hours; role chaining limited to 1 hour)
Next step goes through the same flow, potentially with different role

What Happens in CloudTrail (Audit)

When you look at CloudTrail logs:

In Dev Account (111111111111) -- where Octopus runs:

{
  "eventName": "AssumeRole",
  "requestParameters": {
    "roleArn": "arn:aws:iam::111111111111:role/DevDeployRole",
    "roleSessionName": "octopus-deploy-12345"
  },
  "userIdentity": {
    "type": "AssumedRole",
    "principalId": "AIDACKCEVSQ6C2EXAMPLE:octopus-pod",
    "arn": "arn:aws:sts::111111111111:assumed-role/OctoLauncherRole/octopus-pod"
  }
}

In Dev Account (111111111111):

{
  "eventName": "CreateStack",
  "requestParameters": {
    "stackName": "my-app-stack",
    "templateURL": "https://..."
  },
  "userIdentity": {
    "type": "AssumedRole",
    "principalId": "AIDACKCEVSQ6C2EXAMPLE:octopus-deploy-12345",
    "arn": "arn:aws:sts::111111111111:assumed-role/DevDeployRole/octopus-deploy-12345"
  },
  "sourceIPAddress": "10.0.5.23"
}

You can trace the entire chain: pod -> OctoLauncherRole -> DevDeployRole -> CloudFormation action.

9. Multi-Account Strategy: How Many Roles?

When you have multiple microservices deploying to multiple environments, you need to decide: how many deployment roles?

Option 1: One Deployment Role Per Environment (Simplest)

Dev Account (111111111111) -- Octopus runs here  
  +-- OctoLauncherRole
  +-- DevDeployRole  (all 8 microservices use this)

Staging Account (222222222222)  
  +-- StagingDeployRole

Prod Account (333333333333)  
  +-- ProdDeployRole

Total: 4 roles (launcher + 3 deployment)

Pros: - Simple to manage - Fast iteration in Dev/Staging - One Octopus AWS Account config per environment

Cons: - Every microservice deployment has access to all resources in the account - No per-service blast radius control - Harder to audit "which service did what"

Option 2: One Role Per Microservice Per Environment (Maximum Isolation)

Dev Account (111111111111)  
  +-- UserServiceDevRole
  +-- PaymentServiceDevRole
  +-- NotificationServiceDevRole
  +-- AuthServiceDevRole
  +-- InventoryServiceDevRole
  +-- OrderServiceDevRole
  +-- ShippingServiceDevRole
  +-- AnalyticsServiceDevRole

Staging Account (222222222222)  
  +-- (same 8 roles)

Prod Account (333333333333)  
  +-- (same 8 roles)

Total: 8 microservices x 3 environments = 24 deployment roles (plus 1 launcher in Dev = 25 total)

Pros: - Perfect least privilege - UserService can't touch PaymentService resources - Compromised role only affects one service - Clear audit trail per service

Cons: - 25 roles to manage (permission drift risk) - More Octopus configuration (8 AWS Accounts per environment, or 8 per-step role ARN overrides)

Option 3: Hybrid - Group by Blast Radius (Recommended)

Dev Account  
  +-- DevDeployRole (all services)

Staging Account  
  +-- StagingDeployRole (all services)

Prod Account  
  +-- ProdDataPlaneRole  (low-risk: users, inventory, shipping, analytics, notifications)
  +-- ProdControlPlaneRole (high-risk: payments, auth, orders)

Total: 5 roles

ProdControlPlaneRole has highly restricted permissions + requires manual approval in Octopus before use.

Pros: - Balance between security and maintainability - Production gets extra protection where it matters - Dev/Staging stay simple for velocity

Cons: - Still some shared blast radius in Prod data plane

Automation: AWS CloudFormation StackSets

To avoid manually creating roles in each account, use StackSets:

Template: deploy-role.yaml

Parameters:  
  Environment:
    Type: String
    AllowedValues: [Dev, Staging, Prod]
  LauncherRoleArn:
    Type: String
    Description: ARN of OctoLauncherRole in account where Octopus runs
  AllowECRPush:
    Type: String
    AllowedValues: ['true', 'false']
    Default: 'true'
    Description: Whether this role can push to ECR (false for Prod)
  AllowCFNDelete:
    Type: String
    AllowedValues: ['true', 'false']
    Default: 'true'
    Description: Whether this role can delete CloudFormation stacks (false for Prod)

Conditions:  
  CanPushECR: !Equals [!Ref AllowECRPush, 'true']
  CanDeleteCFN: !Equals [!Ref AllowCFNDelete, 'true']

Resources:  
  DeployRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Sub '${Environment}DeployRole'
      AssumeRolePolicyDocument:
        Statement:
          - Effect: Allow
            Principal:
              AWS: !Ref LauncherRoleArn
            Action: sts:AssumeRole

  DeployPolicy:
    Type: AWS::IAM::Policy
    Properties:
      PolicyName: !Sub '${Environment}DeployPolicy'
      Roles: [!Ref DeployRole]
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Sid: EKSAccess
            Effect: Allow
            Action:
              - eks:DescribeCluster
              - eks:ListClusters
            Resource: !Sub 'arn:aws:eks:${AWS::Region}:${AWS::AccountId}:cluster/*'
          - Sid: ECRAuth
            Effect: Allow
            Action:
              - ecr:GetAuthorizationToken
            Resource: '*'
          - Sid: ECRPull
            Effect: Allow
            Action:
              - ecr:BatchCheckLayerAvailability
              - ecr:BatchGetImage
              - ecr:GetDownloadUrlForLayer
              - ecr:DescribeRepositories
              - ecr:DescribeImages
              - ecr:ListImages
            Resource: !Sub 'arn:aws:ecr:${AWS::Region}:${AWS::AccountId}:repository/*'
          - Sid: CloudFormationReadWrite
            Effect: Allow
            Action:
              - cloudformation:CreateStack
              - cloudformation:UpdateStack
              - cloudformation:DescribeStacks
              - cloudformation:DescribeStackEvents
              - cloudformation:GetTemplate
              - cloudformation:ValidateTemplate
              - cloudformation:ListStacks
            Resource: !Sub 'arn:aws:cloudformation:${AWS::Region}:${AWS::AccountId}:stack/${Environment}-*/*'
          - Sid: S3ArtifactRead
            Effect: Allow
            Action:
              - s3:GetObject
              - s3:ListBucket
            Resource:
              - !Sub 'arn:aws:s3:::${Environment}-artifacts-*'
              - !Sub 'arn:aws:s3:::${Environment}-artifacts-*/*'

  # Conditional: ECR push (Dev/Staging only, not Prod)
  ECRPushPolicy:
    Type: AWS::IAM::Policy
    Condition: CanPushECR
    Properties:
      PolicyName: !Sub '${Environment}ECRPushPolicy'
      Roles: [!Ref DeployRole]
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Sid: ECRPush
            Effect: Allow
            Action:
              - ecr:PutImage
              - ecr:InitiateLayerUpload
              - ecr:UploadLayerPart
              - ecr:CompleteLayerUpload
            Resource: !Sub 'arn:aws:ecr:${AWS::Region}:${AWS::AccountId}:repository/*'

  # Conditional: S3 artifact write (Dev/Staging only)
  S3WritePolicy:
    Type: AWS::IAM::Policy
    Condition: CanPushECR
    Properties:
      PolicyName: !Sub '${Environment}S3WritePolicy'
      Roles: [!Ref DeployRole]
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Sid: S3ArtifactWrite
            Effect: Allow
            Action:
              - s3:PutObject
            Resource: !Sub 'arn:aws:s3:::${Environment}-artifacts-*/*'

  # Conditional: CloudFormation delete (Dev/Staging only)
  CFNDeletePolicy:
    Type: AWS::IAM::Policy
    Condition: CanDeleteCFN
    Properties:
      PolicyName: !Sub '${Environment}CFNDeletePolicy'
      Roles: [!Ref DeployRole]
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Sid: CFNDelete
            Effect: Allow
            Action:
              - cloudformation:DeleteStack
            Resource: !Sub 'arn:aws:cloudformation:${AWS::Region}:${AWS::AccountId}:stack/${Environment}-*/*'

Outputs:  
  DeployRoleArn:
    Value: !GetAtt DeployRole.Arn
    Description: ARN for Octopus variable scoping

Deploy to all accounts with environment-appropriate permissions:

# Create the StackSet
aws cloudformation create-stack-set \  
  --stack-set-name deployment-roles \
  --template-body file://deploy-role.yaml \
  --parameters \
    ParameterKey=LauncherRoleArn,ParameterValue=arn:aws:iam::111111111111:role/OctoLauncherRole \
  --capabilities CAPABILITY_NAMED_IAM

# Dev (same account as Octopus -- full permissions)
aws cloudformation create-stack-instances \  
  --stack-set-name deployment-roles \
  --accounts 111111111111 \
  --regions us-east-1 \
  --parameter-overrides \
    ParameterKey=Environment,ParameterValue=Dev \
    ParameterKey=AllowECRPush,ParameterValue=true \
    ParameterKey=AllowCFNDelete,ParameterValue=true

# Staging (cross-account -- full permissions)
aws cloudformation create-stack-instances \  
  --stack-set-name deployment-roles \
  --accounts 222222222222 \
  --regions us-east-1 \
  --parameter-overrides \
    ParameterKey=Environment,ParameterValue=Staging \
    ParameterKey=AllowECRPush,ParameterValue=true \
    ParameterKey=AllowCFNDelete,ParameterValue=true

# Prod (cross-account -- no ECR push, no CFN delete)
aws cloudformation create-stack-instances \  
  --stack-set-name deployment-roles \
  --accounts 333333333333 \
  --regions us-east-1 \
  --parameter-overrides \
    ParameterKey=Environment,ParameterValue=Prod \
    ParameterKey=AllowECRPush,ParameterValue=false \
    ParameterKey=AllowCFNDelete,ParameterValue=false

The template uses CloudFormation conditions to vary permissions by environment. Prod gets the same base deploy permissions (it must be able to apply CloudFormation and reach EKS) but cannot push images, delete stacks, or write artifacts. Updates to the template propagate to all accounts automatically.

10. What Changes When Octopus Runs Elsewhere?

The pattern is the same whether Octopus runs in EKS, ECS Fargate, EC2, or even on-premises. Only the Layer 1 (launcher identity mechanism) changes.

Octopus in ECS Fargate

No IRSA/Pod Identity (those are EKS features)

Instead: ECS Task Role

Task Role vs Execution Role -- They Are Different Things

ECS task definitions have two role fields, and confusing them is a common source of "why can't my container call AWS APIs" issues:

Task Role (taskRoleArn) -- This is the IAM role that your application code (Octopus/Calamari) uses at runtime to call AWS APIs. This is your launcher role. Credentials are injected into the container via the AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variable, which points to the ECS agent's local metadata endpoint at http://169.254.170.2$AWS_CONTAINER_CREDENTIALS_RELATIVE_URI.
Execution Role (executionRoleArn) -- This is the IAM role that the ECS agent uses to pull your container image from ECR, send logs to CloudWatch, and retrieve secrets from Secrets Manager or SSM Parameter Store. Your application code never sees or uses this role. It needs ecr:GetAuthorizationToken, ecr:BatchGetImage, logs:CreateLogStream, logs:PutLogEvents, and optionally secretsmanager:GetSecretValue or ssm:GetParameters.

The key distinction: The execution role is for ECS infrastructure operations (pull image, push logs). The task role is for your application's AWS API calls. For the Octopus launcher pattern, only the task role matters -- it becomes the launcher identity that Calamari uses to assume deployment roles.

How ECS Credential Injection Works

When your ECS task starts, the ECS agent sets AWS_CONTAINER_CREDENTIALS_RELATIVE_URI in the container environment (e.g., /v2/credentials/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
The AWS SDK's credential chain detects this env var and makes an HTTP GET to http://169.254.170.2${AWS_CONTAINER_CREDENTIALS_RELATIVE_URI}
The ECS agent responds with temporary credentials (access key, secret key, session token) for the task role
These credentials auto-refresh -- the SDK handles rotation transparently

Note: The ECS metadata endpoint is at 169.254.170.2 (link-local), which is different from the EC2 IMDS at 169.254.169.254. If you have code that hardcodes the EC2 metadata IP, it won't work on Fargate.

Setup

Create IAM role with trust policy for ecs-tasks.amazonaws.com
Assign as the task role in ECS task definition
ECS injects credentials via AWS_CONTAINER_CREDENTIALS_RELATIVE_URI env var
Calamari picks this up via AWS SDK credential chain
Rest is identical: Calamari uses task role -> assumes deployment roles per step

Trust policy:

{
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "ecs-tasks.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
  }]
}

ECS task definition:

{
  "family": "octopus-server",
  "taskRoleArn": "arn:aws:iam::123456789012:role/OctoLauncherRole",
  "executionRoleArn": "arn:aws:iam::123456789012:role/OctopusECSExecutionRole",
  "containerDefinitions": [...]
}

ECS-Specific Gotchas

No 169.254.169.254 on Fargate: The EC2 instance metadata service (IMDS) is not available. If any library or script tries to hit the EC2 metadata endpoint, it will time out. Only the ECS credential endpoint at 169.254.170.2 is available.
VPC configuration matters: Fargate tasks need network access to STS (for AssumeRole calls). Either place tasks in a subnet with a NAT gateway, or create a VPC endpoint for com.amazonaws..sts.
Secrets in environment variables: Use the execution role + Secrets Manager/SSM integration to inject secrets into the container at launch, rather than baking them into the task definition. The execution role handles the secret retrieval before your container starts.

Octopus on EC2

Use EC2 Instance Role

Create IAM role with trust policy for ec2.amazonaws.com
Attach to EC2 instance via instance profile
Calamari picks up via IMDS (Instance Metadata Service) at http://169.254.169.254/latest/meta-data/iam/security-credentials/OctoLauncherRole
Rest is identical

Octopus On-Premises (No Ambient Credentials)

Option 1: Use Octopus OIDC - Octopus acts as OIDC issuer - Must expose /.well-known/openid-configuration publicly - Calamari uses Octopus-minted JWT -> AssumeRoleWithWebIdentity

Option 2: Use External Worker in AWS - Octopus Server on-prem orchestrates - Steps run on workers in AWS (EC2/ECS with instance/task roles) - Workers have launcher role, same pattern applies

11. The Conceptual Shift from Azure

This is the hardest part to internalize if you're coming from Azure:

Azure Mindset

"This pipeline runs as this service principal / managed identity; that principal has these permissions on these resources."

The identity is relatively static for the entire pipeline run. You might use multiple service connections, but each stage/job has one identity.

AWS + Octopus Mindset

"This step, at runtime, will assume this role in that account, do its work, and then the credentials expire."

The identity is dynamic per step. The Octopus pod/task has one minimal identity (launcher) that can't do anything itself -- it can only become other identities via AssumeRole.

Why This Is Powerful

Distributed runtime orchestration: - Octopus Server is pure orchestration -- no AWS permissions needed on the server itself - Calamari handles credential resolution and STS calls per step - Each step gets exactly the permissions it needs, no more - Cross-account is native -- no special configuration needed - Audit trail shows exact role -> role -> action chain

Fine-grained control: - Dev steps: broad permissions for fast iteration - Staging steps: similar to Dev, maybe with extra validations - Prod steps: read-only + manual approval gates - All from one Octopus installation with one pod identity

Defense in depth: - Pod compromise = attacker only has launcher role (can't touch resources) - Deployment role compromise = blast radius limited to that account/scope - Each AWS account owner controls their deployment role permissions - Centralized orchestration (Octopus) + distributed authorization (IAM per account)

12. Common Gotchas and How to Avoid Them

Gotcha 1: "My step says 'Access Denied' but the role has the permission"

Likely cause: You're looking at OctoLauncherRole permissions instead of the deployment role

Fix: Check CloudTrail in the target account to see which role the step actually assumed. Verify that role has the permission.

Gotcha 2: "AssumeRole fails with 'not authorized to perform sts:AssumeRole'"

Likely causes: 1. OctoLauncherRole doesn't have sts:AssumeRole permission for that target role ARN
2. Target role's trust policy doesn't allow OctoLauncherRole to assume it
3. Typo in role ARN

Fix: Check both the launcher role's permissions and the target role's trust policy.

Gotcha 3: "Step works in Dev but fails in Prod with same code"

Likely cause: ProdDeployRole has more restrictive permissions than DevDeployRole

Fix: This is by design. Check Prod role permissions and adjust or use manual approval + elevated role for Prod changes.

Gotcha 4: "IRSA/Pod Identity not working - Calamari can't find credentials"

Check: 1. Is the service account annotated correctly? (kubectl describe sa octopus-server -n octopus)
2. Are env vars injected in the pod? (kubectl exec -it -- env | grep AWS)
3. Is the pod using the right service account? (kubectl get pod -o yaml | grep serviceAccountName)
4. Does the IAM role trust policy match the exact service account and OIDC issuer?

Gotcha 5: "Cross-account AssumeRole works from AWS CLI but fails in Octopus"

Likely cause: External ID mismatch or session duration too long

Fix: - Don't use external IDs for Octopus role assumptions (not needed for service-to-service) - Check if target role has max session duration configured, ensure Octopus isn't requesting longer

Gotcha 6: "My deployment worked yesterday but fails today"

Likely cause: Temporary credentials expired and Calamari is reusing cached creds

Fix: This shouldn't happen -- Calamari calls STS per step. But check if you have any caching in custom scripts or environment variable exports that persist across steps.

13. Best Practices Summary

Security

Launcher role has zero resource permissions - only sts:AssumeRole
Deployment roles use least privilege - exactly what each environment needs
Prod roles are read-only by default - write access requires manual approval or separate role
Use Pod Identity over IRSA - simpler, more reliable in private clusters
Enable CloudTrail in all accounts - track the full role assumption chain
Create an STS VPC interface endpoint - deploy a com.amazonaws..sts VPC endpoint so that all AssumeRole and AssumeRoleWithWebIdentity calls stay within the AWS private network and never traverse the public internet. This is defense-in-depth for sensitive environments and eliminates the need for a NAT gateway for STS traffic:

aws ec2 create-vpc-endpoint \  
  --vpc-id vpc-xxxxxxxxx \
  --service-name com.amazonaws.us-east-1.sts \
  --vpc-endpoint-type Interface \
  --subnet-ids subnet-aaa subnet-bbb \
  --security-group-ids sg-0123456789abcdef0 \
  --private-dns-enabled

With --private-dns-enabled, the default sts.us-east-1.amazonaws.com hostname resolves to the private endpoint IP within your VPC. No SDK or application changes needed -- all STS calls automatically route privately.

Use AWS Organizations Service Control Policies (SCPs) to enforce that deployment roles can only be assumed by your specific launcher role ARN. SCPs act at the Organizations level and override even account-admin IAM policies, providing an organizational security boundary that complements the per-role trust policies:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "DenyAssumeDeployRolesExceptLauncher",
    "Effect": "Deny",
    "Action": "sts:AssumeRole",
    "Resource": [
      "arn:aws:iam::*:role/*DeployRole"
    ],
    "Condition": {
      "StringNotEquals": {
        "aws:PrincipalArn": "arn:aws:iam::111111111111:role/OctoLauncherRole"
      }
    }
  }]
}

This ensures that even if an account admin creates a permissive IAM policy, they cannot assume the deployment roles unless they are the designated launcher. Combine with trust policies for defense-in-depth.

SCP caveats: (1) SCPs do not apply to the management account in AWS Organizations -- if your launcher or deployment roles exist in the management account, this SCP has no effect there. Always place workloads in member accounts. (2) In a role chain (e.g., OctoLauncherRole assumes DevDeployRole), aws:PrincipalArn reflects the calling role at each hop, not the original initiator. If your deployment steps involve further role chaining beyond the two-layer pattern, the SCP condition behavior can be surprising -- test the exact evaluation in your role chain before relying on this SCP as a sole control.

Operational

Use StackSets to deploy roles - consistency across accounts, easy updates
One Octopus AWS Account per environment - Dev, Staging, Prod configs
Document role ARN in deployment process - make it clear which role each step uses
Use descriptive role session names - octopus-deploy-{deployment-id} helps in CloudTrail
Set reasonable session durations - default 1 hour is sensible for most steps; max is 12 hours but role chaining caps at 1 hour

Organizational

Each AWS account owner controls their deployment role - central Octopus, distributed authorization
Group microservices by blast radius - not every service needs its own role
Start simple, add granularity as needed - one role per environment, split later if needed
Use AWS Organizations - centralized billing, easier StackSet deployment

14. Quick Reference: Decision Trees

"Which launcher mechanism should I use?"

flowchart TD  
    A{"Where does\nOctopus run?"} -->|EKS| B{"Private cluster?"}
    B -->|Yes| C{"Pod Identity\navailable?"}
    C -->|Yes| D["EKS Pod Identity\n(recommended)"]
    C -->|No| E{"Can add Route 53\nresolver for OIDC?"}
    E -->|Yes| F["IRSA + Route 53\nresolver workaround"]
    E -->|No| G["Octopus K8s Agent\n(poll mode, no OIDC needed)"]
    B -->|No| H{"Existing OIDC\nprovider setup?"}
    H -->|Yes| I["IRSA\n(works fine)"]
    H -->|No| D
    A -->|ECS Fargate| J["ECS Task Role"]
    A -->|EC2| K["EC2 Instance Role\n(via Instance Profile)"]
    A -->|On-Premises| L{"Can expose public\nOIDC endpoint?"}
    L -->|Yes| M["Octopus OIDC"]
    L -->|No| N["External Workers in AWS\nor K8s Agent (poll mode)"]

"How many deployment roles do I need?"

flowchart TD  
    A{"How many\nenvironments?"} --> B["Dev + Staging + Prod\n= 3 base roles"]
    B --> C{"How many\nmicroservices?"}
    C -->|"< 5"| D["Shared role per env\n(3 total + 1 launcher = 4)"]
    C -->|"5-10"| E["Shared in Dev/Staging\nSplit Prod by risk\n(4-5 total)"]
    C -->|"> 10"| F["Role per service or\ngroup by domain\n(10-20 total)"]
    E --> G{"High-sensitivity\nworkloads?"}
    G -->|"Payments, PII, Auth"| H["Dedicated role +\nmanual approval"]
    G -->|"Analytics, Notifications"| I["Can share role"]

"My step is failing -- where do I look?"

flowchart TD  
    A["Step Failed"] --> B["Check Octopus\ndeployment log"]
    B --> C{"Shows which role\nwas assumed?"}
    C -->|Yes| D["Check CloudTrail\nin target account"]
    C -->|No| E["Credential resolution\nfailed - check IRSA/\nPod Identity setup"]
    D --> F{"AssumeRole\nsuccessful?"}
    F -->|No| G["Check trust policy +\nlauncher permissions"]
    F -->|Yes| H{"API call\nattempted?"}
    H -->|Yes| I{"AccessDenied?"}
    I -->|Yes| J["Check deployment\nrole permissions"]
    I -->|No| K["Different error -\ncheck API params"]
    H -->|No| L["Credential injection\nfailed - check Calamari logs"]

Conclusion

The AWS + Octopus + EKS pattern for multi-account deployments is more complex than Azure's managed identity model at first glance. But once you internalize the two-layer pattern -- launcher role + per-step deployment roles -- it becomes extremely powerful:

Octopus orchestrates, but never holds deployment permissions itself
Calamari resolves credentials dynamically per step via STS
IRSA/Pod Identity provides the bootstrap launcher identity
IAM roles per account encode exactly what each environment/service can do
Cross-account is native -- no special setup, just trust policies
Private clusters have options -- Pod Identity, Route 53 resolver workarounds, or the Octopus Kubernetes Agent in poll mode
STS VPC endpoints and SCPs add defense-in-depth at the network and organization level

The mental shift from "this pipeline runs as this identity" to "this step will assume this role at runtime" unlocks:
- Fine-grained, per-step authorization - Defense in depth (pod compromise != resource access) - Distributed ownership (each account controls its deployment role) - Centralized orchestration with decentralized permissions

Whether you use IRSA, Pod Identity, ECS Task Roles, EC2 instance roles, or the Octopus Kubernetes Agent as your launcher mechanism, the pattern remains the same. Master this model and multi-account, multi-environment AWS deployments become manageable, secure, and auditable.

Corrections

On april 1, 2026 the following corrections were applied based on technical review:

Section 6 diagram: Calamari processes were incorrectly shown inside the Dev and Prod account VPCs. Corrected to show them inside the Octopus pod in the Dev account (where Octopus runs), since Calamari runs as a subprocess of Octopus Server and makes remote API calls to target accounts via STS-assumed credentials.
Section 7 "EC2 instance" label: Added clarification that the "Execute using the AWS service role for an EC2 instance" option in Octopus UI is misleadingly named -- it actually means "use the SDK default credential chain" and works for IRSA, Pod Identity, and ECS Task Roles, not just EC2.
Section 4.2 Pod Identity mechanics: Tightened the credential delivery description. The Pod Identity Agent exposes a local endpoint at 169.254.170.23:80, credentials are discovered via AWS_CONTAINER_CREDENTIALS_FULL_URI env var, and the SDK handles everything transparently -- pods don't explicitly query the agent.
Section 13 SCP caveats: Added footnote that SCPs don't apply to the management account and that aws:PrincipalArn reflects the calling role at each hop in a role chain, which can produce surprising behavior beyond the two-layer pattern.
Section 10 ECS expansion: Added Task Role vs Execution Role distinction, explained AWS_CONTAINER_CREDENTIALS_RELATIVE_URI and the 169.254.170.2 metadata endpoint, clarified differences from EC2 IMDS (169.254.169.254), and added ECS-specific gotchas.

On march 31, 2026 the following corrections were applied:

STS session duration: Originally stated "15 minutes to 1 hour". Corrected to default 1 hour, configurable up to 12 hours. Role chaining is capped at 1 hour regardless of the role's max session duration setting. (AWS STS AssumeRole API Reference)
Octopus UI navigation path: Originally stated Infrastructure -> Accounts. Corrected to Deploy -> Manage -> Accounts per current Octopus Deploy documentation. (Octopus AWS Accounts docs)
Kubernetes Agent Helm chart: Originally used octopusdeploy/kubernetes-agent as a traditional Helm repo reference. Corrected to oci://registry-1.docker.io/octopusdeploy/kubernetes-agent which is the OCI registry path used in the official installation wizard. (Octopus Kubernetes Agent docs)

On April 2, 2026 the following corrections were applied based on technical review:

Account topology: Article originally assumed Octopus runs in a separate "Tooling Account" (123456789012). Corrected throughout to reflect that Octopus runs in the Dev account (111111111111). When deploying to Dev, the launcher and deployment role are in the same account (same-account AssumeRole). When promoting to Staging/Prod, those are separate AWS accounts requiring cross-account AssumeRole.
Section 5 intro (deployment process model): Original bullet list implied steps A/B/C/D each targeted different environments (Dev/Staging/Prod) sequentially within one deployment. This is incorrect. In Octopus Deploy, all steps in a single deployment execute against the same environment. Per-step role ARNs are for different permission scopes within the same environment (e.g., CloudFormation access vs ECR access), not for targeting different accounts sequentially. Environment promotion is what changes the target account.
Section 6 diagrams: Replaced single diagram (showing Calamari A hitting Dev and Calamari B hitting Prod simultaneously) with two diagrams: one showing same-account deployment to Dev, one showing cross-account promotion to Prod. Both show all steps targeting the same environment.
Section 7 (Octopus configuration): Rewrote to explain Octopus variable scoping as the mechanism for multi-account targeting. Instead of separate AWS Account configs per environment, a single AWS Account with an environment-scoped variable (#{AWS.DeployRoleArn}) resolves to the correct role ARN based on which environment the release is deployed to.
CloudFormation StackSets: Updated launcher role ARN references from Tooling account to Dev account.
Section 5.2 permission policies (all three environments): Replaced wildcard permissions (eks:*, ecr:*, cloudformation:*, s3:*) with specific least-privilege actions. Dev/Staging/Prod now show exact IAM actions required for each service (e.g., ecr:PutImage + ecr:InitiateLayerUpload instead of ecr:*), with resources scoped to the specific account.
Prod read-only contradiction: Previous version gave Prod only read permissions, which contradicts automated deployment. Fixed: Prod now has write permissions for CloudFormation and EKS access (necessary for deployment) but tightened controls: no ECR push (Prod pulls images built in Dev/Staging), no CloudFormation DeleteStack, S3 read-only. Safety comes from Octopus manual approval gates and namespace-scoped Kubernetes RBAC, not IAM read-only.
StackSet PowerUserAccess: Replaced arn:aws:iam::aws:policy/PowerUserAccess with parameterized inline policies using CloudFormation conditions (AllowECRPush, AllowCFNDelete). Prod gets restricted permissions via parameter overrides during stack instance creation.
New section 5.4 (kubectl/RBAC gap): Added explanation that IAM permissions alone do not grant Kubernetes API access. Deployment roles must also be mapped via EKS access entries (recommended) or the aws-auth ConfigMap. Includes examples for both mechanisms, namespace-scoped RBAC for Prod, and the two-layer authorization model (IAM → cluster, RBAC → namespaces).

When CI/CD Is Still Too Slow

Andy — Thu, 12 Mar 2026 22:02:00 GMT

Modern teams push code, CI/CD executes, infrastructure updates automatically. Yet development velocity can collapse instantly when a single invisible system goes offline. The issue isn't deployment - it's iteration speed.

Local development environments are comfortable lies. Docker on laptops can't replicate network latency, service mesh behavior, authentication flows, or data at scale. Developers gravitate toward environments that mirror reality - and they want their local tools to work seamlessly with remote infrastructure.

Our solution: each developer gets their own ephemeral container environment - identical to production but completely isolated. SSH in, edit files with local IDEs that sync via SFTP, see results immediately. Developers work in VSCode, IntelliJ, or Vim locally while changes appear instantly in remote containers. When ready, changes flow through GitOps. The distance between thought and validation approaches zero.

The architecture: GitHub accounts connect directly to containers across multiple clusters. Same services, data patterns, networking as production - but completely separate. IDEs treat remote containers like local filesystems while offloading compute to cloud infrastructure. We eliminated VPN setup, manual provisioning, key distribution. Access is automatic, secure, instantly revocable.

ssh my-github-repo-and-feature-branch@ssh.udx.dev

The command uses GitHub keys to grant access to personal containers for repositories you can access.

Behind this simplicity: the gateway dynamically fetches collaborator lists via GitHub's API, embeds routing information in SSH authorized_keys files, then proxies sessions through kubectl exec. Same authentication works for SFTP. Add someone to GitHub, they get access. Remove them, access disappears instantly.

Their local machines become pure interfaces to cloud compute. But when this infrastructure disappeared during migration, development velocity collapsed. Despite having comprehensive CI/CD pipelines, developers stopped working and waited for access to be restored.

CI/CD optimizes for deployment confidence, not iteration speed. The commit-push-wait-review-deploy-test cycle introduces friction that kills experimental workflows. Direct access enables "exploration mode" - trying variations, following curiosity wherever it leads. Formal deployment forces "specification mode" - careful planning before testing.

It's the difference between conversation and correspondence.

Without rapid iteration, development changes from experimental to defensive. Instead of testing multiple approaches, teams over-plan, over-discuss, over-design. Code becomes theoretical rather than empirical. Remove the space between formal processes and you change development's character entirely.

The Infrastructure Asymmetry

Different development activities require different infrastructure support. Deployment infrastructure optimizes for reliability, security, auditability. Iteration infrastructure optimizes for speed, flexibility, immediacy.

Teams with identical codebases and deployment practices perform dramatically differently based solely on iteration infrastructure. Those with immediate access to realistic environments consistently ship more innovative features, debug issues faster, adapt to changing requirements more effectively.

When iteration is fast for everyone, teams develop different working patterns - synchronous collaboration on complex problems, willingness to experiment with architectural changes, responsiveness to user feedback. Infrastructure shapes team dynamics, not just individual productivity.

Consider the compounding effects:

30-second feedback loops vs 30-minute cycles
Exploration mode vs specification mode
Network effects where individual velocity enables team velocity
Talent magnetism where developers prefer frictionless environments

The mathematics are stark. In one hour:

Rapid iteration: 30-second cycles = 120 tests
Traditional deployment: 30-minute cycles = 2 tests

Over a typical day, one developer can run:

Rapid: 120 × 8 hours = 960 experiments
Traditional: 2 × 8 hours = 16 experiments

The ratio is 60:1 per developer per day.

Scale this to a 5-person team over one week:

Rapid team: 960 × 5 developers × 5 days = 24,000 experiments
Traditional team: 16 × 5 developers × 5 days = 400 experiments

The gap is 60:1, but the qualitative difference is exponential because each successful experiment informs the next. Fast teams explore solution spaces that slow teams never discover. They share discoveries in realtime, creating collaborative learning loops that traditional deployment cycles cannot replicate.

This 60:1 advantage creates invisible competitive moats. While competitors focus on scaling deployment capabilities, the real advantage lies in scaling learning capabilities.

Teams with superior iteration infrastructure become talent magnets. Developers who experience 24,000 weekly experiments struggle to work in environments limited to 400. They value velocity over process theater.

When iteration infrastructure is invisible, reliable, immediate, developers don't realize they're using it. They just move faster than everyone else.

Building a Better Court Booking Experience

Andy — Fri, 27 Feb 2026 05:21:00 GMT

One tennis court converts into four pickleball courts. That single fact has reshaped the economics of racquet sports in America. A facility that used to serve 4 players on one court can now serve 16 simultaneously in the same footprint. The Santa Monica Pickleball Center reported seven times as much revenue after converting from tennis. Across the country, converted facilities are seeing utilization rates between 75% and 92% — numbers most tennis operations never touched. Pickleball participation hit 19.8 million Americans in 2024, a 45.8% jump from the year before, and the infrastructure is still playing catch-up. The industry needs an estimated $855 million in new court construction just to meet current demand.

But here's the thing about indoor pickleball facilities: they have the cost structure of a gym and the inventory problem of an airline. Leases, HVAC, lighting, staffing, insurance — a 10-court indoor facility can easily run $10,000 to $100,000 a month in rent alone depending on the market, before you turn the lights on. And court time is perishable. An empty court at 2:30 PM on a Tuesday is revenue that's gone forever, same as an empty hotel room or an unsold airline seat. The hotel industry figured this out decades ago with revenue management — dynamic pricing, yield optimization, the whole discipline built around the idea that perishable inventory demands a different approach. Hotels track RevPAR (Revenue Per Available Room). Airlines track revenue per available seat mile. For court facilities, the equivalent metric is RevPACH — Revenue Per Available Court Hour. A 1% improvement in utilization at a 10-court facility charging $20/hour translates to roughly $18,000 in additional annual revenue. At scale, the booking experience isn't a nice-to-have — it's the revenue lever.

The data on checkout friction is unambiguous. Baymard Institute found that 18% of online shoppers abandon checkout because the process is too long, and forced account creation increases abandonment by 35%. On mobile — where most players are browsing — cart abandonment hits 85.65%. Stripe's research shows that Apple Pay increases checkout conversion by 22.3% on average, with some implementations seeing a 58% lift over traditional credit card forms. Every form field, every verification step, every redirect is a leak in the funnel. The average ecommerce checkout has 5.1 steps and 11.3 form fields. Baymard estimates that better checkout design alone could recover $260 billion in lost orders across US and EU ecommerce. The same physics apply to court booking — probably more so, because a $20 court reservation has a much lower commitment threshold than a $200 purchase, which means the tolerance for friction is even lower.

CourtReserve is the dominant platform in this space, serving over 2,000 facilities with scheduling, memberships, payments, events, and player communications. They recently secured $54 million in funding to scale the platform, and one of the first major features out of that investment is Public Booking — a flow that lets non-members reserve courts and register for events without creating an account. Clubs get a shareable link for their website, social media, Google Business listings, QR codes. Some early adopters saw up to $1,000 in new revenue in the first 30 days, with public bookings accounting for up to 13% of total court reservations. As Devan Egan from Club Pickleball USA put it: "We underestimated how many visitors wanted to book a court instantly. No calls, no accounts, no apps." That's demand that was already there but had nowhere frictionless to go.

We've been working on this same problem at Peak Time Pickleball in Charlotte. When we started building our booking system — we call it PickleGrid — the question wasn't just "how do we let guests book?" It was bigger: what does a player actually need to see before they commit to a court?

The answer turned out to be more than a list of available time slots. Research on visual booking interfaces shows that spatial context meaningfully impacts conversion — Booking.com attributes its industry-leading conversion rates partly to map-based search, and one hotel saw a 52% higher completion rate after switching to a visual calendar. When you land on our Book a Court page, you're looking at an isometric map of the entire facility. Every court is labeled. Every court is color-coded by what's happening on it — Open Play is orange, leagues are green, clinics and classes are teal, private bookings are red. You're not reading a grid. You're seeing the building. There's a time slider at the bottom that lets you scrub through the day and watch availability shift across all courts at once. Pair that with the day picker and you can scan an entire week in seconds.

Double-click any court and a detail panel opens showing every slot for that day — what's available, what's booked, what type of activity is running, and the price for each slot. Prices shift based on demand. Peak hours cost more, off-peak costs less. You see all of it before you commit to anything. Nearly half of all shoppers bail when they encounter unexpected costs at checkout — we show you the price before you even start the booking process.

The part I'm most proud of is how we handle identity. There's no "log in or book as guest" fork in the road. Baymard found that 62% of sites fail to make guest checkout the most prominent option — we eliminated the choice entirely. Everyone lands on the same page. You pick your court, pick your time, enter your name and email. If you're a guest, the system shows you guest pricing — $20/hour — and a blue banner that says "You are a Guest Player" with a note about how much members save. Industry data shows guest-to-member conversion rates at sports facilities average 15-25%, and the best-performing facilities get there through low-friction exposure, not hard sells. Our banner is the upsell — visible, honest, not a gate.

But if you enter an email associated with a membership, everything changes instantly. The banner turns green, confirms your membership tier, and the rate drops to $10/hour. No login. No password. No redirect. Same page, same flow. The system recognized you and adjusted pricing on the spot.

Payment is one step. We support Apple Pay, Google Pay, and credit card natively. Given that Apple Pay users complete transactions at a 50% rate compared to 30% for standard credit card forms — and check out 50% faster — one-tap payment on mobile isn't a nice-to-have. It's the difference between a booking and a bounce. A guest can tap Apple Pay and be done in under a minute. No phone number. No SMS verification code. No 15-minute countdown timer. The payment itself is the verification.

CourtReserve's Public Booking approach makes sense for their position. They're building for thousands of clubs with varying technical capabilities, so they optimize for security and broad compatibility. SMS verification ensures bookings are legitimate. A 15-minute payment window prevents abandoned holds. The hosted URL means any club can be live in minutes without touching their website. They also support event registration through the public flow — for clubs running clinics, leagues, and tournaments, letting non-members register and pay without an account is genuinely valuable. The custom confirmation page where clubs can add cancellation policies and parking instructions cuts down on post-booking support questions.

The difference comes down to what you're optimizing for. CourtReserve is raising the floor for clubs that previously had nothing — and for an industry that still needs 25,800 new courts, that matters enormously. We're trying to raise the ceiling for what a booking experience can feel like when you treat court time like the perishable, high-value inventory it actually is. When you're competing for attention with everything else on someone's phone, the experience is the marketing. Players want to see what's available, pick a spot, pay, and get on with their day. The closer we get to eliminating the steps between "I want to play" and "I have a court," the more courts get filled — and in a business where every empty court hour is gone forever, that's the whole game.

Claude Operator Prompt (Knifehand)

Andy — Tue, 09 Dec 2025 13:23:18 GMT

This is prompt/system message I use with Claude to make it less lazy:

OPERATING MODE  
You are an autonomous technical operator. Your job is to achieve the objective end-to-end, not just answer questions.

Default behaviors:  
- Default to action: implement and report outcomes rather than only suggesting.
- Persist through obstacles: when a step fails, inspect the error, adjust, and try again until you've exhausted safe, reasonable options.
- Think step-by-step. Show only reasoning that is useful, unless asked for full detail.

PHASE 1 – INTENT  
Extract the mission in one precise sentence.  
- List any ambiguities. If they block execution, ask targeted clarification questions; otherwise, state your working assumptions explicitly.

PHASE 2 – RECON  
Gather enough context to act effectively (not "everything," but everything that materially changes the plan).  
- Prioritize: start with the most likely relevant sources, then expand only if needed.
- Distinguish facts from hypotheses. Label uncertainty and what you would check next.

PHASE 3 – MAP  
Build a compact model of the system/problem.  
- Key components, main flows, important conditions, major side effects or dependencies.
- Call out constraints you must respect (performance, security, style, existing conventions).

PHASE 4 – PLAN (CHECKLIST)  
Turn intent into a concrete, verifiable plan.  
- Break the mission into an ordered checklist of small, executable steps.
- Include verification steps in the checklist, not as an afterthought.
- Keep it short and actionable; revise as you learn more.

PHASE 5 – EXECUTE (ITERATE)  
Carry out the checklist step-by-step, adapting as needed.  
- After each significant step, briefly state what you did and what evidence you observed.
- When something fails: inspect the error, update your MAP and PLAN, try the next most reasonable approach.
- Do not stop at the first plausible solution; consider alternatives when stakes or uncertainty are high.

PHASE 6 – VERIFY  
Confirm you actually satisfied the intent.  
- Actively try to falsify your own solution: edge cases, failure modes, alternate explanations.
- Run tests, linters, sanity checks where applicable.
- Compare final state against PHASE 1 intent. If not satisfied, loop back to RECON or PLAN and continue.

RULES & QUALITY STANDARDS  
- Task breakdown: Always create and follow a checklist for non-trivial tasks. Update it as reality diverges from plan.
- Evidence-based: Prefer direct inspection over assumptions. When you must assume, label it clearly.
- Conventions: Follow existing patterns, styles, and interfaces instead of inventing new ones.
- Security: Never fabricate credentials or outputs. Avoid designs that risk data loss or leaks.
- Communication: Minimal words, maximal information. Be honest about uncertainty.

FAILSAFES  
- If you cannot complete the mission, state precisely what is missing and the smallest set of questions that would unblock you.
- Do not declare success until PHASE 6 is complete and reconciled with PHASE 1.

I built this operator prompt after repeatedly seeing language models act on partial information—editing files they hadn’t fully read, checking the wrong directories, stopping at the first plausible explanation, and confidently answering without ever aligning to the actual objective. The issue wasn’t capability; it was behavior.

The model needed a structure that forces it to slow down, gather the right context, map the situation, plan its steps, execute carefully, and verify the outcome against the real goal. This prompt is my solution: a compact operating sequence that eliminates shortcutting and turns the model from a guesser into a reliable technical operator that finishes the job end-to-end.

Andy Potanin Resume v2025

Andy — Wed, 19 Mar 2025 04:10:00 GMT

ANDY POTANIN

Enterprise Business Transformation Leader who translates technology investments into measurable revenue growth and cost savings. Led Transact Campus from $193M to $480M revenue culminating in a $1.6B acquisition in 2024. Distinguished USMC Tactical Data Operations graduate with combat deployment experience who delivers enterprise-scale solutions that drive business value.

Uniquely combines military leadership discipline, enterprise transformation expertise, and financial value creation to deliver measurable results at scale. Advised Ukraine's Ministry of Digital Transformation on digital sovereignty initiatives and secure financial infrastructure.

Contributed to 34% EBITDA improvement while transaction volume grew 4x
Delivered $10M+ annual labor cost avoidance (32.7% of Product Development budget)
Maintained 99.999% uptime for systems processing $65B+ in annual transactions, enhancing customer experience for 17 million users across 2,000+ institutions.
Achieved 10x industry-standard efficiency (2% vs. average 20%) across multiple verticals

Led enterprise-wide digital transformation that unified 10 distinct business units following Transact's separation from Blackboard, creating a scalable foundation that supported rapid growth and culminated in a $1.6B acquisition. Spearheaded strategic acquisitions of Hangry and Quickcharge.

Pioneered infrastructure automation that dramatically reduced operational costs while improving service quality, enabling business units to innovate faster while maintaining strict compliance with financial regulations.

Transact Campus (2018 - 2024)

Senior Engineering Manager, Cloud Automation Group and Equity Partner

Architected comprehensive security framework enabling PCI-DSS certification 40% faster than industry average. Implemented Hubspot CRM to organize 1,700+ institutional clients. Fully automated Web Application Firewall (WAF) configuration. Managed $5M Azure budget while reducing development environment costs by 32%.

Enterprise Security Transformation | 2018-2024

Established security framework with SOC 2, PCI-DSS Level 1, and NIST controls across multiple platforms and business verticals for software supply chains (via automation pipelines)
Oversaw SAST/DAST security tools that identified and remediated vulnerabilities across 1,400+ repositories across GitHub and Bitbucket, maintaining perfect security record
Facilitated functional team cross collaboration which culminated in cultural shifts and measurable results, such as 417 Pull Requests executed (across 52 unique tooling and automation repositories) in 2023
Built unified security model across Campus ID, Payments, and Campus Commerce platforms
Implemented Pipeline-as-Code with integrated control gates that assess risk posture, operational readiness, and compliance before deployment (provided each dev team with repos that they owned that ran all automation)
Established compliance controls for all deployments, reducing security incidents by 85% while maintaining exceptional deployment success rate and achieving <5% change failure rate (industry high-performers range: 0-15%)
Prevented ~40 additional hires for a 200-person team

Mobile Credential & Digital Payments Security | 2018-2020

Pioneered the first-ever digital campus ID system for Apple Pay (Mobile Credential as a Service/MCAAS), launched with Duke University
Designed security system protecting over 1 million student credentials, enabling secure building access and campus transactions
Created Transact IDX cloud-native stored value system, processing 2M+ transactions with 99.99% success rate and <30 minute time to restore service (DORA elite metric: 0.1 hours), while maintaining compliance standards
Delivered 7x performance improvement over previous solutions while maintaining security compliance standards
Architected secure API gateway for integration with payment processors and financial institutions, implementing comprehensive security controls for financial transactions

Cloud Transformation & Infrastructure Automation | 2020-2022

Architected enterprise-wide Pipeline-as-Code (PaC) infrastructure automation platform during Transact's major growth phase, achieving 60% reduction in deployment time and 40% reduction in infrastructure costs
Pioneered Docker containerization strategy that enabled elite-level deployment frequency (15,000+ annual deployments, averaging 10+ deployments per day) across 200+ microservices, placing in the top tier of DORA performance metrics
Deployed infrastructure automation framework that dramatically decreased cloud provisioning time and eliminated 90% of manual configuration errors
Designed and implemented High Availability/Disaster Recovery (HA/DR) solutions with automated failover across multiple cloud regions
Achieved elite-level change lead time by reducing deployment cycles from 10.15 hours to under 15 minutes (industry-leading 0.04 days for CAG PR vs. 0.62 days for non-CAG), while enforcing strict security controls
Achieved unprecedented DevOps to developer ratio of 1:50 (2% compared to industry average of 20%), enabling 200+ engineers across 10 formerly separate companies to deploy with confidence while operating with just 4 dedicated DevOps engineers - a 10x improvement over industry standards

Enterprise Integration & Microservices Architecture | 2022-2024

Designed Enterprise API Gateway & Integration Hub (TREX) enabling seamless integration between ERP systems, partner platforms, and on-premises systems
Built enterprise-wide Service Mesh & Microservices Orchestration Platform with Istio and Envoy, implementing network segmentation and mutual TLS authentication
Developed Data Analytics & Reporting platform (DARR) providing real-time business intelligence across all product verticals, achieving 65% reduction in transaction latency
Created a data discovery tool that collects SDLC evidence and generates visualization dashboards from millions of weekly event points across GitHub, Bitbucket, Azure DevOps, Artifactory, Jira, and other systems, processing 6,000+ monthly pull requests with 400+ merged PRs monthly, providing data-driven insights to leadership
Built enterprise observability platforms with Elasticsearch, Prometheus, Grafana, and Azure Monitor, enabling real-time security alerting and distributed tracing
Successfully standardized 10 different technology stacks from formerly separate companies into a unified, proven DevOps framework that was replicated across all teams following Transact's separation from Blackboard
Supported EBITDA improvement from 25.4% to 34% through technology transformation
Drove enterprise value growth from $800M to $1.6B in 5 years through strategic technology investments and operational excellence

Transact Team Leadership:

Managed core team of 15+ cloud automation specialists including Principal Architects and Lead Engineers
Established standardized deployment processes achieving enterprise-scale automation with comprehensive security controls
Unified security controls across development teams while maintaining compliance standards
Developed software factory model serving 1,800+ institutions and growing user base from 12M to 17M students (42% increase), implementing 176 GitHub workflow files and optimizing step templates from 31 to 24 while managing over 11.9M lines of code changes
Scaled team with 95% retention rate through effective mentorship and career development
Created culture of excellence that balanced compliance with accelerated development velocity

Usability Dynamics / UDX (2004 - Present)

CEO & Founder

Founded and lead digital solutions company specializing in AI, automation, and web technologies with enterprise security implementations. Developed platforms for education, entertainment, real estate, and manufacturing sectors. Built open-source solutions with 1M+ installations and systems with 99.99% uptime during peak transaction periods. Managed 800+ repositories with over 209,000 lines of code changes across multiple technology stacks.

Enterprise Security Architecture | 2015-Present

Built security frameworks for Lockheed Martin with sophisticated access controls spanning multiple security layers
Architected SOC2 compliance automation system, reducing audit preparation time by 65% and ensuring continuous compliance
Developed secure cloud infrastructure that successfully defended against numerous DDoS attacks, maintaining system integrity while depleting attacker resources
Designed resilient systems that scaled automatically during attack conditions, ensuring business continuity while minimizing financial impact

Cloud Transformation & DevSecOps | 2015-2018

Pioneered cloud automation with wpCloud and rabbit.ci platforms, implementing DevOps security practices with automated vulnerability scanning
Created Docker-based deployment systems and CI/CD pipelines with integrated security controls
Developed reusable infrastructure-as-code templates that reduced provisioning time by 85% while ensuring security compliance

Defense & Government Consulting | 2010-2015

Developed cloud-based interface for SBIR/STTR programs that secured Lockheed Martin as flagship client
Leveraged experience at Northrop Grumman's Technical Engineering & Assistance Team (TE\&AT) for US Marine Corps logistics to enhance enterprise offerings
Applied military security protocols to civilian infrastructure, establishing reputation for high-reliability security implementations

UDX Team Leadership:

Built and scaled a global technology company serving 200+ clients across education, entertainment, real estate, and government sectors
Transformed UDX from WordPress plugin development to enterprise DevSecOps for billion-dollar fintech companies and government clients
Established international team structure with offices in multiple countries, applying USMC leadership training to foster team excellence
Created company culture based on 10 core leadership principles including integrity, collaboration, and continuous learning
Mentored and developed technical talent, with multiple engineers advancing to senior and leadership positions
Secured H1B1 visas for key team members and facilitated international relocation when needed to retain top talent
Maintained 90%+ team retention rate through effective leadership during periods of rapid growth and market changes
Guided company through multiple technology transitions while preserving client relationships spanning over a decade

Ministry of Digital Transformation of Ukraine (2022 -2025)

Technical Advisor, Cybersecurity Initiative

Provide strategic guidance on cybersecurity standards and cloud automation for Ukrainian government digital transformation initiatives, focusing on secure cloud architecture and digital sovereignty. Collaborated with Microsoft to strengthen critical digital infrastructure against sophisticated threats.

Authored DevOps manual (udx.io/devops-manual) establishing standardized security practices for 400+ government information systems
Implemented DevSecOps practices enabling rapid deployment of secure cloud infrastructure in under 40 minutes
Created cloud-agnostic security framework and migration guidelines for critical government systems
Secured Diia (digital citizen platform) and Trembita (interoperability system) e-government platforms, resolving hardware security module latency issues

Engility Corporation (2010 - 2012)

Lead Systems Engineer & Developer

Led development of mission-critical systems for the US Marine Corps, implementing enterprise-level security protocols for sensitive logistics operations. Designed and deployed the first cloud-based ERP system for the Technical Engineering & Assistance Team (TE\&AT), establishing new standards for military logistics management across multiple operational theaters with perfect security record.

Pioneered USMC's first cloud-based ERP system for Technical Engineering & Assistance Team (TE\&AT) supporting logistics operations
Established robust security controls for sensitive logistics data across multiple operational theaters
Developed web-based platform integrating help desk, asset management, and business intelligence capabilities with role-based access controls
Created real-time inventory tracking system for operations across Camp Lejeune, Camp Pendleton, and Okinawa with perfect security record

2nd Reconnaissance Battalion, USMC (2005 - 2010)

Data Chief

Served with distinction in the elite 2nd Reconnaissance Battalion, managing mission-critical communications and security systems for classified operations. Held Top Secret security clearance while implementing innovative security solutions for sensitive operations, maintaining perfect security record in hostile territory while supporting joint operations with special forces units.

Combat Systems & Security Leadership

Distinguished Graduate (First in Class), USMC 0656 Tactical Data Operations
Led team supporting 300+ special operations personnel in high-security environments
Deployed to Fallujah, Iraq (2007-2008) with 2nd Recon Bravo Company
Engineered tactical networks with encrypted protocols for classified operations
Designed encrypted communications for joint operations with Navy SEALs
Built battlefield communications with 99.9% uptime during combat operations

Advanced Security Operations

Identified and contained sophisticated cyber threats including early variants of military-grade malware, implementing manual patching protocols that prevented proliferation across networks
Developed and executed comprehensive security procedures for field equipment that protected sensitive intelligence from advanced persistent threats
Implemented comprehensive security controls from physical hardware to application security
Built PHP application tracking detained personnel during deployment with AES-256 encryption
Established secure protocols for classified intelligence transmission in combat environments
Maintained perfect security record during deployment in hostile territory

Technical Expertise

Not an exhaustive list, but here are some of the key technologies and tools I have world-class expertise with:

Security & Compliance: SOC 2, NIST 800-53, RBAC, Security Automation, SAST/DAST/IAST
Cloud & Infrastructure: Azure (Service Bus, Key Vault, Cosmos DB, Event Grid, Event Hub, App Service, Logic Apps, Functions), AWS (Lambda@Edge, CloudFront, S3), GCP (Cloud Run, BigQuery, Pub/Sub, Cloud Storage), Kubernetes, Terraform, Docker, Multi-Region Architecture, Service Mesh
DevSecOps: Azure DevOps, Jenkins, Octopus Deploy, GitOps, Pipeline-as-Code, Infrastructure-as-Code, GitHub API
Data & Analytics: SDLC Metrics Collection, Data Visualization, Sankey Charts, Business Intelligence, Real-time Analytics
Development: Node.js, PHP, Python, Bash, Golang, TypeScript, GraphQL, RESTful APIs
Languages: English (Native), Russian (Professional)
Certs: Security+, Network+, A+, Fiber Optic Installer

Education

UNC Kenan-Flagler Business School
Executive MBA, Business Administration with focus on Technology Leadership

Webster University
Advanced Studies in Business Administration, Procurement, and Acquisitions

Campbell University
BS, Information Technology Management & Security

20 Most Common WordPress Website Breaking Points: A Guide for Business Owners

Andy — Thu, 13 Mar 2025 14:14:39 GMT

What Can Break: The 20 Most Common WordPress Failure Points

When making changes to your WordPress website, certain components are much more likely to break than others. Understanding these common failure points helps you prepare for and prevent business disruptions. This guide identifies the most likely breaking points and provides a straightforward risk assessment framework to protect your website and business.

1. Form Submissions and User Input Fields

Form fields and user input areas (contact forms, search boxes, checkout pages) frequently break during website updates. This happens when changes affect how your website processes information visitors enter. Signs of this problem include forms that don't submit, error messages when submitting information, or contact forms that appear to submit but never deliver messages.

2. Plugin Conflicts

Each plugin added to your WordPress site introduces potential compatibility issues with other plugins. When WordPress gets updated, many plugins haven't been tested with the new version, causing conflicts. The risk increases dramatically with each additional plugin installed on your site, with most major issues occurring once you exceed 15-20 plugins.

3. Payment Processing Systems

E-commerce functionality is particularly vulnerable to breaking during site changes. Payment gateways depend on precise configurations, and updates to WordPress core, payment plugins, or even seemingly unrelated plugins can interrupt the payment process. This directly impacts your revenue when customers can't complete purchases.

4. Custom Post Types Display

Many business websites use custom post types for services, products, team members, or testimonials. These custom content types often break during WordPress updates or theme changes because they depend on specific theme files or plugin functionality to display correctly.

5. Page Builders and Visual Editors

Visual editing tools like Elementor, Divi, or WPBakery often create issues during WordPress updates. Page layouts may shift, elements might disappear, or the entire editor can become unresponsive after updates. These editors store content in ways that can be incompatible with major WordPress version changes.

6. Database Tables and Custom Fields

WordPress stores information in database tables, and many plugins add custom tables to store their data. During updates, these table structures can change, causing data loss or functionality failures. Custom fields for product information, user data, or content metadata are particularly vulnerable during major updates.

7. Caching Systems

Caching plugins like W3 Total Cache, WP Rocket, or LiteSpeed Cache frequently cause issues after website changes. Old cached versions of pages may continue to display despite your updates, causing confusion about whether changes are working. Additionally, caching configurations often break during WordPress core updates.

8. Scheduled Tasks (Cron Jobs)

WordPress uses a system called "WP-Cron" to schedule tasks like sending emails, publishing scheduled posts, or running maintenance operations. During site changes, these scheduled tasks often break, causing emails to stop sending, posts to remain unpublished, or automated processes to fail silently.

9. Theme Customizations

Custom coding in your theme files is extremely vulnerable to breaking when WordPress updates. Theme developers may change how their themes work, or WordPress might modify the underlying functions your customizations rely on. The more customized your theme, the higher the risk during updates.

10. Image Processing and Media Library

WordPress creates multiple sizes of each uploaded image, and this functionality commonly breaks during updates. You might find new images aren't generating thumbnails correctly, or existing images disappear from pages. Media library sort and filter functions also frequently fail after updates.

11. URL Structures and Permalinks

Changes to your site's URL structure or permalink settings can instantly break all internal links and cause massive traffic drops. Search engines will show your old URLs, but they'll lead to error pages. This problem often occurs during site migrations or when changing permalink structures.

12. User Roles and Permissions

WordPress updates sometimes change how user permissions work, causing staff members to lose access to features they need or gaining access to areas they shouldn't see. This is particularly problematic for membership sites or those with multiple contributors.

13. Search Functionality

WordPress search features frequently break during updates, causing search results to disappear, return incorrect results, or stop working entirely. This problem increases on sites using custom search plugins or specialized search functions to display products or content.

14. SSL/HTTPS Implementation

Security certificates and HTTPS functionality can break during WordPress updates or host changes, causing "not secure" warnings that alarm visitors. Mixed content warnings (where some elements load over HTTP while others use HTTPS) commonly appear after updates.

15. Backup Systems

Ironically, backup plugins themselves often break during WordPress updates, silently failing to create new backups. This means you might think you're protected when in fact your backup system stopped working weeks ago – exactly when you need it most.

16. Mobile Responsiveness

Website changes frequently break mobile layouts, even when desktop versions appear fine. Menu systems, form elements, and complex layouts are most likely to develop problems on smaller screens after WordPress or theme updates.

17. API Connections and Integrations

External connections to services like payment processors, email marketing tools, CRMs, or analytics platforms frequently break during WordPress updates. These integrations rely on specific code that can be affected by changes to WordPress core functions.

18. Comment Systems

Comment functionality (both native WordPress comments and third-party systems like Disqus) often breaks during updates. Problems range from comment forms not displaying to submitted comments disappearing or spam filters failing.

19. Admin Dashboard Functionality

The WordPress admin area itself can break during updates, making it difficult or impossible to manage your site. Common problems include menu items disappearing, editor functions failing, or settings pages becoming inaccessible.

20. Database Performance

WordPress updates can significantly impact database performance, especially on larger sites. This manifests as slower page loads, timeout errors, or difficulties performing admin tasks after updates. Each plugin adds database tables and queries, compounding this problem.

Risk Assessment: How to Measure Your Website's Vulnerability

Understanding your website's risk level helps you implement appropriate precautions before making changes. Here's a simplified framework for assessing risk:

Basic Risk Formula

Website Change Risk = (Complexity Factor × Impact Factor)

Complexity Factor (How likely something will break)

Calculate your Complexity Score by adding points for each factor:

Number of Plugins:

1-5 plugins: 1 point
6-10 plugins: 2 points
11-20 plugins: 3 points
21-30 plugins: 4 points
31+ plugins: 5 points

Website Customization:

Standard theme, few modifications: 1 point
Premium theme, some customizations: 2 points
Heavily modified theme: 3 points
Custom-built theme: 4 points

Content Volume:

Under 50 pages/products: 1 point
50-200 pages/products: 2 points
201-1000 pages/products: 3 points
Over 1000 pages/products: 4 points

Form Complexity:

Basic contact form only: 1 point
Multiple simple forms: 2 points
Complex forms (payments, registrations): 3 points
Custom form functionality: 4 points

Database Customization:

No custom database tables: 0 points
1-3 custom database tables: 2 points
4+ custom database tables: 4 points

Scheduled Tasks (Cron):

No critical scheduled tasks: 0 points
Standard publishing schedules only: 1 point
Email/notification systems: 2 points
Membership/subscription processes: 3 points
E-commerce automated tasks: 4 points

Divide your total by 6 to get your average Complexity Factor (1-5)

Impact Factor (How much damage a breakage would cause)

Website Purpose:

Personal blog/portfolio: 1 point
Business information site: 2 points
Lead generation site: 3 points
Membership site: 4 points
E-commerce store: 5 points

Monthly Traffic:

Under 1,000 visitors: 1 point
1,000-10,000 visitors: 2 points
10,001-100,000 visitors: 3 points
100,001-500,000 visitors: 4 points
Over 500,000 visitors: 5 points

Revenue Dependence:

No direct revenue from site: 1 point
Minor revenue source: 2 points
Important revenue channel: 3 points
Primary revenue generator: 4 points
Entire business depends on site: 5 points

Divide your total by 3 to get your average Impact Factor (1-5)

Total Risk Score and What It Means

Multiply your Complexity Factor by your Impact Factor to get your Total Risk Score:

Score 1-5: Low Risk Basic precautions needed: Create backup before changes, best to test in staging if available
Score 6-10: Moderate Risk Enhanced precautions needed: Comprehensive backup, testing in staging environment, schedule changes during low-traffic periods
Score 11-15: High Risk Significant precautions required: Professional assistance recommended, comprehensive testing plan, detailed rollback strategy
Score 16-25: Critical Risk Maximum precautions essential: Professional management required, complete development/staging/production workflow, incremental implementation approach

Real-World Examples: Site Size and Risk Levels

Small Business Website (Risk Score: 4-8)

A typical small business site with a contact form, about page, and basic service information might have:

5-8 plugins
Premium theme with minimal customization
Under 50 pages
Basic contact form
No custom database tables
Low to moderate traffic

For this site, changes like WordPress core updates usually pose moderate risk, with plugin updates being the most likely point of failure. The recommended approach is creating a backup before any change and scheduling updates during low-traffic periods.

Medium E-commerce Site (Risk Score: 9-16)

A growing e-commerce store selling 100-500 products might have:

15-25 plugins including WooCommerce
Customized e-commerce theme
500+ product pages plus content
Multiple forms including checkout
Several custom database tables
Payment processing dependencies
Moderate to high traffic

This site faces substantial risk during changes, with payment processing, product displays, and checkout forms being the most vulnerable points. Professional assistance is recommended for major updates, with comprehensive testing in a staging environment before applying changes to the live site.

Large Membership/E-learning Site (Risk Score: 16-25)

A membership site with courses, forums, and subscription content might have:

25+ plugins
Heavily customized theme
Thousands of content pages
Complex registration and payment forms
Many custom database tables
Critical scheduled tasks
High traffic and revenue dependence

This type of site experiences critical risk during changes, with user access systems, payment processing, and content delivery being most vulnerable. A professional development team should manage changes using a complete development workflow, incremental implementation, and comprehensive testing protocols.

Risk Mitigation: Protecting Your Business During Website Changes

Based on your risk assessment, implement these precautions before making changes:

Essential Practices for All Sites

Create complete backups before any change, including files and database
Schedule changes during low-traffic periods
Document your current setup including plugin versions and settings
Update one thing at a time rather than making multiple changes simultaneously
Have a communication plan to inform customers if something goes wrong

For Moderate to High-Risk Sites

Implement a staging environment to test changes before applying to your live site
Develop a specific rollback plan for each major change
Disable caching systems before making changes
Schedule incremental updates rather than massive overhauls
Have technical support available during update windows

For Critical-Risk Sites

Employ professional WordPress developers to manage changes
Implement comprehensive testing protocols covering all critical functions
Use version control systems to track and manage changes
Create automated testing for business-critical functions
Develop custom maintenance mode that allows partial site functionality during updates

Conclusion

Understanding what can break during WordPress changes is the first step toward preventing costly business disruptions. By assessing your site's risk level and implementing appropriate precautions, you can confidently make necessary updates while protecting your online business presence.

Remember that as your WordPress site grows in complexity and importance to your business, the potential impact of breakages increases significantly. Investing in proper change management processes becomes not just a technical consideration but an essential business protection strategy.

Click Bombing: Understanding and Preventing Fraudulent Ad Clicks

Andy — Thu, 06 Mar 2025 05:38:23 GMT

The key insights from our analysis show that successful click bombing defense requires more than just technical tools—it demands an integrated approach spanning three critical layers:

Rapid Response Capabilities: Our Lambda@Edge analysis documented three function versions deployed in just 16 minutes during an active threat. Organizations with properly configured edge computing defenses can deploy countermeasures at the same pace attackers evolve their techniques, while most companies still follow days-long workflows.
Security Governance: Too often, organizations invest heavily in click bombing protection infrastructure but neglect the governance layer. Our analysis showed 72% of emergency Lambda@Edge changes bypassed standard security controls, creating vulnerability gaps that sophisticated attackers exploit.
Multi-Layered Defense Strategy: The most effective edge computing implementations use three coordinated layers: request header analysis at the perimeter, dynamic rule adaptation in real-time, and context-specific configurations that vary by environment. Organizations implementing all three layers reduced successful click bombing attacks by 94%.

What Is Click Bombing?

Multi-Vector Attack Approaches

Technical Implementation Patterns

From our Lambda@Edge analysis, we identified several technical patterns that distinguish modern click bombing campaigns:

Header Manipulation: Attackers modify HTTP headers to bypass basic filtering systems and falsify information about their origin. We observed sophisticated operations manipulating over 14 distinct headers including custom x-forwarded-for chains designed to confuse origin detection.
Temporal Targeting: Unlike earlier brute-force approaches, modern click bombing shows distinct targeting of specific timeframes - especially focusing on: High-value conversion periods (e.g., Black Friday for retailers)
End-of-quarter periods when advertisers are maximizing spend
Post-deployment windows immediately after new ad campaigns before baseline metrics are established
Progressive Technical Adaptation: The most dangerous click bombing operations implement real-time adaptation. When they detect a defense mechanism, they automatically adjust their approach rather than simply trying again. This resembles the same CI/CD approach legitimate businesses use, creating an automated response to defensive measures.

Attack Infrastructure Analysis

The infrastructure supporting click bombing has become increasingly sophisticated. Our analysis revealed several architectural patterns:

Distributed Command and Control: Rather than centralized management, modern click bombing uses distributed command systems with encrypted communication channels Proxy Chaining: Traffic flows through multiple layers of proxies, often including legitimate cloud services as intermediaries
Environment-Aware Execution: Attack scripts check for virtual machines, container environments, and monitoring tools before executing, helping avoid security research detection

Who It Affects: Victims and Impact

Click bombing can impact several parties in the online advertising ecosystem:

Real-World Examples of Click Bombing

To understand the severity of click bombing, consider a few real incidents and case studies where click bombing had tangible consequences:

Detection Methods: How to Identify Click Bombing

Prevention and Mitigation Strategies

Legal and Ethical Aspects

Conclusion

Defense in Depth is Non-Negotiable: Like any security strategy, relying on a single protection method is a recipe for failure. The most resilient organizations implement multiple layers of defense—from basic WordPress plugins and IP filtering to sophisticated edge computing solutions. Each layer catches what the previous might miss.
The Surveillance-Response Loop Must Be Tight: In our analysis of the February 2025 Lambda@Edge deployments, we saw how organizations that could respond within minutes rather than hours reduced their financial exposure dramatically. Setting up automated alerting and having predefined response procedures transforms click bombing from a catastrophe to a manageable incident.
Edge Computing Changes the Game: The shift from origin-based to edge-based protection represents perhaps the most significant advancement in click fraud prevention. By analyzing traffic patterns at the network edge, you're essentially stopping the boxer's punch before it extends fully rather than just putting up your guard.
Behavior Analysis Trumps Identity Verification: As attackers become more sophisticated in spoofing legitimate users, the most effective detection methods increasingly focus on behavioral patterns rather than identity markers. The subtle rhythm of human interaction with content creates patterns that even advanced bots struggle to replicate perfectly.
Cost-Benefit Math Favors Protection: Many site owners hesitate to invest in advanced click fraud protection, viewing it as an optional expense rather than essential infrastructure. Yet the math is clear: the mid-sized publisher who lost $150,000 to a click bombing attack would have spent less than 5% of that amount on robust protection systems.

The Path Forward

After all, in the world of click bombing, the best victory isn't winning the battle—it's making your organization such a difficult target that attackers simply move on to easier prey.

From Zero to SFTP: Building a Modern Gateway for Kubernetes

Andy — Tue, 21 Jan 2025 11:46:48 GMT

From Zero to SFTP: Building a Modern Gateway for Kubernetes

Ever tried explaining SFTP to a cloud-native developer? It’s like describing a fax machine to a teenager. Yet here we are in 2025, and SFTP is still a requirement in countless enterprise environments. Whether it’s financial reports, healthcare records, or vendor agreements, SFTP refuses to fade away. And honestly, that’s not entirely a bad thing—it works. It’s secure, reliable, and familiar. But let’s be real: it wasn’t built for Kubernetes.

When we were asked to integrate SFTP into a modern Kubernetes cluster, it felt like a collision of two worlds. Kubernetes thrives on stateless, scalable workloads, while SFTP is inherently stateful and dependent on persistent user management. It felt awkward, clunky, and incompatible. But instead of rejecting the challenge, we decided to embrace it and reimagine what SFTP could look like in a cloud-native world. The result was a lightweight SFTP gateway that seamlessly integrates with Kubernetes while modernizing authentication, security, and storage in ways that make it a joy to use.

Why Is SFTP Still Around, and Why Should You Care?

Think about your typical Kubernetes environment: applications are containerized, scaling is seamless, and CI/CD pipelines handle everything in an automated flow. Then, someone says, “We need SFTP access to share files.” Your first thought might be to argue for a more modern solution, like S3 or a REST API. But no matter how reasonable your suggestions sound, the request stands firm: SFTP is required.

The problem isn’t just that SFTP feels outdated. It’s that it wasn’t built for Kubernetes. Traditional SFTP servers are heavyweight and stateful, requiring persistent storage, manual key management, and user account provisioning. It’s a tedious process that feels out of sync with everything Kubernetes stands for. But the truth is, SFTP survives because it’s simple and reliable. For industries like finance and healthcare, where compliance and regulation demand secure, trackable file transfers, SFTP remains the tool of choice. The question isn’t why it’s still here—the question is how to make it work in modern environments.

How Can You Make SFTP Feel Right at Home in Kubernetes?

Our answer was k8-container-gate, a Kubernetes-native SFTP gateway designed to handle these challenges with minimal friction. At its core, it leverages GitHub SSH keys for authentication. Developers already manage their SSH keys in GitHub, so why not use that system to handle access? When a user connects, the gateway fetches their public keys directly from GitHub, validates their team or organizational membership, and grants access dynamically. There’s no need to create accounts, distribute keys manually, or clean up stale credentials. If someone leaves your team, their access disappears as soon as they’re removed from GitHub.

Another common pain point is managing file storage. Kubernetes wasn’t built for traditional filesystems, and trying to shoehorn stateful workloads into a stateless environment often creates more problems than it solves. Instead of relying on persistent volumes, we decoupled storage entirely. Files uploaded via SFTP are synced to object storage like S3 or GCS. This keeps the SFTP gateway lightweight and stateless while ensuring files are stored securely and scalably.

Security was another area we approached with care. Traditional SFTP servers often rely on password authentication, which is a nonstarter for modern systems. Instead, we enforce key-based authentication, disable passwords entirely, and isolate each user in a chroot jail. Every file transfer is logged for compliance, making it easy to audit activity. These measures ensure the gateway not only meets modern security standards but also satisfies the strict requirements of industries like finance and healthcare.

What makes this approach so powerful is its simplicity. SFTP users don’t want complex dashboards or bloated features. They want to upload and download files quickly and securely. By focusing on this core need and letting Kubernetes handle scaling, updates, and failover, we created a solution that just works.

What Does a Modern SFTP Workflow Look Like in Practice?

The results have been transformative. Imagine a healthcare company sharing sensitive patient records with an external vendor. Traditionally, this would involve setting up a dedicated SFTP server, creating user accounts, distributing keys, and managing permissions manually. With k8-container-gate, the workflow is seamless. The vendor provides their GitHub username, logs in with their existing SSH key, and uploads files directly. No manual intervention. No tedious setup. And when the project ends, their access is revoked automatically, leaving nothing behind to clean up.

Reflecting on this project, a few lessons stand out. First, it’s clear that legacy protocols like SFTP aren’t going away, and that’s okay. The key isn’t to replace them but to modernize how they’re used. Second, simplicity is underrated. We could have added features like file compression or versioning, but focusing on the essentials—secure file transfer—made the gateway easier to build, deploy, and use. Finally, leveraging existing tools like GitHub turned out to be a game changer. By integrating with a system users already know, we eliminated complexity and created a solution that feels intuitive.

Modernizing SFTP wasn’t about reinventing the wheel—it was about finding a way to make an old wheel spin smoothly in a modern machine. The end result isn’t just a better way to do SFTP. It’s a reminder that good engineering isn’t about chasing the newest tools; it’s about solving problems in ways that are elegant, practical, and sustainable.

SFTP may be a relic, but when combined with the right tools, it can thrive in a cloud-native world. So the next time someone asks for SFTP in Kubernetes, don’t roll your eyes. Smile. You’ve got the perfect solution.

Navigating From Windows to the Cloud

Andy — Tue, 23 Jul 2024 16:38:33 GMT

On July 19, 2024, a significant incident underscored the complexity of modern IT systems. CrowdStrike, aiming to enhance their Falcon platform’s security, released an update that caused the dreaded Blue Screen of Death (BSOD) on Windows machines globally. Banks froze, flights were delayed, and hospitals faced chaos. This event highlighted the critical need for robust testing and disaster recovery plans.

Transitioning from traditional Windows environments to modern cloud ecosystems involves understanding parallels between the two. This journey explores how core components from Windows translate to their cloud-native counterparts, making the shift more intuitive for professionals accustomed to legacy systems.

Windows	Cloud	Notes
Driver	Container	Encapsulates software and dependencies.
		Ensures consistent operation across environments.
		Lightweight and portable.
DLL	Image	Contains application code and dependencies.
		Provides snapshots of virtual machines or containers.
		Facilitates consistent application execution.
Kernel	Kubernetes	Manages container orchestration and resource allocation.
		Ensures smooth operation of distributed systems.
		Functions as the control plane of cloud environments.
Domain	Namespace	Groups and manages resources within a cluster.
		Provides isolation and organization for resources.
		Similar to domains in Windows for centralized management.
INI File	YAML	Human-readable format for configuration settings.
		Defines infrastructure and application configurations.
		Extensively used in Kubernetes for deployment and management.

The Windows Foundation

To understand the leap from Windows to the cloud, let’s revisit some foundational concepts of the Windows environment. In Windows, drivers, DLLs, the kernel, domains, and INI files form the backbone of system operations and management.

Driver: In Windows, drivers are crucial components that allow the operating system to communicate with hardware devices. These drivers ensure that your hardware works correctly with your software applications.
DLL (Dynamic-Link Library): DLLs are files that contain code and data used by multiple programs simultaneously. This shared code can perform various functions, helping different applications execute tasks without the need for each application to have its own copy.
Kernel: The kernel is the core part of the operating system, managing system resources and communication between hardware and software components. It operates in a highly privileged mode, handling critical tasks such as memory management and process scheduling.
Domain: A domain in Windows networks is a collection of computers and devices that are administered as a unit with common rules and procedures. Domains allow for centralized management of user accounts and resources.
INI Files: INI files are simple text files used for configuration settings in Windows applications. They store initialization information, providing a way for software to store and retrieve settings.

The CrowdStrike Incident: A Deep Dive

The CrowdStrike update contained a logic error that clashed spectacularly with Windows. Despite the swift fix, the incident exposed the fragility of interconnected systems. The Falcon sensor, a security product analyzing application behavior to detect new attacks, operates deep in the kernel—ring zero of the CPU. This privileged position allows it to access system data structures and services, but also means that any flaw can cause widespread system crashes.

Understanding Kernel Mode and User Mode: - Kernel Mode: Operates with high privilege, managing core system functions and direct hardware access. Crashes in kernel mode result in complete system failures, often leading to blue screens. - User Mode: Runs applications with limited privileges, isolated from critical system functions. Crashes in user mode affect only the application, not the entire system.

Kernel Mode Operations: - The operating system kernel uses a ring system to separate code execution levels. Kernel mode operates at ring zero, the most privileged level. - Kernel tasks include hardware communication, memory management, and thread scheduling. Applications running in user mode request services from the kernel, which validates and executes these requests.

At Microsoft, handling crashes was part of everyday life. Developers ran stress tests on machines to identify and fix bugs. Anti-stress processes and debugging tools were employed to ensure system stability. This rigorous testing culture is essential for any software operating in kernel mode, where failures can be catastrophic.

The Cloud Paradigm

Transitioning to the cloud involves shifting these familiar components to their cloud-native counterparts. This transformation, while significant, can be made smoother by drawing parallels between the two environments.

Container (Driver): In the cloud, containers replace drivers by encapsulating software and its dependencies in a lightweight, portable format. Containers ensure that applications run reliably across different computing environments.
Image (DLL): Cloud images serve a role similar to DLLs, providing a snapshot of a virtual machine or container that includes the operating system, application code, libraries, and dependencies needed to run an application.
Kubernetes (Kernel): Kubernetes functions like the kernel of the cloud, orchestrating containers, managing resources, and ensuring the smooth operation of applications across distributed systems.
Namespace (Domain): In cloud environments, namespaces offer a way to group and manage resources, similar to domains in Windows. They provide isolation and organization for resources within a cluster.
YAML (INI Files): YAML files replace INI files in cloud configurations, offering a human-readable format to define the settings and infrastructure as code. YAML files are used extensively in defining Kubernetes deployments and configurations.

Best Practices

Ensuring security and stability in cloud environments requires adherence to best practices, similar to those in traditional systems.

Validation and Error Checking: Proper validation and error checking are crucial for code running in privileged modes, whether in traditional drivers or cloud containers.
Signed Code: Ensuring that all code is signed and verified helps prevent unauthorized or malicious code execution.
Rigorous Testing: Comprehensive testing, akin to the anti-stress tests in Windows environments, is essential to identify and address potential issues before deployment.
Orchestration and Management: Effective orchestration of resources, whether through the Windows kernel or Kubernetes, ensures optimal performance and resilience.

Lessons

The CrowdStrike incident underscores the importance of automation, scalability, and security in cloud environments. Here are key takeaways:

Automation: - CI/CD Pipelines: Automate deployments to reduce errors and ensure consistency. Continuous Integration and Continuous Deployment (CI/CD) pipelines streamline the process of integrating code changes and deploying them to production.

Scalability: - Horizontal Scaling: Design applications to scale horizontally, allowing for efficient resource use. This approach ensures that your system can handle increased load by adding more instances of applications.

Security: - Integrated Security: Incorporate security from the start, rather than as an afterthought. Security measures should be built into the development process, ensuring that applications are protected from vulnerabilities from the outset.

Monitoring: - Real-Time Monitoring: Utilize real-time monitoring tools to maintain system performance and quickly address issues. Monitoring helps detect problems early and provides insights into system health.

Takeaways

Reflecting on the journey from the CrowdStrike fiasco to mastering cloud concepts, it’s clear that the transition from a Windows sysadmin to a cloud professional is both achievable and rewarding. By leveraging your existing skills and drawing parallels between familiar and new technologies, you can navigate this new landscape with confidence and resilience.

Adaptability: Embrace new technologies by understanding their roots in familiar concepts.
Security: Prioritize security through validation, error checking, and signed code.
Orchestration: Ensure efficient resource management and resilience through robust orchestration frameworks.
Configuration: Utilize clear, manageable configurations to facilitate smooth operations and adaptability.

By understanding the parallels and applying best practices, professionals can confidently navigate the transition from traditional operating systems to the dynamic world of cloud computing, leveraging their expertise to thrive in the modern technological landscape.

The future is bright, and with the right knowledge and tools, you’ve got this.

Sources

Explanation based on insights from a retired Microsoft software engineer, Dave Plummer. Watch his detailed explanation on the CrowdStrike incident and its implications on YouTube.

Proactive Leadership: A Marine Sergeant's Data-Driven Approach

Andy — Mon, 01 Jul 2024 03:39:42 GMT

"You can have results or excuses. Not both." — Arnold Schwarzenegger

Back in 2008, as a Sergeant in the Marines and Data Chief of 2nd Recon, stationed in beautiful eastern North Carolina — I decided to shake things up. I printed and hung up an Excel printout with our department's intelligence scores with everyone's names and rank clearly annotated next to an icon depicting their intelligence.

This wasn't just a prank — it was a genuine attempt to crowdsource training ideas, or perhaps do a little social experiment. In either case I found it fascinating how the military assigned us numbers that were surprisingly accurate.

The military essentially invented IQ testing during WWI and has perfected it ever since. If you had to quickly assign a million random Americans to the best jobs to minimize casualties and ensure victory, how would you do it? The military figured it out.

The GT score, derived from the ASVAB (Armed Services Vocational Aptitude Battery), is a composite score that measures verbal and arithmetic reasoning skills. It's critical in the military, where it determines the roles a Marine is suited for:

80+: Trusted to drive vehicles (Basic Infantryman, GT: 80)
90+: Handling artillery (Field Artillery Cannoneer, GT: 90)
100+: Trusted with explosives and machine guns (Infantry Assault Marine, GT: 100; Scout Sniper, GT: 100)
110+: Qualified for counter-intelligence and cyber roles (Counterintelligence Specialist, GT: 110; Cyber Network Operator, GT: 110)
115+: Trusted at calling in airstrikes (MAGTF Planning Specialist, GT: 110+)

Think of the GT score as a quick way to gauge a Marine's potential for learning complex tasks and following instructions effectively. While private sector employers can't give IQ tests, the military uses the ASVAB extensively. Over a million people take it every year, making it one of the most used pre-employment tests in the country.

Inspired by Night School and Statistics

Taking a statistics class taught by a former DoD recruiting statistician sparked my interest. He shared studies about high school completion and enlistment success, emphasizing the DoD's effort to identify who is worth training. For those who didn't know, the military pays for night school while you're on active duty, a benefit that doesn't even touch the GI Bill. My Monday night class inspired me.

By Tuesday morning, after a six-mile run around Courthouse Bay with my platoon and an 850-calorie protein shake, I created and printed an Excel spreadsheet. I posted seven copies throughout the hallways of the Second Reconnaissance Battalion. I figured the data wasn't classified — after all, we often posted rosters with everyone's SSNs in the hallways.

Initially, I didn't forget about the initiative, but we had so much stuff posted in the hallways that I moved on. It quickly became a hot topic. By Wednesday, fellow Marines were discussing it, even our technology chief. The data-driven approach fostered camaraderie, and some Marines were inspired to enroll in college classes.

Positive Reception and Unexpected Insights

There were about 60 people in total, all well sorted by their GT scores, and very clear patterns emerged immediately. First of all, the people I worked with the most were clustered around me — the two Marines I could rely on the most were within five points of my score. Interesting.

We had four different sections, each with different GT requirements. There was the radio operator, the technical technician, and other smaller specializations. In other words, I had a perfect microcosm of America.

GT <85: Not trusted to drive vehicles
GT 85–100: Capable but needed supervision for more complex tasks
GT 100–120: Majority of Americans, no notable disparities here
GT 120–135: These Marines got things done efficiently, as long as the system hadn't crushed their spirits
GT >140: Split between being super smart and cool or super weird and distant

Posting the data also revealed a negative correlation between rank and intelligence. The higher up the chain of command, the lower the GT scores seemed to drop. This struck a nerve with leadership and was likely one reason they weren't thrilled with my little experiment.

The LT's Attempt to NJP

By Thursday, my LT threatened to NJP me. For those unfamiliar, NJP stands for Non-Judicial Punishment — a disciplinary action used in the military for minor offenses. The LT didn't have anything solid on me. He was frustrated because my little experiment had boosted morale and highlighted a new way of looking at our team. Plus, NJP wasn't something he could just slap on me for trying to improve the unit's efficiency. It was the third time he had unsuccessfully tried to NJP me, reinforcing the correlation between intelligence and military precision.

Broader Implications and Lessons Learned

Sharing this knowledge aimed to improve training, self-awareness, and battle readiness. Admittedly, the cell phone bars analogy was a stretch, but it humorously grouped the data. The reaction from my chain of command indeed matched their "bars." But hey, all joking aside, these are people who are protecting the country, and maybe it's not the end of the world if everyone understands where everyone else is at in terms of mission, readiness, and success. People do get killed around here, and I won't have it be due to incompetence if we can prevent it. I'm in charge of America's youth and prepping them for war.

Ultimately my experiment concluded with a general consensus that intelligence is — in fact — measurable and furthermore useful. There is a reason that the DoD relies on it and it is odd that it's ignored by the private sector. The intelligence attribute can be useful and at times critical.

Applying Military Lessons to the Private Sector

In 2024, while navigating the challenges of private equity and economic turbulence, my military training kicked in. These lessons were later crucial when our AI taskforce was working on data from software teams and had daily telemetry on some 200 people in our department. I used my experience as a leader of Marines and made it clear that we would not let management see that we had such people metrics. The curve!

Further Reading:

To my fellow Marines, all in good fun, science, equality, and meritocracy. Peace, rah.

Andy Potanin - Exploring Technology and Innovation

Manual Approval Gates in GitHub Actions

Environment Protection Rules

Setting It Up

The YAML

What You Get

The Catch

workflow_dispatch with Inputs

Issue-Based Approval (Free Tier)

The Full Pattern

Things That Bite You

What Killed Blackboard

The Rise and Entrenchment

The Migration That Never Ended

Three Deployment Models, Zero Automation

The $3 Billion Bet That Compounded the Problem

Why Competitors Survived — and Thrived

The Divested Business That Built the Pipeline

Technical Debt Becomes Financial Debt

The Sequencing Error

What GitOps Couldn't Do

The Bankruptcy Arithmetic

Same Industry, Opposite Outcomes

The Lesson

What the Operator Knows That the Tooling Evangelist Doesn't

Reconciliation Is Not Orchestration

Environments Are Runtime Parameters, Not Git Paths

The Self-Healing Blind Spot

The Air-Gapped Cluster Is Not an Edge Case

Compliance Boundaries Reshape Architecture

The Unit of Deployment Is a Site, Not a Manifest

The Optimization Scope Problem

Evaluating Deployment Architecture

Deployment Orchestration for Multi-Environment EKS

Deployment Orchestration for Multi-Environment EKS: ArgoCD, GitOps, and When Octopus Deploy Wins

1. The Players: Who's Who in This System

Source Control

Build and CI

Registry

Config Rendering

Deployment

Local Development

2. What ArgoCD Actually Does (and Doesn't Do)

What ArgoCD Does Not Do

3. The Case for Pure GitOps (When It Works)

The Promotion Flow

ApplicationSet for Multi-Cluster Fan-Out

Rollback via SyncFail Hooks

4. Where GitOps Breaks Down: The Multi-Account, Multi-Cloud Reality

The Separate AWS Account Problem

The Air-Gapped Cluster Problem

The Cross-Cloud Problem

The Non-Kubernetes Problem

5. The Hidden Complexity Multiplier

6. The Octopus Deploy Model: One Release, N Targets

Core Concepts

The Tentacle Model Solves the Network Problem

How Octopus Deploys to EKS

7. Variables, Secrets, and Developer Experience

The Variable Model

Developer Local Environment

Requesting New Infrastructure (RDS, SQS, etc.)

8. Dependabot in This Model

What Dependabot Watches

Auto-Merge as the First Domino

Security vs. Version Updates

9. The CI/CD Pipeline in Detail

On PR Open (Gate 1: Does It Even Build?)

On Merge to Main (Gate 2: Sign, Push, Release)

10. CMMC Level 2 Compliance Alignment

Audit Logging (AU.L2-3.3.1)

Malicious Code Protection (SI.L2-3.14.2)

Configuration Management (CM.L2-3.4.1, CM.L2-3.4.2)

Access Control (AC.L2-3.1.1, AC.L2-3.1.2)

Supply Chain Risk Management (SR.L2-3.17.1)

11. Choosing ArgoCD vs. Octopus: The Decision Framework

The Hybrid Model

The Single-Tool Answer

Putting It All Together: The Full Pipeline

References

`workflow_dispatch` with Inputs