Polar Squad Blog

Check your Kubernetes deployments! — Polar Squad

Polar Squad — Wed, 03 Jun 2026 11:11:09 GMT

Here’s a post about deploying applications to Kubernetes and associated things to take into account. This post was originally published in 2019, but is good stuff today – if you encounter something that should be updated, please let us know!

—

When writing and setting up software, it’s natural for us to focus on just the happy path. After all, that’s the path that everyone wants. Unfortunately, software can fail quite often, so we need to give the unhappy paths some attention as well.

Kubernetes is no exception here. When deploying software to Kubernetes, it’s easy to focus on the happy path without properly checking that everything went as expected. In this article, I’ll talk about what is typically missing when deploying applications to Kubernetes, and demonstrate how to improve it.

Typical flow for deploying applications to Kubernetes

In Kubernetes, most service-style applications use Deployments to run applications on Kubernetes. Using Deployments, you can describe how to run your application container as a Pod in Kubernetes and how many replicas of the application to run. Kubernetes will then take care of running as many replicas as specified.

Here’s an example deployment manifest in YAML format for running three instances of a simple hello world web app:

apiVersion: apps/v1
kind: Deployment
metadata:
 labels:
 app: myapp
 name: myapp
spec:
 replicas: 3
 selector:
 matchLabels:
 app: myapp
 template:
 metadata:
 labels:
 app: myapp
 spec:
 containers:
 - image: polarsquad/hello-world-app:master
 name: hello-world
 ports:
 - containerPort: 3000

One of the key features of Deployments is how it manages application updates. By default, updating the Deployment manifest in Kubernetes causes the application to be updated in a rolling fashion. This way you’ll have the previous version of the deployment running while the new one is brought up. In the Deployment manifest, you can specify how many replicas to bring up and down at once during updates.

For example, we can add a rolling update strategy to the spec section of the manifest where we bring one replica up at a time, and make sure there are no missing healthy replicas at any point during the upgrade.

spec:
 strategy:
   type: RollingUpdate
   rollingUpdate:
     maxUnavailable: 0
     maxSurge: 1

The update is usually performed either by patching the manifest directly or by applying a full Deployment manifest from the file system. From Kubernetes’ point of view, it makes no difference. If the contents of the manifest update are valid, then Kubernetes will happily accept the update. Most of the time, an application update mostly contains a change in the container image tag or some of the environment variable configurations you might have.

To automate the process, you might choose to deploy your app in your CI pipeline using kubectl.

kubectl apply -f deployment.yaml

So now you have a pattern and a flow for getting your app to run on Kubernetes. Everything good, right? Unfortunately, no!

It’s a great start, but it’s usually not enough. Applying a deployment to Kubernetes finishes once Kubernetes has accepted the deployment, not when it has finished. Kubectl apply does not verify that your application even starts. This deployment flow is demonstrated in the picture below.

In order to properly check that the update proceeds as expected, we need assistance from another kubectl command.

Rollout to the rescue!

This is where kubectl’s rollout command becomes handy! We can use it to check how our deployment is doing.

By default, the command waits until all of the Pods in the deployment have been started successfully. When the deployment succeeds, the command exits with return code zero to indicate success.

$ kubectl rollout status deployment myapp
Waiting for deployment "myapp" rollout to finish: 0 of 3 updated replicas are available…
Waiting for deployment "myapp" rollout to finish: 1 of 3 updated replicas are available…
Waiting for deployment "myapp" rollout to finish: 2 of 3 updated replicas are available…
deployment "myapp" successfully rolled out

If the deployment fails, the command exits with a non-zero return code to indicate a failure.

If you’re already using kubectl to deploy applications from CI, using rollout to verify your deployment in CI will be a breeze. By running rollout directly after deploying changes, we can block the CI task from completing until the application deployment finishes. We can then use the return code from rollout to either pass or fail the CI task.

So far so good, but how does Kubernetes know when an application deployment succeeds?

Readiness probes and deadlines

In order for Kubernetes to know when an application is ready, it needs some help from the application. Kubernetes uses readiness probes to examine how the application is doing. Once an application instance starts responding to the readiness probe with a positive response, the instance is considered ready for use.

For web services, the most simple implementation is an HTTP GET endpoint that starts responding with a 200 OK status code when the server starts. In our hello world app, we could consider the app healthy when the index page can be loaded. Here’s the readiness probe configuration for our hello world app:

readinessProbe:
  httpGet:
    path: /
    port: 3000

A more sophisticated implementation of the health check might perform some background checks to verify that everything is ready for the application to serve requests, and serve that information through a dedicated health endpoint (e.g. /health or /ready). It’s up to the application developers to figure out when the application is ready, and how to respond back to probes.

Readiness probes tell Kubernetes when an application is ready, but not if the application will ever become ready. If the application keeps failing, it may never respond with a positive response to Kubernetes. How does Kubernetes then know when the deployment is going nowhere?

In our Deployment manifest, we can specify how long Kubernetes should wait for deployment to progress until it considers the deployment to have failed. If the deployment doesn’t proceed until the deadline is met, Kubernetes marks the deployment status as failed, which the rollout status command will be able to pick up.

$ kubectl rollout status deployment myapp
Waiting for deployment "myapp" rollout to finish: 1 out of 3 new replicas have been updated…
error: deployment "myapp" exceeded its progress deadline

What makes the deadline fantastic is that if the deployment manages to proceed within the deadline, Kubernetes will reset the deadline timer, and start waiting again. This way you don’t have to estimate a deadline for the entire deployment, but just a single instance of the application.

For example, if we set a deadline of 30 seconds, Kubernetes will wait 30 seconds for the application to become ready. If the application becomes ready, Kubernetes will wait another 30 seconds for the next instance to become ready.

Scripting automated rollback

Currently, when a deployment fails in Kubernetes, the deployment process stops, but the pods from the failed deployment are kept around. On deployment failure, your environment may contain pods from both the old and new deployments.

To get back to a stable, working state, we can use the rollout undo command to bring back the working pods and clean up the failed deployment.

$ kubectl rollout undo deployment myapp
deployment.extensions/myapp
$ kubectl rollout status deployment myapp
deployment "myapp" successfully rolled out

Awesome! Now that we have a way to determine when our deployments fail and how to revert the deployment, we can automate the deployment and rollback process with a simple shell script.

kubectl apply -f myapp.yaml
if ! kubectl rollout status deployment myapp; then
    kubectl rollout undo deployment myapp
    kubectl rollout status deployment myapp
    exit 1
fi

We first rollout the changes, and then immediately wait for the rollout status. If the rollout succeeds, we continue normally. If it fails, we undo the deployment, wait for undo to finish, and report back a failure with exit code 1. This flow is demonstrated in the picture below.

There is one major caveat not addressed in the script: kubectl commands may fail because of the network conditions! The script above doesn’t account for any connection failures, which means that the script may interpret a network failure in the rollout command as a failed deployment. Kubectl does retry retriable errors automatically, but it will fail eventually if the Kubernetes API is not available for a long period of time.

Conclusion

In this article, I’ve talked about the typical deployment flow used with service style applications in Kubernetes, and how it’s not enough to ensure safe deployments. I’ve presented a way to extend the deployment flow with status checks and an automated rollback procedure.

One area I haven’t covered is how the same checks and automated rollback are achieved in Helm. Helm deployments have their own additional quirks when it comes to detecting failed deployments. I’ve covered these quirks in this article.

I’ve published the code examples in a GitHub Gist. Thanks for reading!

Jaakko is the CTO of Polar Squad, a software developer turned SRE / DevOps Consultant. He is dedicated to bringing people together to deliver exactly what is needed in a rapid, reliable, and stress-free way. Given a bit of time, he’s probably able to solve all the problems.

Cloud cost and resource optimization: How Polar Squad can save you money while reducing your carbon footprint — Polar Squad

Polar Squad — Wed, 03 Jun 2026 10:52:09 GMT

In these challenging times, many businesses from the travel sector to supply chain and retail platforms are looking to reduce costs while ensuring their actions are sustainable.

Yet when it comes to tech, the actual cloud is often overlooked as a source of emissions – and major potential savings.

“You wouldn’t leave your faucets open when you leave home, or your car running when you’re not driving it, but many businesses treat their clouds like limitless resources,” says Polar Squad DevOps consultant Katariina (Kata) Vakkuri.

A recent peer-reviewed study suggests that global emissions from ICT are as high as 3.9 per cent, exceeding the emissions of commercial aviation, which is estimated to be about 2.4 per cent.

“Perhaps the word ‘cloud’ conjures up something eco-friendly,” says Vakkuri, “but the reality is far from that. Major resources such as upkeep of buildings, heating and cooling go into keeping your data in the cloud. We need to educate people about the environmental impacts of data storage and offer viable solutions.”

DevOps solutions

Getting DevOps support pays for itself many times over, as one of its main goals is improving existing systems and structures.

“Instead of just building new systems, which is what many consultancies do, we often help to improve and optimize what already exists,” says Tero Kiminki, a senior Polar Squad DevOps consultant.

DevOps cloud optimization can save up to 10 per cent of costs a month, which can add up to a significant amount of money, according to Kiminki. Large companies’ monthly cloud costs can run to several hundred thousand euros.

One of the current challenges is that many businesses have outsourced data management and platform knowledge, says Kiminki. This means no one is responsible for staying on top of costs and sustainability issues, let alone basic management.

“As not all partners building solutions know how to get the most out of the cloud and there’s often a lack of knowledge about the cloud within an organization, the result can be non-optimal set-ups that waste time, money and the environment,” he says.

Polar Squad to the rescue

This is where Polar Squad steps in by helping clients optimize both old and new and get back their knowledge (in the form of data).

When companies have outsourced their knowledge, it’s not always a priority for a partner to question how the system has been built. It’s easy to take shortcuts and just follow through on someone else’s orders. Polar Squad helps clients to challenge solutions set up or proposed by others that may not be optimal or even in the client’s best interests.

“We also help clients to get forgotten or lost knowledge back and raise their awareness for the future so they can utilize cost and sustainable solutions from Day One going forward,” says Kiminki.

“Knowing what you’re running in the cloud and understanding your business also frees up resources for developers, who can then focus on developing and creating new things,” he adds.

When applications are done correctly from the beginning, it reduces the need for optimization work which can take away resources from future projects. According to Erno Aapa, Polar Squad co-founder and COO, when implementing FinOps (financial operations) for managing cloud costs, “it’s important not to kill innovation, for example, by not allowing testing of new services. When there’s good communication and all the costs are visible to the teams, everyone can make better financial choices.”

You can also learn more about optimising cloud infra costs with this quick checklist we compiled with our friends at NetApp.

—

If you think you might benefit from evaluating the possibilities to save costs in your cloud environment, contact us!

Craftsperson's Guide to GitHub Actions #3: Building and Releasing — Polar Squad

Niko Heikkilä — Wed, 07 Jan 2026 22:00:00 GMT

In the previous chapter, we gained confidence through comprehensive testing, including unit tests, property-based tests, and mutation tests. Our action quality is rock-solid.

But quality means nothing until the product reaches users. An action that only exists on your machine is just an expensive science experiment. In this final chapter, we'll build for production and create an automated release pipeline that safely delivers your action to the world.

Building Your Action

For JavaScript-based GitHub Actions, we need a build process. GitHub doesn't compile or bundle the code before execution, which means we must ship a ready-to-run artefact.

This means transpiling TypeScript to JavaScript, bundling all dependencies, and checking the build artefact into Git. Yes, you read that correctly. Unlike typical projects where you ignore your build artefacts from version control, GitHub Actions requires you to commit them to the repository.

Use whichever bundler you prefer. The example repository utilises Bun for its excellent bundling feature, but esbuild or Rolldown can also be used with similar results.

As discussed in Chapter 1, keep your action entry point separate from source files and tests. This separation makes configuring bundling easier.

The key is defining a reproducible build command that works identically everywhere — on your laptop, on your colleague's machine, in GitHub Actions. I use Taskfile for orchestration, but npm scripts, Makefiles, or shell scripts are equally valid choices. Pick your tool; just make it consistent.

build:
  desc: Build GitHub Action
  sources:
    - bin/**/*.ts
    - src/**/*.ts
  generates:
    - dist/index.js
  cmd: >
      bun build bin/index.ts
      --production
      --target node
      --outdir dist
      --format esm

This builds our TypeScript entry point into a minified Node.js script in modern ESM format.

Running task build produces a 510 KB bundle containing all dependencies. That might seem large for a simple ROT-13 action, but GitHub Actions runners download it in no time.

The hardest part is to remember to commit the build artefact. Being human — and thus forgetful — we automate this with a pre-commit hook using Husky.

task -p lint build
git add dist README.md

This hook runs linting and building in parallel, then stages both the dist directory and README.md.

Why the README? We generate action documentation during the build using action-docs, keeping documentation synchronised with code.

Building and committing is just the first step. Now we need to verify the action actually works and release it safely.

Trust, But Verify the Action

Our CI/CD pipeline runs the same tests you run locally: unit tests, property-based tests, and mutation tests. But that's not enough.

Despite covering a lot of ground with existing tests, we still need to verify the action works in its actual runtime environment: GitHub Actions. Let's create an acceptance testing workflow. It's verbose, so we'll break it down piece by piece.

name: Acceptance Tests
  
on:  
  pull_request:  
    branches: [main]  
  push:  
    branches: [main]

env: original: "Hello, World!" transformed: "Uryyb, Jbeyq!" jobs:  test-unit: # Unit, property-based, and mutation tests ...  test-local-action:  name: Test local action  permissions:  contents: read  strategy:  matrix:  runner: [ubuntu-latest, macos-latest, windows-latest]  runs-on: $  steps:  - name: Checkout  uses: actions/checkout@v5  - name: Test with valid input  uses: ./  id: valid  with:  string: $  - name: Fail if output is incorrect  if: steps.valid.outputs.result != env.transformed  run: | echo "::error::Expected result of transformation was '$', but got '$'" exit 1  - name: Test with empty input  uses: ./  id: invalid  continue-on-error: true  with:  string: ""  - name: Fail if empty input succeeds  if: steps.invalid.outcome != 'failure'  run: | echo "::error::Expected action to fail when given empty input, but it succeeded." exit 1

Why testing matrices? While our action is platform-agnostic, many aren't. File system operations, for instance, often work similarly on Linux and macOS but break on Windows. Developing on a single platform while mocking the entire filesystem effectively hides these issues. Thus, we catch defects before our users do by testing across many platforms.

Since we haven't released the action yet, we use relative notation to reference the repository root. Remember to check out the repository first. GitHub Actions won't do it automatically.

We test both valid input (should succeed) and invalid input (should fail). The assertion steps use conditional execution. They only run when verification fails, resulting in the workflow failing.

Semantic Versioning Done Right

When the verification passes, it’s time to release. Unlike some release processes that feel like organising a conference, releasing GitHub Actions is refreshingly simple: we tag the verified commit and push.

GitHub Actions recommends semantic versioning with a simple twist. Instead of one, we publish three tags for each release:

Full version: v1.2.3 (patch-level precision)
Minor version: v1.2 (minor updates included)
Major version: v1 (the convenient default)

This approach lets users choose their comfort level. Want automatic updates? Use v1. Need stability? Pin to v1.2.3. The major version tag is what most users reference, and we keep it updated automatically.

Here's the workflow:

jobs:
 test-unit: ...
 test-local-action: ...
 
 release:
   name: Release
   if: github.event_name == 'push' && github.ref == 'refs/heads/main'
   needs: [test-unit, test-local-action]
   runs-on: ubuntu-latest
   permissions:
     contents: write
   
   steps:
     - name: Checkout
       uses: actions/checkout@v5  
       with:  
         fetch-depth: 0  
   
     - name: Determine next version  
       id: version  
       uses: mathieudutour/github-tag-action@v6.2  
       with:  
         github_token: $  
         default_bump: patch  
         create_annotated_tag: true
         dry_run: true
   
     - name: Release new version  
       if: steps.version.outputs.new_version != steps.version.outputs.previous_version  
       run: |  
         function push() {  
           local tag="$1"  
           git tag -fa "$tag" -m "Release $tag"  
           git push origin "$tag" --force 
         }  
           
         git config user.name "$USERNAME"  
         git config user.email "$EMAIL"  
           
         push "$TAG"  
         push "$(echo "$TAG" | cut -d . -f 1)"  
         push "$(echo "$TAG" | cut -d . -f 1-2)"  
           
         gh release create "$TAG" \
           --title "Release $TAG" \
           --notes "$CHANGELOG" \
           --verify-tag 
       env:
         USERNAME: github-actions[bot]  
         EMAIL: github-actions[bot]@users.noreply.github.com  
         GITHUB_TOKEN: $  
         TAG: $  
         CHANGELOG: $

We use mathieudutour/github-tag-action to parse the next version from Conventional Commit messages. It runs in dry-run mode to generate the version without actually pushing it. If your organisation bans external actions, you'll need to implement version logic with a custom action yourself.

The release step creates three tags and force-pushes them. Yes, force-pushing is usually considered extreme, but we're moving tag pointers, not rewriting history. This is safe in the pipeline, but don’t do it on your machine. The consequence of tag mutation is that other developers need to run git pull --force to sync updated tags.

We generate a basic changelog from commits. I don't endorse maintaining CHANGELOG.md files as they're often out of sync. Instead, create a commit log during release, and edit the release notes afterwards if needed.

Post-Release Verification: Test Like a User

The release is tagged and pushed. But we're not done yet. We need one final check: verifying the action works exactly as users will use it by referencing the released tag, not local files.

jobs:
 test-unit: ...
 test-local-action: ...
 release: ...

 test-tagged-action:
   name: Test tagged action  
   runs-on: ubuntu-latest
   needs: [release]  
   permissions:  
     contents: read  
   
   steps:  
     - name: Test happy case
       uses: nikoheikkila/rot-13-action@v1  
       id: valid  
       with:  
         string: $  
   
     - name: Test sad case
       uses: nikoheikkila/rot-13-action@v1  
       id: invalid  
       continue-on-error: true
       with:  
         string: ""

Notice the critical difference from earlier verification: we reference the action using nikoheikkila/rot-13-action@v1, not ./. This tests exactly what users will run.

If this job is successful, you can have high confidence that your release works correctly. It's not just tagged, but also verified.

Conclusion: Actions Are Software

When this series began, you might have viewed GitHub Actions as simple automation scripts too trivial for serious software engineering practices.

I hope you now see the light: GitHub Actions are software. They deserve the same software engineering rigour as any production system: clean architecture, comprehensive testing, automated verification, and safe delivery pipelines.

The investment pays off in reliability, maintainability, and confidence. Instead of "push and pray" development, you have a fast feedback loop that catches bugs before users do. Instead of fragile scripts that break mysteriously, you have well-tested components that adapt to change.

What You've Learned

Through this series, you've mastered:

Design: Separating business logic from infrastructure using dependency injection
Testing: Unit tests, property-based tests, and mutation testing for genuine confidence
Building: Creating reproducible production artifacts
Releasing: Semantic versioning with automated verification
Delivery: Safe deployment with post-release verification

Next Steps

Clone the example repository and use it as a foundation for your own actions. The code is production-ready, battle-tested, and follows the principles we've discussed.

Found something to improve? Submit an issue or pull request. All contributions are welcome. After all, continuous improvement is what software craftsmanship is all about.

Now go build something great. Your users will thank you for the reliability, and your future self will thank you for the maintainability.

Craftsperson's Guide to GitHub Actions #2: Scaling Up the Testing — Polar Squad

Niko Heikkilä — Tue, 09 Dec 2025 22:00:00 GMT

In the previous post, we improved the design of our GitHub Action to make it testable. But having testable code is just the beginning. We're still far from delivering a truly reliable solution to our users.

In this chapter, we'll enhance our testing approach with two powerful techniques that many developers overlook: property-based testing and mutation testing. These aren't just advanced techniques for the sake of it. They're practical tools that uncover bugs traditional testing misses.

Before diving into advanced testing techniques, let me introduce the example we'll use throughout this post.

The ROT-13 Action

Our example action is deliberately simple: it transforms an input string using ROT-13 and outputs the result. ROT-13 is a letter substitution cipher that replaces each letter with the letter 13 positions after it in the alphabet. Simple enough to understand, yet complex enough to demonstrate testing challenges.

Using the action in a workflow is straightforward:

- name: ROT-13
 uses: nikoheikkila/rot-13-action@v1
 id: rot-13
 with:
 string: "Hello, world!"

We set a display name for the logs, assign an ID for accessing output in subsequent steps, and reference the action using its repository name and version tag.

When executed, the action logs its transformation:

▽ Run nikoheikkila/rot-13-action@v1
  with:
    string: Hello, World!
    
Hello, World! -> Uryyb, Jbeyq!

This represents the minimum functionality we need to test. However, as professional software engineers, we should strive for more than the bare minimum.

GitHub Actions also supports composite actions, which are essentially shell scripts split into multiple steps. While they can simplify long workflows, they're notoriously difficult to test correctly. I only recommend them for trivial use cases.

Let's start with practical unit tests.

Unit Testing: Fast Feedback Without Waiting

Yes, the ultimate test environment for GitHub Actions is GitHub Actions itself. But that doesn't mean we should test there exclusively.

As we discussed in the previous chapter, the key challenge is testing action logic without building and pushing code after every change. With proper design, testing becomes fast and even enjoyable.

Every non-trivial GitHub Action has a core where the business logic lives. You might be tempted to test only this core in isolation, but that's a mistake. We need to verify the complete behaviour from input to output.

I'm not talking about end-to-end tests that span multiple systems. Instead, think of this as behaviour-focused sociable testing. These tests verify how components work together to produce the expected behaviour.

We're not testing individual components in isolation. We're verifying complete behaviour from input to output while the components interact. Hence, we refer to it as sociable testing.

"But is this unit testing or integration testing?", you might ask.

If the question bothers you, I highly recommend Ted M. Young's article on why the distinction doesn't matter as much as you think. What matters is that our tests are fast, reliable, and verify the expected behaviour.

Here's what happy path tests look like for our action:

These tests arrange the action with predefined input, execute it, assert the output matches expectations, and verify the log message. No spies or mocks needed. The tests are data-driven, readable, and blazingly fast.

it.each([
  ["A", "N"],
  ["M", "Z"],
  ["N", "A"],
  ["Z", "M"],
  ["a", "n"],
  ["m", "z"],
  ["n", "a"],
  ["z", "m"],
  ["HELLO", "URYYB"],
  ["WORLD", "JBEYQ"],
  ["ROT13", "EBG13"],
  ["123", "123"],
  ["!@#$%", "!@#$%"],
  ["Hello, World!", "Uryyb, Jbeyq!"],
  ["Héllo", "Uéyyb"],
  ["🔒 secret", "🔒 frperg"],
  ["Тест", "Тест"],
  ["مرحبا", "مرحبا"],
  ["こんにちは", "こんにちは"],
])("transforms %s to %s", (input, expectedResult) => {
 core.setInput("string", input);

 action.run();

 const actualResult = core.getOutput("result");
 expect(actualResult).toBe(expectedResult);
 expect(core.eventsOf("info")).toContain(`${input} -> ${expectedResult}`);
});

Edge cases matter, too. While the ROT-13 transformation of an empty string is technically valid, let's demonstrate input validation by requiring input length between 1 and 1,048,576 characters (1 MB).

it("fails with empty string input", () => {
  const input = "";
  core.setInput("string", input);

  expect(() => action.run()).toThrowError(
    "input field 'string' cannot be empty",
  );
});

it("fails with input exceeding 1 MB", () => {
  const maxSize = 1024 * 1024;
  const input = "*".repeat(maxSize + 1);
  core.setInput("string", input);

  expect(() => action.run()).toThrowError(
    `input field 'string' cannot exceed ${maxSize} characters`,
  );
});

Our unit tests now pass with 100% coverage. Time to celebrate? Not quite. Traditional code coverage is a vanity metric: it tells us which lines were executed, not whether our tests actually verify the correct behaviour. Let's do better.

Property-Based Testing: Testing What You Can't Imagine

Think about the essential properties of ROT-13 transformation:

Length preservation: transformation doesn't alter the input length
Inverse operation: applying ROT-13 twice returns the original string
Case preservation: uppercase letters stay uppercase, lowercase letters stay lowercase
Character selectivity: only alphabetic characters rotate

You could write hundreds of example-based tests to cover these properties and still miss a myriad of edge cases. Even LLMs would struggle to generate comprehensive enough examples.

There's a better way, and it’s called property-based testing. Libraries like Hypothesis (Python) or QuickCheck (Haskell) have demonstrated the power of this approach. For JavaScript, we'll use fast-check, which generates test data automatically based on properties we define.

A property-based test in fast-check is elegant:

type Predicate = (s: string) => boolean;

it("does not change text length", () => {
 const preservesLength: Predicate = (s) => transform(s).length === s.length;

 assert(property(string(), preservesLength));
});

We assert that a property holds true for all strings.

The predicate function returns true when the transformation preserves length. Fast-check then generates hundreds of random strings and checks our predicate against each one. If all pass, the test passes.

When a test fails, fast-check doesn't just throw up its hands. It shrinks the failing input to find the minimal example that demonstrates the bug. Instead of debugging "xKj9!@mNqP#$wZ", you might get "A". This shrinking process is invaluable for understanding the causes of failures.

Once you've identified the minimal failing case, write it as a traditional unit test, fix it, then return to property-based testing to verify the fix. This workflow integrates beautifully with test-driven development.

Testing the inverse property is equally straightforward:

it("is its own inverse", () => {
 const isItsOwnInverse: Predicate = (s) => transform(transform(s)) === s;

 assert(property(string(), isItsOwnInverse));
});

Sometimes we need to constrain inputs to test-specific properties. Fast-check's string().filter() method lets us generate only strings matching certain criteria:

it("preserves uppercase", () => {
 const isUpperCase: Predicate = (s) => s === s.toUpperCase();
 const preservesUpperCase: Predicate = (s) =>
  [...transform(s)].every(isUpperCase);

 assert(property(string().filter(isUpperCase), preservesUpperCase));
});

it("preserves lowercase", () => {
 const isLowerCase: Predicate = (s) => s === s.toLowerCase();
 const preservesLowercase: Predicate = (s) =>
  [...transform(s)].every(isLowerCase);

 assert(property(string().filter(isLowerCase), preservesLowercase));
});

it("only transforms alphabetic characters", () => {
 const isSpecialCharacter: Predicate = (s) => !/[A-Za-z]/.test(s);
 const skipsTransformation: Predicate = (s) => transform(s) === s;

 assert(property(string().filter(isSpecialCharacter), skipsTransformation));
});

Not every GitHub Action benefits from property-based testing. If your action doesn't involve mathematical properties or transformations, traditional tests might suffice.

However, for many actions involving data, cryptography, parsing, or any domain with distinct invariants, fast-check saves enormous amounts of time while uncovering bugs you'd never have imagined.

We now have more test code than production code. "Isn't this overkill for a simple ROT-13 transformation?" you might ask.

No. Quality software often has significantly more test code than production code. We're building confidence that our action behaves correctly under all circumstances. And we're not done yet: the most powerful testing technique is still ahead.

Mutation Testing: The Ultimate Reality Check

Mutation testing is the most humbling technique in a developer's toolkit. Why? Because it tests your tests.

Testing tests might sound recursive and pointless, but it's critical, especially if you write tests after the code. We've all been there until we learn Test-Driven Development. Mutation testing reveals whether your tests actually verify behaviour or just exercise code.

Here's how it works: A mutation testing tool modifies your source code in subtle ways — switching > to >=, changing && to ||, removing conditionals, or tweaking regular expressions. These modifications are called mutators.

After mutation, the tool runs your tests. If tests fail, the so-called mutant is killed, which is good. If tests still pass, the mutant survives, which is bad since your tests didn't catch the bug. The mutation score is the percentage of killed mutants out of all mutants.

Mutation testing exposes the harsh truth: traditional code coverage is a vanity metric. Many teams treat 100% coverage as proof of quality, but mutation testing tells a different story.

I've seen codebases that enforce up to 100% coverage scores through quality gates, yet when I run mutation tests, numerous mutants still survive because the tests were written hastily. You might have experienced this yourself: refactor some code — or let an LLM do it — and then see all tests pass, only to watch bugs appear in production. Your tests touched every line but verified little.

Improving mutation scores in legacy codebases is a challenging task. Even modern LLMs struggle with this. The pragmatic approach is to start with a lower threshold and gradually increase it as you improve tests. For new projects, aim for 100% from the start and enforce it throughout your pipeline.

Our ROT-13 action is new, so we'll choose perfection.

Setting up mutation testing for our GitHub Action is straightforward with StrykerJS:

{
  commandRunner: {
    command: "bun test",
  },
    checkers: ["typescript"],
    mutate: ["src/**/*.ts"],
    reporters: ["clear-text", "progress"],
    thresholds: {
      high: 100,
      low: 100,
      break: 100,
  }
}

Key configuration points:

Command runner: Specifies the test command. We use Bun, but any test runner works.
Checkers: TypeScript checker eliminates mutants that cause type errors, saving time.
Mutate: Defines which source files to mutate. Be specific: mutating tests or dependencies wastes time.
Thresholds: We set all thresholds to 100%, meaning anything less than perfect fails the build.

The high/low thresholds control report colours (green/yellow/red), but since ours are identical, we get either green or failure. There is no middle ground.

Mutation tests add only a few seconds to our test suite, making them perfect for pre-push hooks. Run unit tests, property-based tests, and mutation tests together. Push the commit only when everything is green.

Conclusion

Building quality into GitHub Actions requires discipline, but the payoff is worth it. Follow this testing pyramid:

Start with unit tests using test-driven development
Add property-based tests to uncover edge cases you'd never think to write
Verify with mutation testing to ensure your tests actually test what they claim to test
Add a few workflow tests to verify your action is callable

The foundation for all of this is writing actions that do one thing well. Simple logic is more straightforward to test than complex logic. Introduce these practices early, ideally before writing much production code. Retrofitting quality is always more difficult.

Remember that GitHub Actions are just functions. They take inputs and produce outputs. The same testing principles that apply to your backend or frontend code apply here. No exceptions, no excuses.

In the next chapter, we'll address the final piece: building and releasing your action. We'll create a CI/CD pipeline that verifies your action works correctly in a real GitHub Actions environment, then releases it safely to users. We'll also explore how to handle external dependencies and asynchronous operations without sacrificing test speed.

It’s where all the design and testing practices come together to create automation you can trust in production. This is the kind of delivery culture we help teams build every day at Polar Squad.

Let us ship this action!

Craftsperson's Guide to GitHub Actions #1: Designing for Success — Polar Squad

Niko Heikkilä — Wed, 03 Dec 2025 22:00:00 GMT

Discover how to design testable GitHub Actions by avoiding common pitfalls like implicit dependencies and global state. Learn to separate business logic from infrastructure using dependency injection for fast, reliable testing.

Like many other types of scripts, GitHub Actions suffer from a common problem: it's tempting to take a legacy Bash or PowerShell script and convert it into JavaScript without considering best practices like modularity and separation of concerns. The result is code that's difficult to understand, test, and maintain.

Typically, actions become tightly coupled with the underlying infrastructure. They make network requests haphazardly, write to and read from disk, and invoke arbitrary shell commands — all without proper abstraction.

Even seemingly straightforward actions, such as downloading and adding a binary executable to the path, quickly reveal their complexity. What if the download fails? What if permissions prevent writing to the directory? What if we download the wrong version? What if the executable doesn't launch at all?

If you've ever written a script for installing custom software from the internet, you know how exhaustive the logic can be, even when it looks as simple as piping curl output to bash. This is why we can't afford to skip quality practices, even for simple scripts.

The Hidden Cost of Implicit Dependencies

To streamline action development, GitHub provides the @actions/toolkit monorepo, which includes several helpful packages. While these tools are powerful, I've seen them used in ways that create maintenance nightmares.

Take a look at this code snippet from GitHub's official tutorial. If you're familiar with software design principles, it might make you wince:

import * as core from "@actions/core";
import * as github from "@actions/github";

try {
  const nameToGreet = core.getInput("who-to-greet");
  core.info(`Hello ${nameToGreet}!`);

  const time = new Date().toTimeString();
  core.setOutput("time", time);

  const payload = JSON.stringify(github.context.payload, undefined, 2);
  core.info(`The event payload: ${payload}`);
} catch (error) {
  core.setFailed(error.message);
}

Three significant issues stand out:

Direct third-party library usage without abstraction
Global mutable state dependencies
Uncontrolled side effects (date construction)

While I won't judge a tutorial too harshly, these patterns are spreading rapidly, especially as developers use LLMs to generate action code. The resulting actions become difficult to test and maintain.

Import Smells: When Dependencies Use You

Issue #1 prevents safe side-effect handling and clean testing. The core object performs multiple responsibilities: logging to the console, setting output data, and terminating the script with exit codes.

Running this code in tests without a proper abstraction creates a miserable experience. Environment variables for inputs are missing, outputs can't be asserted, and your terminal floods with logs.

"But we can always use spies and mocks!" some might argue.

True, but at what cost? Your business logic is still entangled with infrastructure, tests become fragile, and refactoring turns into a risky endeavour. When tests depend heavily on implementation details, they require increasingly complex mocking setups. Coming back six months later to fix a bug through tests? Good luck.

Importing functionality from third-party packages without abstraction creates what I call import smells — dependencies that pollute your code and make testing painful.

Global Mutable State: The Silent Killer

Issue #2 introduces dependency on global mutable state that changes during pipeline runs. This creates the dreaded scenario: tests pass locally but fail after pushing. Debugging becomes a guessing game.

Worse still, your action might behave differently between pull request and release workflows because the global context contains different data. What worked in one scenario breaks in another.

Using context for simple debug logging? That's fine. However, once complexity increases, confidence in your actions tends to evaporate. As a maintainer, prepare yourself for a steady stream of bug reports and confused users.

The Time Problem

Issue #3 is subtler but equally important.

When business logic calculates the current timestamp directly, testing becomes unreliable because each test run produces a different timestamp.

You can't test the timestamp cleanly and end up either mocking the system clock, or writing weak assertions that check the string vaguely resembles a timestamp. That's not testing but wishful thinking.

This ties back to the first issue: even without explicit imports, you have an implicit dependency on the Date class.

To address the issues above, we need to design actions that are easy to test and not coupled to the infrastructure. How do we solve that?

The Solution: Dependency Injection

The fix is simpler than you might think. Let's separate business logic from infrastructure by splitting our code into two files:

// action.mjs
export function run({ core, github, date }) {
  const nameToGreet = core.getInput("who-to-greet");
  core.info(`Hello ${nameToGreet}!`);
  
  const time = date.toTimeString();
  core.info(`The current time is ${time}`);
  
  const payload = github.payload.dump();
  core.info(`The event payload: ${payload}`);
}

// index.mjs
import * as core from "@actions/core";
import * as github from "@actions/github";
import * as action from './action.js';

try {
  action.run({ core, github, date: new Date() });
} catch (error) {
  core.setFailed(error.message);
}

The structure looks similar, but the improvement is dramatic. Our business logic is now infrastructure-independent. The logic doesn't import anything from the Actions toolkit. This makes the code portable to other systems with similar interfaces.

The infrastructure hasn't disappeared; it's been separated through dependency injection. The core, github, and date objects are now parameters we can easily substitute in tests.

That's it! Our business logic and dependencies are now cleanly separated, and, most importantly, they are testable.

In place of infrastructure dependencies, we use infrastructure test doubles — fake objects — which don’t contain side effects making them ideal for tests.

I've added a few helper methods to the fake objects to make testing more ergonomic. JavaScript and TypeScript allow this flexibility as long as objects implement the required interface. Since these test doubles never leave the test suite, there's no risk of them appearing in production code.

The assertions simply inspect an in-memory log, ensuring tests pass for the right reasons, produce clean reports, and allow confident refactoring of business logic.

Dependency injection scales extremely well. No matter how complex dependency you have, you can always pick out the interesting parts, write an abstraction, and use a test double in its place.

Conclusion

Before writing your next GitHub Action, ask yourself one question:

Can I test this logic without pushing to GitHub and running a workflow?

If the answer is yes — using the techniques described here — you've unlocked fast feedback loops that reduce defects and make changing business logic cheaper and safer.

But we're just getting started. In the next chapter, we'll explore advanced testing techniques, including property-based testing to uncover edge cases you might never think to write, and mutation testing to verify that your tests actually test what they claim to test.

It’s the next step toward building automation you can truly trust.

See you there!

Craftsperson's Guide to GitHub Actions — Polar Squad

Niko Heikkilä — Tue, 02 Dec 2025 22:00:00 GMT

Learn to build production-ready GitHub Actions with clean architecture, comprehensive testing strategies, and reliable release pipelines, transforming fragile automation scripts into maintainable software.

Have you ever rushed to push a GitHub Action, only to watch it fail in the pipeline? Or inherited an action so tightly coupled to infrastructure that testing it locally felt impossible? You're not alone.

GitHub Actions have become the backbone of modern CI/CD workflows; yet, we often treat them as throwaway scripts rather than production-grade software. What you usually get is: brittle automation, slow feedback loops, and extensive trial and error in your pipelines.

This year at Polar Squad, I've spent considerable time assisting a platform engineering team in building reusable GitHub Actions used daily by hundreds of developers.

Here's what I learned: the difference between fragile scripts and reliable automation lies in applying the same modern software engineering principles you'd use for any critical production system.

What You'll Learn

This series demonstrates how to build GitHub Actions that are:

Testable: Run comprehensive tests locally in seconds, not minutes.
Maintainable: Separate business logic from infrastructure concerns using clean architecture.
Reliable: Achieve confidence through unit tests, property-based tests, and mutation testing.
Production-ready: Build, release, and verify your actions with automated pipelines.

The techniques I share aren't specific to GitHub Actions. They're fundamental software engineering practices applied to CI/CD. Clean architecture ensures your code is easy to understand and cheap to change. Fast-running tests provide immediate feedback on changes. The ability to refactor safely means you can adapt to evolving requirements without fear.

Who This Guide Is For

If you're a software engineer building JavaScript or TypeScript-based GitHub Actions and want to move beyond the "push and pray" development, this guide is for you. Whether you're creating actions for your team or publishing them publicly, these practices will save you time and headaches.

I've prepared an example repository with a fully functional GitHub Action that demonstrates every principle discussed in this series. Clone it, experiment with it, and use it as a foundation for your own actions.

Series Chapters

I have divided the series into three focused chapters, each building on the previous one:

Chapter 1: Designing for Success
Chapter 2: Scaling Up the Testing
Chapter 3: Building and Releasing

Follow along over the next few weeks as we explore how to build automation that’s testable, maintainable and actually enjoyable to work with.

In a struggle to tame Large Language Models — Polar Squad

Anoop Vijayan — Tue, 13 May 2025 21:00:00 GMT

Initial excitement around Artificial Intelligence (AI) predictive abilities led to rapid investment in machine learning research and development of Large Language Models (LLMs) which sparked a wave of innovation in a variety of streams and at the same time revealed both the potential and limitations of early AI systems.

Introduction

This blog post will shortly recap what has been happening in the AI’s LLM space since the beginning from the author’s perspective and we will dive a little deeper into MCP, which stands for Model Context Protocol and has been in the hype recently in this space. MCP was introduced by Anthropic, a California based AI startup in late 2024. Why is MCP getting so popular and why are many of the Large language model (LLM) researchers, companies and individuals diving into this topic? Let’s dive in.

The initial WOW phase

When researchers and scientists found that LLMs can predict the next word or phrase reliably, that was the eye-opening moment. An example of how LLMs can predict next tokens is shown below, when the user just types the letters “Glo” , the user is offered with relevant suggestions.

This simple word prediction trick when applied to huge datasets unlocked new and valuable application areas. For end users this meant that the LLMs were not only able to predict what a user might think, but also answer hard questions. From that point till now we have been trying to tame LLMs to make them work the way we want.

Initially, the focus was on chat-based language models because they excelled at tokens/words prediction. These models were trained on vast amounts of text data and learned to understand patterns in language, making them highly effective at tasks like conversation, text completion, and natural language understanding.

As the technology progressed, researchers and developers began expanding beyond pure text applications leading to the development of models that could understand, generate and manipulate visual content.

Augmentation phase

At the beginning, it became clear that retraining large models was not feasible for everyone due to the immense computational resources and expertise required. As a solution, a method called Retrieval-Augmented Generation (RAG) was developed, which enables the model to access relevant data sources in real time rather than relying solely on pre-trained knowledge.

This approach allowed the model to retrieve relevant information from private datasets and incorporate that context into its responses, improving accuracy and relevance. By using RAG, the model can generate more informed and contextually aware outputs, without the need for constant retraining, making it more flexible and adaptable to specific tasks or queries. However, RAG has challenges with retrieval accuracy, adds more latency and is harder to debug.

Function-calling Phase

Another significant breakthrough occurred with the introduction of function/tool calling, where LLMs were able to invoke functions dynamically, based on specific needs or contexts. This addressed some of the drawbacks with RAG, for example, the model can invoke precise methods, the retrieval accuracy improved with lesser latency and ease of debugging.

Eventually Anthropic standardized the way the invocation happens and MCP came into existence. This integration opened up new possibilities for automating workflows and enhanced decision-making processes by enabling the model to seamlessly interact with external systems and execute tasks autonomously or on demand. Let's dive a little deeper.

MCP under the hood

How are AI models able to invoke tools, also known as functions?

Models are trained to do a "function call" when they need more data. It is important to know that models in use need to be trained for this purpose, though this is more common with most of the models these days. This can be configured to auto or explicit modes. Models do this by generating a JSON output that represents the input parameters that this function needed and get in return the function output. And the function call needs to be declared to the model using a JSON Schema so the model can understand the features it represents, required input and what it gets in return. An example snippet below:

Also most of the time you may add some system prompt to guide the model to use the functions you made available.

So let's take a look at what steps are needed to get an MCP server in place.

So, essentially, in this case, the MCP server is an API server which can perform some git actions. To make your model call this one, firstly a MCP server needs to be created and accessible over the network. They can be community hosted as well. After that, config JSON needs to be created and updated to the LLM. After which, the LLM invokes the server if need be based on the user’s query. Learn more about function calling in OpenAI documentation.

Also, this started to be an eco-system and loads of MCP servers were implemented from the commercial and community groups. Some of those are collected under MCP servers collection in GitHub.

Applications of MCP

Multiple MCP servers can be connected to a single LLM or multiple LLMs, each capable of being autonomously triggered with different parameters, working together in a coherent way. They could be a Search Engine Optimiser (SEO) with MCP for a single website or an aggregator across a bunch of websites. Technically speaking, any software product can implement an MCP server to enable direct integration to LLMs and LLMs can augment the knowledge, make additions, modifications directly to those products.

An example below shows a git tool, a web tool and a file tool configured to use for a given LLM. Imagine as a developer, if you just provide the requirements in simple plain text, the model tries to use its own knowledge and reaches out to the needed tools and finally gets back with a fully finished work available in version control. This could be placed in a continuous loop to obtain better results which LLM itself can improve on without human intervention autonomously. Developing a feature in software development used to take days if not weeks before this concept existed. As a result, certain coding editors like Cursor, Windsurf, Microsoft Copilot and Bolt gained popularity and this type of coding was named vibecoding by Andrey Karpathy.

Where are we today

Taking a step back, so far, we have reached a level where we can invoke functions based on the scenario or input provided to LLMs. This capability has fundamentally transformed how AI systems can interact with the world around them. This has already exploded with wide possibilities, enabling LLMs to perform complex actions based on natural language understanding. Now there are systems which have multimodal and multiagent systems that can process images alongside text, understand spoken commands, and generate content across different formats simultaneously. Multi-agent architectures have further revolutionized the field by enabling specialized AI entities to collaborate, debate, and collectively solve problems that would be challenging for a single person or even a development team.

Challenges

This being still in a very early stage, there are a significant number of errors, bugs, vulnerabilities, license violations, and other issues around this. Given the nature of these technologies, robust solutions addressing security vulnerabilities and privacy concerns remain in early development phases. Also, as the lack of tooling for monitoring and troubleshooting around these systems cause challenges with traceability, predictability etc. Considering the hype, these agents are not autonomous enough to perform complex tasks without human intervention. This is also due to the fact that AI agents are not able to have a thought process like humans as of today. This makes AI still not very useful for mission-critical or common real-life use cases involving emotional intelligence, creative ability and ethical reasoning or judgements.

Where are we heading to

The environment is still dynamic and moving at a fast pace. It's hard to predict which direction it will take in the coming weeks, months or years. Some companies like Nvidia, Google claim that we will be reaching AGI (Artificial General Intelligence, sometimes called human level intelligence AI) in less than 5 years, but the feeling I have is that we are still in the process of taming the LLMs to make it work in a way we want.

That said, companies like OpenAI, Anthropic, Meta, Google and others are trying hard to release amazing features every month improving their models on a mission to become first to achieve AGI to lead the AI market space.

Horseless carriage: AI is not just for faster coding — Polar Squad

Risto Laurikainen — Sun, 04 May 2025 21:00:00 GMT

Vibe coding in progress.

We often talk about AI in software development as a way to write code faster, but that’s only part of the story. As tools evolve, so do our workflows, expectations, and roles. From low-stakes experimentation through “vibe coding” to structured multi-agent systems, this blog post explores how AI-augmented programming is reshaping how we code and think about building software altogether.

There’s no shortage of opinions on AI-augmented programming. Either it will, or it won’t replace software engineers. It’s either the best thing ever for productivity or leads to wasting time debugging weird bugs created by the AI. Opinions on which specific large language model is best are usually based on the person’s anecdotal experience.

One recurring theme in many of these opinions is that they are based on, at best, the current state of AI-augmented programming or, at worst, an outdated understanding of the state from last year. The field is moving so rapidly that it is challenging to keep up and make good arguments about where it is headed.

Perhaps because it’s hard to understand the direction of AI-augmented programming, many arguments look at it in terms of how things have worked in the past. By now, it should be clear that with such a massive paradigm shift, this doesn’t always work.

More than just faster coding

Just like people of the past thought of cars as horseless carriages, people today might think of AI-augmented programming as something that makes you write code faster. This thinking is partly correct but is somewhat limited and can be misleading. Writing code faster does not automatically lead to delivering higher-quality software faster, and it certainly does not lead to better outcomes on its own.

The software industry has a long-standing tradition of overemphasizing output over outcomes. New methodologies like agile development and DevOps have often been seen by many as tools for delivering more software faster. It looks like this same trend is also continuing with AI-augmented programming, with a lot of the conversation revolving around productivity as measured by lines of code produced. Less attention is given to how we could make use of AI for better outcomes. However, a simple program with good product-to-market fit is better than a complex one without.

Rather than thinking about AI-augmented programming in terms of how programming has worked in the past, we could also look at what it enables that wasn’t possible before. What new use cases does it allow that were either impossible or impractical in the past? How should we change our behaviors due to these new use cases? Thinking this way can lead to better insights.

Explosion of experimentation

There’s been a lot of talk lately about vibe coding. The term was coined by Andrej Karpathy and defined as follows:

“There's a new kind of coding I call "vibe coding", where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. … It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.”

Commentators often dismiss this type of coding as producing low-quality code and various security issues. That’s true, at least with the current models – you probably shouldn’t run this code in production, though some will do so anyway. But what if we think of what new use cases this type of programming enables?

One of the core principles of DevOps listed in The DevOps Handbook is a culture of continual experimentation and learning. When you can generate code much faster, you can build many more demos, prototypes, and proofs-of-concept. This experimentation allows you to discover the right thing to build quicker and to demonstrate and communicate ideas more effectively to your coworkers.

However, you need to be mindful not to consider these experiments production-quality and be sure to throw away the code once it has served its purpose. Sometimes, there can be an aversion to throwing away the prototype since it’s basically working, and effort went into building it. If the effort needed is significantly reduced, that might also help to let go of the prototypes more easily.

Show, don’t just tell

As a practical example of this approach, I recently created a demo of a web application for a customer using what could be called a structured approach to vibe coding. I had a concrete description of the data model I wanted to use in JSON format. I specified the technologies I wanted to use based on my familiarity with them, and iteratively built the web app with the AI writing most of the code. I finished the application in about one workday’s worth of hours while also working on other stuff, and I recorded a demo of the application and shared it in Slack for comments. I could have done this without AI, but it probably would have taken a week or two, which is quite a lot for what I intended to be a throwaway demo. The exercise served its purpose well: Showing what the intended workflow with this application could look like in the future, and getting people excited about the possibilities of automation.

Another benefit is that you don’t need to be a programmer to create these prototypes. Anyone can create something to communicate an idea quickly or to turn a vague idea in their head into something more concrete. You can start with a prompt to generate your concept, then refine the app the AI created a couple of times before showing it to someone else for feedback. You can either do this from scratch with a fresh codebase or add features to an existing codebase.

Once you know better what should be built based on some quick vibe coding experiments, you can write the production code more easily. Of course, you can still have the AI assist with the production code. However, the productivity increase there won’t be as dramatic since you need to understand the code and be more mindful of architecture and maintainability.

From a programmer to a product manager

When moving beyond demos and experimentation, vibe coding doesn’t cut it anymore, and a more structured approach is needed. For common problems and programming languages, the current state-of-the-art large language models are already good enough to write significant portions of code, leaving the programmer to review the code and set up rules and workflows for the AI. Managing the context for the AI and providing good guardrails and clear instructions are essential. The work of a programmer moves closer to that of a product manager: Create an unambiguous roadmap for what the product you’re building should do and provide context and feedback for the AI to implement it.

In addition to defining a roadmap for the AI, the programmer must provide a clear set of rules as guardrails. Agentic coding tools like Cursor, Roo Code, and Cline implement a rules system for this. For example, the programmer can use these rules to enforce a specific workflow for the AI agent, like using test-driven development or a particular coding style. They can also prevent the AI from making certain common mistakes that these agentic tools will make without explicit guidance.

In this phase, you do need to understand what good code looks like, how to test it, and how to structure it. At least for now, the AI will regularly get these parts wrong, though good rules and context help a lot to reduce clear mistakes. Experience in software engineering is still needed to review code, write it yourself when necessary, and give good instructions for the AI.

Multi-agentic workflows

Taking a step further from the programmer becoming a product manager to an AI assistant, future and already some current workflows involve orchestrating a set of AI agents with different roles. For instance, one model writes the code while another reviews it, and a third model is responsible for planning and breaking down the implementation into logical, incremental steps. What’s left to the programmer is to supervise the AI agents and provide good, clear instructions. With its recently released Boomerang Tasks, Roo Code is an example of a tool that is already implementing this workflow.

Using multiple models allows for selecting the most suitable model for each type of task. For example, some models may be best suited for planning, while others are excellent for producing code. This leads to better results with fewer mistakes, and can also be cheaper since the most expensive model doesn’t need to be used for everything.

What’s left for humans to do?

Software is ultimately built for humans; only humans can tell what they want from that software. Quite often, we don’t even know exactly what we want from software. In a way, articulating good instructions and managing context for a bunch of AI agents is a different interface for programming computers. In his blog post “The End of Programming as We Know It,” Tim O’Reilly argues that there have been similar shifts in programming paradigms before, like moving from flipping switches on a computer to writing assembly and later higher-level compiled languages, and that large language models are just the next iteration of this development.

Not all software engineering fundamentals will change when AI agents are introduced into the workflow. My colleague Niko recently wrote an excellent blog post about how lean principles interact with AI-augmented coding and that more output doesn’t necessarily lead to better outcomes.

Risto Laurikainen is a DevOps consultant who has worked on platform engineering before it was called platform engineering. He has worked on building and using these platforms for more than a decade in various roles from architecture to team leading.

AI-augmented Software Development: Hype, Vibes and Smoking Production Environments — Polar Squad

Niko Heikkilä — Mon, 24 Mar 2025 22:00:00 GMT

Generative AI tools have changed how we developers approach our daily work. Today, headlines tout the arrival of AI-augmented software development and vibe coding as silver bullets, making development teams orders of magnitude more effective. However, the promised gains are shallow if teams do not also pay attention to the software delivery aspects.

As a software engineer actively using GitHub Copilot and regularly consulting Anthropic's Claude, I have witnessed the power of AI in specific contexts where frequent experimentation and prototyping are valued. AI recalibrates knowledge work so that what once was necessary for humans to handle is now secondary. Meanwhile, the importance of the rest of the knowledge work aspects has grown a thousandfold.

However, it's crucial to acknowledge that using AI comes with tradeoffs we must not overlook. While AI-augmented software development can be beneficial, it's equally important to recognize that AI-augmented software delivery is still a distant goal. Understanding these limitations is key to being well-prepared in our work.

The True Nature of Software Delivery

There are significant hazards when using AI to accelerate your development stemming from the fact that programming is only a tiny part of the software delivery process. This process, or more suitably referred to as a value stream, encompasses all the thinking, discussing, experimenting, and learning involved in delivering a working software product to end users.

Understanding the Value Streams in Software Delivery

As defined by Lean Enterprise Institute, the value stream includes all of the actions, both value-creating and nonvalue-creating, required to bring a product from concept to launch (development value stream) and from order to delivery (operational value stream). Nonvalue-creating actions include unnecessary handoffs, rework due to poor initial design, or delays due to resource constraints.

In software delivery, the value stream encompasses the entire lifecycle from ideation to production deployment and operations. Programming, or codifying requirements and expectations, is only a step. It typically accounts for less than half of the overall delivery. The remaining work involves understanding user needs, designing solutions, testing hypotheses, collaborating across disciplines, and, most importantly, operating amidst organizational barriers.

These non-programming activities are precisely where the most significant delivery challenges and non-value-creating actions emerge. Teams struggle with requirement ambiguity, stakeholder alignment, integration issues, and organizational power dynamics that ultimately determine whether software delivers the intended value. Organizational barriers could include resistance to change, lack of cross-functional collaboration, or unclear decision-making processes.

AI excels at brainstorming mostly syntactically correct code with some logical defects but remains fundamentally limited in navigating these human-centered aspects of software delivery.

Lean Principles and Identifying Waste in Software Delivery

The most notable impediments to optimizing value streams and team flow have been researched as part of Lean manufacturing and are called waste (muda). We can regard everything that does not create customer value as waste.

Mary and Tom Poppendieck, in their book Lean Software Development: An Agile Toolkit, popularised the mapping of different types of Lean waste to software delivery impediments.

According to the Poppendiecks, the primary waste in our work includes:

Partially done work (work-in-progress) or every backlog item sitting in the work queue between the backlog and production.
Overproduction, or all the extra features we write while solving the problem.
Relearning and reworking involved for the backlog items moving back and forth between inspection points.
Internal and external handoffs.
Context switching when you need to drop a task and focus on a new one.
Defects that hinder user experience and endanger businesses.

There is only so little that AI can do to help overcome this waste. On the contrary, it often only exacerbates it. For instance, AI may generate code that solves a problem but introduces new bugs, leading to rework. It may also encourage overproduction by suggesting unnecessary features. Understanding these limitations is crucial for effectively integrating AI into software delivery processes.

As we enter a new era in software development, we must remain vigilant about the effectiveness of our delivery processes and value streams. While AI-augmented development provides the potential to streamline coding and ignite creativity, it cannot replace the fundamental human elements that drive successful software delivery.

Acknowledging AI limitations and committing to optimizing our value streams, we can use it not as a crutch but as a companion for success. In doing so, we not only enhance our effectiveness as developers but also ensure that the software we deliver meets the needs of our users.

The path forward demands a novel approach integrating AI's strengths while exposing and eliminating the waste typically associated with software delivery projects.

Partially Done Work

Thanks to its powerful autocomplete features, AI has the potential to revolutionize software development by completing tasks orders of magnitude faster. However, this added speed also carries multiple risks.

Completing one work cycle faster—in Lean manufacturing terms, moving work from one station to another—can jam the team flow as bottlenecks emerge. The quicker we work in the implementation phase, the more work we pile in front of the bottleneck ahead.

Following Eli Goldratt's Theory of Constraints, AI-augmented programming is a perfect candidate for the illusion of local optima where we attempt to improve the total performance of the system by improving the performance of an individual cog only to find out it doesn’t yield the expected results.

Bottlenecks and the Illusion of Speed

Let's see an example of how that can happen. Consider a team enthusiastically embracing AI coding assistance. Within a couple of weeks, the developers generate code at incredible velocity. However, their testing and review columns are piling up unfinished work. Usually, many pull requests are waiting for review, which the team resolves by rubber-stamping and disbanding the value of code review altogether. Meanwhile, the UI/UX designers are struggling to keep pace and stress that the developers are moving too fast for them.

Without exposing and improving the bottleneck found in testing and review capacity, the team effectively turns their work into a warehouse of partially done work, delaying delivery. Even further delay happens when the code passes the review and it's time to merge the work, but it proves slow and painful due to numerous merge conflicts.

Can AI help with merge conflicts? It is unlikely unless you provide it with full context from both sides of the merge, which is not trivial. Furthermore, having AI review and test features has been the talk of the town for years now, but at the same time, many regulated companies follow the so-called principle of four eyes, where a human outside the work context must review the changes.

As DevOps people have been trying to sell continuous delivery mechanisms and reliable pipelines to these companies for a decade with varying success, I'm skeptical that these companies would activate AI code review and testing and automatically let all the changes move to production.

Overproduction

AI answers often provide excessive detail and redundant lines of code when we desire simplicity. While experienced developers can fine-tune their prompts, many users, especially juniors, could unquestioningly accept and use overly complex solutions.

While I have nothing against explaining the answers in detail, I seek only a few lines of code when using coding assistance. The less code I receive, the better I follow its intent. Due to the overly helpful nature baked into AI system prompts, they often spit out large parts resembling online tutorials aiding with project setup, file structure, and documentation.

Even though you can work around this problem with careful prompting, junior developers often have little experience nor second thoughts about the answers. So, they are happy to copy and paste the answers to their work.

A recent study showed that especially younger people with higher reliance on AI tools had more problems with critical thinking than older people. Therefore, it’s not unusual to state that they could treat AI-generated code without criticism, using code that technically looks feasible, but in reality, delivers too much. The result is gross overproduction, over-polishing, and often tight coupling between components. Most of the models have not been trained well with clean software design principles and, by default, tend to produce complete solutions instead of small, iterable experiments.

AI's Tendency Towards Complexity

Suppose I ask the AI for help while implementing a date picker component. If it generates hundreds of lines of code, including validation logic, internationalization support, and calendar-like navigation, what should have been an MVP allowing users to select a date has become a tangled mess of features you won't need now—if ever.

The above example perfectly defines overproduction. The code works but includes features users didn't request or need. When the requirements change soon enough, the team modifies the complex initial implementation three times longer than if they had built a minimal solution. The AI optimized for completeness rather than simplicity, creating waste that had to be carried forward or refactored away.

Relearning and Rework

Accepting AI-generated code without careful review can lead to more work in the future. It's crucial to understand and be comfortable with the code you're working with, as the person changing the code is often someone else from your team.

Having to relearn often leads to heavy code refactoring, which becomes more complex as more overproduction occurs. Of course, you can have AI do the rework, but without pausing to understand the changes you're about to make, you're only placing an order for more rework in the future.

The Cost of Unreviewed AI-Generated Code

A junior developer might use AI to generate a user authentication system. Without fully understanding the generated code, they integrate it into the codebase. Sometime later, when the team needs to add social login capabilities, the team can't grasp the architecture embedded in the generated code. Thus, the team spends days refactoring the logic before extending it.

The team could have saved the time investment had they engaged more deeply with the design in the first place. This pattern repeats in worst codebases with black box AI solutions requiring extensive rework whenever they need modification.

Handoffs

No matter how effectively we use AI, handoffs to other people will inevitably happen for many teams. Often, our products move from the initial development team to the maintenance team so the original team can focus on building new business-critical features. In more old-fashioned environments, products move from development to operations for deploying and running those.

Imagine handing over a codebase where AI has generated 90 % of the code. If anything, that is a ticking time bomb. Sure, for handover, you can generate the required documentation and align the roadmap with AI, but with what context and cost?

The Challenge of Domain-Specific Knowledge Transfer

AI struggles with domain-specific knowledge transfer, which is the most critical information during handoffs between teams. This limitation stems from how we train AI. While it excels at identifying patterns and generating coherent text, it often lacks sufficient understanding of specialized domain contexts.

For example, a payroll system handling the salaries for municipal employees isn't your average code as it embodies significant regulatory knowledge, compliance requirements, and institution-specific business rules of which AI is unaware. Likewise, healthcare has a bottomless pit of laws, regulations, and essential complexity. Training custom models is possible, but the return on investment is not likely to be profitable as we uncover more complexity during the training.

When AI generates documentation for complex systems, it can describe the technical architecture and surface-level functionality but cannot capture the many whys behind critical decisions.

In the payroll example, why are transactions handled differently on weekends? Why must specific hour reports undergo additional verification steps? These domain-specific rationales are firmly rooted in the institutional knowledge of the team and domain experts who built the system. Past incidents and interpretations often shape this knowledge, which might not be public information.

The most valuable documentation for handoffs addresses these domain-specific nuances: the edge cases, historical context, and business justifications. They explain why a system works as it does. Here, AI falls short, creating a dangerous knowledge gap during team transitions.

Context Switching

As I explained with partially done work, AI-augmented development creates review bottlenecks, forcing team members to switch contexts between other work items rather than focusing on their work.

Even if we adjust the AI to solve the waste of overproduction and produce only minimum code, we aggravate the problem because a higher number of smaller batches demanding review chokes the throughput and causes even more context-switching.

More Pull Requests, Less Throughput

In a team heavily utilizing AI, we see a troubling pattern in the pull request process. Team members submit smaller and more frequent pull requests — averaging 12 per week instead of the previous 3–4.

While smaller pull requests generally sound better, the increased volume quickly overwhelms the system capacity. Other team members hop between multiple pull requests daily while their work suffers.

The team suffers from a bottleneck where people write code faster than they can review. Everyone suffers from a high cognitive load and feels exhausted despite AI supposedly making their work easier. We had optimized AI tools for individual speed at the expense of team flow.

Defects

Whenever I've asked AI to generate a solution to a problem, it has consistently left out unit tests unless I specifically asked it to write them.

I'm unsurprised, having seen many hastily written codebases during my career. The company behind the AI model has likely trained it with a significant subset of public codebases lacking tests. How could it know tests are needed in every serious programming context if many teams do not lead with an example?

If AI does not, by default, write tests for you, then all the AI-augmented 10x developers will only deliver more defects than value. Reflecting on the possibility of AI-reviewed code, I doubt it would know to block a pull request when it lacks tests. That is solely the responsibility of the continuous delivery pipeline. Furthermore, when automated tests mean AI testing the code written and reviewed by another AI, it's another ticking time bomb causing havoc and defects in your production environment.

Risks of Neglecting Tests

In an example scenario, a project using AI-generated modules extensively, passed all the quality assurance checks, and the team deployed the changes to production. Within the first week, the team had discovered critical vulnerabilities in the API layer the AI had implemented.

Despite looking professional and working correctly for the happy path, the code lacked functional verification for edge cases, error handling, and security headers. The developers had asked for an API fulfilling acceptance criteria but hadn't specifically defined the level of testing they needed. The AI obliged by generating code that functionally worked, but its fundamental insecurities were not caught in the delivery pipeline. The rework cost the team weeks of emergency patching and a security incident review, not to mention losing their reputation along with clients.

Making AI Work Meaningful

Though this blog post might make the current situation look bleak, , there is still plenty of hope. Making AI-augmented software delivery work is a matter of using the correct tools correctly. That means instead of accruing waste we must use AI to expose and eliminate it.

Establishing Quality and Guidelines

Teams must establish guardrails that ensure AI builds quality rather than accelerates output. Have clear guidelines for how and where AI tools fit your delivery workflow.

First, we must establish quality gates dictating that the AI-generated code must pass before integration. Those include automated checks in the form of unit and preferably mutation tests, complexity metrics via static analysis, scanning of security vulnerabilities and software bill of materials, and adherence to architectural patterns. When you practice continuous delivery and automate these gates into your CI/CD pipeline, you have created safeguards against the quality issues inherent in AI-generated solutions.

Second, implement and document a team-wide protocol for AI tools usage. For example, consider a policy of not deploying AI-generated code without peer-reviewing your prompt and generated answers. Teams should require their developers to explain the functionality of AI-generated code, ensuring they understand what they're adding to the product. Using agentic tools such as Cursor and Warp, which propose and implement changes incrementally while keeping the developers in the loop is helpful. Defining and documenting recommended prompting techniques as a shared library is another trick worth considering.

Third, invest in AI literacy across your delivery pipeline. If you have quality assurance specialists, they should share an understanding of how and why they augmented test automation code with AI. Product owners should learn to write requirements without embedding a hallucinated well of wishes from AI. Most importantly, tech leads should become experts in identifying and removing AI waste described in this post.

The Path Forward with AI-Augmented Software Delivery

Thinking of AI as a companion instead of a driver plays a key role here. We have already seen increased perceived throughput using AI while sacrificing quality and stability. We must train, tune, and prompt AI to provide quality solutions to balance the situation. At the same time, we still need to continuously educate ourselves on modern software engineering principles. Regard AI as someone with access to virtually unlimited knowledge resources but lacking a deep understanding of how to apply those to your context.

Thus, the lesson here is that quality does not fall into our hands unless we demand it. To be able to demand it, we must learn to recognize it. More than training the AI, we still need to train ourselves in fundamentals. Tools amplify our capabilities but also our limitations. An undisciplined team with access to a powerful AI produces poorly architected systems faster. Conversely, a team grounded in solid engineering principles will benefit even from a contextually limited AI.

At Polar Squad, treating AI as a companion rather than a driver, using it responsibly, and using it to serve the entire delivery lifecycle instead of being a mere coding assistant can make working with it meaningful. If you require human assistance to master AI assistance, we are ready to help you.

Seriously, can we help you out with AI assistance? Let’s chat and see how we can help you ! tuomas.lindholm@polarsquad.com – +358 40 177 1719

Where's the undo button? (Part III) — Polar Squad

Polar Squad — Tue, 23 Apr 2024 21:00:00 GMT

This is the third part of the blog series where we examine the relationship between DevOps and safety. My name is Tuomo Niemelä and I work as a DevOps consultant at Polar Squad which operates in the intersection of people and technology.

You can read the first part from here and the second part here.

Everyday safety

If the previous parts were too technical or academic, don’t worry! There’s still plenty of ways to “do DevOps” in everyday situations. I’m going to list a couple of real life examples relating to the points I listed in the previous part. While some might see these methods and ideas as obvious I still think these things are good to be said out loud.

“Safety is not mere emotional weather but rather the foundation on which strong culture is built. The deeper questions are, Where does it come from? And how do you go about building it? ”

— Daniel Coyle

Getting over fear of failure

One of my all time favorites happens during daily meetups. You’re going through who is focusing on what today and are there any blockers. Imagine a situation where your colleague says in a lower tone that one fix he is implementing isn’t going his way. He might even be hinting there’s something wrong with his intelligence or skills. Pause here: This is the exact moment when you need to catch the exposed vulnerability happening in milliseconds and embrace it. You could say something like “That’s ok, I feel such things hard all the time - there was this one time when…” and continue there with something which expresses your vulnerability in exchange.

Previous scenario can happen the other way around also. Once the time is right, I myself might ridicule my own doing in order to give my team members a chance to pick my vulnerability and meet me halfway. Sometimes it works, sometimes not. These things take time and some trust that people want to do the right thing most of the time.

Something about innovation and creativity

I love brainstorming! You know the situations when it’s allowed to throw ideas no matter how crazy or stupid. Sometimes your colleague catches something essential from that and synthesizes a whole new solution from new ideas combined! Maybe while conducting an ordinary planning or problem solving session we’re not allowed to be as wild as in brainstorming sessions since the amount of useless or incorrect information could distract us from the actual solution.

Still there is something to learn from brainstorming: there is more freedom to fail. When you and your teammates are starting to jell this freedom emerges into any session. Just start with the classics like “I know this sounds stupid but…” or “I know I’m an idiot but…”, works like a charm! All team members should remember the following: Don't get hung up on little details or some syntax errors right away, there is always a chance to refine the solution after.

About team learning

Sometimes there can be this stigma towards two or more people doing the same thing. Since time is money and to maximize throughput every expensive developer should do only their own thing all the time right? Wrong! If we’re going to work as a team - an antifragile team - there needs to be cooperation and knowledge sharing. It can even happen in any mundane task.

For example while doing a bigger (small is better I know) production deployment with database manipulation I might ask a teammate to join me as an extra pair of eyes. Just in case. Not only does this reduce stress by sharing the burden, it also offers us a chance to share our viewpoints about the whole process and the state of the system or tools.

It starts with communication

Sarcasm: it’s an art form. I’m sure your intention isn’t to hurt anyone. Anyway if you don’t know when it’s the right place and time to use it on people, then don’t. There is this unwritten rule for when to use sarcasm among close friends or colleagues. Just note that even though you might feel that it is okay to use sarcasm on a person, that person might not feel the same. Sarcasm is dangerous. “But that person just doesn’t understand humor!” No! Now you’re just an asshole.

Lastly there's a couple of things we need to remember: we need to stop downplaying the problems at hand and the successes in the end. Firstly, there's no such thing as a trivial problem. If there were, we wouldn't call them problems. Every person is moving in a different stage in their career path. Something which is trivial to you might not be so trivial to others. Secondly, remember to celebrate even the small wins out loud - together. Software is never going to be fully ready or perfect. Its life cycle continues to evolve after production launch. My point being: don’t ever rob people of feeling good about themselves and don’t get blinded by the continuous improvement cycles.

It comes down to trust

But how do we people build trust? That could be yet another topic on its own. In the meantime all I can say is that it usually helps if one isn't a complete asshole. Listen to people and build from there. Showing vulnerability is a leap of faith but around the right people it’s always worth it.

Once the team starts to jell together, these details mentioned earlier become more natural and automatic. I still would recommend keeping an eye on them especially while the team is new or a new team member is introduced to the pack. Also: take care of your juniors, and someday they might be your seniors. If you want to dig deeper in these topics I highly recommend a book “The Culture Code” by Daniel Coyle. Now let’s end this.

There’s no such thing as 100% safety: the unspeakable happens

In this journey I wanted to share my viewpoint that DevOps represents first and foremost safety. We’ve been examining some technological practices and solutions which one could implement into systems to gain more safety, possibly making the system more welcoming and manageable for anyone who wishes to learn it.

We went through 5 key points in the psychological safety framework and ended it all with real life examples and practical ideas. Since psychological safety is such a huge part of the whole picture, I could almost call technological safety as “everything else”-safety just for balance. Now I want you to go through one last thought experiment.

“Everyone has a plan until they get punched in the mouth. ”

— Mike Tyson

I want you to imagine a situation where you are making some major changes to the production environment. Similar changes have been going well to the test environment earlier, tests show green and you feel pretty confident that everything will go just fine. There were processes in place and you followed them. There were some safety measurements in place, but somehow by accident you went around them. Maybe some critical part of the automation like the database backup failed also. The inevitable human error happens and now you wrecked the live production environment big time beyond repair.

Here's where processes and tools end, and culture begins. How does your team and organization react to these kinds of happenings? Is there a redemption? In what company are you in? What are your values? Systems will fail but that doesn't mean people around you must also fail you.

Bonus tip: If you’re conducting “blameless postmortems” while being tense or with overly mechanical efficiency you might still be doing just plain postmortems.

Polar Squad Blog

Check your Kubernetes deployments! — Polar Squad

Typical flow for deploying applications to Kubernetes

Rollout to the rescue!

Readiness probes and deadlines

Scripting automated rollback

Conclusion

Cloud cost and resource optimization: How Polar Squad can save you money while reducing your carbon footprint — Polar Squad

DevOps solutions

Polar Squad to the rescue

Craftsperson's Guide to GitHub Actions #3: Building and Releasing — Polar Squad

Building Your Action

Trust, But Verify the Action

Semantic Versioning Done Right

Post-Release Verification: Test Like a User

Conclusion: Actions Are Software

What You've Learned

Next Steps

Craftsperson's Guide to GitHub Actions #2: Scaling Up the Testing — Polar Squad

The ROT-13 Action

Unit Testing: Fast Feedback Without Waiting

Property-Based Testing: Testing What You Can't Imagine

Mutation Testing: The Ultimate Reality Check

Conclusion

Craftsperson's Guide to GitHub Actions #1: Designing for Success — Polar Squad

The Hidden Cost of Implicit Dependencies

Import Smells: When Dependencies Use You

Global Mutable State: The Silent Killer

The Time Problem

The Solution: Dependency Injection

Conclusion

Craftsperson's Guide to GitHub Actions — Polar Squad

What You'll Learn

Who This Guide Is For

Series Chapters

In a struggle to tame Large Language Models — Polar Squad

Introduction

The initial WOW phase

Augmentation phase

Function-calling Phase

MCP under the hood

Applications of MCP

Where are we today

Challenges

Where are we heading to

More reading

Horseless carriage: AI is not just for faster coding — Polar Squad

More than just faster coding

Explosion of experimentation

Show, don’t just tell

From a programmer to a product manager

Multi-agentic workflows

What’s left for humans to do?

AI-augmented Software Development: Hype, Vibes and Smoking Production Environments — Polar Squad

The True Nature of Software Delivery

Understanding the Value Streams in Software Delivery

Lean Principles and Identifying Waste in Software Delivery

Partially Done Work

Bottlenecks and the Illusion of Speed

Overproduction

AI's Tendency Towards Complexity

Relearning and Rework

The Cost of Unreviewed AI-Generated Code

Handoffs

The Challenge of Domain-Specific Knowledge Transfer

Context Switching

More Pull Requests, Less Throughput

Defects

Risks of Neglecting Tests

Making AI Work Meaningful

Establishing Quality and Guidelines

The Path Forward with AI-Augmented Software Delivery

Where's the undo button? (Part III) — Polar Squad

Everyday safety

Getting over fear of failure

Something about innovation and creativity

About team learning

It starts with communication

It comes down to trust

There’s no such thing as 100% safety: the unspeakable happens