Do your tests communicate?

Osher El-Netanany
Israeli Tech Radar
Published in
11 min readJul 9, 2023

--

The key to moving fast is the confidence to make changes.

The confidence to make changes depends on test coverage.

Ever since we figured that out, automatic tests became essential. This led to the mass adoption of getting started snippets.

However, getting started is not enough.

Mind the gap (image from here)

What’s wrong with getting-started snippets?

Most examples available aim to get one started quickly. This is true for stackoverflow, codewhisperer, copilot, Bard, and ChatGPT. Even tutorials of test frameworks share the blame.

Why?

  1. They aim for the lowest common denominator of readers. As such they skip over professional details.
  2. They defer the steep part of the learning curve. You will meet it only after you’re already hooked on their framework.
  3. There are opposing opinions about best practices and styles. So tutorials either try to stay unopinionated, or present their opinion as fact.
    Both ways avoid the discussion and the thought it should inspire.
  4. Many optimizations depend on the team’s capabilities, culture, and concrete case.

The results are usually working, but are student-grade at best.

The next parts explains the pitfall of our culture, why it’s broken and what leads out of it.
For the how and the code samples — skip here.

Missing Reality

We live the reality of rushing to the finish line of some value delivered to a customer. We forget that this finish line is an arbitrary lap in a longer race.

“We’re all mad here” / Cheshire Cat. (image from here)

The reality we miss is:

A codebase is established once but is tested continuously throughout its life.

T.T.R. — Time to Recovery

Codebases are bound to grow and Tests are bound to fail.
But how much time does it take to recover from a failed test?

Poorly placed. (image from here)

In this sense, a lot of our work is not to write new code, but to get into an existing code. We need to decipher both production and test codes to judge between them:

  • Does the test code guard a valid spec that must be upheld?
    — or —
  • Has a new spec rendered the older obsolete?

The less decipher you need — the better your TTR is. There is an even better reality:

What if the tests could communicate precisely what requirements they are guarding?

What if the test output was all you need to know what you have to fix without deciphering any code?

What if it could do that for you with the same clarity even 6 months later?

Well, you could skip that most frustrating part of the recovery...

Cultural Key

Passing on the knowledge, 5th element (image from here)

Our industry is in a perpetual state of inexperience. Check this link of Uncle Bob explaining it. Thus, culture fails to propagate.

This means that a lot of us need to invent the same wheel over and over. And the fact that we end with similar results means it’s the right answer.

A good culture nourishes continuous improvement.

Treating a failed test like an outage is another cultural element.
Optimizing TTR is a cultural element.

But how to get there?

Leading Factors

a mock drawing of a strategic plan with arrows
What leads there? (image from here)

What’s common between a notification on a service outage, and a notification on a failed build?

  • Both are notifications that disrupt your workflow and that have to be handled.
  • Both are likely to get you started on a trouble-shooting course to deduct the context.
  • Any minute you spend on either of them is putting out fires instead of making progress.

Sure, not the same size of flame, but in essence — the same lousy feeling of waste and miscommunication.

The magic happens once the team decides to act on the similarity between a failed test and a down-time outage.

And the lessons are

Relatively speaking... (image from here)

When you treat a failed test like an outage, optimizing TTR leads to a few conclusions:

  1. Aspire the test output to be verbose enough to save you reading the test code, i.e. spit all the context you need with the errors.
  2. Do not assume the developer knows. Take the extra step to describe the use-case and the context.
  3. Divert the cognitive load away from scaffold and instrumentation.
  4. Focus the cognitive load on the meaningful details of the test case.

Let’s start from the worst case, and improve on it upwards few small steps at a time.

Level (-5) — the naïve getting started

Not funny. (image from here)

Unfortunately, as a consultant I still get to see test suites in this spirit:

const myModule = ... //require or import the System-Under-Test

it("should work", async () => {
await setup...;
await step1(...);
expect(…)... .
await step2(...);
expect(…)... .
await step3(...);
expect(…)... .

// and a load more of those in the same function
});

This is the bare minimum that can stop erroneous code from being deployed.

To the most improved form of this post — skip here.

Want to live there? (image from here)

This is the tests-world equivalent of an unorganized, poorly named, unarchitectured, deeply case-nested code, full of copy-pastes and with no regard to isolation or concerns. Many would loathly call it a script, as if a script is not code (what a lame self-deceit…)!

And yet, too many teams do not require their test code to be more.

By the end of this post, you should be able to tell just how horrible it is. Well, sure, it’s better than no tests at all, however, in the pace of our industry, it will soon clog your progress.

What’s wrong with level (-5)?

Several things.

  1. It pays homage to the BDD directive to “follow the English wording of the API “. But it does so in a way that does not provide any information about the test case or the scenario.
  2. When any step fails — the entire scenario breaks and any following steps do not get to run.
    (ℹ️) Sometimes I see a try-catchwith a cleanup attempt. It is not much better, because you still have to bail or rethrow the errors the test is supposed to produce. This makes you work for the test-runner instead of having it work for you.
  3. When the scenario fails, all the indication you get is the error. When the error is raw — it usually is cryptic and generic and does not provide much useful information.
  4. When a few steps in the scenario could throw a similar error, it gets confusing. This makes it hard to note the point of failure.
    (ℹ️) Sometimes I see consul.log calls that try to help identify the point in the test flow. But that’s again working for the test runner and the assertion libraries instead of letting them work for you.
  5. When there is a failure — you have no idea where is the culprit. Is the problem in the test code? I.e. did the test fail to arrange for, to interact with, or to clean up after the S.U.T (System-Under-Test)?
    Or is it because the S.U.T failed, i.e. a breaking change in production code?
    (ℹ️) Sometimes I see comments of //arrange or //setup and //cleanup or //teardown. But these are comments visible in the test code, where the goal is to rid us of reading the test code.
Facepalm. A rather old gesture. (image from here)

Given a failure, you’re most likely to spend valuable time jumping between test code and production code, trying to make sense of all that jabber before you can even judge which between them is right.

Addressing all the “(ℹ️) sometimes” mentions above might get you from level (-5) to level (-2), and still leave you far behind.

A structure like that fails any decent job interview.

Level 0 — Using titles

The next level I get to see is in the following spirit.

describe('my-module', () => {
// async api_one(...)
it('should do this when called with ...', async () => { ...
it('should do that when called after ...', async () => { ...
it('should throw that error when ...', async () => { ...

// async api_two(...)
it('should do this when called with ...', async () => { ...
it('should do that when called after ...', async () => { ...
it('should throw that error when ...', async () => { ...

There are a lot of examples like this over the internet, in tutorials, and in tests of open-source packages that are often used as a reference.

Here we’re in a far better state than the previous snippet:

  • It is organized
  • There’s an apparent thought about the case matrix
  • It is a basis for test isolation — a failure in one test will not prevent other cases to run
  • When any test fails — an English explanation will come in the test output with the rejecting error.

What’s still wrong?

First — Start small. The order of the text is reversed.

If you look at a case matrix (or any truth table for that matter) — the conditions come first. In the natural order, you set the preconditions and then expect a behavior. You do not observe a behavior and then match for it the preconditions it worked under…At least at the human level — that form is confusing.

And what if in similar conditions you’d like to validate a few requirements? Should you repeat the conditions in every one of their titles? Will you act asynchronously to check each time a different propperty?

That’s fuzzy... (image from here)

Second— When a test fails you still have to read the test code.

Popular assertion libraries like should, expect or chai try to imitate English. This allows coders to convey information about the test in the code. But this also brings the illusion you can forego good titles, sending you to read the test code on each failure.

Test code is still code, and codes tend to have a poor signal-to-noise ratio. Even the things you may not consider noise demand cognitive effort.

Facepalm. Naturally (image from here)

Third — the sections are divided by comments, which are not accessible to spec-reporters.

The Reporter is the part the test runner uses to emit results to the test output. Most reporters emit a summary of failures in the end.

A Spec-reporter is a reporter that prints your entire test cases tree using your descriptions and titles, usually before the failures summary.
It marks each test in the tree with a pass/fail/skip notation. This allows reading the narrative your test tree is telling and the place the failures take in it.

Comments are inaccessible to test reporters — sending you back to reading the test code.

Still facepalm, though modern. (image from here)

With some discipline, this report can act as the software specification. I.e. — readable documentation that comes right out of your test code.

Spec reporters work well in conjunction with pending tests. Using pending tests is adding spec titles without providing their test handler. This makes them appear in the tree as skipped.
This is useful to write down directly to the suite all the cases you mean to implement and get to them later one by one.

I will always be nostalgic for the clean look & feel of mocha.js

Spec reporter is the default reporter for mocha, built in to tap, supported by the built-in runner in node. It works with Jest using a plugin package.

(ℹ️) Mind that you do not have to write tests first or work with TDD/BDD to use spec reporters. Run the spec reporter whenever you want to see what narrative your test tree is telling :)

Last — there are a few things you can ask the test runner to do for you:

  1. execute setup and cleanup for you.
  2. make sure that if a test fails, the cleanup will still happen
  3. fail a test whose setup or cleanup failed
  4. notify you of the error if the test failed on the setup or during the test itself.

Fixing these 4 issues will bring you to level (4).

Level (4) — basic professional

The professional cornerstone (image from here)
describe('my-module', () => {
context('when used in cased A…', () => {
before(async () => { ... //case setup
it('should fulfil requirement 1…', () => { ...
it('should fulfil requirement 2…', () => { ...
...
after(async () => { ... //cleanup
})
context('when used in case B…', () => {
before(async () => { ... //case setup
it('should fulfil requirement 1…', () => { ...
it('should fulfil requirement 2…', () => { ...
...
after(async () => { ... //cleanup

Mocha BDD recommends the api context for describing case context. In fact, it’s an alias for describe. Jest supports only describe and lets you nest it like mocha, so using describe makes things uniform between the two.

The Arrange and Act stages are performed on the asynchronous before hooks — e.g. inject test data, and perform an HTTP request. Then all the Assert steps operate on the obtained response object, and happen synchronously.

Recap

You don’t have to pick one (image from here)

What have we accomplished so far?

  1. Test-runner makes sure that setup & teardown codes run even when the scenario fails. No try-catch, no console.log. The test-runner will tell you on each failure exactly what handler failed. It will note if it’s a setup/teardown hook or the test itself.
  2. The context and case are communicated with titles that will be printed for each failed test.
  3. Each context is its own closure with its own variables. You can use it to hold a state that is relevant to the test case.

(ℹ️) mocha lets you hold state on the this. You’ll have to write all your handlers as old-school functions instead of arrow functions.
Personally, I abhor the use of this in JavaScript and like to hold my state on members of closures, but you do you…

The road is still long. This is hardly half way.
This is not the first post about tests (here are links to one, two, and three). There may be more parts in this series.

So, what’s next?

Heads up (Image from here)

There are many more levels to score.
For example:

  1. You can master mocking with spies, and stubs.
    (ℹ️) But be careful not to get lost in the scope of units and miss testing that the system works as a whole.
  2. You can use test-case factories. This lets you express your tests with the sole focus on inputs, outputs, and expectations. Much better than copy-paste a whole structure and hacking inside it.
  3. You can organize data fixtures. Facilitate their setup and cleanup, and organize them as modules so you can import their setup/teardown hooks, and import the injected data itself. This lets you use and make assertions against logical entities instead of values hardcoded in your test.
  4. You can produce coverage reports and integrate code-smells detection. You can then use moving ratchets of quality bars.

Now it’s your turn to teach me:
- LMK: which of them should I cover first?
- Show me in claps between 1 to 50 how did you like this work.
💙 I appreciate your engagement and time 💙

Special thanks to Yonatan Kra, a good man I worked with in the past whose video finally kicked me back to sit and write all this.

--

--

Osher El-Netanany
Israeli Tech Radar

Coding since 99, LARPing since 94, loving since 76. I write fast but read slow, so I learnt to make things simple for me to read later. You’re invited too.