Key Aspects of a Well Designed Test

A badly written test can lead to false passes or false failures, as well as inaccurate interpretations of the specs. Therefore it is important that the tests all be of a high standard. All tests must follow the test format guidelines and well designed tests should meet the following criteria:

  • The test passes when it's supposed to pass
  • The test fails when it's supposed to fail
  • It's testing what it claims to be testing

Self-Describing Tests

As the tests are likely to be used by many other people, making them easy to understand is very important. Ideally, tests are written as self-describing, which is a test page that describes what the page should look like when the test has passed. A human examining the test page can then determine from the description whether the test has passed or failed.

Note: The terms “the test has passed” and “the test has failed” refer to whether the user agent has passed or failed a particular test — a test can pass in one web browser and fail in another. In general, the language “the test has passed” is used when it is clear from context that a particular user agent is being tested, and the term “this-or-that-user-agent has passed the test” is used when multiple user agents are being compared.

Self-describing tests have some advantages:

  • They can be run easily on any layout engine.
  • They can test areas of the spec that are not precise enough to be comparable to a reference rendering. (For example, underlining cannot be compared to a reference because the position and thickness of the underline is UA-dependent.)
  • Failures can (should) be easily determined by a human viewing the test without needing special tools.

Manual Tests

While it is highly encouraged to write automatable tests either as reftests or script tests, in rare cases a test can only be executed manually. All manual tests must be self-describing tests. Additionally, manual tests should be:

  • Easy & quick to determine the result
  • Self explanatory & not require an understanding of the specification to determine the result
  • Short (a paragraph or so) and certainly not require scrolling on even the most modest of screens, unless the test is specifically for scrolling or paginating behaviour.


Reftests should be self-describing tests wherever possible. This means the the descriptive statement included in the test file must also appear in the reference file so their renderings may be automatically compared.

Script Tests

Script tests may also be self-describing, but rather than including a supplemental statement on the page, this is generally done in the test results output from testharness.js.

Self-Describing Test Examples

The following are some examples of self-describing tests, using some common techniques to identify passes:


In addition to the self describing statement visible in the test, there are many techniques commonly used to add clarity and robustness to tests. Particularly for reftests, which rely wholly on how the page is rendered, the following should be considered and used when designing new tests.

Indicating success

The green paragraph

This is the simplest form of test, and is most often used when testing the things that are independent of the rendering, like the CSS cascade or selectors. Such tests consist of a single line of text describing the pass condition, which will be one of the following:

This line should be green.

This line should have a green border.

This line should have a green background.

The green page

This is a variant on the green paragraph test. There are certain parts of CSS that will affect the entire page, when testing these this category of test may be used. Care has to be taken when writing tests like this that the test will not result in a single green paragraph if it fails. This is usually done by forcing the short descriptive paragraph to have a neutral color (e.g. white).

This example is poorly designed, because it does not look red when it has failed.

The green square

This is the best type of test for cases where a particular rendering rule is being tested. The test usually consists of two boxes of some kind that are (through the use of positioning, negative margins, zero line height, transforms, or other mechanisms) carefully placed over each other. The bottom box is colored red, and the top box is colored green. Should the top box be misplaced by a faulty user agent, it will cause the red to be shown. (These tests sometimes come in pairs, one checking that the first box is no bigger than the second, and the other checking the reverse.) These tests frequently look like:

The green paragraph and the blank page

These tests appear to be identical to the green paragraph tests mentioned above. In reality, however, they actually have more in common with the green square tests, but with the green square colored white instead. This type of test is used when the displacement that could be expected in the case of failure is likely to be very small, and so any red must be made as obvious as possible. Because of this, test would appear totally blank when the test has passed. This is a problem because a blank page is the symptom of a badly handled network error. For this reason, a single line of green text is added to the top of the test, reading something like:

The two identical renderings

It is often hard to make a test that is purely green when the test passes and visibly red when the test fails. For these cases, it may be easier to make a particular pattern using the feature that is being tested, and then have a reference rendering next to the test showing exactly what the test should look like.

The reference rendering could be either an image, in the case where the rendering should be identical, to the pixel, on any machine, or the same pattern made using different features. (Doing the second has the advantage of making the test a test of both the feature under test and the features used to make the reference rendering.)

Visual Example 1

Visual Example 2

Text-only Example

Indicating failure

In addition to having clearly defined characteristics when they pass, well designed tests should have some clear signs when they fail. It can sometimes be hard to make a test do something only when the test fails, because it is very hard to predict how user agents will fail! Furthermore, in a rather ironic twist, the best tests are those that catch the most unpredictable failures!

Having said that, here are the best ways to indicate failures:


Using the color red is probably the best way of highlighting failures. Tests should be designed so that if the rendering is a few pixels off some red is uncovered or otherwise rendered on the page.

Visual Example

Text-only Example

View the pages' source to see the usage of the color red to denote failure.

Overlapped text

Tests of the line-height, font-size and similar properties can sometimes be devised in such a way that a failure will result in the text overlapping.

The word “FAIL”

Some properties lend themselves well to this kind of test, for example quotes and content. The idea is that if the word “FAIL” appears anywhere, something must have gone wrong.


View the page's source to see the usage of the word FAIL.

Special Fonts


Todd Fahrner has developed a font called Ahem, which consists of some very well defined glyphs of precise sizes and shapes. This font is especially useful for testing font and text properties. Without this font it would be very hard to use the overlapping technique with text.

The font‘s em-square is exactly square. Its ascent and descent is exactly the size of the em square. This means that the font’s extent is exactly the same as its line-height, meaning that it can be exactly aligned with padding, borders, margins, and so forth.

The font's alphabetic baseline is 0.2em above its bottom, and 0.8em below its top.

The font has four glyphs:

  • X U+0058 A square exactly 1em in height and width.
  • p U+0070 A rectangle exactly 0.2em high, 1em wide, and aligned so that its top is flush with the baseline.
  • É U+00C9 A rectangle exactly 0.8em high, 1em wide, and aligned so that its bottom is flush with the baseline.
  • U+0020 A transparent space exactly 1em high and wide.

Most other US-ASCII characters in the font have the same glyph as X.

Ahem Usage

If the test uses the Ahem font, make sure its computed font-size is a multiple of 5px, otherwise baseline alignment may be rendered inconsistently (due to rounding errors introduced by certain platforms' font APIs). We suggest to use a minimum computed font- size of 20px.

E.g. Bad:

{font: 1in/1em Ahem;}  /* Computed font-size is 96px */
{font: 1in Ahem;}
{font: 1em/1em Ahem} /* with computed 1em font-size being 16px */
{font: 1em Ahem;} /* with computed 1em font-size being 16px */

E.g. Good:

{font: 100px/1 Ahem;}
{font: 1.25em/1 Ahem;} /* with computed 1.25em font-size being 20px

If the test uses the Ahem font, make sure the line-height on block elements is specified; avoid line-height: normal. Also, for absolute reliability, the difference between computed line-height and computed font-size should be divisible by 2.

E.g. Bad:

{font: 1.25em Ahem;} /* computed line-height value is 'normal' */
{font: 20px Ahem;} /* computed line-height value is 'normal' */
{font-size: 25px; line-height: 50px;} /* the difference between
computed line-height and computed font-size is not divisible by 2. */

E.g. Good:

{font-size: 25px; line-height: 51px;} /* the difference between
computed line-height and computed font-size is divisible by 2. */

Example test using Ahem

View the page's source to see how the Ahem font is used.

Installing Ahem
  1. Download the TrueType version of Ahem.
  2. Open the folder where you downloaded the font file.
  3. Right-click the downloaded font file and select “Install”.

Explanatory Text

For tests that must be long (e.g. scrolling tests), it is important to make it clear that the filler text is not relevant, otherwise the tester may think he is missing something and therefore waste time reading the filler text. Good text for use in these situations is, quite simply, “This is filler text. This is filler text. This is filler text.”. If it looks boring, it's working!


In general, using colors in a consistent manner is recommended. Specifically, the following convention has been developed:


Any red indicates failure.


In the absence of any red, green indicates success.


Tests that do not use red or green to indicate success or failure should use blue to indicate that the tester should read the text carefully to determine the pass conditions.


Descriptive text is usually black.

Fuchsia, Yellow, Teal, Orange

These are useful colors when making complicated patterns for tests of the two identical renderings type.

Dark Gray

Descriptive lines, such as borders around nested boxes, are usually dark gray. These lines come in useful when trying to reduce the test for engineers.

Silver / Light Gray

Sometimes used for filler text to indicate that it is irrelevant.

Methodical testing

Some web features can be tested quite thoroughly with a very methodical approach. For example, testing that all the length units work for each property taking lengths is relatively easy, and can be done methodically simply by creating a test for each property/unit combination.

In practice, the important thing to decide is when to be methodical and when to simply test, in an ad hoc fashion, a cross section of the possibilities.

This is an example of a methodical test of the :not() pseudo-class with each attribute selector in turn, first for long values and then for short values.


This technique should not be cast aside as a curiosity -- it is in fact one of the most useful techniques for testing CSS, especially for areas like positioning and the table model.

The basic idea is that a red box is first placed using one set of properties, e.g. the block box model's margin, height and width properties, and then a second box, green, is placed on top of the red one using a different set of properties, e.g. using absolute positioning.

This idea can be extended to any kind of overlapping, for example overlapping to lines of identical text of different colors.

Tests to avoid

The long test

Any manual test that is so long that is needs to be scrolled to be completed is too long. The reason for this becomes obvious when you consider how manual tests will be run. Typically, the tester will be running a program (such as “Loaderman”) which cycles through a list of several hundred tests. Whenever a failure is detected, the tester will do something (such as hit a key) that takes a note of the test case name. Each test will be on the screen for about two or three seconds. If the tester has to scroll the page, that means he has to stop the test to do so.

Of course, there are exceptions -- the most obvious one being any tests that examine the scrolling mechanism! However, these tests are considered tests of user interaction and are not run with the majority of the tests.

Any test that is so long that it needs scrolling can usually be split into several smaller tests, so in practice this isn't much of a problem.

This is an example of a test that is too long.

The counterintuitive “this should be red” test

As mentioned many times in this document, red indicates a bug, so nothing should ever be red in a test.

There is one important exception to this rule... the test for the red value for the color properties!

Unobvious tests

A test that has half a sentence of normal text with the second half bold if the test has passed is not very obvious, even if the sentence in question explains what should happen.

There are various ways to avoid this kind of test, but no general rule can be given since the affected tests are so varied.

The last subtest on this page shows this problem.