Measuring The Performance Of Typefaces For Users (Part 2)

In the first part of this article, we saw that measuring and comparing typefaces is not a simple task. Testing it (subjectively or objectively) also depends on the context — which can be very tricky. We saw how important it is to keep the typographic design parameters and variables the same to get a more accurate result while testing different typefaces.

Did you miss out on it? Don’t worry! Take a look at the first part of this article here so you can get all the context you need to fully enjoy it.

Measuring, comparing, and testing typefaces may guide your project decisions towards a more accessible and highly legible typeface. Let’s dive into the specifics of typeface aspects so you can get the best out of your tests. So what aspects of typefaces could we measure?

Aspects Specific To The Typeface Itself For Extended Reading Typefaces

Ascender And Descender Height In Relation To X-height

Validation:
There seems to be some truth in a larger x-height and medium-length ascenders and descenders being ideal.
Description:
The most efficient typefaces with the best ratio of x-height and cap-height seem to be Wayfinding Sans Pro and Johnston Underground with an x-height between 67% to 69% of the cap height.
(Learn more about it at Ralf Hermann’s (director of Typography.Guru) paper, “Does A Large X-Height Make Fonts More Legible?.”
Measurement:
Millimeters.
Measure quality type:
Strong (objective).

Aesthetic Quality In Relation To Other Similar Typefaces

Validation:
There would probably be other typefaces that the typeface would be similar to or would fit with, so how well does it compare?
Description:
Expert review or opinion.
Measurement:
Not good, okay, very good.
Measure quality type:
Weak (subjective).

Aesthetic Quality In Relation To Historical Revival Or Similarity

Validation:
There would probably be other typefaces that the typeface would be similar to or fit with. Therefore, how well does it perform when compared to other typefaces that are considered to be historically well done or have been revived well?
Description:
Expert review or opinion.
Measurement:
Not good, okay, very good.
Measure quality type:
Weak (subjective).

Character, Symbol, And Language Support

Validation:
To find out how usable the typeface is, for different characters, symbols, languages, and information types.
Description:
We know character and symbol support, like for maths and different languages, is desirable and needed.
Measurement:
Numerical score or tick list against features and languages.
Measure quality type:
Strong (objective).

Kerning

Validation:
To find out how well the typeface has been kerned — because kerning leads to better and more legible typography.
Description:
A kerning test based on the example below, from Veronika Burian and José Scaglione’s — directors of the font foundry TypeTogether — article “Quality type: How to spot fonts worth your money.”

Measurement:
Not good, okay, very good. Percentage value, maybe. Precise numerical score based on a standard test.
Measure quality type:
Strong (objective).

Accessibility (Children)

Validation:
We know that accessible characters and symbols lead to better, more legible, and easier-to-read typography.
Description:
Infant characters as in my paper “Letter and symbol misrecognition in highly legible typefaces for general, children, dyslexic, visually impaired and aging readers — 2019 fourth edition.”

Measurement:
Numerical score based on the number of characters and symbols. Research and design effort based on not good, okay, very good.
Measure quality type:
Strong (objective).

Accessibility (Dyslexia)

Validation:
Dyslexic characters as in my paper “Letter and symbol misrecognition in highly legible typefaces for general, children, dyslexic, visually impaired and aging readers — 2019 fourth edition” and Robert Hillier’s (typeface designer and academic) PhD “A typeface for the adult dyslexic reader.”

Description:
Numerical score based on the number of characters and symbols.
Measurement:
Research and design effort based on not good, okay, very good.
Measure quality type:
Strong (objective).

Accessibility (Vision Impairment)

Validation:
Visually impaired characters as in my paper “Letter and symbol misrecognition in highly legible typefaces for general, children, dyslexic, visually impaired and aging readers — 2019 fourth edition.”

Measurement:
Numerical score based on the number of characters and symbols.
Description:
Research and design effort based on not good, okay, very good.
Measure quality type:
Strong (objective).

Aspects Specific To The Reader/User For Extended Reading Typefaces

Comprehension

Validation:
To find out how much and how well information is absorbed, retained, and recalled from a typeface or different typefaces.
Description:
This is a very difficult area to measure, and I would like to explain why:
1. Not everyone will be able to recall accurately everything they know or do not know in an exam or in test questions;
2. What people say they know, fail to communicate, and what they actually do in the real world are very different things. Just because some cannot recall something or write it in an exam paper, it does not mean they do not know it.

“Many previous reading studies investigated the effect of typography on reading speed. But we know that faster speed does not always equate to better comprehension. In fact, better comprehension is often associated with slower reading speed.”

— Sofie Beier (legibility expert), “Bringing together science and typography”

Measurement:
A paragraph of text, or pages of text and information, followed by questions or set the users tasks to do based on the information.
Measure quality type:
Weak (subjective).

Speed

Validation:
To find out how much, or how much more, they can read compared to what is considered normal/average from a typical typeface.
Description:
This is another very difficult area to measure, let me explain why. I could quickly scan a single page of a book and, in theory, have read all the content in about 6 seconds (because I have scanned my eyes across all the text quickly). Although, just because I have, in theory, read (or scanned) the text, it does not necessarily mean I have understood or absorbed it. However, the result would be taken into consideration — and there would probably exist clear and strong differences in performance between comparing a script typeface (like Snell Roundhand) against a highly legible typeface (like the Unit typeface). So the Unit typeface would be much easier and quicker to read than Snell Roundhand.
Measurement:
Eye-tracking (time, speed, and behavior) recording and data collection.
Measure quality type:
Strong (objective).

Facial Muscle Activation

Validation:
The zygomatic muscle activity (which controls smiling) is positively associated with positive emotional stimuli and a positive mood state.
Description:
By placing tiny sensors over certain facial muscles, one can measure the minute changes in the muscles’ electrical activity, which reflects changes in muscle tension. Facial EMG (electromyography) studies have found that activity of the corrugator muscle (which lowers the eyebrow and is involved in producing frowns) varies inversely with the emotional valence of the presented stimuli and reports of mood state.
As you can see in the academics John Cacioppo, Lauren Bush, and Louis Tassinary’ paper “Microexpressive facial actions as a function of affective stimuli: Replication and extension” and in the academic Ulf Dimberg’s writing “Facial electromyography and emotional reactions.”
Measurement:
Use an electromyography (EMG) sensor placed on top of a muscle to measure the amount of electrical current in the muscle. You get frequency readings.
Measure quality type:
Weak (subjective) and maybe strong (objective).

Character, Symbol, Or Word-finding Test

Validation:
To find out how quickly they can find information.
Description:
The participants were asked to locate a specific character in a text with a color pen. The specific character was shown at the bottom of the sheet for easy referral. The response times were recorded. Find out more about this method in the academic Brian Sze-Hang Kwok’s paper “Legibility of medicine labels.”
Measurement:
A numerical score of correct and incorrect.
Measure quality type:
Strong (objective).

Searching a Phrase Test

Validation:
To find out how quickly readers can find information.
Description:
The participants were required to locate a phrase in the context of a medicine label. The specific phrase was shown at the bottom of the sheet for easy referral. The response times were recorded. Find more about this method in the academic Brian Sze-Hang Kwok’s paper “Legibility of medicine labels.”
Measurement:
A numerical score of correct and incorrect.
Measure quality type:
Strong (objective).

Read-aloud

Validation:
To find out if there are any issues.
Description:
The subject producing a reader protocol is requested to read the text aloud and to immediately express any thoughts about the document. More about this method can be seen in the academics Leo Lentz’s and Henk Pander Maat’s paper “Reading aloud and the delay of feedback.”
Measurement:
Notes and recordings based on not good, okay, very good, and notes of specific problems.
Measure quality type:
Weak (subjective).

Think-aloud

Validation:
To find out if there are any issues.
Description:
Get people to perform certain specific tasks while using the document to vocalize the person’s thinking. From academics Leo Lentz’s and Henk Pandar Maat’s paper “Reading aloud and the delay of feedback.”
Measurement:
Notes and recording based on not good, okay, very good, and notes of specific problems.
Measure quality type:
Weak (subjective).

At-a-glance

Validation:
To find out if they can correctly identify a word, letter, or symbol and not misread it as another word, letter, or symbol in a quick response environment.
Description:
Typefaces were individually sized to a height of 4 mm using the letter “H” as the reference. Participants viewed the monitor at a distance of approximately 70 cm. Participants’ distance to the screen was measured at the start of the session using a tape measure. Each individual trial followed the same sequence of presentation: a large fixation rectangle signifying the start of the new trial (400 ms), a masking stimulus composed of non-letter characters (200 ms), the stimulus of interest (variable timing, according to staircase rules as described above), a second masking stimulus of non-letter characters (200 ms), and then a response prompt (up to 5000ms). You can see more about this method in the paper “The great typography bake-off: comparing legibility at-a-glance,” by Ben Sawyer’s (academic), Jonathan Dobres (academic), Nadine Chahine (typeface designer), and academic Bryan Reimer’s.
Measurement:
A numerical score based on the number of characters or symbols. The time measure is also checked.
Measure quality type:
Strong (objective).

Questionnaire

Validation:
People’s opinions, preferences, thoughts, concerns, views, likes, and dislikes.
Description:
Question asking, interviews, and real-time observations.
Measurement:
Notes and recordings.
Measure quality type:
Weak (subjective).

The Radner Reading Charts

Validation:
Test the person’s vision accuracy and acuity with a typeface.
Description:
The Radner reading chart is a highly standardized multilingual reading test system. The result of the collaboration is a standardized, valid, and reliable reading test system available in numerous languages. The reading chart consists of sentence optotypes, which are optimized reading test items (standardized by construction), and statistical selection. Sentence optotypes consist of short sentences that are highly comparable in terms of the number of words (14 words), the word length, the position of words, the lexical difficulty, and the syntactic complexity. Language-specific characteristics were considered, as were the number of letters and syllables per word, line, and sentence. Get to know more from the legibility experts Sofie Beier and Kevin Larson’s paper “How does typeface familiarity affect reading performance and reader preference?”

Measurement:
Correct or incorrect response. Note the point of failure or incapability to proceed anymore.
Measure quality type:
Strong (objective).

Legibility (Misrecognition)

Validation:
To find out if they can correctly identify a letter, number, word, or symbol and not misread it as another letter, word, or symbol.
Description:
As in my paper, “Letter and symbol misrecognition in highly legible typefaces for general, children, dyslexic, visually impaired and aging readers — 2019 fourth edition.”
Measurement:
Score for correct or incorrect identification. Time measure check also.
Measure quality type:
Strong (objective).

Legibility (At a Very Small Typeface Size)

Validation:
To push a person’s eyesight to the maximum and see what happens at a very small size.
Description:
At small sizes, less than 8pt, for instance.
Measurement:
x-height size measurement preferred over pt size. Also, maybe a rating of difficulty and time to read like: easy, reasonable, and hard.
Measure quality type:
Strong (objective).

Legibility (Distance)

Validation:
To see when a letter, symbol, or a word becomes unreadable and how far away it can be read or not recognized anymore.
Description:
In Robert Waller’s (information designer) article “Comparing typefaces for airport signs” he says that you could use a screen, physical sign, or printed paper to display a word, letter, or symbol. A person needs to stand far away and then get closer to the display until they can correctly identify the word, letter, or symbol. If a screen is being used, the person can also be at a fixed distance from the screen and then you can make the word, letter, or symbol bigger on the screen, until they can correctly identify it. This would give us a legibility score and distance measurement in relation to the correct identification. The first presented character was the letter “d.” As identified in Miles Tinker’s book Legibility of print, this character is one of the most easily recognizable letters. The purpose of this first exposure was to locate the individual vision threshold. The participant was placed at a distance of 10 meters from the screen and asked to move slowly forward until the presented letter was at the threshold of being identifiable. This was the distance at which the individual participant was tested — varying from 4.5–9 meters (with an average of 6 meters) from the screen. You can read more about this method in Sofie Beier (legibility expert) and Kevin Larson’s (legibility expert) paper “Design improvements for frequently misrecognised letters.”
Measurement:
Measurement in mm, cm or m.
Measure quality type:
Strong (objective).

Legibility (Rotated Information)

Validation:
To push a typeface and person’s eyesight to the maximum and see what happens at these extreme angles. Also, these angles are common in VR (virtual reality) software and products.
Description:
At an angle: -45 degrees horizontally left and +45 degrees horizontally right, -45 degrees vertically up and +45 degrees vertically down.

Measurement:
Legibility test (character, symbol, word) test.
Measure quality type:
Strong (objective).

Legibility (Degrading, Distortion, and Blurring)

Validation:
To push a typeface and person’s eyesight to the maximum and see what happens under these extreme conditions.
Description:
Legibility degrading test as in Ralf Hermann’s (director of Typography.Guru) paper “Designing the ultimate wayfinding typeface.”

Measurement:
The score for correct or incorrect identification. Time measure was also checked.
Measure quality type:
Strong (objective).

Appeal (Typeface Fitting Subject and Content)

Validation:
How well does it fit and suit the content?
Description:
An appeal concerning the content, as an example, for content on gardening, when a slightly more organic, chiseled, and wavy typeface might communicate and fit the content better.
Measurement:
Score based not good, okay, very good.
Measure quality type:
Strong (objective).

Appeal (User Feedback and Responses In Relation to Other Typefaces They May Know)

Validation:
How well does it fit and suit the content?
Description:
Appeal in relation to the typeface itself. What does the user say they like or dislike about this typeface in relation to other typefaces they use and know about? This method could produce interesting observations and data, albeit highly subjective.
Measurement:
Notes and recordings.
Measure quality type:
Weak (subjective).

Fixation Duration

Validation:
How quick or lengthy does the eye have to fixate to understand the information?
Description:
Fixation duration is a period of time when the focus of the participant’s gaze is relatively still on an area and taking in information about that which is looked at, as in the academics Ivan Burmistrov, Tatiana Zlokazova, Iuliia Ishmuratova, and Maria Semenova’s paper “Legibility of light and ultra-light fonts: eyetracking study.”
Measurement:
Milliseconds (ms).
Measure quality type:
Weak (subjective) and maybe strong (objective).

Saccadic Amplitude

Validation:
To find out what behavior, movements, or patterns are happening.
Description:
Saccadic amplitudes are a quick simultaneous movement of both eyes, like when you are reading a line of text. In this method, we monitor what happens with saccadic eye movements when reading, as in the academics Ivan Burmistrov, Tatiana Zlokazova, Iuliia Ishmuratova, and Maria Semenova’s paper “Legibility of light and ultra-light fonts: eyetracking study.”
Measurement:
Degrees (°).
Measure quality type:
Weak (subjective) and maybe strong (objective).

Aspects Specific to The Users’ Environment and Situation

Light

Validation:
To see how a typeface performs in lighting conditions and see how people respond in the lighting conditions.
Description:
Low light or good light condition and see how it affects the information and performance.
Measurement:
The score for correct or incorrect identification. Time measure is also checked. The light strength is measured in lumens (lm).
Measure quality type:
Strong (objective).

Stress

Validation:
To find out how typefaces would work better or worse, under stress and high-pressure situations, with quick stressed eye movements.
Description:
Setup situations such as: booking a ticket and going through an airport, doing tasks after they have finished a 6-hour working day, or reading or doing tasks late at night — when it is more likely that they will be more tired.
Measurement:
Accuracy and efficiency of users’ actions. Maybe blood pressure testing.
Measure quality type:
Weak (subjective).

Time Pressure

Validation:
To find out how typefaces would work better (or worse) under time pressure and quick stressed eye movements.
Description:
Setup situations like finding information within a certain timeframe, booking a taxi very quickly, or finding something in a telephone directory.
Measurement:
Time measuring. Maybe a blood pressure test.
Measure quality type:
Weak (subjective).

In Diverse Situations (Driving In a Car At Distance On a Road Sign Or Airport Sign)

Validation:
To push a person’s vision (length of view) and agility and see how typefaces respond.
Description:
To test the extremes of people’s vision and ability, like driving in a car and reading a road sign where distance and orientation are factors. How does weather affect information and communication?
Measurement:
In meters, centimeters, or millimeters.
Measure quality type:
Strong (objective).

Aspects Specific to Technology for Extended Reading Typefaces

Range of Weights

Validation:
It is always appreciated and helpful to use a typeface with a range of weights.
Notes and description of measuring type:
Range of weights offered.
Measurement:
Numerical score in the amount of weights.
Measure quality type:
Strong (objective).

On-screen Rendering and Hinting

Validation:
Bad hinting and screen rendering leads to hard-to-read on-screen typography and illegibility.
Description:
Analyze by taking a screengrab (then zooming-in), or by using a zoom-in device (magnifying glass), then analyzing the hinting.
Measurement:
A score based on not good, okay, and very good. Also, use 3 different types of screens (low-resolution, HD and 4k+).
Measure quality type:
Strong (objective).

Font File Size

Validation:
Larger font sizes can take up more bandwidth, especially across larger websites, and be slower to load initially in a webpage’s first content paint.
Description:
Look at the file size of the font.
Measurement:
The file size (kb) would give a score, although this measure is certainly not very useful, as there is no escape from a typeface with a large symbol and language support, which cannot really be made any smaller in file size.
Measure quality type:
Strong (objective).

OpenType Features/Variable?

Validation:
If a typeface has more desirable features (such as small caps, different number styles, ligatures, and so on), it makes the typography better and typographically more usable.
Description:
A typeface is better if it has the features required by users and information.
Measurement:
Numerical score or tick list of features.
Measure quality type:
Strong (objective).

Specific Typographic Design Variables Affecting Performance

Typographic Design

Validation:
They affect typeface and typographic communication.
Description:
Tracking, leading, kerning, typeface weight, line length, word spacing, condensed weight, typeface size, typeface color, OpenType features.
Measurement:
Various possible (must be controlled and precise, as mentioned in Ralf Hermann (director of Typography.Guru) paper “What makes letters legible?”).
Measure quality type:
Weak (subjective) hard to measure accurately.

What Would Typeface Performance Measurements, Results, and Scores Potentially Look Like?

A data table, infographic, or some kind of graph could be used?

Scientists And Designers Needing to Work Better Together

Sofie Beier (legibility expert) in her paper “Letterform research: An academic orphan” touches upon the different issues and constraints designers and academics have faced in the past:

“To produce findings that are relevant for the practicing designer, scientists benefit from consulting designers in the development of the experiments. While designers can contribute with design skills, they cannot always contribute with scientific rigor. Hence, researchers will profit from adopting a methodological approach that ensures both control of critical typographical variables and scientific validation. An interdisciplinary collaboration where scientists provide valid test methods and analysis and designers identify relevant research questions and develop test materials, will enable a project to reach more informed findings than what the two fields would be able to produce in isolation.”

— Sofie Beier in Letterform Research: An Academic Orphan

To recap, designers have tended to produce information lacking scientific rigor in the past. In contrast, scientists produce information that is hard to understand — with equations and lacks practical application. So both sides, whichever you are on, have their weaknesses and lack expertise.

Am I Making Typeface Designers’ Job Harder?

It is not my aim to make a typeface designer’s job any harder. It is commonly known that any typeface takes at least one year of hard work. The typeface designer Martin Majoor states that it took him over 15 years to design the Questa typeface. I have nothing but respect for typeface designers and the amazingly hard job they do. In fact, I have so much respect for the time and difficulty of designing a typeface, that I refuse even to try to attempt the task.

What Now?

Research into what is legible and what characteristics make letters and symbols more legible, go to the library and research online. For example, the academic journal Visible Language has all journals available for free on their website. There is some incredible research and work done, that was done more than 50 years ago;
Speak with people and speak with other typeface designers;
Avoid designing and releasing typefaces done, expressed, and designed on your own;
Test typefaces, try to do the test accurately and try to compare what you are designing with another typeface, to see where there are weaknesses and strengths in testing results with people and in different contexts and environments. How is it working (or not working) better than another typeface and in different contexts and environments?
Test typefaces with different categories of people in different contexts and environments;
Make your findings, design intentions, and tactical fixes available for free, as part of the typeface release, as a publication, or as some kind of central public list (like on GitHub), so we can start better and get to where we need to be quicker;
Maybe a completely new typeface might not be as good of an idea as you think. Maybe extension, improvement, or modification to an existing typeface might be smarter. New is not automatically better.

Conclusions

Do we, as users and designers, really need to assess typefaces and find out how they perform? Is it necessary? Well, whatever your thoughts are, in 2022, with a mass of typefaces available and 100s of years of designing and manufacturing typefaces, it is time to consider this topic. I think the time has come, and we are there. This is especially true for highly legible typefaces, some kind of measure or measures — even if new typefaces got released and they only had one performance measure (or say three), would be a start.

We may also need some cross-measurable tests that are used, so everyone tests against the same (or as near as) thing. Because, as previously mentioned, if someone tests their typeface against a sans serif (like Arial), then another person tests their typeface using another typeface (like Helvetica), the data will not be cross-compatible. And furthermore, what typographic design and typesetting values the two people use, would most definitely be different, but actually, it would be highly desirable and more accurate if they were the same.

I hope I have made your life more difficult and more confusing! Maybe I have asked a lot of pointless questions? Furthermore, in theory, just because a typeface does not score well (or score just as well as another typeface) does not necessarily mean it is an ineffective or bad typeface. It just means that, in theory, it may not score as well as another typeface in the same context. Nor does it mean that it would not perform well in reality. What did I say? I told you this was a difficult area!

To confuse things even more, legibility expert Kevin Larson, academic Richard Hazlett, usability expert Barbara Chaparro and academic Rosalind Picard in their paper “Measuring the aesthetics of reading,” found that, when the typeface was set with no OpenType features in a normal body text paragraph (as typically found in a book), the users could read faster and understood more of the text. In other words, there were no ligatures, no small caps, no old-style figures, no real fractions, and no real superscripts and subscripts. I am not saying you should start typographically undesigning, disregarding years of best practice knowledge, but it goes to show that few things are certain in graphic communication.

Well whatever your thoughts are, in 2022 with a mass of typefaces available and 100s of years of designing and manufacturing typefaces, it is time to consider this topic, I think the time has come and we are there. This is especially true for highly legible typefaces, some kind of measure or measures, even if new typefaces got released and they only had one performance measure, or say three, this would be a start.

Acknowledgements

Alma Hoffmann (editing and feedback), Kevin Larson (feedback), Karel van der Waarde (extensive comments and feedback), and Erik Spiekermann (feedback).

Author Profile