Comparison of Readability Scores for Written Health Information Across Formulas Using Automated vs Manual Measures
Introduction
Assessing the readability of written health information is a common way to evaluate whether patients are likely to understand it.1 Readability is an objective measure that estimates a text’s equivalent school-grade reading level and is increasingly recommended globally in health policies.2,3 Several formulas for calculating readability exist, and scores can vary substantially depending on the formula applied.4 There has also been a proliferation of automated online calculators that provide readability estimates within seconds. However, the accuracy and consistency of automated calculators have not been evaluated.
The aims of this study were to assess (1) the variability of readability scores across automated calculators, (2) the association of text preparation with score variability, and (3) the level of agreement of automated readability scores with the reference standard (manually calculated scores) using the Simple Measure of Gobbledygook (SMOG) Index, the Flesch Kincaid Grade Level (FKGL), and the Automated Readability Index (ARI).
Methods
This cross-sectional study followed the STROBE reporting guideline. Ethical approval was not required because all information was in the public domain, and no human participants were involved.
In April 2022, we identified automated readability calculators from the published literature and wide application in Australia (eMethods and eTable in Supplement 1). We selected 2 webpages from 5 health topics linked on the Centers for Disease Control and Prevention website: COVID-19, attention-deficit/hyperactivity disorder, chronic obstructive pulmonary disease, diabetes, and cancer.
Two scores were obtained by each calculator: one using unedited text and one in which the text was prepared based on guidelines for readability assessment (eg, removal of incomplete sentences and midsentence periods) (eMethods in Supplement 1).5 We calculated the proportion of text excluded in the preparation process. We reported the SMOG Index, FKGL, and ARI to capture formulas used across all included calculators.
To provide a reference standard for determining the accuracy of automated scores, we calculated the SMOG index, FKGL, and ARI scores manually for the prepared text (eMethods in Supplement 1). Agreement with the reference standard was assessed using Bland-Altman plots.6 Comparisons were made across formulas, calculators, and methods of text preparation. A 95% limit of agreement less than 1 grade was considered good agreement; 2 grades or above was considered poor agreement and therefore inaccurate. Data were analyzed using R, version 4.1.2 (R Foundation for Statistical Computing). Bland-Altman plots were constructed using the package ggplot2.
Results
We identified 8 readability calculators: Microsoft Word, Online Utility, Readable, Readability Studio, Readability Formula, WebFX, Hemingway App, and the Sydney Health Literacy Lab (SHeLL) Health Literacy Editor. There were 16 combinations of calculator and formula (4 for the SMOG Index, 6 for FKGL, and 6 for ARI).
Across all calculators, the same text produced scores that varied by up to 12.9 grade reading levels even when using the same formula (Table). For all but 3 calculations, text preparation decreased variability among calculators (range, 2.1 grade levels) (Table). However, for 5 of 10 texts, this preparation involved omitting more than 20% of the text (range, 4%-25%).
Bland-Altman plots for SMOG Index scores are displayed in the Figure. The SMOG Index scores from Readability Studio and SHeLL Editor and the FKGL scores from Microsoft Word showed good agreement with the reference standard. All other calculators showed poor agreement. For example, the 95% limits of agreement for ARI scores from WebFx were 7.1 grades below to 6.0 grades above the reference standard (reduced to 0.3-2.1 grades below the reference standard after text preparation).
Discussion
Our findings suggest that automated readability scores are inconsistent and often inaccurate, meaning that, despite good intentions, health information that is revised to meet health literacy guidelines may still be too complex for people to understand. A limitation of this study is that although a difference of 2 reading grade levels from the reference standard is large (our definition of poor agreement), a minimally important difference has not been defined. Comprehensive guidance on the conduct and reporting of readability assessments is needed to improve accuracy of readability scores and the accessibility of written information for patients.