maartenjan.dev

Hi! I'm Maarten-Jan, a Software Engineer focused on Java, Kotlin, Event Sourcing, Kubernetes and of course Bash! I sometimes write down what I do.

A spell checker in the pipeline

In our current project, we're having a pretty big public documentation repository. We're using Docusaurus to combine Markdown with React components. As we're committing to this repository with several people, spelling errors sneak in now and again. So I wanted to add a build pipeline step to do some spell checking. How hard could it be? As it turns out, harder than I thought. So I thought I would write down what I did and why.

So before any spell checking can take place, We need to get rid of the 'code' in our files, to only check the plain text. I did not find a really good way to clean up React elements from mdx, so I wrote some bash scripts that sort of work:

remove_content_between_backticks() {
    local c="$1"
    # Remove code blocks (triple backticks)
    echo "$c" | sed -E '/```/,/```/d' |
    # Remove inline code blocks (single backticks)
    sed -E 's/`[^`]*`//g'
}

mdx_to_md() {
  local c="$1"
  echo "$c" | sed -e '1,/^---$/{ /^---$/!d; }' \
                                   -e '/@site/d' \
                                   -e '/</,/>/d'
}

After the React components are filtered, Pandoc can be used to convert md to plain text:

echo "$content" | pandoc -f markdown -t plain --wrap=none | grep -oP '\p{L}+' > plain.txt

Now the spell checking can be done. In order to do that I used hunspell, in our case using a Dutch, English and custom dictionary (how to set this up, look here):

local errors=$(hunspell -l -d nl,en_US,spelling/custom-words plain.txt)

if [ -n "$errors" ]; then
  echo "Spelling errors found in: $file"
  echo "$errors"
  return 1
fi

rm -f plain.txt
return 0

This spell checking can be run using an image with all the required libraries installed, and setting the locales to UTF-8:

FROM ubuntu:latest

# Set locale to UTF-8
ENV LANG=C.UTF-8
ENV LC_ALL=C.UTF-8

RUN apt update && apt install -y \
    locales \
    hunspell hunspell-nl hunspell-en-us pandoc && \
    locale-gen en_US.UTF-8 && \
    update-locale LANG=en_US.UTF-8

WORKDIR /home/admin

Putting it all together

I made a folder in the project with the required scripting, see here. It contains the custom dictionary and a script to run the spell checking for each md and mdx file in the project. In our Gitlab CI I added a build step, essentially building the container every build ( this could be optimized):

spellcheck:
  stage: build
  image: ubuntu:latest
  script:
    - | #/bin/sh

      export LANG=C.UTF-8
      export LC_ALL=C.UTF-8

      apt update && apt install -y \
        locales \
        hunspell hunspell-nl hunspell-en-us pandoc && \
        locale-gen en_US.UTF-8 && \
        update-locale LANG=en_US.UTF-8

      mv spelling/spellcheck.sh spellcheck.sh

      ./spellcheck.sh

And that's it! After fixing all the current spelling errors, and adding custom words, we should be spelling error free!