Article Thumbnail

Calculate The Estimated Reading Time For Your Posts

Extend the markdown to HTML generator by a module for estimated reading time

Florian Dahlitz
6 min
May 25, 2020

Introduction

This is part two of the Markdown To HTML series. The article shows you how to extend the markdown to HTML conversion pipeline we built in part one to calculate the estimated reading time per post. The estimated reading time will then be shown in the article's heading.

But before we jump in, here is an overview about the series. It consists of three parts:

  • Part 1 presents the implementation of the whole generation pipeline (link).
  • Part 2 (current article) extends the implemented pipeline by a module used to compute the estimated reading time for a given article (link).
  • Part 3 demonstrates how you can use the pipeline to produce RSS feeds (link).

The code used in all three parts is available on GitHub.

Calculate the Estimated Reading Time

Let's start by implementing a module that calculates the estimated reading time for a given text. The question arising at the beginning is how reading time can actually be estimated. To break it down, we need to know how many words a person usually reads per minute (WPM = words per minute).

Depending on the source you consult, the average number lies between 130 and 370 WPM. My articles contain a bunch of (normally Python) source code. It usually takes more time to read source code than a normal sentence if both have the same number of words in it. That is why I go with 200 WPM. You can test it and adjust the number later accordingly, but let's go with 200 for now.

Also as a result of dealing with source code, I do not want to count each word on its own. Given that the words there are usually less long than in a normal text, we need an ordinary average word length. I am writing my articles in English and found a forum post where people wrote that the average word length in English is around 5 characters [1]. So far, so good.

In order to calculate the estimated reading time, we need a module in which the corresponding functions can live. Let's create a new file called blog.py in the services directory. The first thing we define are two constants holding the values we discussed earlier.

# blog.py

WPM = 200
WORD_LENGTH = 5

The next thing we implement is a function that calculates the number of words in a given text by dividing its length by the average word length.

def _count_words_in_text(text: str) -> int:
    return len(text) // WORD_LENGTH

Notice that we added a trailing underscore to the function's name to indicate that it is a private function, which should not be used directly. It is meant to be only used internally.

Sometimes, I need to use HTML-tags in my markdown files. Consequently, these tags need to be filtered and removed. Therefore, we implement a new function called _filter_visible_text, which does exactly that by utilizing Python's re-module.

import re


def _filter_visible_text(text: str) -> str:
    clear_html_tags = re.compile("<.*?>")
    text = re.sub(clear_html_tags, "", text)

    return "".join(text.split())

First, we compile a regular expression pattern matching HTML-tags into a regular expression object. Next, we replace all HTML-tags by an empty string. Lastly, we split the whole text and join the elements of the resulting list using an empty string. This results in a single string without white spaces, newline characters, tabs and HTML-tags.

Note: If you want to learn more about regular expressions in Python, make sure to consult Python's documentation [2] or to visit DataCamp's regular expression tutorial [3].

Now, we are able to implement the estimate_reading_time() function.

def estimate_reading_time(text: str) -> int:
    filtered_text = _filter_visible_text(text)
    total_words = _count_words_in_text(filtered_text)

    return total_words // WPM

In the end, this function will

  1. get the markdown text of an article supplied,
  2. filter out its visible text,
  3. calculate the number of words in it, and
  4. compute the estimated reading time, which is returned afterwards.

Here you can see the whole module again:

# blog.py

import re

WPM = 200
WORD_LENGTH = 5


def _count_words_in_text(text: str) -> int:
    return len(text) // WORD_LENGTH


def _filter_visible_text(text: str) -> str:
    clear_html_tags = re.compile("<.*?>")
    text = re.sub(clear_html_tags, "", text)
    return "".join(text.split())


def estimate_reading_time(text: str) -> int:
    filtered_text = _filter_visible_text(text)
    total_words = _count_words_in_text(filtered_text)
    return total_words // WPM

Awesome, we implemented a module which calculates the estimated reading time for us! Let's move on and integrate it into the existing conversion pipeline.

Integrate the Module Into the Pipeline

At first, we need to import the module. Hence, add the corresponding import-statement to the convert.py script.

# convert.py import statements
import blog
# the rest of the code

Next, we need to identify which text has to be supplied to the estimate_reading_time() function. You may remember that we have two with-blocks in the for-loop (if you don't, have a look at it here). The first with-block is responsible for reading the markdown text from a file and converting it into HTML. The content of the markdown file is stored in a local variable called content. We invoke the blog.estimate_reading_time() function with the content variable as argument to compute the estimated reading time for the currently processed post.

estimated_reading_time = blog.estimate_reading_time(content)

Later, we will display the estimated reading time in the heading of the corresponding post. Therefore, we need to pass it to the document's environment. We can do so by adding it as a keyword argument to the render() method two lines later.

doc = env.get_template(str(BLOG_TEMPLATE_FILE)).render(
                content=html,
                baseurl=BASE_URL,
                estimated_reading_time=estimated_reading_time,
                url=url,
                **_md.Meta,
            )

Great! There is only one thing left to do: Display the estimated reading time for each article in the corresponding article heading.

Display the Estimated Reading Time in the Post's Heading

In order to display the estimated reading time, we need to modify the layout.html template. In the body of the document is only one h1-tag. We change the content of it as follows.

{% raw %}<h1>{{ title[0] }} <em>({{ estimated_reading_time }} min)</em></h1>{% endraw %}

The estimated reading time will now be displayed in parentheses behind the article's title (emphasised).

Note: Make sure to run the convert.py script from the command-line to update the articles' HTML files.

Summary

Congratulations, you have made it through the whole article! While reading the article, you have learned how to calculate the estimated reading time for a given next and how to integrate it into the conversion pipeline you implemented in part one.

I hope you enjoyed reading the article. Make sure to share it with your friends and colleagues. If you have not already, consider following me on Twitter, where I am @DahlitzF or subscribing to my newsletter so you will not miss any upcoming article. Stay curious and keep coding!

References