Build A Markdown To HTML Conversion Pipeline Using Python

Table of Contents

Introduction
Project Setup
Flask Setup
Writing the Posts
Markdown to HTML Converter
Summary
References

Introduction¶

A few months ago, I wanted to serve my own blog instead of using websites like Medium. It was a pretty basic blog and I wrote all my articles in HTML. However, some day, I came across the idea of writing my own markdown to HTML generator, which would eventually allow me to write my articles in markdown. Furthermore, extending it by features like estimated reading time would be easier. Long story short, I implemented my own markdown to HTML generator and I really like it!

In this article series, I want to show you how you can build your own markdown to HTML generator. The series consists of three parts:

Part 1 (current article) presents the implementation of the whole generation pipeline (link).
Part 2 extends the implemented pipeline by a module used to compute the estimated reading time for a given article (link).
Part 3 demonstrates how you can use the pipeline to produce your own RSS feeds (link).

The code used in all three parts is available on GitHub.

Note: The idea of a markdown to HTML generator for my articles is based on an implementation Anthony Shaw uses to generate his articles.

Project Setup¶

In order to follow the current article, you need to install a few packages. We put them into a requirements.txt file.

# requirements.txt

Flask==1.1.2
Markdown==3.2.1

Markdown is a package, which allows you to transform your markdown code into HTML. We use Flask to serve the static files afterwards.

But before you install them, create a virtual environment to not mess up your Python installation:

$ python -m venv .venv
$ source .venv/bin/activate

Once activated, you can install the dependencies from the requirements.txt file via pip.

$ python -m pip install -r requirements.txt

Great! Let's create a few directories to better organize our code. First, we create a directory app. This directory contains our Flask app serving the blog. All subsequent directories will be created inside the app directory. Second, we create a directory called posts. This directory contains the markdown files we want to convert into HTML files. Next, we create a directory templates, which will contain the templates we serve later using Flask. Inside the templates directory, we create two more directories:

posts contains the resulting HTML files, which correspond to the ones in the posts directory in the application's root.
shared contains HTML files which are used across many files.

Furthermore, we create a directory called services. The directory will contain modules we use in our Flask application or to generate certain things for it. Last but not least, a directory called static is created with two subdirectories images and css. Custom CSS files and the thumbnails for the posts will be stored here.

Your final project structure should look like this:

$ tree .
.
├── app
│   ├── posts
│   ├── services
│   ├── static
│   │   ├── css
│   │   └── images
│   └── templates
│       ├── posts
│       └── shared
└── requirements.txt


9 directories, 1 file

Awesome! We finished the general project setup. Let's hit over to the Flask setup.

Flask Setup¶

Routing¶

We already installed Flask in the last section. However, we still need a Python file which defines the endpoints the users can access. Create a new file in your app directory called main.py and copy the following content into it.

# main.py

from flask import Flask
from flask import render_template

app = Flask(__name__)


@app.route("/")
def home():
    return render_template("index.html")


@app.route("/posts/<string:name>")
def blog_post(name: str):
    return render_template(f"posts/{name}.html")


if __name__ == "__main__":
    app.run("0.0.0.0", port=5000, debug=True)

The file defines a basic Flask application with two endpoints. The first endpoint, which the user can access using the / route, returns the index page, where all posts are listed.

The second endpoint is a more generic one. It accepts a post's name and returns the corresponding HTML file.

Next, we turn the app directory into a Python package by adding a __init__.py file to it. This file is empty. If you are on a UNIX machine, you can run the following command from your project's root directory:

$ touch app/__init__.py

Templates¶

Now, we create two templates index.html and layout.html. We store both in the templates/shared directory. The layout.html template will be used for a single blog entry, whereas the index.html template is used to generate the index page from where we can access each post. Let's start with the index.html template.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">

    <title>Overview Of All Available Posts</title>
    <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css"
          integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">
    <link rel="stylesheet" href="/static/css/style.css">
</head>
<body>
    <div class="container">
        <div class="row justify-content-center">
            <h1>Overview Of All Available Posts</h1>
        </div>
        <div class="row row-cols-1 row-cols-md-3">{% raw %}
            {% for post in posts %}
                <div class="col mb-4 d-flex justify-content-center">
                    <a class="card-link d-flex justify-content-center"
                       href="{{ post.rel_link }}">
                        <div class="card col-12 col-sm-9 col-md-12">
                            <img src="/static/images/{{ post.image[0] }}"
                                 class="card-img-top" alt="{{ post.title[0] }} Thumbnail">
                            <div class="card-body">
                                <h5 class="card-title">{{ post.title[0] }}</h5>
                                <p class="card-text">{{ post.subtitle[0] }}</p>
                            </div>
                            <div class="card-footer text-muted">
                                    {{ post.published[0] }}
                            </div>
                        </div>
                    </a>
                </div>
            {% endfor %}
        </div>
    </div>
</body>
</html>{% endraw %}

It is a basic HTML file, where we have two meta-tags, a title, and two style sheets. Notice that we use a remote style sheet and a local one. The remote style sheet is utilized to enable the Bootstrap [¹] classes. The second one is for custom styles. We define them later.

The body of the HTML file encloses a single container, which contains Jinja2 [²] logic to generate a Bootstrap card [³] for each post. Did you notice that we do not access the values directly based on the variable names but need to add [0] to it? This is because the parsed metadata from the posts are lists. In essence, each metadata element is a list of exactly one element. We will have a look at it later. So far, so good. Let's take a look at the layout.html template.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">

    <title>{% raw %}{{ title[0] }}{% endraw %} - My Blog</title>

    <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css"
          integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">
    <link rel="stylesheet" href="/static/css/style.css">
</head>
<body>
    <div class="container">
        <h1>{% raw %}{{ title[0] }}{% endraw %}</h1>
        {% raw %}{{ content | safe }}{% endraw %}
    </div>
</body>
</html>

As you can see, it is a little bit shorter and simpler than the previous one. The head of the file is pretty similar to the index.html file except the fact that we have a different title. Of course, we could use a common template for both, but I do not want to make things more complex at this point.

The container in the body defines only an h1-tag. Afterwards, the content we supply to the template is inserted and rendered.

Styling¶

As promised in the last section, we will have a look at the custom CSS file called style.css. We locate the file in static/css and customize our page as needed. Here is the content we will use for our basic example:

blockquote {
    padding: 10px 20px;
    margin: 0 0 20px;
    font-size: 17.5px;
    border-left: 5px solid #3e3c3c;
}

blockquote > p {
    margin-bottom: 0;
}

.card {
    padding-left: 0;
    padding-right: 0;
}

I do not like the default appearance of blockquotes in Bootstrap, so we add a bit more spacing and a border on the left. Additionally, the margin at the bottom of the paragraph inside the blockquote is removed. With it, it looks pretty unnatural.

Last but not least, the padding on the left and on the right of the cards are removed. With the additional padding on both sides, the thumbnails are not aligned properly, so we remove them here.

So far so good. We finished everything concerning the Flask setup. Let's start writing some posts!

Writing the Posts¶

As the title promises, you can write your posts in markdown - yeah! There is nothing else you need to take care about while writing your posts except that it needs to be valid markdown.

After finishing the article, we need to add some metadata to the post. This metadata is added before the actual article and is separated from it by three dashes ---. Here is an extract of an example post (post1.md):

title: Post Number 1
subtitle: This is the first post
category: Python
published: May 3, 2020
image: placeholder.jpg
---
## First Heading

First paragraph including useful information followed by a code block.

Note: You can find the complete sample article in the GitHub repository at app/posts/post1.md.

In our case the metadata consists of a title, subtitle, category, the date it will be/was published and the path to the corresponding thumbnail for the card in index.html.

We used the metadata in the HTML files, do you remember? The metadata specification needs to be valid YAML. In the example at hand, the key is followed by a colon and the value. In the end, the value after the colon is the first and only element in a list. That is why we accessed the values by the index-operator [] in the templates.

Let's suppose we finished writing our articles. Before we can move on to the conversion part, there is one thing left to do: We need thumbnails for our posts! To make things easier, just pick a random picture you have on your computer or from the web, name it placeholder.jpg and put it in the static/images directory. The metadata of the two posts in the GitHub repository contain an image key-value pair with placeholder.jpg as value.

Note: In the GitHub repository you can find the two sample articles I am referring to.

Markdown to HTML Converter¶

Finally, we can start implementing the markdown to HTML converter. Therefore, we utilize the third-party package Markdown we installed at the beginning. Let's start by creating a new module in which our conversion service will live. Hence, we create a new file named converter.py in our service directory. We go through the whole script step by step. You can view the whole script at once in the GitHub repository.

First, we import everything we need and create a few constants:

import os

from datetime import datetime
from email.utils import formatdate, format_datetime  # for RFC2822 formatting
from pathlib import Path

import jinja2
import markdown

ROOT = Path(__file__).parent.parent.parent
POSTS_DIR = ROOT / "app" / "posts"
TEMPLATE_DIR = ROOT / "app" / "templates"
BLOG_TEMPLATE_FILE = "shared/layout.html"
INDEX_TEMPLATE_FILE = "shared/index.html"
BASE_URL = os.environ.get("DOMAIN", "http://0.0.0.0:5000/")

ROOT points to the root of our project. Hence, it is the directory which contains the app directory.
POSTS_DIR is the path to the posts written in markdown.
TEMPLATE_DIR points to the templates directory respectively.
BLOG_TEMPLATE_FILE stores the relative path to the layout.html file (relative to the loaders searchpath).
INDEX_TEMPLATE_FILE is the relative path to the index.html file (relative to the loaders searchpath).
BASE_URL is the base url of our project, e.g. https://florian-dahlitz.de. By default (if it is not provided via the environment variable DOMAIN) the value is http://0.0.0.0:5000.

Next, we create a new function called generate_entries(). It is the only function we define in order to convert the posts.

def generate_entries():
    posts = POSTS_DIR.glob("*.md")

    extensions = ['extra', 'smarty', 'meta']
    loader = jinja2.FileSystemLoader(searchpath=TEMPLATE_DIR)
    env = jinja2.Environment(loader=loader, autoescape=True)

    all_posts = []
    for post in posts:
        pass

Inside the function, we start by getting the paths of all markdown files in the POSTS_DIR directory. pathlib's awesome glob() function helps us with it.

Furthermore, we define the extensions we want the Markdown package to use. All the extensions used in this article come with the installation of it by default.

Note: You can find out more about the extensions in the documentation [⁴].

Additionally, we instantiate a new file loader and create an environment used while converting the articles. Subsequently, an empty list called all_posts is created. This list will contain all posts after we processed them. Now, we enter the for-loop and iterate over all posts we found in POSTS_DIR.

print("rendering {0}".format(post))

url = Path("posts") / f"{post.stem}"
url_html = f"{url}.html"
target_file = TEMPLATE_DIR / url_html

_md = markdown.Markdown(extensions=extensions, output_format='html5')

We start the for-loop by printing the path to the post we are currently processing. This is especially helpful if something breaks. Then we know, which post's conversion failed.

Next, we create the part of the url right after the base url. Let's say we have an article with the heading "Python For Beginners". We store the post in a file called python-for-beginners.md, so the resulting url will be http://0.0.0.0:5000/posts/python-for-beginners.

The variable url_html stores the same string as url except that we add .html at the end. We use this variable to define another one called target_file. The variable points to the location, where the corresponding HTML file will be stored.

Last but not least, we define a variable _md. It represents an instance of the markdown.Markdown class, which is used to convert the markdown code to HTML. You might ask yourself, why we did not instantiate this instance before the for-loop but inside. Of course, for our little example here, it would make no difference (except a slightly shorter execution time). However, if you use extensions like footnotes for using footnotes, it is necessary to instantiate a new instance for each post as the footnotes once added are not removed from this instance. Consequently, if your first post uses some footnotes, all other posts will have the same footnotes even though you did not define them explicitly.

Let's move on to the first with-block in the for-loop.

with open(post) as post_f:
    content = post_f.read()
    html = _md.convert(content)
    doc = env.get_template(str(BLOG_TEMPLATE_FILE)).render(content=html, baseurl=BASE_URL, url=url, **_md.Meta)

In essence, the with-block opens the current post and reads the content of it into the variable content. Afterwards, _md.convert() is invoked to convert the content written in markdown into HTML. Subsequently, the environment env is used to render the resulting HTML code based on the supplied template BLOG_TEMPLATE_FILE (which is layout.html if you remember).

with open(target_file, "w") as post_html_f:
    post_html_f.write(doc)

The second with-block is used to write the document created in the first with-block to the target_file.

The following three lines of code take the publishing date (published) from the metadata, bring it into the correct format (RFC 2822) and assign it back to the metadata of the post. Furthermore, the resulting post_dict is added to the all_posts list.

post_date = datetime.strptime(_md.Meta['published'][0], "%B %d, %Y")
post_dict = dict(**_md.Meta, date=post_date, rfc2822_date=format_datetime(post_date), rel_link=f"/{url}", link="{0}{1}".format(BASE_URL, url))

all_posts.append(post_dict)

We are now outside the for-loop, hence, we iterated over all posts we found in POSTS_DIR and processed them. Let's have a look at the three remaining lines of code in the generate_entries() function.

all_posts.sort(key=lambda item: item['date'], reverse=True)

with open(TEMPLATE_DIR / "index.html", "w") as index_f:
    index_f.write(env.get_template(str(INDEX_TEMPLATE_FILE)).render(posts=all_posts, template_path=str(BLOG_TEMPLATE_FILE)))

We sort the posts by date but reversed, so the latest posts are shown first. Subsequently, we write the posts to a newly created index.html file in the templates directory. Do not mistake this index.html for the one in the templates/shared directory. The one in the templates/shared directory is the template, this one is the generated one we want to serve using Flask.

The last thing we add to the script is the following if-statement after the function generate_entries().

if __name__ == "__main__":
    generate_entries()

This means if we execute the file via the command-line, it calls the function generate_entries().

Awesome, we finished the converter.py script! Let's try it out by running the following command from your project's root directory:

$ python app/services/converter.py

You should see some output printing the paths of the files it converted. Assuming you wrote two articles or used the two posts from the GitHub repository, you should find three newly created files in the templates directory. First, the index.html, which is directly located in the templates directory, and secondly, two HTML files in the templates/posts directory, which correspond to the markdown files.

You can view them in the browser by starting your Flask application and going to http://0.0.0.0:5000.

$ python app/main.py

Summary¶

Awesome, you made it through the first part of the series! In this article, you have learned how to utilize the Markdown package to create your own markdown to HTML generator. You implemented the whole pipeline, which is highly extensible, what you will see in the upcoming posts.

I hope you enjoyed reading the article. Make sure to share it with your friends and collegues. If you have not already, consider following me on Twitter where I am @DahlitzF or to subscribe to my newsletter so you will not miss any upcoming article. Stay curious and keep coding!

References¶

Python third-party tutorial

Become a Patron!