Article Thumbnail

Create Your Own Diff-Tool Using Python

How to create your own simple diff-tool using nothing but Python

Florian Dahlitz
9 min
April 27, 2020

Why Do I Need My Own Diff-Tool?

I heavily use git to track my coding projects, articles, business work, and more. One of the beautiful things about git is that you can easily compare different states of your work by simply using its built-in diff functionality. Only two constraints need to be met in order to use this functionality: First, you need a git repository and second, the file needs to be tracked by the git repository.

But what if you only want to modify a single file, compare it to an older version and all of this without the need for a git repository? This is where this article comes into place. The goal is to create a diff-tool, which allows you to compare two versions of a file:

  1. without the need for a git repository and
  2. built upon the Python standard library!

Furthermore, our diff-tool should be able to export the computed diff to an HTML-file. To be able to follow this article you need nothing else, but Python. This article was specifically written for Python 3.8.2 (CPython). The source code can be found on GitHub. Without further introduction, let's jump in!

The unified_diff() Function

The Python standard library contains a module called difflib. According to the documentation, this module provides classes and functions for comparing sequences. Furthermore, various output formats are available [1].

While inspecting the module, the unified_diff() function emerges from all other functions. Having a look at the provided examples and generated outputs, it pretty much looks like the diff-computing function we are looking for. It takes up to eight arguments, but only two are required:

  • a: list of strings (required)
  • b: list of strings (required)
  • fromfile: used to display the first file's name (default: '')
  • tofile: used to display the second file's name (default: '')
  • fromfiledate: modification time of the first file (default: '')
  • tofiledate: modification time of the second file (default: '')
  • n: number of context lines* (default: 3)
  • lineterm: character added at the end of control lines (those with ---, +++, or @@) to be able to be processed properly by io.IOBase.readlines() and io.IOBase.writelines() (default: '\n')

* context lines are used to provide the user some context where the changes happened

In essence, the unified_diff() function takes two lists of strings and compares them. If they are equal, the delta is empty. If there are any differences, a respective delta is returned. Let's take a simple example: You are planning a pizza party together with your best friend and write down some ingredients you need to buy first. For simplicity, the shopping list is a simple text file (my_shopping_list.txt), which looks like this:

cheese
tomates

You send it over to your friend and he adds one ingredient as no one of you has it already at home: salami. Furthermore, he recognizes your typo and corrects it as well. To be able to pass the changes properly to our diff-tool, he makes a copy of the list and renames it to friends_shopping_list.txt. Here is the final shopping list:

cheese
tomatoes
salami

Of course, this is a fairly simple example and the texts are not very long, so it can be easily processed by a human. But let's have fun and follow the example. To compute the diff between both files, we read them into memory and pass them to the unified_diff() function:

# shopping_list_diff.py

import difflib
import sys

file1 = open("my_shopping_list.txt").readlines()
file2 = open("friends_shopping_list.txt").readlines()

delta = difflib.unified_diff(file1, file2)
sys.stdout.writelines(delta)

First, we import difflib and sys. Second, we read the content of both files and save them to separate variables, file1 and file2. As we need lists of strings, we use readlines(). Subsequently, we compute the delta of both lists and write it to stdout through sys.stdout.writelines().

Executing the script at hand results in the following output:

$ python shopping_list_diff.py
---
+++
@@ -1,2 +1,3 @@
 cheese
-tomates
+tomatoes
+salami

From the output, the word cheese was not touched, but is used as a context line because the following line was modified. tomates has been removed, and tomatoes as well as salami were added. If you look at the header of the printed delta, you can see, that we do not get any information about what --- and +++ stand for or which files they represent. Let's adjust our script by adding the file names to the unified_diff() function:

delta = difflib.unified_diff(file1, file2, "my_shopping_list.txt", "friends_shopping_list.txt")

Now, running the script results in:

$ python shopping_list_diff.py
--- my_shopping_list.txt
+++ friends_shopping_list.txt
@@ -1,2 +1,3 @@
 cheese
-tomates
+tomatoes
+salami

Great! We implemented a simple script computing and printing the difference between two file contents. Let's move on and turn it into a command-line tool.

Building a Command-Line Tool

With the purpose of turning our script into a useful command-line tool, we utilize Python's argparse module. First of all, we put our previous written code into a function called create_diff(), which accepts two arguments old_file and new_file. Both are Path objects [2]. We use the passed Path objects to read their content and get the name of the provided files using the Path object's name attribute. As our little script is now more general purpose oriented and not limited to shopping lists anymore, we put our code into a new file called diff_tool.py (which is a more suitable name for our script). So far, the script looks like this:

# diff_tool.py

import difflib
import sys

from pathlib import Path


def create_diff(old_file: Path, new_file: Path):
    file_1 = open(old_file).readlines()
    file_2 = open(new_file).readlines()

    delta = difflib.unified_diff(file_1, file_2, old_file.name, new_file.name)
    sys.stdout.writelines(delta)

Next, we define a new function main(), which is responsible for the general workflow.

# previous diff_tool.py code

import argparse


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("old_file_version")
    parser.add_argument("new_file_version")
    args = parser.parse_args()

    old_file = Path(args.old_file_version)
    new_file = Path(args.new_file_version)

    create_diff(old_file, new_file)


if __name__ == "__main__":
    main()

At first, a new argument parser is defined. We tell the parser to accept two arguments old_file_version and new_file_version. Both are required. Calling parse_args() parses the command-line input and converts the input into the correct format. Subsequently, both command-line arguments are accessed and converted into Path objects. Afterwards, create_diff() is called with old_file and new_file as arguments.

Note: If you want to learn more about the argparse module, I can highly recommend Python's argparse tutorial [3], which provides a more gentle introduction to Python command-line parsing.

Now, if we execute the script without any arguments, it shows us, which arguments are required:

$ python diff_tool.py
usage: diff_tool1.py [-h] old_file_version new_file_version
diff_tool1.py: error: the following arguments are required: old_file_version, new_file_version

Providing both shopping lists results still in the desired output:

$ python diff_tool.py my_shopping_list.txt friends_shopping_list.txt
--- my_shopping_list.txt
+++ friends_shopping_list.txt
@@ -1,2 +1,3 @@
 cheese
-tomates
+tomatoes
+salami

So far, we build a simple diff-tool by turning your short script from the beginning into a command-line tool - cool! Now, we will add some more lines to support HTML output.

Providing the Diff in HTML Format

The difflib module provides an HtmlDiff class, which can be used to create an HTML table (or a complete HTML file containing the table) showing a side by side, line by line comparison of text with inter-line and intra-line change highlights. In our example, we use the HtmlDiff.make_file() function, which returns a string representing a complete HTML file. The latter highlights any differences line by line.

Therefore, we extend our script as follows:

# diff_tool.py

import argparse
import difflib
import sys

from pathlib import Path


def create_diff(old_file: Path, new_file: Path, output_file: Path = None):
    file_1 = open(old_file).readlines()
    file_2 = open(new_file).readlines()

    if output_file:
        delta = difflib.HtmlDiff().make_file(
            file_1, file_2, old_file.name, new_file.name
        )
        with open(output_file, "w") as f:
            f.write(delta)
    else:
        delta = difflib.unified_diff(file_1, file_2, old_file.name, new_file.name)
        sys.stdout.writelines(delta)


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("old_file_version")
    parser.add_argument("new_file_version")
    parser.add_argument("--html", help="specify html to write to")
    args = parser.parse_args()

    old_file = Path(args.old_file_version)
    new_file = Path(args.new_file_version)

    if args.html:
        output_file = Path(args.html)
    else:
        output_file = None

    create_diff(old_file, new_file, output_file)


if __name__ == "__main__":
    main()

The create_diff() function now takes an additional third parameter output_file, which is also a Path object. This will be the file, we write our HTML diff into. We check whether an output_file was passed. If so, we compute the diff in HTML format and save it to the passed file.

Note: We use the w mode for writing. If the file already exists, it is truncated beforehand [4].

If no output_file was passed, we compute the unified diff and write it to stdout.

We extend the main() function by registering an additional, optional command-line argument --html taking a filename as input. If a filename is provided, it is converted into a Path object and passed to create_diff().

After executing the following command, you have a diff.html file in your current working directory, which you can open with your favourite browser to see the actual diff.

$ python diff_tool.py my_shopping_list.txt friends_shopping_list.txt --html diff.html

Summary

Congratulations, you have made it through the article! While reading the article you learned how to compute a simple diff using Python's difflib module. Furthermore, you were able to turn your little diff-script into a command-line tool using Python's argparse module. Subsequently, you added a few lines of code to also support HTML as an output format.

What's next? You can check out the difflib documentation [1], get to know varying ways to compute diffs, search for other kinds of diffs, and extend your diff-tool even further. Additionally, you can check out the article's GitHub repository and compute the diff between file.md and file_update.md. Do you find all changes?

I hope you enjoyed reading the article. Make sure to share it with your friends and colleagues. If you have not already, consider following me on Twitter where I am @DahlitzF. Stay curious and keep coding!

References