I heavily use git to track my coding projects, articles, business work, and more. One of the beautiful things about git is that you can easily compare different states of your work by simply using its built-in diff functionality. Only two constraints need to be met in order to use this functionality: First, you need a git repository and second, the file needs to be tracked by the git repository.
But what if you only want to modify a single file, compare it to an older version and all of this without the need for a git repository? This is where this article comes into place. The goal is to create a diff-tool, which allows you to compare two versions of a file:
Furthermore, our diff-tool should be able to export the computed diff to an HTML-file. To be able to follow this article you need nothing else, but Python. This article was specifically written for Python 3.8.2 (CPython). The source code can be found on GitHub. Without further introduction, let's jump in!
unified_diff()
Function¶The Python standard library contains a module called difflib
.
According to the documentation, this module provides classes and functions for comparing sequences.
Furthermore, various output formats are available [1].
While inspecting the module, the unified_diff()
function emerges from all other functions.
Having a look at the provided examples and generated outputs, it pretty much looks like the diff-computing function we are looking for.
It takes up to eight arguments, but only two are required:
''
)''
)''
)''
)3
)---
, +++
, or @@
) to be able to be processed properly by io.IOBase.readlines()
and io.IOBase.writelines()
(default: '\n'
)* context lines are used to provide the user some context where the changes happened
In essence, the unified_diff()
function takes two lists of strings and compares them.
If they are equal, the delta is empty.
If there are any differences, a respective delta is returned.
Let's take a simple example:
You are planning a pizza party together with your best friend and write down some ingredients you need to buy first.
For simplicity, the shopping list is a simple text file (my_shopping_list.txt
), which looks like this:
cheese
tomates
You send it over to your friend and he adds one ingredient as no one of you has it already at home: salami
.
Furthermore, he recognizes your typo and corrects it as well.
To be able to pass the changes properly to our diff-tool, he makes a copy of the list and renames it to friends_shopping_list.txt
.
Here is the final shopping list:
cheese
tomatoes
salami
Of course, this is a fairly simple example and the texts are not very long, so it can be easily processed by a human.
But let's have fun and follow the example.
To compute the diff between both files, we read them into memory and pass them to the unified_diff()
function:
# shopping_list_diff.py
import difflib
import sys
file1 = open("my_shopping_list.txt").readlines()
file2 = open("friends_shopping_list.txt").readlines()
delta = difflib.unified_diff(file1, file2)
sys.stdout.writelines(delta)
First, we import difflib
and sys
.
Second, we read the content of both files and save them to separate variables, file1
and file2
.
As we need lists of strings, we use readlines()
.
Subsequently, we compute the delta of both lists and write it to stdout
through sys.stdout.writelines()
.
Executing the script at hand results in the following output:
$ python shopping_list_diff.py
---
+++
@@ -1,2 +1,3 @@
cheese
-tomates
+tomatoes
+salami
From the output, the word cheese
was not touched, but is used as a context line because the following line was modified.
tomates
has been removed, and tomatoes
as well as salami
were added.
If you look at the header of the printed delta, you can see, that we do not get any information about what ---
and +++
stand for or which files they represent.
Let's adjust our script by adding the file names to the unified_diff()
function:
delta = difflib.unified_diff(file1, file2, "my_shopping_list.txt", "friends_shopping_list.txt")
Now, running the script results in:
$ python shopping_list_diff.py
--- my_shopping_list.txt
+++ friends_shopping_list.txt
@@ -1,2 +1,3 @@
cheese
-tomates
+tomatoes
+salami
Great! We implemented a simple script computing and printing the difference between two file contents. Let's move on and turn it into a command-line tool.
With the purpose of turning our script into a useful command-line tool, we utilize Python's argparse
module.
First of all, we put our previous written code into a function called create_diff()
, which accepts two arguments old_file
and new_file
.
Both are Path
objects [2].
We use the passed Path
objects to read their content and get the name of the provided files using the Path
object's name
attribute.
As our little script is now more general purpose oriented and not limited to shopping lists anymore, we put our code into a new file called diff_tool.py
(which is a more suitable name for our script).
So far, the script looks like this:
# diff_tool.py
import difflib
import sys
from pathlib import Path
def create_diff(old_file: Path, new_file: Path):
file_1 = open(old_file).readlines()
file_2 = open(new_file).readlines()
delta = difflib.unified_diff(file_1, file_2, old_file.name, new_file.name)
sys.stdout.writelines(delta)
Next, we define a new function main()
, which is responsible for the general workflow.
# previous diff_tool.py code
import argparse
def main():
parser = argparse.ArgumentParser()
parser.add_argument("old_file_version")
parser.add_argument("new_file_version")
args = parser.parse_args()
old_file = Path(args.old_file_version)
new_file = Path(args.new_file_version)
create_diff(old_file, new_file)
if __name__ == "__main__":
main()
At first, a new argument parser is defined.
We tell the parser to accept two arguments old_file_version
and new_file_version
.
Both are required.
Calling parse_args()
parses the command-line input and converts the input into the correct format.
Subsequently, both command-line arguments are accessed and converted into Path
objects.
Afterwards, create_diff()
is called with old_file
and new_file
as arguments.
Note: If you want to learn more about the
argparse
module, I can highly recommend Python'sargparse
tutorial [3], which provides a more gentle introduction to Python command-line parsing.
Now, if we execute the script without any arguments, it shows us, which arguments are required:
$ python diff_tool.py
usage: diff_tool1.py [-h] old_file_version new_file_version
diff_tool1.py: error: the following arguments are required: old_file_version, new_file_version
Providing both shopping lists results still in the desired output:
$ python diff_tool.py my_shopping_list.txt friends_shopping_list.txt
--- my_shopping_list.txt
+++ friends_shopping_list.txt
@@ -1,2 +1,3 @@
cheese
-tomates
+tomatoes
+salami
So far, we build a simple diff-tool by turning your short script from the beginning into a command-line tool - cool! Now, we will add some more lines to support HTML output.
The difflib
module provides an HtmlDiff
class, which can be used to create an HTML table (or a complete HTML file containing the table) showing a side by side, line by line comparison of text with inter-line and intra-line change highlights.
In our example, we use the HtmlDiff.make_file()
function, which returns a string representing a complete HTML file. The latter highlights any differences line by line.
Therefore, we extend our script as follows:
# diff_tool.py
import argparse
import difflib
import sys
from pathlib import Path
def create_diff(old_file: Path, new_file: Path, output_file: Path = None):
file_1 = open(old_file).readlines()
file_2 = open(new_file).readlines()
if output_file:
delta = difflib.HtmlDiff().make_file(
file_1, file_2, old_file.name, new_file.name
)
with open(output_file, "w") as f:
f.write(delta)
else:
delta = difflib.unified_diff(file_1, file_2, old_file.name, new_file.name)
sys.stdout.writelines(delta)
def main():
parser = argparse.ArgumentParser()
parser.add_argument("old_file_version")
parser.add_argument("new_file_version")
parser.add_argument("--html", help="specify html to write to")
args = parser.parse_args()
old_file = Path(args.old_file_version)
new_file = Path(args.new_file_version)
if args.html:
output_file = Path(args.html)
else:
output_file = None
create_diff(old_file, new_file, output_file)
if __name__ == "__main__":
main()
The create_diff()
function now takes an additional third parameter output_file
, which is also a Path
object.
This will be the file, we write our HTML diff into.
We check whether an output_file
was passed.
If so, we compute the diff in HTML format and save it to the passed file.
Note: We use the
w
mode for writing. If the file already exists, it is truncated beforehand [4].
If no output_file
was passed, we compute the unified diff and write it to stdout
.
We extend the main()
function by registering an additional, optional command-line argument --html
taking a filename as input.
If a filename is provided, it is converted into a Path
object and passed to create_diff()
.
After executing the following command, you have a diff.html
file in your current working directory, which you can open with your favourite browser to see the actual diff.
$ python diff_tool.py my_shopping_list.txt friends_shopping_list.txt --html diff.html
Congratulations, you have made it through the article!
While reading the article you learned how to compute a simple diff using Python's difflib
module.
Furthermore, you were able to turn your little diff-script into a command-line tool using Python's argparse
module.
Subsequently, you added a few lines of code to also support HTML as an output format.
What's next?
You can check out the difflib
documentation [1], get to know varying ways to compute diffs, search for other kinds of diffs, and extend your diff-tool even further.
Additionally, you can check out the article's GitHub repository and compute the diff between file.md
and file_update.md
.
Do you find all changes?
I hope you enjoyed reading the article. Make sure to share it with your friends and colleagues. If you have not already, consider following me on Twitter where I am @DahlitzF. Stay curious and keep coding!