Some line-oriented diff tools also highlight word changes (e.g. GitHub
or git's --word-diff). They still don't understand the code
though. Difftastic will always find matched delimiters: you can see
the closing ) from or_else has been highlighted.
Difftastic releases are published as GitHub
releases with
pre-built binaries for Windows, macOS and Linux. Open the latest
release page,
download the file matching your OS and CPU architecture, and extract
the difft executable application file.
Difftastic uses the cc crate for building C/C++ dependencies. This
allows you to use environment variables CC and CXX to control the
compiler used (see the cc
docs).
See contributing for instructions on debug
builds.
If a MIME database is available, difftastic will use it to detect
binary files more accurately. This is the same database used by the
file command, so you probably already have it.
This page contains recommendations for people creating a difftastic
package.
Note that the difftastic author only provides the source code and the
prebuilt binaries on GitHub. Packages have been created by other
people -- thank you!
Difftastic's build script (the build.rs file) uses Rayon to build C
libraries in parallel, which can lead to minor ordering changes in the
final binary.
You can avoid this by disabling Rayon parallelism.
Difftastic depends on
tree_magic_mini,
which accesses the MIME database on the current system. The MIME
database is used to recognise file types, so difftastic does not try
to compare binary files as text.
This means that the difftastic package should depend on a MIME
database package, if available.
Difftastic respects the XDG base
specification
to find the MIME database files. These files are typically at
/usr/share/mime/, /usr/local/share/mime/ or
/opt/homebrew/share/mime/.
Please consider including the difftastic manual with your
package. These are HTML files that can be generated with mdbook. The
following command generates HTML at manual/book/.
$ cd manual
$ mdbook build
manual/book.toml also references a script
replace_version_placeholder.sh that replaces occurrences of
DFT_VERSION_HERE in the manual. For packaging, it may be easier to
remove the configuration from book.toml and replace the text
directly.
This page describes how to use the difft binary directly. See also
the Git, Mercurial,
Fossil, or Jujutsu pages for instructions on how to configure
them to use difftastic.
If you have a file with <<<<<<< conflict markers, you can pass it as
a single argument to difftastic. Difftastic will construct the two
file states and diff those.
$ difft FILE-WITH-CONFLICTS
# For example:
$ difft sample_files/conflicts.el
Every difftastic option can be set with a command line argument or an
environment variable. For example, DFT_BACKGROUND=light is equivalent to
--background=light.
Environment variables are often useful when using VCS tools like git,
because they invoke the difft binary directly.
For a full list of configuration options, see --help.
$ difft --help
...
OPTIONS:
--background <BACKGROUND>
Set the background brightness. Difftastic will prefer brighter colours on dark backgrounds.
[env: DFT_BACKGROUND=]
[default: dark]
[possible values: dark, light]
...
2: Difftastic was given invalid arguments. This includes invalid usage
(e.g. the wrong number of arguments) as well as paths that difftastic
cannot read (e.g. non-existent paths or insufficient permissions).
1: When called with --exit-code, difftastic will return an exit code
of 1 when it finds any syntactic changes (in text files) or byte changes
(in binary files).
If you like difftastic, we recommend that you configure git aliases
so you can use difftastic more easily.
[alias]
# Difftastic aliases, so `git dlog -p` is `git log -p`
# with difftastic and likewise for the other subcommands.
dlog = -c diff.external=difft log --ext-diff
dshow = -c diff.external=difft show --ext-diff
ddiff = -c diff.external=difft diff
The author likes the following additional aliases to reduce typing:
[alias]
# `git log` with patches shown with difftastic.
dl = -c diff.external=difft log -p --ext-diff
# Show the most recent commit with difftastic.
ds = -c diff.external=difft show --ext-diff
# `git diff` with difftastic.
dft = -c diff.external=difft diff
Git also has a difftool
feature which allows users to
invoke CLI or GUI comparison tools.
For best results, we recommend using -c diff.external=difft as
described above. Git passes more information to the external diff,
including file permission changes and rename information, so
difftastic can show more information.
To define a difftool named difftastic, add the following to your
~/.gitconfig.
[difftool "difftastic"]
# See `man git-difftool` for a description of MERGED, LOCAL and REMOTE.
cmd = difft "$MERGED" "$LOCAL" "abcdef1" "100644" "$REMOTE" "abcdef2" "100644"
You can now use difftastic as a difftool:
$ git difftool -t difftastic
For the best results when using difftastic as a difftool, we recommend
the following additional git configuration:
[difftool]
# Run the difftool immediately, don't ask 'are you sure' each time.
prompt = false
[pager]
# Use a pager if the difftool output is larger than one screenful,
# consistent with the behaviour of `git diff`.
difftool = true
[diff]
# Set difftastic as the default difftool, so we don't need to specify
# `-t difftastic` every time.
tool = difftastic
Mercurial supports external diff
tools with the
Extdiff extension. Enable it by adding an entry to extensions in
your .hgrc.
[extensions]
extdiff =
You can then run hg extdiff -p difft instead of hg diff
(assumes the difft binary is on your $PATH).
You can also define an alias to run difftastic with hg. Add the
following to your .hgrc to run difftastic with hg dft.
[extdiff]
cmd.dft = difft
# You can add further options which will be passed to the command line, e.g.
# opts.dft = --background light
All options of hg diff are also supported by hg dft; for example,
hg dft --stat will show statistics of changed lines and hg dft -r 42 -r 45
will show the diff between two revisions.
This page lists all the languages supported by difftastic. You can
also view the languages supported in your current installed version
with difft --list-languages.
You can override language detection for specific file globs using the
--override option.
$ difft --override=GLOB:NAME FIRST-FILE SECOND-FILE
# For example, treating .h files as C rather than C++:
$ difft --override=*.h:c sample_files/preprocessor_1.h sample_files/preprocessor_2.h
See difft --help for more examples of --override usage.
Difftastic converts the tree-sitter parse tree to a simplified syntax
tree. The syntax tree is a uniform representation where everything is
either an atom (e.g. integer literals, comments, variable names) or a
list (consisting of the open delimiter, children and the close
delimiter).
The flag --dump-syntax will display the syntax tree generated for a
file.
The simple representation of the difftastic parse tree makes diffing
much easier. Converting the detailed tree-sitter parse tree is a
recursive tree walk, treating tree-sitter leaf nodes as atoms. There
are two exceptions.
(1) Tree-sitter parse trees sometimes include unwanted structure. Some
grammars consider string literals to be a single token, whereas others
treat strings as a complex structure where the delimiters are
separate.
tree_sitter_parser.rs uses atom_nodes to mark specific tree-sitter
node names as flat atoms even if the node has children.
(2) Tree-sitter parse trees include open and closing delimiters as
tokens. A list [1] will have a parse tree that includes [ and ]
as nodes.
tree_sitter_parser.rs uses open_delimiter_tokens to ensure that
[ and ] are used as delimiter content in the enclosing list,
rather than converting them to atoms.
Difftastic can match up atoms that occur in different parts of the
simplified syntax tree. If e.g. a [ is treated as an atom,
difftastic might match it with another [ elsewhere. The resulting
diff would be unbalanced, highlighting different numbers of open and
close delimiters.
The simplified syntax tree only stores node content and node
position. It does not store whitespace between nodes, and position is
ignored during diffing.
A vertex in the graph represents a position in two syntax trees.
The start vertex has both positions pointing to the first syntax node
in both trees. The end vertex has both positions just
after the last syntax node in both trees.
Consider comparing A with X A.
START
+---------------------+
| Left: A Right: X A |
| ^ ^ |
+---------------------+
END
+---------------------+
| Left: A Right: X A |
| ^ ^|
+---------------------+
From the start vertex, we have two options:
we can mark the first syntax node on the left as novel, and advance
to the next syntax node on the left (vertex 1 above), or
we can mark the first syntax node on the right as novel, and advance
to the next syntax node on the right (vertex 2 above).
START
+---------------------+
| Left: A Right: X A |
| ^ ^ |
+---------------------+
/ \
Novel atom L / \ Novel atom R
1 v 2 v
+---------------------+ +---------------------+
| Left: A Right: X A | | Left: A Right: X A |
| ^ ^ | | ^ ^ |
+---------------------+ +---------------------+
Choosing "novel atom R" to vertex 2 will turn out to be the best
choice. From vertex 2, we can see three routes to the end vertex.
2
+---------------------+
| Left: A Right: X A |
| ^ ^ |
+---------------------+
/ | \
Novel atom L / | \ Novel atom R
v | v
+---------------------+ | +---------------------+
| Left: A Right: X A | | | Left: A Right: X A |
| ^ ^ | | | ^ ^|
+---------------------+ | +---------------------+
| | |
| Novel atom R | Nodes match | Novel atom L
| | |
| END v |
| +---------------------+ |
+-------->| Left: A Right: X A |<---------+
| ^ ^|
+---------------------+
We assign a cost to each edge. Marking a syntax node as novel is worse
than finding a matching syntax node, so the "novel atom" edge has a
higher cost than the "syntax nodes match" edge.
The best route is the lowest cost route from the start vertex to the
end vertex.
Difftastic uses Dijkstra's algorithm to find the best (i.e. lowest cost)
route.
One big advantage of this algorithm is that we don't need to construct
the graph in advance. Constructing the whole graph would require
exponential memory relative to the number of syntax nodes. Instead,
vertex neighbours are constructed as the graph is explored.
This is tricky because x has changed its depth in the tree, but x
itself is unchanged.
Not all tree diff algorithms handle this case. It is also challenging
to display this case clearly: we want to highlight the changed
delimiters, but not their content. This is challenging in larger
expressions.
Difftastic: Difftastic considers nodes to be equal even at
different depths, achieving the desired result in this case.
Difftastic: Difftastic currently shows result 2, but this case is
sensitive to the cost model. Some previous versions of difftastic have
shown result 1.
We want to highlight [[foo]] being moved inside the
parentheses. However, a naive syntax differ prefers to consider a removal
of () in the before and an addition of () in the after to be more
minimal diff.
// Before
foo(bar(123))
// After
foo(extra(bar(123)))
Desired result: foo(extra(bar(123)))
We want to consider both foo and bar to be unchanged. This case is
challenging for diffing algorithms that do a bottom-up then top-down
matching of trees.
There are two atoms inside the () that we could consider as
unchanged, either the bar or the ,. (We can't consider both to be
unchanged as they're reordered.)
We want to consider bar to be unchanged, as it's a more important
atom than the , punctuation atom. Doing this in a
language-agnostic way is difficult, so difftastic has a small list of
punctuation characters that always get lower priority than other
atoms.
In some cases, reformatting code can change the trailing punctuation
without changing the meaning of the code. We don't want to show a diff
in this case.
# Before
foo(x, y)
# After (semantically identical)
foo(
x,
y,
)
This is language-specific. For example, a trailing comma can change the meaning
of code in Python.
# Before (the value 1)
(1)
# After (a tuple)
(1,)
Desired result: (1,)
However, we can't simply discard this punctuation before diffing.
# Before
[2,]
# After
[2,3,]
Possible result: [2,3,]
Desired result: [2,3,]
If the diffing logic effectively sees [2] and [2,3] because we've
discarded the punctuation, we don't get the desired result here.
Difftastic: Difftastic solves this problem by considering trailing
punctuation during diffing, and then post-processing known syntactic
elements that aren't significant.
Sliders are a common problem in text based diffs, where lines are
matched in a confusing way.
They typically look like this. The diff has to arbitrarily choose a
line containing a delimiter, and it chooses the wrong one.
+ }
+
+ function foo () {
}
git-diff has some heuristics to reduce the risk of this (e.g. the
"patience diff"), but it can still occur.
There's a similar problem in tree diffs.
;; Before
A B
C D
;; After
A B
A B
C D
Possible result:
A BA B
C D
Preferred result:
A B
AB
C D
Ideally we'd prefer marking contiguous nodes as novel. From the
perspective of a longest-common-subsequence algorithm, these two
choices are equivalent.
// Before
function foo(x) { return x + 1; }
// After
function bar(y) { baz(y); }
Possible result: function bar(y) { baz(y); }
In this example, we've deleted a function and written a completely
different one. A tree-based diff could match up the function and the
outer delimiters, resulting in a confusing display showing lots of
small changes.
As with sliders, the replacement problem can also occur in textual
line-based diffs. Line-diffs struggle if there are a small number of
common lines. The more precise, granular behaviour of tree diffs makes
this problem much more common though.
// Before
/* The quick brown fox. */
foobar();
// After
/* The slow brown fox. */
foobaz();
foobar and foobaz are completely different, and their common
prefix fooba should not be matched up. However, matching common
prefixes or suffixes for comments is desirable.
// Before
"""A very long string
with lots of words about
lots of stuff."""
// After
"""A very long string
with lots of NOVEL words about
lots of stuff."""
It would be correct to highlight the entire string literal as being
removed and replaced with a new string literal. However, this makes it
hard to see what's actually changed.
It's clear that variable names should be treated atomically, and
comments are safe to show subword changes. It's not clear how to
handle a small change in a 20 line string literal.
It's tempting to split strings on spaces and diff that, but users
still want to know when whitespace changes inside strings. " " and
" " are not the same.
Users may expect difftastic to find no changes here. This is difficult
for several reasons.
For programming languages, side effects might make the order
relevant. set(foo(), bar()) might behave differently to set(bar(), foo()).
For configuration languages like JSON or YAML, some parser
implementations do actually expose ordering information
(e.g. object_pairs_hook=OrderedDict in Python, or serde_json's
preserve_order feature in Rust).
To make matters worse, unordered tree diffing is NP-hard.
For the unordered case, it turns out that all of the problems in
general are NP-hard. Indeed, the tree edit distance and alignment
distance problems are even MAX SNP-hard.
There's no guarantee that the input we're given is valid syntax. Even
if the code is valid, it might use syntax that isn't supported by the
parser.
Difftastic: Difftastic will fall back to a line-oriented diff if
any parse errors occur, to avoid diffing incomplete syntax trees. When
this occurs, the file header reports the error count.
Users may opt-in to syntactic diffing by setting
DFT_PARSE_ERROR_LIMIT to a larger value. In this mode, difftastic
treats tree-sitter error nodes as atoms and performs a tree diff as
normal.
This website is generated with
mdbook. mdbook can be
installed with Cargo.
Note: difftastic uses an older Rust toolchain version. You have to run cargo install mdbook outside of the repository directory. Otherwise, installation fails.
$ cargo install mdbook
You can then use the mdbook binary to build and serve the site
locally.
Ideally, the parser should be available as a Rust crate on crates.io.
If that's the case, add it to Cargo.toml in the alphabetically sorted list
of parser dependencies. For instance:
Many parser repositories include a highlights query in the repository without
exposing it in the Rust crate. In that case you can include it as
vendored_parsers/highlights/json.scm in the repository.
atom_nodes is a list of tree-sitter node names that should be
treated as atoms even though the nodes have children. This is common
for things like string literals or interpolated strings, where the
node might have children for the opening and closing quote.
If you don't set atom_nodes, you may notice added/removed content
shown in white. This is usually a sign that a child node should have
its parent treated as an atom.
delimiter_tokens are delimiters that difftastic stores on
the enclosing list node. This allows difftastic to distinguish
delimiter tokens from other punctuation in the language.
If you don't set delimiter_tokens, difftastic will consider the
tokens in isolation, and may think that a ( was added but the )
was unchanged.
You can use difft --dump-ts foo.json to see the results of the
tree-sitter parser, and difft --dump-syntax foo.json to confirm that
you've set atoms and delimiters correctly.
sub-languages is empty for most languages: see the code documentation for details.
Update language_name in guess_language.rs to detect your new
language. Insert a match arm like:
Json => "json",
There may also be file names or shebangs associated with your language; configure those
by adapting the language_globs, from_emacs_mode_header and from_shebang functions
in that file.
GitHub's linguist definitions
are a useful source of common file extensions.
Atom
: An atom is an item in difftastic's syntax tree structure
that has no children. It represents things like literals, variable
names, and comments. See also 'list'.
Delimiter
: A paired piece of syntax. A list has an open delimiter
and a close delimiter, such as [ and ]. Delimiters may not be
punctuation (e.g. begin and end) and may be empty strings (e.g. infix
syntax converted to difftastic's syntax tree).
Hunk
: A group of lines displayed together in the diff
output. Increasing the number of context lines increases the size of
the hunk.
LHS
: Left-hand side. Difftastic compares two items, and LHS refers
to the first item. See also 'RHS'.
Line-oriented
: A traditional diff that compares which lines have
been added or removed, unlike difftastic. For example, GNU diff or the
diffs displayed on GitHub.
List
: A list is an item in difftastic's syntax tree structure that
has an open delimiter, children, and a close delimiter. It represents
things like expressions and function definitions. See also 'atom'.
Novel
: An addition or a removal. Syntax is novel if it occurs
in only one of the two items being compared.
RHS
: Right-hand side. Difftastic compares two items, and RHS
refers to the second item. See also 'LHS'.
Root
: A syntax tree without a parent node. Roots represent
top-level definitions in the file being diffed.
Slider
: A diffing situation where there are multiple minimal diffs
possible, due to adjacent content. It is possible to 'slide' to
produce better results in this situation. See the discussion in Tricky
Cases.
Syntax node
: An item in difftastic's syntax tree structure. Either
an atom or a list.
Token
: A small piece of syntax tracked by difftastic (e.g. $x,
function or ]), for highlighting and aligned display. This is
either an atom or a non-empty delimiter.
This page summarises some of the other tree diffing tools available.
If you're in a hurry, start by looking at Autochrome. It's extremely
capable, and has an excellent description of the design.
If you're interested in a summary of the academic literature, this
blog
post
(and its accompanying
paper
-- mirrored under a CC BY-NC license) are great resources.
json-diff performs a
structural diff of JSON files. It considers subtrees to be different
if they don't match exactly, so e.g. "foo" and ["foo"] are
entirely different.
json-diff is also noteworthy for its extremely readable display of
results.
Languages: ~10 programming
languages
Parser: Several, including srcML
Algorithm: Top-down, then bottom-up
Output: HTML, Swing GUI, or text
GumTree can parse several
programming languages and then performs a tree-based diff, outputting
an HTML display.
The GumTree algorithm is described in the associated paper
'Fine-grained and accurate source code differencing' by Falleri et al
(DOI,
PDF). It
performs a greedy top-down search for identical subtrees, then
performs a bottom-up search to match up the rest.
Languages: S-expression data format
Algorithm: A* search
Output: Merged s-expression file
Tristan Hume wrote a tree diffing algorithm during his 2017 internship
at Jane Street. The source code is not available, but he has a blog
post discussing the design
in depth.
This project finds minimal diffs between s-expression files used as
configuration by Jane Street. It uses A* search to find the minimal
diff between them, and builds a new s-expression with a section marked
with :date-switch for the differing parts.
(Jane Street also has patdiff, but that seems to be a line-oriented
diff with some whitespace/integer display polish. It doesn't
understand that e.g. whitespace in "foo " is meaningful).
Autochrome parses Clojure
with a custom parser that preserves comments. Autochrome uses
Dijkstra's algorithm to compare syntax trees.
Autochrome's webpage includes worked examples of the algorithm and a
discussion of design tradeoffs. It's a really great resource for
understanding tree diffing techniques in general.
graphtage
compares structured data by parsing into a generic file format, then
displaying a diff. It even allows things like diffing JSON against
YAML.
As with json-diff, it does not consider ["foo"] and "foo" to have
any similarities.