Michael K. Johnson
``Kernel patches'' may sound like magic, but the two tools used to create and apply patches are simple and easy to use---if they weren't, some Linux developers would be too lazy to use them... Best of all, they can be very useful to you, even if you never touch a line of source code.
Diff is designed to show you the differences between
files, line by line. It is fundamentally simple to use, but
takes a little practice. Don't let the length of this article
scare you; you can get some use out of diff by reading only the
first page or two. The rest of the article is for those who
aren't satisfied with very basic uses.
While diff is often used by developers to show differences
between different versions of a file of source code, it is
useful for far more than source code. For example,
diff comes in handy when editing a document which
is passed back and forth between multiple people, perhaps via
e-mail. At Linux Journal, we have experience with this.
Often both the editor and an author are working on an
article at the same time, and we need to make sure that each (correct)
change made by each person makes its way into the final version
of the article being edited. The changes can be found by
looking at the differences between two files.
However, it is hard to show off how helpful diff can be in
finding these kinds of differences. To demonstrate with files
large enough to really show off diff's capabilities would require that we
devote the entire magazine to this one article. Instead,
because few of our readers are likely to be fluent
in Latin, at least compared to those fluent in English, we
will give a Latin example from Winnie Ille Pu, a
translation by Alexander Leonard of A. A. Milne's Winnie The
Pooh (ISBN 0-525-48335-7). This will make it harder for the
average reader to see differences at a glance and show how
useful these tools can be in finding changes in much larger
documents.
Quickly now, find the differences between these two passages:
Ecce Eduardus Ursus scalis nunc tump-tump-tump
occipite gradus pulsante post Christophorum
Robinum descendens. Est quod sciat unus et solus
modus gradibus desendendi, non nunquam autem
sentit, etiam alterum modum exstare, dummodo
pulsationibus desinere et de no modo meditari
possit. Deinde censet alios modos non esse. En,
nunc ipse in imo est, vobis ostentari paratus.
Winnie ille Pu.
Ecce Eduardus Ursus scalis nunc tump-tump-tump
occipite gradus pulsante post Christophorum
Robinum descendens. Est quod sciat unus et solus
modus gradibus descendendi, nonnunquam autem
sentit, etiam alterum modum exstare, dummodo
pulsationibus desinere et de eo modo meditari
possit. Deinde censet alios modos non esse. En,
nunc ipse in imo est, vobis ostentari paratus.
Winnie ille Pu.
You may be able to find one or two changes after some
careful comparison, but are you sure you have found
every change? Probably not: tedious, character-by-character comparison
of two files should be the computer's job, not yours.
Use the diff program to avoid eyestrain and insanity:
diff -u 1 2
-- 1 Sat Apr 20 22:11:53 1996
+++ 2 Sat Apr 20 22:12:01 1996
-1,9 +1,9
Ecce Eduardus Ursus scalis nunc tump-tump-tump
occipite gradus pulsante post Christophorum
Robinum descendens. Est quod sciat unus et solus
-modus gradibus desendendi, non nunquam autem
+modus gradibus descendendi, nonnunquam autem
sentit, etiam alterum modum exstare, dummodo
-pulsationibus desinere et de no modo meditari
+pulsationibus desinere et de eo modo meditari
possit. Deinde censet alios modos non esse. En,
nunc ipse in imo est, vobis ostentari paratus.
Winnie ille Pu.
There are several things to notice here:
- The file names and last dates of modification
are shown in a ``header'' at the top. The dates may not mean
anything if you are comparing files that have been passed back
and forth by e-mail, but they become very useful in other
circumstances.
- The file names (in this case, 1
and 2--are preceded by --- and
+++.
- After the header comes a line that includes
numbers. We will discuss that line later.
- The lines that did not change between files are
shown preceded by spaces; those that are different in the
different files are shown preceded by a character which shows
which file they came from. Lines which exist only in a file
whose name is preceded by --- in the header are
preceded by a - character, and vice-versa for lines
preceded by a + character. Another way to remember
this is to see that the lines preceded by a -
character were removed from the first (---)
file, and those preceded by a + character were
added to the second (+++) file.
- Three spelling changes have been made:
``desendendi'' has been corrected to ``descendendi'',
``non nunquam'' has been corrected to ``nonnunquam'',
and ``no'' has been corrected to ``eo''.
Perhaps the main thing to notice is that you didn't need
this description of how to interpret diff's output in order to
find the differences. It is rather easy to compare two adjacent
lines and see the differences.
It's not always this easy
Unfortunately, if too many adjacent lines have been changed,
interpretation isn't as immediately obvious; but by knowing that each
marked line has been changed in some way, you can figure it
out. For instance, in this comparison, where the file 3
contains the damaged contents, and file 4 (identical to file 2
in the previous example) contains the correct contents, three
lines in a row are changed, and now each line with a difference
is not shown directly above the corrected line:
diff -u 3 4
--- 3 Sun Apr 21 18:57:08 1996
+++ 4 Sun Apr 21 18:56:45 1996
-1,9 +1,9
Ecce Eduardus Ursus scalis nunc tump-tump-tump
occipite gradus pulsante post Christophorum
Robinum descendens. Est quod sciat unus et solus
-modus gradibus desendendi, non nunquam autem
-sentit, etiam alterum nodum exitare, dummodo
-pulsationibus desinere et de no modo meditari
+modus gradibus descendendi, nonnunquam autem
+sentit, etiam alterum modum exstare, dummodo
+pulsationibus desinere et de eo modo meditari
possit. Deinde censet alios modos non esse. En,
nunc ipse in imo est, vobis ostentari paratus.
Winnie ille Pu.
It takes a little more work to find the added mistakes;
``nodum'' for ``modum'' and ``exitare'' for ``exstare''.
Imagine if 50 lines in a row had each had a one-character
change, though. This begins to resemble the old job of going
through the whole file, character-by-character, looking for
changes. All we've done is (potentially) shrink the amount of
comparison you have to do.
Fortunately, there are several tools for finding these kinds
of differences more easily. GNU Emacs has ``word diff''
functionality. There is also a GNU ``wdiff'' program which
helps you find these kinds of differences without using Emacs.
Let's look first at GNU Emacs. For this example, files 5 and 6
are exactly the same, respectively, as files 3 and 4 before. I
bring up emacs under X (which provides me with colored text),
and type:
M-x ediff-files RET
5 RET
6 RET
In the new window which pops up, I press the space
bar, which tells Emacs to highlight the differences. Look at
Figure 1 and see how easy it is to find each changed word.
Figure 1.
ediff-files 5 6
GNU wdiff is also very useful, especially if you aren't
running X. A pager (such as less) is all that is required--and
that is only required for large differences. The exact same set
of files (5 and 6), compared with the command wdiff -t 5
6, is shown in Figure 2.
Figure 2.
wdiff -t 5 6
If you are getting extra character sequences like
ESC[24 instead of getting underline and reverse video,
it's probably because you are using less, which by
default doesn't pass through all escape characters. Use
less -r instead, or use the more pager. Either should
work.
wdiff uses the termcap database (that's what the
-t option is for) to find out how to enable underline
and reverse video, and not all termcap entries are correct. In
some instances, I've found that the linux termcap
entry works well for other terminals, since the codes for
turning underline and reverse video on and off don't differ
very much across terminals. To use the linux termcap
entry, you can do this:
TERM=linux wdiff -t 5 6 | less -r
This will work only with
bourne shell derivatives such as bash, not with csh or tesh.
But since you need to
do this only to correct for a broken termcap database, this
limitation shouldn't be too much of a problem.
wdiff isn't always built with the termcap support needed to
underline and reverse video, and it's not always what you want
even if you have a working termcap database, so there's an
alternate output format that is just as easy to understand.
We'll kill two birds with one stone by also showing off wdiff's
ability to deal with re-wrapped paragraphs while showing off
its ability to work without underline and reverse video. File 8
is the same as the correct file 2, shown at the beginning of
this article, but file 7 (the corrupted one) now has much
shorter lines, which makes them even harder to compare ``by
eye'':
Ecce Eduardus Ursus scalis
nunc tump-tump-tump occipite
gradus pulsante post
Christophorum Robinum
descendens. Est quod sciat
unus et solus modus gradibus
desendendi, non nunquam autem
sentit, etiam alterum nodum
exitare, dummodo pulsationibus
desinere et de no modo
meditari possit. Deinde censet
alios modos non esse. En, nunc
ipse in imo est, vobis
ostentari paratus.
Winnie ille Pu.
wdiff is not confused by the differently-wrapped lines.
The command wdiff 7 8 produces this output:
Ecce Eduardus Ursus scalis nunc tump-tump-tump
occipite gradus pulsante post Christophorum
Robinum descendens. Est quod sciat unus et solus
modus gradibus
[-desendendi, non nunquam-]
{+descendendi, nonnunquam+} autem
sentit, etiam alterum [-nodum
exitare,-] {+modum exstare,+} dummodo
pulsationibus desinere et de [-no-] {+eo+}
modo meditari
possit. Deinde censet alios modos non esse. En,
nunc ipse in imo est, vobis ostentari paratus.
Winnie ille Pu.
Remember the + and - characters? They
mean the same thing with wdiff as they mean with diff.
(Consistent user interfaces are wonderful.)
Chunks
Near the beginning of this article, I promised to explain
this line:
-1,9 +1,9
that describes the chunk that diff found differences
in. In each file, the chunk starts on line 1 and extends for 9
lines beyond the first line. However, with this small example,
the chunk shown in the example contains the whole file.
With larger files, only the lines around the changes, called
the context, are shown.
In files 9 and 10, I've inserted a lot of blank lines in the
middle of the paragraph, in order to show what multiple chunks
look like. File 9 is damaged, file 10 is correct (except for
the blank lines in the middle of the paragraph):
diff -u 9 10
--- 9 Mon Apr 22 15:46:37 1996
+++ 10 Mon Apr 22 15:46:14 1996
-1,7 +1,7
Ecce Eduardus Ursus scalis nunc tump-tump-tump
occipite gradus pulsante post Christophorum
Robinum descendens. Est quod sciat unus et solus
-modus gradibus desendendi, non nunquam autem
+modus gradibus descendendi, nonnunquam autem
-33,7 +33,7
sentit, etiam alterum modum exstare, dummodo
-pulsationibus desinere et de no modo meditari
+pulsationibus desinere et de eo modo meditari
possit. Deinde censet alios modos non esse. En,
nunc ipse in imo est, vobis ostentari paratus.
Winnie ille Pu.
So you see that we have one seven-line chunk starting at
line 1 and one seven-line chunk starting at line 33 are shown
here.
You should notice several things here:
- There is one header at the top of each
chunk.
- Blank lines are included as part of
a chunk's context.
- Lines that are not changed and that are not
within three lines of a changed line are not included in any
chunk.
``Patches'' (or ``diffs'') are the output of the diff
program. They include all the chunks of changes between the two
files.
Other formats
This only brushes the surface of diff. For one thing, the three
lines of unchanged context is configurable. Instead of using
the -u option, you can use the -U lines
option to specify any reasonable number of lines of context.
You can even specify -U 0 if you don't want to use any
context at all, though that is rarely useful.
What does the -u (or -U lines) argument
mean? It specifies the unified diff format, which is the
particular format covered here. Other formats include:
- ``context diffs'' which have the same information as
unified diffs, but are less compact and less readable
- ``ed script diffs'' or ``normal diffs'' which are in
a format that can be easily converted into a form that can be
used to cause the (nearly obsolete) editor ed to automatically
change another copy of the old file to match the new file.
This format has no context and could easily be replaced by
-U 0, except for compatibility with older software
and the POSIX standard.
You will almost never want to create context or normal
diffs, but it may be useful to recognize them from time to
time. Context diffs are marked by the use of the character
! to mark changes, and normal diffs are marked by
the use of the characters < and > to
mark changes.
Here are examples:
diff -c 1 2
*** 1 Sat Apr 20 22:11:53 1996
--- 2 Sat Apr 20 22:12:01 1996
***************
*** 1,9 ****
Ecce Eduardus Ursus scalis nunc tump-tump-tump
occipite gradus pulsante post Christophorum
Robinum descendens. Est quod sciat unus et solus
! modus gradibus desendendi, non nunquam autem
sentit, etiam alterum modum exstare, dummodo
! pulsationibus desinere et de no modo meditari
possit. Deinde censet alios modos non esse. En,
nunc ipse in imo est, vobis ostentari paratus.
Winnie ille Pu.
--- 1,9 ----
Ecce Eduardus Ursus scalis nunc tump-tump-tump
occipite gradus pulsante post Christophorum
Robinum descendens. Est quod sciat unus et solus
! modus gradibus descendendi, nonnunquam autem
sentit, etiam alterum modum exstare, dummodo
! pulsationibus desinere et de eo modo meditari
possit. Deinde censet alios modos non esse. En,
nunc ipse in imo est, vobis ostentari paratus.
Winnie ille Pu.
diff 1 2
4c4
< modus gradibus desendendi, non nunquam autem
---
> modus gradibus descendendi, nonnunquam autem
6c6
< pulsationibus desinere et de no modo meditari
---
< pulsationibus desinere et de eo modo meditari
There are a few other important things to note here:
- In context diffs, the * character is
used in place of the unified diff's - character, and
the - character is used in place of the +
character. The context diff format was designed before the
unified diff format, but the unified diff format's choice of
characters is mnemonic and therefore preferable.
- Context diffs repeat all context twice for each
chunk. This is a waste of space in files, but far more
importantly, it separates the changes too widely, making
patches less human-readable.
- Normal, old-style diffs are very contracted and
use very little space. They are useful in situations where you
don't normally expect a human to read them, where saving space
makes a lot of sense, and where they will never be applied to
files which have changed. For example, RCS (covered in the May
1996 issue of LJ) uses a format almost identical to
old-style diffs to store changes between versions of files.
This saves space and time in a situation where any context at
all would be a waste of space.
Using Patches
When someone changes a file that other people have copies of
(source code, documentation, or just about any other text
file), they often send patches instead of (or in addition to)
making the entire new file available. If you have the old file
and the patches, you might wish that you could have a program
apply the patches. You might think that normal diff format,
which was made to look like input to the ed program, would be
the best way to accomplish this.
As it turns out, this is not true.
A program called patch has been written which is
specifically designed to apply patches to files (change the
files as specified in the patch). It correctly recognizes all
the formats of patches and applies them. With unified and
context diffs, patch can usually apply patches, even if lines
have been added or removed from the file, by looking for
unchanged context lines. Only if the context lines have
themselves been changed is patch likely to fail.
To apply patches with patch, you normally have a file
containing the patch (we'll call it patchfile), and then
run patch:
patch < patchfile
Patch is very verbose. If it gets confused by anything, it
stops and asks you in English (it was written by a linguist, not
a computer scientist) what you want to do. If you want to learn
more about patch, the man page is unusually readable.
Other Related Tools
If you read the RCS article in the May issue (Take Command:
Keeping Track of Change, LJ #25, May 1996), you may have noticed
that the article talked a bit about a program called rcsdiff.
rcsdiff is really just a front end to diff. That is, it looks
for arguments that it understands (such as revision numbers and
the filename) and prepares two files representing the two
versions of the file you are examining. It then calls diff with
the remaining options. The RCS article used -u to
get the unified format without explaining what it meant, but
you can use -c to get context diffs, or use -U
lines to choose the amount of context you get in a
unified diff, or use any other diff options you like.
You may notice that rcsdiff produces more verbose output than
normal diff. From the RCS article:
rcsdiff -u -r1.3 -r1.6 foo
==============================================
RCS file: foo,v
retrieving revision 1.3
retrieving revision 1.6
diff -u -r1.3 -r1.6
--- foo 1996/02/01 00:34:15 1.3
+++ foo 1996/02/01 01:05:28 1.6
-1,2 +1,6
This is a test of the emergency
-RCS system. This is only a test.
+RCS version control system.
+This is only a test.
+
+I'm now adding a few lines for
+the next version.
It looks just like a normal unified diff except for the
first 5 lines.
This doesn't prevent you from sending patches to people. The
patch program is extremely good about ignoring extraneous
information. It can even ignore news or mail headers, extra
comments written in a file outside a patch, and people's
signatures following patches. Patch tells you when it is
determining whether text is part of a patch or not by saying
``Hmm...''
If you don't care how two files differ, but just want to
know whether they differ, the cmp program will tell you.
It works not only for text files, but also for binary files.
In this example, the files 5 and 6 are different; 2 and 4 are
the same:
cmp 5 6
5 6 differ: char 159, line 4
cmp 2 4
Notice that when two files are the same, cmp doesn't say
anything at all. It only tells you explicitly if the files have
been changed. For use in writing shell scripts, cmp also
returns true if the files are the same and false if they
don't, as shown by this shell session:
if cmp 5 6 ; then
echo "same"
else
echo "different"
fi
5 6 differ: char 159, line 4
different
if cmp 2 4 ; then
echo "same"
else
echo "different"
fi
same
There are several other programs with related functionality.
In particular, diff3 can be used to merge together two different
files that have both been edited from a common ancestor file.
That common ancestor must exist in order for diff3 to
work correctly.
The info pages which are shipped with diff are probably
installed on your system. If you want to learn more about diff, try the
command info diff or use info mode from within emacs
or jed.
diff, wdiff, patch, and emacs are available via ftp from
the canonical GNU ftp archive, prep.ai.mit.edu, in the
directory /pub/gnu/
Michael K. Johnson's wife Kim likes A. A. Milne and
briefly studied Latin (unlike Michael, whose experience with
Latin was limited to singing in choir), which is why she owns
Winnie Ille Pu as well as Tela Charlottae
(Charlotte's Web).