String comparison in Delphi

Have you ever wondered how utilities like Beyond Compare or DIFF are comparing files? They do it (I guess) by solving the longest common subsequence (LCS) problem.

After reading the Wikipedia article linked above, I obtained an overall view of the problem and I looked at the possible resolutions. So, I decided to implement a Delphi class to do the string comparison trick, which is the base for the text file comparison.

Let me put it as follows: given two strings to be compared, I want to highlight in blue the characters added to the first string and in red the characters removed from it. The common (unchanged) characters will keep the default color.
 
For example:

String 1 = Delphi allows both structural and object oriented programming.

String 2 = Does Delphi allow object oriented programming?

Highlighted differences:

Does Delphi allows both structural and object oriented programming.?

The Delphi class looks like this:

type
  TDiff = record
    Character: Char;
    CharStatus: Char;  //Possible values: [+, -, =]
  end;

  TStringComparer = class
  ……………
  public
    class function Compare(aString1, aString2: string): TList<TDiff>;
  end;

When you call TStringComparer.Compare, a generic list of TDiff records is created. A TDiff record contains a character and whether this character was added (CharStatus = ‘+’), removed (CharStatus = ‘-’) or unchanged (CharStatus = ‘=’) in both strings under comparison.

Let’s drop two edits (Edit1, Edit2), a rich edit (RichEdit1) and a button (Button1) on a Delphi form. To highlight the differences put the following code in the OnClick event of the button:

procedure TForm1.Button1Click(Sender: TObject);
var
  Differences: TList<TDiff>;
  Diff: TDiff;
begin
  //Yes, I know...this method could be refactored ;-)
  Differences:= TStringComparer.Compare(Edit1.Text, Edit2.Text);
  try
    RichEdit1.Clear;
    RichEdit1.SelStart:= RichEdit1.GetTextLen;
    for Diff in Differences do
      if Diff.CharStatus = '+' then
      begin
        RichEdit1.SelAttributes.Color:= clBlue;
        RichEdit1.SelText := Diff.Character;
      end
      else if Diff.CharStatus = '-' then
      begin
        RichEdit1.SelAttributes.Color:= clRed;
        RichEdit1.SelText:= Diff.Character;
      end
      else
      begin
        RichEdit1.SelAttributes.Color:= clDefault;
        RichEdit1.SelText:= Diff.Character;
      end;
  finally
    Differences.Free;
  end;
end;

It looks like in the image below:


For the full implementation read further down. Note that various optimizations could be added to the code below, but I didn’t implement them. Anyway, I hope this helps. Feedback is welcome! Feel free to find and correct bugs ;-)


unit StringComparison;

interface

uses
  Math, Generics.Collections;

type
  TDiff = record
    Character: Char;
    CharStatus: Char;  //Possible values: [+, -, =]
  end;

  TStringComparer = class
  strict private
    type TIntArray = array of array of Integer;
    class function LCSMatrix(X, Y: string): TIntArray;
    class procedure ComputeDifferences(aLCSMatrix: TIntArray;
                                       X, Y: string;
                                       i, j: Integer;
                                       aDifferences: TList<TDiff>);
  public
    class function Compare(aString1, aString2: string): TList<TDiff>;
  end;

implementation

{ TStringComparer }

class function TStringComparer.LCSMatrix(X, Y: string): TIntArray;
var
  m, n,
  i, j: Integer;
begin
  m:= Length(X);
  n:= Length(Y);

  //We need one extra column and one extra row to be filled with zeroes
  SetLength(Result, m + 1, n + 1);

  //First column filled with zeros
  for i := 0 to m do
    Result[i, 0] := 0;

  //First row filled with zeros
  for j:= 0 to n do
    Result[0, j]:= 0;

  //Storing the lengths of the longest common subsequences
  //between prefixes of X and Y
  for i:= 1 to m do
    for j:= 1 to n do
      if X[i] = Y[j] then
        Result[i, j] := Result[i-1, j-1] + 1
      else
        Result[i, j]:= Max(Result[i, j-1], Result[i-1, j]);
end;

class procedure TStringComparer.ComputeDifferences(aLCSMatrix: TIntArray;
                                                   X, Y: string;
                                                   i, j: Integer;
                                                   aDifferences: TList<TDiff>);
var
  CharDiff: TDiff;
begin
  if (i > 0) and (j > 0) and (X[i] = Y[j]) then
  begin
    ComputeDifferences(aLCSMatrix, X, Y, i-1, j-1, aDifferences);
    CharDiff.Character:= X[i];
    CharDiff.CharStatus:= '=';   //The character did not change
    aDifferences.Add(CharDiff);
  end
  else
    if (j > 0) and ((i = 0) or (aLCSMatrix[i,j-1] >= aLCSMatrix[i-1,j])) then
    begin
      ComputeDifferences(aLCSMatrix, X, Y, i, j-1, aDifferences);
      CharDiff.Character:= Y[j];
      CharDiff.CharStatus:= '+'; //The character was added
      aDifferences.Add(CharDiff);
    end
    else if (i > 0) and ((j = 0) or (aLCSMatrix[i,j-1] < aLCSMatrix[i-1,j])) then
    begin
      ComputeDifferences(aLCSMatrix, X, Y, i-1, j, aDifferences);
      CharDiff.Character:= X[i];
      CharDiff.CharStatus:= '-'; //The character was removed
      aDifferences.Add(CharDiff);
    end;
end;

//This is a factory method
class function TStringComparer.Compare(aString1, aString2: string): TList<TDiff>;
var
  Matrix: TIntArray;
begin
  Result:= TList<TDiff>.Create;
  Matrix:= LCSMatrix(aString1, aString2);
  ComputeDifferences(Matrix,
                     aString1, aString2,
                     Length(aString1), Length(aString2),
                     Result);
end;

end.

1 comment: