Natural sorting of strings

Sorts strings that contain numbers, not character-wise (lexical) but regarding the numeric value: “a1” < “a2” < “a10” < “a11” < “a20”

Characterwise
order:
Natural
order:
Picture1.jpg Picture1.jpg
Picture10.jpg Picture2.jpg
Picture2.jpg Picture5.jpg
Picture251.jpg Picture10.jpg
Picture26.jpg Picture26.jpg
Picture5.jpg Picture251.jpg

String sorting algorithms used in computers usually compare the strings character by character and decide upon the first difference whether one string comes before or after another. This works well for words and names, but not for numbers (see table at the right side). Therefore one must consider the entire number and compare by its numeric value. While numbers are basically also sorted from left to right, digit by digit, this is only true if both numbers have the same digit count. For shorter numbers you need to “imagine the leading zeros”. Characterwise sorting puts “ab” before “c”, but also “12” before “3” which seems wrong to us.

Put it simple, natural sorting produces the following order:

a < a0 < a01 < a1 < a1a < a1b < a2 < a10 < a11 < a20 < a100 < b

Scale of justice 2

Strings without digits are sorted as usual. Further features of my implementation are:

  • Negative numbers at the very beginning of a string are sorted numerically correct.
  • Explicitly specified leading zeros are ignored. In order to produce a deterministic result, they are assigned a preference that serves as difference if the strings would otherwise be equal. More leading zeros come first (analogue to the simple characterwise order). (See example)
  • Certain special characters are at first ignored just like leading zeros. They are: Space, single and double quotation marks (only in the ASCII character set, U+0027 and U+0022). The list of these characters can be modified in the special string. Numbers still end in front of these special characters: two numbers separated by a space are not combined before comparing.
  • This function’s result is deterministic, that means that strings in any arbitrary order are transformed into exactly one order. Therefore preferences are collected while comparing, which can serve as decision criterion if no other relevant differences can be found. So in equal words, upper case comes before lower case and ignored special characters eventually will be regarded. If leading zeros would be ignored, it would lead to the following “order”: … a0 < a01 = a1 < a1a …

What this function cannot do:

  • Numbers with decimals (of different length) are not sorted correctly. Instead, their decimal part is interpreted as another integer part and compared numerically. So for example “1.2” comes before “1.005” because 2 < 5. Supporting this feature would require recognising the decimal separator which is different depending on the language. But maybe I’m still going to do that.

Compatibility: .NET Version 2.0 or newer

Usage notes

Sorting data can be reduced to pair comparison of items in the sorting algorithm. So this is not a sorting algorithm (like Bubble Sort or Quicksort), but merely a comparison function. It can also be used separately, where ever two strings need to be sorted. In order to sort entire lists, you can for example use the following code:

List<string> myList = new List<string>();
myList.Add("abc");
myList.Add("def");
// …
myList.Sort(NatCompare);

The source code contains preprocessor statements (“#define …”) at the beginning of the file that can be used to configure the function processing. You can for instance deactivate the support for local special characters or negative numbers. These code sections are then no longer compiled in and don’t use any space or runtime. These statements must always be at the beginning of the file, also in front of the using and namespace specifications. Since these are only symbol definitions, they can also be defined in the project configuration (Visual Studio: Project properties; Build; Symbols for conditional compilation).

Download

NaturalSort.cs8.0 KiBSource code of the NatCompare function

Performance

Since this function does a lot more, it takes more time than a simple character-wise comparision on the same data. To test the comparison speed, I have run time measurements with a non-representative list of about 21 100 file names from my Windows directory, using the List<string>.Sort(Comparison<string>) method.

Option Compared to String.Compare Compared to NATSORT_CMP_ CHAR Compared to NATSORT_CMP_ STRING_NOCASE
Character comparison modes:
NATSORT_CMP_CHAR
Simple character comparison by code table
−33 %
NATSORT_CMP_CHAR_NOCASE
Case-insensitive character comparison
+150 % +280 %
NATSORT_CMP_STRING
Regarding local special characters
+550 % +875 %
NATSORT_CMP_STRING_NOCASE
Regarding local special characters, case-insensitive
+567 % +900 %
Additional options:
+ NATSORT_WITH_SPECIAL
Skips certain special characters
+130 % +15 %
+ NATSORT_WITH_NEGATIVE
Supports negative numbers (did not occur in my test data)
+8 % +3 %

All in all you can say that using the function for natural sorting with enabled support for local special characters, negative numbers, case sensitive mode and ignoring certain special characters leads to a pure comparison work of 6.7 times the normal sorting.

Changes

2009Feb12
Negative numbers are now sorted correctly. Before, the order depended on the string length. (Thanks to Michael Hodel for the helpful hint.)

References

  • Numeric String Sort in C# (The Code Project)
    I found some inspiration for my implementation in this article but eventually did not use its code as a template. In fact I haven’t even tried to understand it and instead started with my own thoughts right away. ;) My function is now better configurable, natively supports negative numbers and local special characters, consists of only a single function and is about as fast (according to my measurement). Somewhere down in the site’s visitor comments there’s a patch to support local special characters, which I have used in my benchmark.

Licence and terms of use

Copying and distribution of this file, with or without modification, are permitted provided the copyright notice and this notice are preserved. This file is offered as-is, without any warranty. (GNU All-Permissive licence)

Statistic data

  • Created on 2007-07-08, updated on 2009-02-12.
  • Ca. 100 lines of code, estimated development costs: 100 - 400 €