Correctly converting a character to lower/upper case

02/08/2021

Gérald Barré

.NET

This post is part of the series 'Strings in .NET'. Be sure to check out the rest of the blog posts of the series!

String comparisons are harder than it seems
.NET Regex: \d is different from [0-9]
How to correctly count the number of characters of a string
Correctly converting a character to lower/upper case (this post)
How not to read a string from a UTF-8 stream
Regex with IgnoreCase option may match more characters than expected
How to remove diacritics from a string in .NET

Strings are complicated. A common mistake is using char.IsUpper or char.ToUpper incorrectly, such as when converting the first character of a string to uppercase. The naive approach, which is often wrong, looks like this:

static string FirstCharacterToUpperCaseBad(string str)
{
    if(string.IsNullOrEmpty(str) || char.IsUpper(str[0]))
        return str;
    return char.ToUpperInvariant(str[0]) + str[1..];
}

This method works for many strings. For example, "abc" correctly becomes "Abc". However, the Latin alphabet is not the only writing system. Consider the Osage alphabet. The character 𐓸 should become 𐓐 when converted to uppercase. However, FirstCharacterToUpperCaseBad("𐓸") returns the same string.

In .NET, a string is a sequential read-only collection of char objects. A char represents a UTF-16 code unit. UTF-16 is a character encoding that maps Unicode code points to sequences of 16-bit code units. It is a variable-length encoding, where code points are encoded using one or two 16-bit code units.

The string "𐓸" consists of two char values because it requires two UTF-16 code units. As a result, "𐓸".Length returns 2. The following screenshot shows how a and 𐓸 are encoded in UTF-16:

source: https://tools.meziantou.net/string-info

Accessing "𐓸"[0] retrieves only the first UTF-16 code unit, which is just half the character. Without both units, it is impossible to determine the character's case or convert it correctly. As a result, char.ToUpperInvariant("𐓸"[0]) returns the character unchanged.

The correct approach is to check whether the first character is part of a surrogate pair (two char values) and use both for the conversion. Rather than handling this manually with char.IsSurrogate, you can use the Rune type, which abstracts the complexity:

static string FirstCharacterToUpperCase(string str)
{
    if(string.IsNullOrEmpty(str))
        return str;

    // Get the first Rune of the string
    var result = Rune.DecodeFromUtf16(str, out var rune, out var charsConsumed);

    // Check if the rune is uppercase
    if (result != OperationStatus.Done || Rune.IsUpper(rune))
        return str;

    // Convert the first rune to uppercase and concatenate it to the rest of the string
    return Rune.ToUpperInvariant(rune) + str[charsConsumed..];
}

You can now test this method with various strings:

FirstCharacterToUpperCase("abc def");   // Abd def   (Latin)
FirstCharacterToUpperCase("𐓷𐓘𐓻𐓘𐓻𐓟 𐒻𐓟"); // 𐓏𐓘𐓻𐓘𐓻𐓟 𐒻𐓟 (Osage)
FirstCharacterToUpperCase("𐐿𐐱𐐻");       // 𐐗𐐱𐐻       (Deseret)
// etc. (U+10C80, U+118A0, U+16E40)

In general, when working with arbitrary text, consider using Rune instead of char.

Do you have a question or a suggestion about this post? Contact me!

Follow me:

Enjoy this blog?

💖 Sponsor on GitHub