Capitalize the first letter of names or Title Case

The problem

I was working on an application when suddenly I stepped into a method that intrigued me.

public static string NameFormatter(string name)
{
    if (!String.IsNullOrEmpty(name))
    {
        int index = 1;
        char[] separator = new[] { ' ', '-', '_', '.', '\'' };
        name = name.Substring(0, 1).ToUpper() + name.Substring(1);
        while ((index = name.IndexOfAny(separator, index)) > 0)
        {
            index++;
            if (name.Length > index + 1)
            {
                name = name.Replace(
                                name.Substring(index, 1),
                                name.Substring(index, 1).ToUpper());
            }
        }
    }
    return name;
}


This code has no comments and it has been in production since years, at least the method’s name gives a hint of its intention, this method is meant to format a string (like a name) replacing the first character after a given separator character (' ', '-', '_', '.', '\'') by its upper case form.

Let’s a make a test:

string[] names = {
                    "john f. smith",
                    "sandra tayllor-murray",
                    "miguel d'angelo",
                    "pablo fernandez duran"
                    };

foreach (string name in names)
{
    Console.WriteLine(NameFormatter(name));
}

// output:
//     John F. Smith
//     Sandra Tayllor-Murray
//     Miguel D'Angelo
//     Pablo FernanDez Duran

It seems to work. Wait! I don’t really like how this application displays my last name (FernanDez). Leaving aside some algorithmic issues, let’s find the source of the problem. The incriminated line of code is:

name = name.Replace(
                name.Substring(index, 1),
                name.Substring(index, 1).ToUpper());

Let’s say name.Substring(index, 1) = "d".
We have: name = name.Replace("d", "d".ToUpper()). Not only the character “d” at the index position is replaced but all the occurrences of “d” in the string are replaced.

Fixing the code

Let’s fix that:

name =  name.Substring(0, index)
        + name.Substring(index, 1).ToUpper()
        + name.Substring(index + 1);

// name = name.Replace(
//                 name.Substring(index, 1),
//                 name.Substring(index, 1).ToUpper());

Now the output is:

// John F. Smith
// Sandra Tayllor-Murray
// Miguel D'Angelo
// Pablo Fernandez Duran

Using the TextInfo.ToTitleCase method

That’s better, but I’m not still happy about this code, a lot of noise, difficult to maintain, the main intention of the algorithm it’s hidden in its implementation and so on. Before trying to rewrite the method let’s see if there is something in the .NET framework that can be useful. We have the TextInfo.ToTitleCase Method.

Let’s make a test:

string[] names = {
                    "john f. smith",
                    "sandra tayllor-murray",
                    "miguel d'angelo",
                    "pablo fernandez duran"
                    };
TextInfo textInfo = new CultureInfo("en-US", false).TextInfo;
foreach (string name in names)
{
    Console.WriteLine(textInfo.ToTitleCase(name));
}

// output:
//    John F. Smith
//    Sandra Tayllor-Murray
//    Miguel D'angelo
//    Pablo Fernandez Duran

Almost there! now Miguel wouldn’t like how the application displays his name (I don’t really know any Miguel D’Angelo).

I tried with other cultures but I didn’t see any difference (and I would like to know if there is a difference between different cultures).

Using Regular Expressions

What to do now? Let’s use a regex.

public static string NameFormatter(string name)
{
    if (!String.IsNullOrEmpty(name))
    {
        return Regex.Replace(
                         name,
                         @"\b[a-zA-Z]",
                         m => m.Value.ToUpper());
    }
    return name;
}
// output:
//    John F. Smith
//    Sandra Tayllor-Murray
//    Miguel D'Angelo
//    Pablo Fernandez Duran

Great! \b[a-zA-Z] will find all characters between a to z and A to Z just after a word boundary (\b). More information about Regex.Replace here.

One case is not covered with this regex: characters following a ‘_’ (underscore). That is because the word boundary \b by definition considers as word characters the “character class” \w, and \w is the short hand for [A-Za-z0-9_] .

Let’s tune our regex to the handle the _ character as a separator: (?<=\b|_)[a-zA-Z].

Bonus

To handle names like mcfry or macdonald’s we can use:
(?<=\b(?:mc|mac)?|_)[a-zA-Z](?<!'s\b)

  • john f. smith
  • sandra tayllor-murray
  • miguel d’angelo
  • pablo fernandez duran
  • mcfry
  • macdonald’s

Will give:

  • John F. Smith
  • Sandra Tayllor-Murray
  • Miguel D’Angelo
  • Pablo Fernandez Duran
  • McFry
  • MacDonald’s

The final code is:

public static string NameFormatter(string name)
{
    if (!String.IsNullOrEmpty(name))
    {
        return Regex.Replace(
                         name,
                         @&quot;(?&lt;=\b(?:mc|mac)?|_)[a-zA-Z](?&lt;!'s\b)&quot;,
                         m =&gt; m.Value.ToUpper());
    }
    return name;
}

Finally

We need to do some extra work on the regex if we want to match an extended range of characters like diacritics (accents), take a look at the Character classes in regex reference on the smdn or here. You can use this set to match also accents:
[a-zA-ZÀ-ÿ].

Warning

Formatting names is a delicate question, it may depend on cultures and languages. The name is the identity of a person and people may no like that an application tells them how to write their name. So maybe the best way to format a name is to leave it as typed by the user.