Unicode and Locale

From Noah.org
Jump to: navigation, search

Unicode and the shell

When you start Bash usually your LANG environment variable is set for you. This variable is first set in /etc/profile. This setting used to be set in /etc/environment.

Run the `locale` command and notice that it first prints LANG and then a bunch of LC_* variables. The LC_* variables may or may not be set in your environment. If any are not set then they take the same value as the LANG variable. If LANG is not set then everything defaults to the C locale... unless LC_ALL is set.


Don't confuse 'setlocale (LC_ALL, "")' with 'LC_ALL=foo'

When C starts is always starts in the C locale. It ignores the LC_* environment variables. You must explicitly load your locale from the environment.

import <locale.h>
setlocale (LC_ALL, "")';

Note that in this context LC_ALL is saying "load everything from the environment". This may be a little confusing because in most contexts the LC_ALL name is used to ignore and override everything in the environment with a single given value.

Set locale temporarily for one command-line

Many commands behave differently depending on the locale. For example, `grep` will interpret range expressions like [a-z] differently depending on the locale (Posix character classes, such as lower:, are safer). This can cause problems with regular expressions. Generally, most unattended scripts run for system administration will prefer the C locale. You might think this problem won't concern the en_US.UTF-8 locale because it's American English just like the C locale, but they are not the same. In particular, collation order is different (alphabetical sort).

Like any command in shell you can prepend a temporary environment setting. In this case LANG will be set to C for the grep environment. LANG will have it's old value outside of the grep environment. This can be very useful in situations where you generally want to run under LANG=en_US.UTF-8, but certain commands will run better under the C locale (faster string matching, ASCII sorting, ISO dates).

LANG=C grep 'Search Text' filename

Fix the sorting order by setting LC_COLLATE=C

If you make one change to your default locale it should probably be this one.

Some shell commands such as `ls -la` use a very annoying sort order when LANG=en_US.UTF-8. You can change just the collate order without changing all the other ways the locale could be used. For example, run the following commands and notice the difference. The C collate order is what some people might remember and love from the old ASCII (ANSI_X3.4-1968) days. Basically everything is sorted by the 8-bit ASCII value of each character.

LC_COLLATE="C" ls -la ~
LC_COLLATE="en_US.UTF-8" ls -la ~

You can set just collate permanently by putting this in your ~/.bashrc: