Difference between revisions of "Unicode and Locale"

From Noah.org
Jump to navigationJump to search
 
(3 intermediate revisions by the same user not shown)
Line 3: Line 3:
 
== Unicode and the shell ==
 
== Unicode and the shell ==
  
When you start Bash usually your LANG environment variable is set for you. The setting usually originates from the /etc/environment file.
+
When you start Bash usually your LANG environment variable is set for you. This variable is first set in '''/etc/profile'''. This setting used to be set in  '''/etc/environment'''.
  
Run the `locale` command and notice that it first prints LANG and then a bunch of LC_ variables. The LC_ variables may or may not be set in your environment. If any of them is not set then it automatically takes on the same value as the LANG variable.
+
Run the `locale` command and notice that it first prints '''LANG''' and then a bunch of '''LC_*''' variables. The '''LC_*''' variables may or may not be set in your environment. If any are not set then they take the same value as the '''LANG''' variable. If '''LANG''' is not set then everything defaults to the '''C''' locale... unless '''LC_ALL''' is set.
 +
<pre>
 +
locale
 +
</pre>
 +
 
 +
== Don't confuse 'setlocale (LC_ALL, "")' with 'LC_ALL=foo' ==
  
 +
When C starts is always starts in the C locale. It ignores the '''LC_*''' environment variables. You must explicitly load your locale from the environment.
 
<pre>
 
<pre>
locale
+
import <locale.h>
 +
...
 +
setlocale (LC_ALL, "")';
 
</pre>
 
</pre>
 +
Note that in this context '''LC_ALL''' is saying "load everything from the environment". This may be a little confusing because in most contexts the '''LC_ALL''' name is used to ignore and override everything in the environment with a single given value.
  
== Set locale for just a single command ==
+
== Set locale temporarily for one command-line ==
  
Many commands behave differently depending on the locale. For example, `grep` will interpret range expressions like [a-z] differently depending on the locale. This can cause problems with regular expressions. Generally, most system administration scripts will prefer the '''C''' locale.
+
Many commands behave differently depending on the locale. For example, `grep` will interpret range expressions like '''[a-z]''' differently depending on the locale (Posix character classes, such as '''[[:lower:]]''', are safer). This can cause problems with regular expressions. Generally, most unattended scripts run for system administration will prefer the '''C''' locale. You might think this problem won't concern the '''en_US.UTF-8''' locale because it's American English just like the '''C''' locale, but they are not the same. In particular, collation order is different (alphabetical sort).
  
 +
Like any command in shell you can prepend a temporary environment setting. In this case '''LANG''' will be set to '''C''' for the '''grep''' environment. LANG will have it's old value outside of the '''grep''' environment. This can be very useful in situations where you generally want to run under '''LANG=en_US.UTF-8''', but certain commands will run better under the '''C''' locale (faster string matching, ASCII sorting, ISO dates).
 
<pre>
 
<pre>
 
LANG=C grep 'Search Text' filename
 
LANG=C grep 'Search Text' filename
 
</pre>
 
</pre>
  
== Fix just the sorting order with collate ==
+
== Fix the sorting order by setting LC_COLLATE=C ==
 +
 
 +
If you make one change to your default locale it should probably be this one.
  
Some shell commands such as `ls` use a very annoying sort order when LANG=en_US.UTF-8. You can change just the collate order without changing all the other ways the locale could be used. For example, run the following commands and notice the difference. The '''C''' collate order is what people might remember and love from the old ASCII (ANSI_X3.4-1968) days.
+
Some shell commands such as `ls -la` use a very annoying sort order when LANG=en_US.UTF-8. You can change just the collate order without changing all the other ways the locale could be used. For example, run the following commands and notice the difference. The '''C''' collate order is what some people might remember and love from the old ASCII (ANSI_X3.4-1968) days. Basically everything is sorted by the 8-bit ASCII value of each character.
  
 
<pre>
 
<pre>
Line 31: Line 43:
  
 
<pre>
 
<pre>
export LC_COLLATE="C"
+
export LC_COLLATE=C
 
</pre>
 
</pre>

Latest revision as of 23:20, 29 May 2014


Unicode and the shell

When you start Bash usually your LANG environment variable is set for you. This variable is first set in /etc/profile. This setting used to be set in /etc/environment.

Run the `locale` command and notice that it first prints LANG and then a bunch of LC_* variables. The LC_* variables may or may not be set in your environment. If any are not set then they take the same value as the LANG variable. If LANG is not set then everything defaults to the C locale... unless LC_ALL is set.

locale

Don't confuse 'setlocale (LC_ALL, "")' with 'LC_ALL=foo'

When C starts is always starts in the C locale. It ignores the LC_* environment variables. You must explicitly load your locale from the environment.

import <locale.h>
...
setlocale (LC_ALL, "")';

Note that in this context LC_ALL is saying "load everything from the environment". This may be a little confusing because in most contexts the LC_ALL name is used to ignore and override everything in the environment with a single given value.

Set locale temporarily for one command-line

Many commands behave differently depending on the locale. For example, `grep` will interpret range expressions like [a-z] differently depending on the locale (Posix character classes, such as lower:, are safer). This can cause problems with regular expressions. Generally, most unattended scripts run for system administration will prefer the C locale. You might think this problem won't concern the en_US.UTF-8 locale because it's American English just like the C locale, but they are not the same. In particular, collation order is different (alphabetical sort).

Like any command in shell you can prepend a temporary environment setting. In this case LANG will be set to C for the grep environment. LANG will have it's old value outside of the grep environment. This can be very useful in situations where you generally want to run under LANG=en_US.UTF-8, but certain commands will run better under the C locale (faster string matching, ASCII sorting, ISO dates).

LANG=C grep 'Search Text' filename

Fix the sorting order by setting LC_COLLATE=C

If you make one change to your default locale it should probably be this one.

Some shell commands such as `ls -la` use a very annoying sort order when LANG=en_US.UTF-8. You can change just the collate order without changing all the other ways the locale could be used. For example, run the following commands and notice the difference. The C collate order is what some people might remember and love from the old ASCII (ANSI_X3.4-1968) days. Basically everything is sorted by the 8-bit ASCII value of each character.

LC_COLLATE="C" ls -la ~
LC_COLLATE="en_US.UTF-8" ls -la ~

You can set just collate permanently by putting this in your ~/.bashrc:

export LC_COLLATE=C