NAME
ctab - Locale character classification, case conversion, and
collating input file
DESCRIPTION
A locale character classification, case conversion and col-
lating input file consists of records separated by newline
characters. Each record consists of one character or colla-
tion element in the locale, where a collation element is a
sequence of two or more characters that collate as a single
unit. These files are not directly accessed by user pro-
grams: the ctab command reads them to produce binary files
loaded by the setlocale() function.
The ordering of the records determines the order of the
locale's characters. Records marked with the translate or
ignore indicator (see KEYWORDS) do not reflect this order-
ing. The ordering of characters in a locale may also be
referred to as their collation weights.
Several characters may have the same primary collation
weights but different secondary weights. In French, the
plain and accented versions of a's all sort to the same pri-
mary location. If there is a tie between a plain and
accented character, however, a secondary sort is applied. A
group of characters with the same primary collation value
are said to belong to the same equivalence class. If a
character is not part of an equivalence class, it has ident-
ical primary and secondary collation weights.
This primary and secondary collation weight information is
used in applications, such as grep, which use ctab informa-
tion to determine string sequence.
The ctab input file describes the collating weights for an
assumed code set and a particular language. If a character
is encountered which does not appear in the ctab file
corresponding to the current locale, the character's colla-
tion weight will be based on its relative position in the
current code set.
Records in the locale ctab input files have fields separated
by a separator character (By default, this separator is a :
(colon), but the user can change this; see KEYWORDS). The
records have the following fields:
subject character
The subject character field is actually the collating
element, which may be comprised of more than one char-
acter. If the subject character is a multicharacter
collating element, the first character in the element
must also be defined as a subject character elsewhere
in the input file. If the character or collating ele-
ment is followed by the equivalence class character,
which is a ^ (circumflex) by default, it is given the
same primary collating weight as the character
represented by the preceding record. The secondary
collation weight is unique. Characters can be speci-
fied using octal escape sequences consisting of a \
(backslash) followed by one or more octal digits. Any
backslash not followed by an octal digit is an escape
character. The subject character field must be ter-
minated by a separator character even if there are no
other fields in the record.
case conversion
The case conversion field specifies the character that
is the inverse case of the character in the first
field. For example, if the first field is p, the
second field is P. If the third field, the character
classification field (see below), contains an l or L
(for lowercase), the second field specifies the upper-
case equivalent of the subject character. If the char-
acter classification field contains a u or U (for
uppercase), the case conversion field specifies the
lowercase equivalent of the subject character. Any
character with a nonempty case conversion field can
specify the corresponding uppercase or lowercase
letter. Characters classified as alphabetic do not
require a corresponding case; that is, the second field
can be empty. The second field currently is not used
for SJIS characters when Japanese Language Support is
installed.
character classification
The character classification field values assume the
following classes and values:
u or U Uppercase letter
l or L Lowercase letter
a or A Alphabetic character
n or N Digits
x or X Hexadecimal digits
p or P Punctuation characters
s or S Whitespace characters
c or C Control characters
g or G Graphic
- No type
Characters can belong to more than one character class,
subject to certain rules. The difference between
graphic and printable characters is that the set of
graphic characters does not include the space charac-
ter, but the set of printable characters does include
the space character. The ASCII code set is predefined
as follows:
A through Z Uppercase letters
a through z Lowercase letters
A through Z, and a through z
Alphabetic characters
0 through 9 Digits
Alphabetic characters and digits
Alphanumeric characters
0 through 9, A through F, and a through f
Hexadecimal digits
acter
Any character below the Space character and the Delete char-
Control characters
and vertical tab
Space, formfeed, newline, carriage-
return, horizontal tab,
Whitespace characters
Any character except the above
Punctuation characters
Characters not defined as alphabetic are automatically
defined as punctuation.
Keywords
A line beginning with the word "option" serves to change one
or more of the default conditions or metacharacters built
into the collating table. The word "option" is followed by
one or more keyword/value pairs. Keywords and values are
separated by tab or space characters. The following key-
words are recognized:
comment
Uses the assigned value as the comment character. The
default value is the # (number sign). Anything on a
line that follows the comment character is ignored.
sep Uses the assigned value as the field separator charac-
ter. The default value is a : (colon). Tabs or spaces
can surround fields or separators.
ignore
Uses the assigned value as the ignore character indica-
tor. The default value is the @ (at sign). A charac-
ter marked with the ignore indicator is ignored for
collation purposes.
repeat
Uses the assigned value as the equivalence class indi-
cator. The default value is the ^ (circumflex) charac-
ter. A character marked with the equivalence class
indicator has the same primary collation value as the
preceding character.
trans
Uses the assigned character as the translate indicator.
The default value is the | (vertical bar). A collation
element marked with the translate indicator is
translated to the collation element(s) following the
indicator. For example, to treat the German eszet ()
element as the two characters ss, the first field of
the line would be:
\337|ss:
The unique collation weight is used in regular expres-
sions (see grep). Characters being translated cannot
be followed by an equivalence character. The subject
character cannot be contained in its own substitution
collation element(s) (not o|oe). The translation
mechanism completes in one pass: none of the characters
in the substitution collation elements can in turn be
the subject of further translation, so the following
example is illegal:
q|r:
x|pq:
Characters being translated have no primary collating
weight of their own, but have a unique collation
weight, which is based on the order of the input line
of the input file.
EXAMPLES
The following line is interpreted as a field containing a
backslash and a colon followed by a field separator:
\\\::
Here are the first and last three lines of a sample C.ctab
file:
\000:
\001:
\002:
}:
~:
\177::c
FILES
/usr/lib/nls/loc/<locale>
Binary character classification, case conversion and
collating output file for locale <locale>.
/etc/nls/loc/<locale>
Binary locale classification, case conversion and col-
lating output file. This is only used as a default
during single-user mode operation.
RELATED INFORMATION
Commands: ctab(1)
Functions: setlocale(3)
"Using Internationalization Features" in the OSF/1 User's
Guide
Acknowledgement and Disclaimer