automatically merged
This commit is contained in:
commit
9aa8ba55f0
@ -1,28 +1,31 @@
|
|||||||
This directory holds configuration files which allow MySQL to work with
|
This directory holds configuration files that enable MySQL to work with
|
||||||
different character sets. It contains:
|
different character sets. It contains:
|
||||||
|
|
||||||
*.conf
|
charset_name.xml
|
||||||
Each conf file contains four tables which describe character types,
|
Each charset_name.xml file contains information for a simple character
|
||||||
|
set. The information in the file describes character types,
|
||||||
lower- and upper-case equivalencies and sorting orders for the
|
lower- and upper-case equivalencies and sorting orders for the
|
||||||
character values in the set.
|
character values in the set.
|
||||||
|
|
||||||
Index
|
Index.xml
|
||||||
The Index file lists all of the available charset configurations.
|
The Index.xml file lists all of the available charset configurations,
|
||||||
|
including collations.
|
||||||
|
|
||||||
Each charset is paired with a number. The number is stored
|
Each collation must have a unique number. The number is stored
|
||||||
IN THE DATABASE TABLE FILES and must not be changed. Always
|
IN THE DATABASE TABLE FILES and must not be changed.
|
||||||
add new character sets to the end of the list, so that the
|
|
||||||
numbers of the other character sets will not be changed.
|
The max-id attribute of the <charsets> element must be set to
|
||||||
|
the largest collation number.
|
||||||
|
|
||||||
Compiled in or configuration file?
|
Compiled in or configuration file?
|
||||||
When should a character set be compiled in to MySQL's string library
|
When should a character set be compiled in to MySQL's string library
|
||||||
(libmystrings), and when should it be placed in a configuration
|
(libmystrings), and when should it be placed in a charset_name.xml
|
||||||
file?
|
configuration file?
|
||||||
|
|
||||||
If the character set requires the strcoll functions or is a
|
If the character set requires the strcoll functions or is a
|
||||||
multi-byte character set, it MUST be compiled in to the string
|
multi-byte character set, it MUST be compiled in to the string
|
||||||
library. If it does not require these functions, it should be
|
library. If it does not require these functions, it should be
|
||||||
placed in a configuration file.
|
placed in a charset_name.xml configuration file.
|
||||||
|
|
||||||
If the character set uses any one of the strcoll functions, it
|
If the character set uses any one of the strcoll functions, it
|
||||||
must define all of them. Likewise, if the set uses one of the
|
must define all of them. Likewise, if the set uses one of the
|
||||||
@ -30,11 +33,7 @@ Compiled in or configuration file?
|
|||||||
more information on how to add a complex character set to MySQL.
|
more information on how to add a complex character set to MySQL.
|
||||||
|
|
||||||
Syntax of configuration files
|
Syntax of configuration files
|
||||||
The syntax is very simple. Comments start with a '#' character and
|
The syntax is very simple. Words in <map> array elements are
|
||||||
proceed to the end of the line. Words are separated by arbitrary
|
separated by arbitrary amounts of whitespace. Each word must be a
|
||||||
amounts of whitespace.
|
number in hexadecimal format. The ctype array has 257 words; the
|
||||||
|
other arrays (lower, upper, etc.) take up 256 words each after that.
|
||||||
For the character set configuration files, every word must be a
|
|
||||||
number in hexadecimal format. The ctype array takes up the first
|
|
||||||
257 words; the to_lower, to_upper and sort_order arrays take up 256
|
|
||||||
words each after that.
|
|
||||||
|
@ -3,9 +3,8 @@ CHARSET_INFO
|
|||||||
============
|
============
|
||||||
A structure containing data for charset+collation pair implementation.
|
A structure containing data for charset+collation pair implementation.
|
||||||
|
|
||||||
Virtual functions which use this data are collected
|
Virtual functions that use this data are collected into separate
|
||||||
into separate structures MY_CHARSET_HANDLER and
|
structures, MY_CHARSET_HANDLER and MY_COLLATION_HANDLER.
|
||||||
MY_COLLATION_HANDLER.
|
|
||||||
|
|
||||||
|
|
||||||
typedef struct charset_info_st
|
typedef struct charset_info_st
|
||||||
@ -56,7 +55,7 @@ character set. Not really used now. Intended to optimize some
|
|||||||
parts of the code where we need to find the default collation
|
parts of the code where we need to find the default collation
|
||||||
using its non-default counterpart for the given character set.
|
using its non-default counterpart for the given character set.
|
||||||
|
|
||||||
binary_numner - ID of a charset+collation pair, which consists
|
binary_number - ID of a charset+collation pair, which consists
|
||||||
of the same character set and the binary collation of this
|
of the same character set and the binary collation of this
|
||||||
character set. Not really used now.
|
character set. Not really used now.
|
||||||
|
|
||||||
@ -65,15 +64,15 @@ Names
|
|||||||
|
|
||||||
csname - name of the character set for this charset+collation pair.
|
csname - name of the character set for this charset+collation pair.
|
||||||
name - name of the collation for this charset+collation pair.
|
name - name of the collation for this charset+collation pair.
|
||||||
comment - a text comment, dysplayed in "Description" column of
|
comment - a text comment, displayed in "Description" column of
|
||||||
SHOW CHARACTER SET output.
|
SHOW CHARACTER SET output.
|
||||||
|
|
||||||
Conversion tables
|
Conversion tables
|
||||||
-----------------
|
-----------------
|
||||||
|
|
||||||
ctype - pointer to array[257] of "type of characters"
|
ctype - pointer to array[257] of "type of characters"
|
||||||
bit mask for each chatacter, e.g. if a
|
bit mask for each character, e.g., whether a
|
||||||
character is a digit or a letter or a separator, etc.
|
character is a digit, letter, separator, etc.
|
||||||
|
|
||||||
Monty 2004-10-21:
|
Monty 2004-10-21:
|
||||||
If you look at the macros, we use ctype[(char)+1].
|
If you look at the macros, we use ctype[(char)+1].
|
||||||
@ -87,17 +86,64 @@ Conversion tables
|
|||||||
to_upper - pointer to array[256] used in UCASE()
|
to_upper - pointer to array[256] used in UCASE()
|
||||||
sort_order - pointer to array[256] used for strings comparison
|
sort_order - pointer to array[256] used for strings comparison
|
||||||
|
|
||||||
|
In all Asian charsets these arrays are set up as follows:
|
||||||
|
|
||||||
|
- All bytes in the range 0x80..0xFF were marked as letters in the
|
||||||
|
ctype array.
|
||||||
|
|
||||||
|
- The to_lower and to_upper arrays map only ASCII letters.
|
||||||
|
UPPER() and LOWER() doesn't really work for multi-byte characters.
|
||||||
|
Most of the characters in Asian character sets are ideograms
|
||||||
|
anyway and they don't have case mapping. However, there are
|
||||||
|
still some characters from European alphabets.
|
||||||
|
For example:
|
||||||
|
_ujis 0x8FAAF2 - LATIN CAPITAL LETTER Y WITH ACUTE
|
||||||
|
_ujis 0x8FABF2 - LATIN SMALL LETTER Y WITH ACUTE
|
||||||
|
|
||||||
|
But they don't map to each other with UPPER and LOWER operations.
|
||||||
|
|
||||||
|
- The sort_order array is filled case insensitively for the
|
||||||
|
ASCII range 0x00..0x7F, and in "binary" fashion for the multi-byte
|
||||||
|
range 0x80..0xFF for these collations:
|
||||||
|
|
||||||
|
cp932_japanese_ci,
|
||||||
|
euckr_korean_ci,
|
||||||
|
eucjpms_japanese_ci,
|
||||||
|
gb2312_chinese_ci,
|
||||||
|
sjis_japanese_ci,
|
||||||
|
ujis_japanese_ci.
|
||||||
|
|
||||||
|
So multi-byte characters are sorted just according to their codes.
|
||||||
|
|
||||||
|
|
||||||
|
- Two collations are still case insensitive for the ASCII characters,
|
||||||
|
but have special sorting order for multi-byte characters
|
||||||
|
(something more complex than just according to codes):
|
||||||
|
|
||||||
|
big5_chinese_ci
|
||||||
|
gbk_chinese_ci
|
||||||
|
|
||||||
|
So handlers for these collations use only the 0x00..0x7F part
|
||||||
|
of their sort_order arrays, and apply the special functions
|
||||||
|
for multi-byte characters
|
||||||
|
|
||||||
|
In Unicode character sets we have full support of UPPER/LOWER mapping,
|
||||||
|
for sorting order, and for character type detection.
|
||||||
|
"utf8_general_ci" still has the "old-fashioned" arrays
|
||||||
|
like to_upper, to_lower, sort_order and ctype, but they are
|
||||||
|
not really used (maybe only in some rare legacy functions).
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Unicode conversion data
|
Unicode conversion data
|
||||||
-----------------------
|
-----------------------
|
||||||
For 8bit character sets:
|
For 8-bit character sets:
|
||||||
|
|
||||||
tab_to_uni : array[256] of charset->Unicode translation
|
tab_to_uni : array[256] of charset->Unicode translation
|
||||||
tab_from_uni: a structure for Unicode->charset translation
|
tab_from_uni: a structure for Unicode->charset translation
|
||||||
|
|
||||||
Non-8 bit charsets have their own structures per charset
|
Non-8-bit charsets have their own structures per charset
|
||||||
hidden in correspondent ctype-xxx.c file and don't use
|
hidden in corresponding ctype-xxx.c file and don't use
|
||||||
tab_to_uni and tab_from_uni tables.
|
tab_to_uni and tab_from_uni tables.
|
||||||
|
|
||||||
|
|
||||||
@ -106,9 +152,9 @@ Parser maps
|
|||||||
state_map[]
|
state_map[]
|
||||||
ident_map[]
|
ident_map[]
|
||||||
|
|
||||||
These maps are to quickly identify if a character is
|
These maps are used to quickly identify whether a character is an
|
||||||
an identificator part, a digit, a special character,
|
identifier part, a digit, a special character, or a part of another
|
||||||
or a part of other SQL language lexical item.
|
SQL language lexical item.
|
||||||
|
|
||||||
Probably can be combined with ctype array in the future.
|
Probably can be combined with ctype array in the future.
|
||||||
But for some reasons these two arrays are used in the parser,
|
But for some reasons these two arrays are used in the parser,
|
||||||
@ -116,32 +162,32 @@ while a separate ctype[] array is used in the other part of the
|
|||||||
code, like fulltext, etc.
|
code, like fulltext, etc.
|
||||||
|
|
||||||
|
|
||||||
Misc fields
|
Miscellaneous fields
|
||||||
-----------
|
--------------------
|
||||||
|
|
||||||
strxfrm_multiply - how many times a sort key (i.e. a string
|
strxfrm_multiply - how many times a sort key (that is, a string
|
||||||
which can be passed into memcmp() for comparison)
|
that can be passed into memcmp() for comparison)
|
||||||
can be longer than the original string.
|
can be longer than the original string.
|
||||||
Usually it is 1. For some complex
|
Usually it is 1. For some complex
|
||||||
collations it can be bigger. For example
|
collations it can be bigger. For example,
|
||||||
in latin1_german2_ci, a sort key is up to
|
in latin1_german2_ci, a sort key is up to
|
||||||
twice longer than the original string.
|
two times longer than the original string.
|
||||||
e.g. Letter 'A' with two dots above is
|
e.g. Letter 'A' with two dots above is
|
||||||
substituted with 'AE'.
|
substituted with 'AE'.
|
||||||
mbminlen - mininum multibyte sequence length.
|
mbminlen - minimum multi-byte sequence length.
|
||||||
Now always 1 except ucs2. For ucs2
|
Now always 1 except for ucs2. For ucs2,
|
||||||
it is 2.
|
it is 2.
|
||||||
mbmaxlen - maximum multibyte sequence length.
|
mbmaxlen - maximum multi-byte sequence length.
|
||||||
1 for 8bit charsets. Can be also 2 or 3.
|
1 for 8-bit charsets. Can be also 2 or 3.
|
||||||
|
|
||||||
max_sort_char - for LIKE range
|
max_sort_char - for LIKE range
|
||||||
in case of 8bit character sets - native code
|
in case of 8-bit character sets - native code
|
||||||
of maximum character (max_str pad byte);
|
of maximum character (max_str pad byte);
|
||||||
in case of UTF8 and UCS2 - Unicode code of the maximum
|
in case of UTF8 and UCS2 - Unicode code of the maximum
|
||||||
possible character (usually U+FFFF). This code is
|
possible character (usually U+FFFF). This code is
|
||||||
converted to multibyte representation (usually 0xEFBFBF)
|
converted to multi-byte representation (usually 0xEFBFBF)
|
||||||
and then used as a pad sequence for max_str.
|
and then used as a pad sequence for max_str.
|
||||||
in case of other multibyte character sets -
|
in case of other multi-byte character sets -
|
||||||
max_str pad byte (usually 0xFF).
|
max_str pad byte (usually 0xFF).
|
||||||
|
|
||||||
MY_CHARSET_HANDLER
|
MY_CHARSET_HANDLER
|
||||||
@ -151,10 +197,10 @@ MY_CHARSET_HANDLER is a collection of character-set
|
|||||||
related routines. Defined in m_ctype.h. Have the
|
related routines. Defined in m_ctype.h. Have the
|
||||||
following set of functions:
|
following set of functions:
|
||||||
|
|
||||||
Multibyte routines
|
Multi-byte routines
|
||||||
------------------
|
------------------
|
||||||
ismbchar() - detects if the given string is a multibyte sequence
|
ismbchar() - detects whether the given string is a multi-byte sequence
|
||||||
mbcharlen() - returns length of multibyte sequence starting with
|
mbcharlen() - returns length of multi-byte sequence starting with
|
||||||
the given character
|
the given character
|
||||||
numchars() - returns number of characters in the given string, e.g.
|
numchars() - returns number of characters in the given string, e.g.
|
||||||
in SQL function CHAR_LENGTH().
|
in SQL function CHAR_LENGTH().
|
||||||
@ -163,29 +209,29 @@ charpos() - calculates the offset of the given position in the string.
|
|||||||
INSERT()
|
INSERT()
|
||||||
|
|
||||||
well_formed_length()
|
well_formed_length()
|
||||||
- finds the length of correctly formed multybyte beginning.
|
- finds the length of correctly formed multi-byte beginning.
|
||||||
Used in INSERTs to cut a beginning of the given string
|
Used in INSERTs to cut a beginning of the given string
|
||||||
which is
|
which is
|
||||||
a) "well formed" according to the given character set.
|
a) "well formed" according to the given character set.
|
||||||
b) can fit into the given data type
|
b) can fit into the given data type
|
||||||
Terminates the string in the good position, taking in account
|
Terminates the string in the good position, taking in account
|
||||||
multibyte character boundaries.
|
multi-byte character boundaries.
|
||||||
|
|
||||||
lengthsp() - returns the length of the given string without traling spaces.
|
lengthsp() - returns the length of the given string without trailing spaces.
|
||||||
|
|
||||||
|
|
||||||
Unicode conversion routines
|
Unicode conversion routines
|
||||||
---------------------------
|
---------------------------
|
||||||
mb_wc - converts the left multibyte sequence into it Unicode code.
|
mb_wc - converts the left multi-byte sequence into its Unicode code.
|
||||||
mc_mb - converts the given Unicode code into multibyte sequence.
|
mc_mb - converts the given Unicode code into multi-byte sequence.
|
||||||
|
|
||||||
|
|
||||||
Case and sort convertion
|
Case and sort conversion
|
||||||
------------------------
|
------------------------
|
||||||
caseup_str - converts the given 0-terminated string into the upper case
|
caseup_str - converts the given 0-terminated string to uppercase
|
||||||
casedn_str - converts the given 0-terminated string into the lower case
|
casedn_str - converts the given 0-terminated string to lowercase
|
||||||
caseup - converts the given string into the lower case using length
|
caseup - converts the given string to lowercase using length
|
||||||
casedn - converts the given string into the lower case using length
|
casedn - converts the given string to lowercase using length
|
||||||
|
|
||||||
Number-to-string conversion routines
|
Number-to-string conversion routines
|
||||||
------------------------------------
|
------------------------------------
|
||||||
@ -193,7 +239,7 @@ snprintf()
|
|||||||
long10_to_str()
|
long10_to_str()
|
||||||
longlong10_to_str()
|
longlong10_to_str()
|
||||||
|
|
||||||
The names are pretty self-descripting.
|
The names are pretty self-describing.
|
||||||
|
|
||||||
String padding routines
|
String padding routines
|
||||||
-----------------------
|
-----------------------
|
||||||
@ -201,7 +247,7 @@ fill() - writes the given Unicode value into the given string
|
|||||||
with the given length. Used to pad the string, usually
|
with the given length. Used to pad the string, usually
|
||||||
with space character, according to the given charset.
|
with space character, according to the given charset.
|
||||||
|
|
||||||
String-to-numner conversion routines
|
String-to-number conversion routines
|
||||||
------------------------------------
|
------------------------------------
|
||||||
strntol()
|
strntol()
|
||||||
strntoul()
|
strntoul()
|
||||||
@ -209,10 +255,10 @@ strntoll()
|
|||||||
strntoull()
|
strntoull()
|
||||||
strntod()
|
strntod()
|
||||||
|
|
||||||
These functions are almost for the same thing with their
|
These functions are almost the same as their STDLIB counterparts,
|
||||||
STDLIB counterparts, but also:
|
but also:
|
||||||
- accept length instead of 0-terminator
|
- accept length instead of 0-terminator
|
||||||
- and are character set dependant
|
- are character set dependent
|
||||||
|
|
||||||
Simple scanner routines
|
Simple scanner routines
|
||||||
-----------------------
|
-----------------------
|
||||||
@ -230,8 +276,8 @@ strnxfrm() - makes a sort key suitable for memcmp() corresponding
|
|||||||
like_range() - creates a LIKE range, for optimizer
|
like_range() - creates a LIKE range, for optimizer
|
||||||
wildcmp() - wildcard comparison, for LIKE
|
wildcmp() - wildcard comparison, for LIKE
|
||||||
strcasecmp() - 0-terminated string comparison
|
strcasecmp() - 0-terminated string comparison
|
||||||
instr() - finds the first substring appearence in the string
|
instr() - finds the first substring appearance in the string
|
||||||
hash_sort() - calculates hash value taking in account
|
hash_sort() - calculates hash value taking into account
|
||||||
the collation rules, e.g. case-insensitivity,
|
the collation rules, e.g. case-insensitivity,
|
||||||
accent sensitivity, etc.
|
accent sensitivity, etc.
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user