Thomas Munro
2018-09-06 13:55:11 UTC
Hello FreeBSD hackers,
An occasional problem run into by PostgreSQL users (and probably other
database-like systems) is that collation definitions change and
on-disk indexes become corrupted. This was one motivation for
PostgreSQL to adopt optional support for ICU, and to track
ucol_getVersion() and detect when it changes so that the user can be
warned that dependent indexes need to be rebuilt. However, for
various reason many users prefer to use the OS collation support,
which remains the default, and PostgreSQL supports both ways.
I'd like to be able to track collation definition versions for libc
collations too. There doesn't currently seem to be a good way to do
that. Am I missing something?
Here's the idea I had:
1. Add a new option -V to localedef(1) so that an arbitrary version
string can be stored in some spare space in the header of LC_COLLATE
files.
2. Add a new libc function: const char *querylocaleversion(int mask,
locale_t locale).
3. Modify the perl scripts under tools/tools/locale/tools/... to
invoke localedef(1) either with a version set by the maintainer in
unicode.conf (eg "30.0.3"), or perhaps extracted from CLDR data files
directly.
I've attached a proof-of-concept patch which has a very rough
implementation of steps 1 and 2. It probably needs better bounds
checking, more thought about how to report lack of version string (""
or NULL?), and other details. Before doing any further work on that I
thought I'd check if people think the idea has legs, or knows of an
existing way to get this information.
I also considered less invasive approaches to detect collation
changes: using a checksum (ie program needs to know how to find the
LC_COLLATE files), or using the FreeBSD version on the basis that
collations should only change when the base system is upgraded
(generating false positives). I don't really like those approaches
much.
I'd be grateful for any feedback, flames etc.
Thanks,
Thomas Munro
An occasional problem run into by PostgreSQL users (and probably other
database-like systems) is that collation definitions change and
on-disk indexes become corrupted. This was one motivation for
PostgreSQL to adopt optional support for ICU, and to track
ucol_getVersion() and detect when it changes so that the user can be
warned that dependent indexes need to be rebuilt. However, for
various reason many users prefer to use the OS collation support,
which remains the default, and PostgreSQL supports both ways.
I'd like to be able to track collation definition versions for libc
collations too. There doesn't currently seem to be a good way to do
that. Am I missing something?
Here's the idea I had:
1. Add a new option -V to localedef(1) so that an arbitrary version
string can be stored in some spare space in the header of LC_COLLATE
files.
2. Add a new libc function: const char *querylocaleversion(int mask,
locale_t locale).
3. Modify the perl scripts under tools/tools/locale/tools/... to
invoke localedef(1) either with a version set by the maintainer in
unicode.conf (eg "30.0.3"), or perhaps extracted from CLDR data files
directly.
I've attached a proof-of-concept patch which has a very rough
implementation of steps 1 and 2. It probably needs better bounds
checking, more thought about how to report lack of version string (""
or NULL?), and other details. Before doing any further work on that I
thought I'd check if people think the idea has legs, or knows of an
existing way to get this information.
I also considered less invasive approaches to detect collation
changes: using a checksum (ie program needs to know how to find the
LC_COLLATE files), or using the FreeBSD version on the basis that
collations should only change when the base system is upgraded
(generating false positives). I don't really like those approaches
much.
I'd be grateful for any feedback, flames etc.
Thanks,
Thomas Munro