Discussion:
Tracking CLDR version in collation definitions
Thomas Munro
2018-09-06 13:55:11 UTC
Permalink
Hello FreeBSD hackers,

An occasional problem run into by PostgreSQL users (and probably other
database-like systems) is that collation definitions change and
on-disk indexes become corrupted. This was one motivation for
PostgreSQL to adopt optional support for ICU, and to track
ucol_getVersion() and detect when it changes so that the user can be
warned that dependent indexes need to be rebuilt. However, for
various reason many users prefer to use the OS collation support,
which remains the default, and PostgreSQL supports both ways.

I'd like to be able to track collation definition versions for libc
collations too. There doesn't currently seem to be a good way to do
that. Am I missing something?

Here's the idea I had:

1. Add a new option -V to localedef(1) so that an arbitrary version
string can be stored in some spare space in the header of LC_COLLATE
files.
2. Add a new libc function: const char *querylocaleversion(int mask,
locale_t locale).
3. Modify the perl scripts under tools/tools/locale/tools/... to
invoke localedef(1) either with a version set by the maintainer in
unicode.conf (eg "30.0.3"), or perhaps extracted from CLDR data files
directly.

I've attached a proof-of-concept patch which has a very rough
implementation of steps 1 and 2. It probably needs better bounds
checking, more thought about how to report lack of version string (""
or NULL?), and other details. Before doing any further work on that I
thought I'd check if people think the idea has legs, or knows of an
existing way to get this information.

I also considered less invasive approaches to detect collation
changes: using a checksum (ie program needs to know how to find the
LC_COLLATE files), or using the FreeBSD version on the basis that
collations should only change when the base system is upgraded
(generating false positives). I don't really like those approaches
much.

I'd be grateful for any feedback, flames etc.

Thanks,

Thomas Munro
yuripv
2018-09-08 14:22:22 UTC
Permalink
Hi Thomas,

I think this makes perfect sense, yes, and not aware of any other way of
having the data version information.

There are some nits in the man page changes, but that can of course be taken
care of during review.

A bigger question is backwards compatibility as you seem to be changing the
on-disk format -- I can't think of anything bad happening off the top of my
head, just wondering if you had some ideas on it.



--
Sent from: http://freebsd.1045724.x6.nabble.com/freebsd-hackers-f4034256.html
Konstantin Belousov
2018-09-08 17:28:19 UTC
Permalink
Post by yuripv
Hi Thomas,
I think this makes perfect sense, yes, and not aware of any other way of
having the data version information.
There are some nits in the man page changes, but that can of course be taken
care of during review.
At least, the new symbols must not go into the FBSD_1.3 namespace, but
into the current namespace of the HEAD at the tome of commit. I believe
it will be FBSD_1.6 when the patch has a chance to hit the tree.

I also think that more specific name, indicating that this is a FreeBSD
unique function, should be used.

I cannot usefully comment on the actual collate code changes.
Post by yuripv
A bigger question is backwards compatibility as you seem to be changing the
on-disk format -- I can't think of anything bad happening off the top of my
head, just wondering if you had some ideas on it.
--
Sent from: http://freebsd.1045724.x6.nabble.com/freebsd-hackers-f4034256.html
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
Thomas Munro
2018-09-10 00:46:58 UTC
Permalink
Post by Konstantin Belousov
Post by yuripv
Hi Thomas,
I think this makes perfect sense, yes, and not aware of any other way of
having the data version information.
There are some nits in the man page changes, but that can of course be taken
care of during review.
At least, the new symbols must not go into the FBSD_1.3 namespace, but
into the current namespace of the HEAD at the tome of commit. I believe
it will be FBSD_1.6 when the patch has a chance to hit the tree.
Thanks, I'll bear that in mind for the next revision.
Post by Konstantin Belousov
I also think that more specific name, indicating that this is a FreeBSD
unique function, should be used.
Yeah, I was wondering about that. This may sound a bit too ambitious,
but my plan is to produce a working version in a real operating
system, and then make a proposal the Austin Group for a future POSIX
revision.

Thomas Munro
2018-09-10 00:39:59 UTC
Permalink
Post by yuripv
Hi Thomas,
I think this makes perfect sense, yes, and not aware of any other way of
having the data version information.
Thanks for the feedback. This is reassuring. Yeah I've looked around a bit.
Post by yuripv
There are some nits in the man page changes, but that can of course be taken
care of during review.
Cool -- I will do some more work on this and post a differential.
Post by yuripv
A bigger question is backwards compatibility as you seem to be changing the
on-disk format -- I can't think of anything bad happening off the top of my
head, just wondering if you had some ideas on it.
The on-disk format I propose is deliberately forwards AND backwards
compatible. That region of the file is currently full of zeroes. If
you call the new function for an older LC_COLLATE file, you just get
an empty string. If you access a file that does have the new version
using an older libc, it ignores that region of the file.
Loading...