Discussion:
On changing rand(3) to random(3) in awk(1)
Chenguang Li
2014-08-28 06:21:10 UTC
Permalink
Below is a summary of my previous posts to freebsd-questions:

Since the original rand(3) could not provide a fair, "one-shot" randomness in awk(1),
I am writing this to suggest that we change *rand(3) to *random(3), which only requires
modifying a few lines of code[1], but will result in better random number generation.
BTW, OSX & gawk already have this. Previous discussion can be found here[2].

What do you think?

[1] My quick patch: https://gist.github.com/horus/03c3d7c17703536033c1
[2] https://lists.freebsd.org/pipermail/freebsd-questions/2014-August/260534.html

Chenguang Li
Vitaly Magerya
2014-08-28 09:51:52 UTC
Permalink
Post by Chenguang Li
Since the original rand(3) could not provide a fair, "one-shot" randomness in awk(1),
I am writing this to suggest that we change *rand(3) to *random(3), which only requires
modifying a few lines of code[1], but will result in better random number generation.
BTW, OSX & gawk already have this. Previous discussion can be found here[2].
What do you think?
I think this is a useful change; in particular, srandom(3) seems to do
a much better job of coping with sequential seeds than srand(3), which
solves a big problem with using 'srand' in our awk.
Post by Chenguang Li
Index: main.c
===================================================================
--- main.c (revision 270740)
+++ main.c (working copy)
@@ -74,7 +74,7 @@
signal(SIGFPE, fpecatch);
srand_seed = 1;
- srand(srand_seed);
+ srandom(srand_seed);
yyin = NULL;
symtab = makesymtab(NSYMTAB/NSYMTAB);
Index: run.c
===================================================================
--- run.c (revision 270740)
+++ run.c (working copy)
@@ -1522,7 +1522,7 @@
break;
/* in principle, rand() returns something in 0..RAND_MAX */
- u = (Awkfloat) (rand() % RAND_MAX) / RAND_MAX;
+ u = (Awkfloat) (random() % RAND_MAX) / RAND_MAX;
You should not use RAND_MAX with random(3), since it returns values
between 0 and 0x7fffffff (inclusive); RAND_MAX only applies to rand(3).
Post by Chenguang Li
- /* in principle, rand() returns something in 0..RAND_MAX */
- u = (Awkfloat) (rand() % RAND_MAX) / RAND_MAX;
+ /* random() returns values in [0, 2147483647] */
+ u = (Awkfloat) random() / 2147483648;
rand random number on (0,1)
rand random number on [0,1)
break;
if (isrec(x)) /* no argument provided */
@@ -1530,7 +1530,7 @@
else
u = getfval(x);
tmp = u;
- srand((unsigned int) u);
+ srandom((unsigned int) u);
You should probably use 'unsigned long' here.
Post by Chenguang Li
u = srand_seed;
srand_seed = tmp;
break;
Otherwise, the patch looks fine.
Peter Pentchev
2014-08-28 13:15:27 UTC
Permalink
[snip]
Post by Vitaly Magerya
Post by Chenguang Li
Index: run.c
===================================================================
--- run.c (revision 270740)
+++ run.c (working copy)
@@ -1522,7 +1522,7 @@
break;
/* in principle, rand() returns something in 0..RAND_MAX */
- u = (Awkfloat) (rand() % RAND_MAX) / RAND_MAX;
+ u = (Awkfloat) (random() % RAND_MAX) / RAND_MAX;
You should not use RAND_MAX with random(3), since it returns values
between 0 and 0x7fffffff (inclusive); RAND_MAX only applies to rand(3).
Post by Chenguang Li
- /* in principle, rand() returns something in 0..RAND_MAX */
- u = (Awkfloat) (rand() % RAND_MAX) / RAND_MAX;
+ /* random() returns values in [0, 2147483647] */
+ u = (Awkfloat) random() / 2147483648;
Hm. Since random() is documented to return "long", wouldn't it be a bit
better with #include <limits.h> and LONG_MAX here?

G'luck,
Peter
--
Peter Pentchev ***@ringlet.net ***@FreeBSD.org ***@storpool.com
PGP key: http://people.FreeBSD.org/~roam/roam.key.asc
Key fingerprint 2EE7 A7A5 17FC 124C F115 C354 651E EFB0 2527 DF13
Chenguang Li
2014-08-28 13:28:48 UTC
Permalink
Post by Peter Pentchev
Post by Vitaly Magerya
Post by Peter Pentchev
[snip]
Index: run.c
===================================================================
--- run.c (revision 270740)
+++ run.c (working copy)
@@ -1522,7 +1522,7 @@
break;
/* in principle, rand() returns something in 0..RAND_MAX */
- u = (Awkfloat) (rand() % RAND_MAX) / RAND_MAX;
+ u = (Awkfloat) (random() % RAND_MAX) / RAND_MAX;
You should not use RAND_MAX with random(3), since it returns values
between 0 and 0x7fffffff (inclusive); RAND_MAX only applies to rand(3).
- /* in principle, rand() returns something in 0..RAND_MAX */
- u = (Awkfloat) (rand() % RAND_MAX) / RAND_MAX;
+ /* random() returns values in [0, 2147483647] */
+ u = (Awkfloat) random() / 2147483648;
Hm. Since random() is documented to return "long", wouldn't it be a bit
better with #include <limits.h> and LONG_MAX here?
I am using RANDOM_MAX here to maintain the original code structure.
I've updated my patch, please check the gist. Thanks!

Chenguang Li
Peter Pentchev
2014-08-28 13:38:46 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
Post by Peter Pentchev
Post by Vitaly Magerya
Post by Peter Pentchev
[snip]
Index: run.c
===================================================================
--- run.c (revision 270740)
+++ run.c (working copy)
@@ -1522,7 +1522,7 @@
break;
/* in principle, rand() returns something in 0..RAND_MAX */
- u = (Awkfloat) (rand() % RAND_MAX) / RAND_MAX;
+ u = (Awkfloat) (random() % RAND_MAX) / RAND_MAX;
You should not use RAND_MAX with random(3), since it returns values
between 0 and 0x7fffffff (inclusive); RAND_MAX only applies to rand(3).
- /* in principle, rand() returns something in 0..RAND_MAX */
- u = (Awkfloat) (rand() % RAND_MAX) / RAND_MAX;
+ /* random() returns values in [0, 2147483647] */
+ u = (Awkfloat) random() / 2147483648;
Hm. Since random() is documented to return "long", wouldn't it be a bit
better with #include <limits.h> and LONG_MAX here?
I am using RANDOM_MAX here to maintain the original code structure.
I've updated my patch, please check the gist. Thanks!
Right, but you've hard-coded RANDOM_MAX to 2^32-1. I don't think this
is correct; I do believe that you should use LONG_MAX for this value.
Of course, #define RANDOM_MAX LONG_MAX could be fine for this purpose
here, but the point is to use LONG_MAX instead of 2^32-1.

G'luck,
Peter
--
Peter Pentchev ***@ringlet.net ***@FreeBSD.org ***@storpool.com
PGP key: http://people.FreeBSD.org/~roam/roam.key.asc
Key fingerprint 2EE7 A7A5 17FC 124C F115 C354 651E EFB0 2527 DF13
Chenguang Li
2014-08-28 13:57:24 UTC
Permalink
Post by Peter Pentchev
... ...
[omitted]
Right, but you've hard-coded RANDOM_MAX to 2^32-1. I don't think this
is correct; I do believe that you should use LONG_MAX for this value.
Of course, #define RANDOM_MAX LONG_MAX could be fine for this purpose
here, but the point is to use LONG_MAX instead of 2^32-1.
I got your point, updated again. :)

Chenguang Li
Chenguang Li
2014-08-28 14:17:55 UTC
Permalink
Post by Chenguang Li
Post by Peter Pentchev
... ...
[omitted]
Right, but you've hard-coded RANDOM_MAX to 2^32-1. I don't think this
is correct; I do believe that you should use LONG_MAX for this value.
Of course, #define RANDOM_MAX LONG_MAX could be fine for this purpose
here, but the point is to use LONG_MAX instead of 2^32-1.
I got your point, updated again. :)
Wait a minute, isn't LONG_MAX architecture-dependent? It's 9223372036854775807
on my 64-bit FreeBSD box. The range of generated random numbers is explicitly
documented. So I am afraid I should hard-code 2^32-1 in the file.

Chenguang Li
Peter Pentchev
2014-08-28 14:21:47 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
Post by Chenguang Li
Post by Peter Pentchev
... ...
[omitted]
Right, but you've hard-coded RANDOM_MAX to 2^32-1. I don't think this
is correct; I do believe that you should use LONG_MAX for this value.
Of course, #define RANDOM_MAX LONG_MAX could be fine for this purpose
here, but the point is to use LONG_MAX instead of 2^32-1.
I got your point, updated again. :)
Wait a minute, isn't LONG_MAX architecture-dependent? It's 9223372036854775807
on my 64-bit FreeBSD box. The range of generated random numbers is explicitly
documented. So I am afraid I should hard-code 2^32-1 in the file.
Oof.

Sorry.

My bad.

I missed the *second* sentence of the random(3) manual page, didn't I.
Specifically looked it over, looked for limits, checked the bottom for
various sections that may document constants and stuff... and missed the
second sentence.

Of course you're right, it's documented as 2^32-1. Though now I wonder
whether stdlib.h shouldn't provide some kind of RANDOM_MAX constant for
this purpose, instead of programs having to do their own hardcoding.

G'luck,
Peter
--
Peter Pentchev ***@ringlet.net ***@FreeBSD.org ***@storpool.com
PGP key: http://people.FreeBSD.org/~roam/roam.key.asc
Key fingerprint 2EE7 A7A5 17FC 124C F115 C354 651E EFB0 2527 DF13
Chenguang Li
2014-08-28 14:33:43 UTC
Permalink
Post by Peter Pentchev
... ...
[omitted]
Wait a minute, isn't LONG_MAX architecture-dependent? It's 9223372036854775807
on my 64-bit FreeBSD box. The range of generated random numbers is explicitly
documented. So I am afraid I should hard-code 2^32-1 in the file.
...
I missed the *second* sentence of the random(3) manual page, didn't I.
Specifically looked it over, looked for limits, checked the bottom for
various sections that may document constants and stuff... and missed the
second sentence.
That's totally fine. :)
Post by Peter Pentchev
Of course you're right, it's documented as 2^32-1. Though now I wonder
whether stdlib.h shouldn't provide some kind of RANDOM_MAX constant for
this purpose, instead of programs having to do their own hardcoding.
It would be nice to have it as a constant, as RAND_MAX for rand(3).

Chenguang Li
Lowell Gilbert
2014-08-28 15:05:10 UTC
Permalink
Post by Chenguang Li
Post by Peter Pentchev
Of course you're right, it's documented as 2^32-1. Though now I wonder
whether stdlib.h shouldn't provide some kind of RANDOM_MAX constant for
this purpose, instead of programs having to do their own hardcoding.
It would be nice to have it as a constant, as RAND_MAX for rand(3).
This comes from the POSIX specification, which specifically calls out
that value even though it specifies the type as a long int. There's no
reason we can't add it to our headers, but it won't be portable.
Vitaly Magerya
2014-08-28 16:22:38 UTC
Permalink
Post by Chenguang Li
I am using RANDOM_MAX here to maintain the original code structure.
I've updated my patch, please check the gist. Thanks!
jmp_buf env;
extern int pairstack[];
@@ -1522,7 +1522,7 @@
break;
/* in principle, rand() returns something in 0..RAND_MAX */
- u = (Awkfloat) (rand() % RAND_MAX) / RAND_MAX;
+ u = (Awkfloat) (random() % RANDOM_MAX) / RANDOM_MAX;
break;
if (isrec(x)) /* no argument provided */
OK, so this part is wrong (and it is wrong in the original sources
too): this construction generates 0.0 twice as often as it generates
other numbers -- both when random() returns 0 and when it returns
Post by Chenguang Li
- /* in principle, rand() returns something in 0..RAND_MAX */
- u = (Awkfloat) (rand() % RAND_MAX) / RAND_MAX;
+ /* random() returns values in [0, RANDOM_MAX] */
+ u = (Awkfloat) random() / (RANDOM_MAX + 1ul);
break;
Note that the "ul" part in "1ul" is vitally important: otherwise there
will be an overflow on systems where long is 32 bit wide. Alternatively
"1.0" can be used here instead.

(Because this overflow may not be obvious in a casual reading, I'm not
a fan of defining RANDOM_MAX separately; I personally would insert the
actual number here directly).
Chenguang Li
2014-08-29 07:12:44 UTC
Permalink
Post by Vitaly Magerya
Post by Chenguang Li
I am using RANDOM_MAX here to maintain the original code structure.
I've updated my patch, please check the gist. Thanks!
jmp_buf env;
extern int pairstack[];
@@ -1522,7 +1522,7 @@
break;
/* in principle, rand() returns something in 0..RAND_MAX */
- u = (Awkfloat) (rand() % RAND_MAX) / RAND_MAX;
+ u = (Awkfloat) (random() % RANDOM_MAX) / RANDOM_MAX;
break;
if (isrec(x)) /* no argument provided */
OK, so this part is wrong (and it is wrong in the original sources
too): this construction generates 0.0 twice as often as it generates
other numbers -- both when random() returns 0 and when it returns
Yes, I've noticed there's a fairness problem, 0 comes out more often. Since
the original code is buggy too, do we need to be compatible with it?
Post by Vitaly Magerya
Post by Chenguang Li
- /* in principle, rand() returns something in 0..RAND_MAX */
- u = (Awkfloat) (rand() % RAND_MAX) / RAND_MAX;
+ /* random() returns values in [0, RANDOM_MAX] */
+ u = (Awkfloat) random() / (RANDOM_MAX + 1ul);
break;
Note that the "ul" part in "1ul" is vitally important: otherwise there
will be an overflow on systems where long is 32 bit wide. Alternatively
"1.0" can be used here instead.
(Because this overflow may not be obvious in a casual reading, I'm not
a fan of defining RANDOM_MAX separately; I personally would insert the
actual number here directly).
Thanks for pointing that out, I've modified the gist as your suggestion.

Chenguang Li
Vitaly Magerya
2014-08-29 08:39:30 UTC
Permalink
Post by Chenguang Li
Post by Vitaly Magerya
OK, so this part is wrong (and it is wrong in the original sources
too): this construction generates 0.0 twice as often as it generates
other numbers -- both when random() returns 0 and when it returns
Yes, I've noticed there's a fairness problem, 0 comes out more often. Since
the original code is buggy too, do we need to be compatible with it?
I see no reason why we would.

Interestingly enough, GNU awk has this bug as well (and I am entirely
too lazy to report it).
Post by Chenguang Li
Thanks for pointing that out, I've modified the gist as your suggestion.
Looks good to me. You might want to open a bug report with that patch,
as it's not clear if any committer is reading this thread.
Chenguang Li
2014-08-29 08:55:57 UTC
Permalink
Post by Vitaly Magerya
... ...
Yes, I've noticed there's a fairness problem, 0 comes out more often. Since
the original code is buggy too, do we need to be compatible with it?
I see no reason why we would.
Interestingly enough, GNU awk has this bug as well (and I am entirely
too lazy to report it).
Yes, I came across that line a few days ago. OSX's awk also has this bug,
judging from their code[1]:

u = (Awkfloat) (random() % RAND_MAX) / RAND_MAX;
Post by Vitaly Magerya
Thanks for pointing that out, I've modified the gist as your suggestion.
Looks good to me. You might want to open a bug report with that patch,
as it's not clear if any committer is reading this thread.
I guess Peter (roam@) is one of the committers?

[1] http://www.opensource.apple.com/source/awk/awk-18/src/run.c

Chenguang Li
Peter Pentchev
2014-08-29 09:12:10 UTC
Permalink
Post by Vitaly Magerya
Post by Chenguang Li
Thanks for pointing that out, I've modified the gist as your suggestion.
Looks good to me. You might want to open a bug report with that patch,
as it's not clear if any committer is reading this thread.
Er... not really, sorry :) A couple of years ago I gave my commit bit
in for safekeeping since other things ate up all my FreeBSD time, and
ever since then I've thought about returning, but not yet. I can't
really commit your changes, sorry.

G'luck,
Peter
--
Peter Pentchev ***@ringlet.net ***@FreeBSD.org ***@storpool.com
PGP key: http://people.FreeBSD.org/~roam/roam.key.asc
Key fingerprint 2EE7 A7A5 17FC 124C F115 C354 651E EFB0 2527 DF13
John Baldwin
2014-09-03 20:50:35 UTC
Permalink
Post by Chenguang Li
Post by Vitaly Magerya
... ...
Yes, I've noticed there's a fairness problem, 0 comes out more often. Since
the original code is buggy too, do we need to be compatible with it?
I see no reason why we would.
Interestingly enough, GNU awk has this bug as well (and I am entirely
too lazy to report it).
Yes, I came across that line a few days ago. OSX's awk also has this bug,
u = (Awkfloat) (random() % RAND_MAX) / RAND_MAX;
Post by Vitaly Magerya
Thanks for pointing that out, I've modified the gist as your suggestion.
Looks good to me. You might want to open a bug report with that patch,
as it's not clear if any committer is reading this thread.
Can you please open a bug report with the latest patch?

Xin Li (delphij@) imported the most recent awk version. Ideally this patch
would go upstream and not just be a local patch. Perhaps Xin can help with
that?
--
John Baldwin
Chenguang Li
2014-09-04 01:45:25 UTC
Permalink
[omitted]
Can you please open a bug report with the latest patch?
would go upstream and not just be a local patch. Perhaps Xin can help with
that?
Sorry for not keeping you guys updated, actually I did open a bug report[1] a
few days ago. I also wrote Brian Kernighan an email with details and a link to
I've read the thread on rand vs random; it's interesting.
It sounds like replacing rand with random would be a good
idea regardless, and probably it's good to replace the
RAND_MAX with an explicit 2^31+1.
I don't think that it would hurt to also use RAND_MAX+1UL
with rand() as an interim measure. You're right that zero
does come up twice as often as any other value. Probably
doesn't make much difference in practice, but better to do
things right when possible.
I'll add this to my list of things to try to fix up, which
is getting longer than it should, but I've been busy on
other things and keep pushing awk over the horizon.
So I guess this would eventually be patched upstream.

[1] https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=193147

Chenguang Li

Loading...