Discussion:
Regular expression compilation fail in current
Fernando Apesteguía
2021-04-26 13:31:38 UTC
Permalink
Hi there,

I'm working with this port PR
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=255182

and the problem seems to boil down to a regular expression that does
not compile on current but it does in 12.2.

The minimum repro is this one:

#include <regex.h>
#include <stdio.h>

int
main()
{
regex_t regexp;
int ret = regcomp(&regexp, "\\s*", REG_EXTENDED | REG_ICASE |
REG_NOSUB);
if ( ret != 0) {
printf("regexp compilation failed: %d\n", ret);
}

return 0;
}

This one works in 12.2 but fails to compile the regexp in FreeBSD
14.0-CURRENT #11 main-n245984-15221c552b3c with error 5 REG_EESCAPE
`\' applied to unescapable character.

Any help is appreciated.

Thanks!
Mark Millard via freebsd-hackers
2021-04-27 03:14:32 UTC
Permalink
Post by Fernando Apesteguía
Hi there,
I'm working with this port PR
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=255182
and the problem seems to boil down to a regular expression that does
not compile on current but it does in 12.2.
#include <regex.h>
#include <stdio.h>
int
main()
{
regex_t regexp;
int ret = regcomp(&regexp, "\\s*", REG_EXTENDED | REG_ICASE |
REG_NOSUB);
Here is my stab at notes for this . . .

It is not all that uncommon for error cases to be
initially mistreated but later toolchains to reject
instead of mistreating the same. I suspect that is
what is going on here. But the details seem to be
as follows.

Using C++11's raw_characters notation to specify
string content, "\\s*" is:

R"%(\s*)%"

In other words, the content of the string is just:

\s*

(3 characters, plus a terminating '\0' present).
It is this later string contant that the regcomp
2nd parameter points to and that leads to the
error report.

The "s" is not valid after the backslash for Basic
Regular Expressions or for Extended Regular Expressions.
( https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html )

REG_EESCAPE is described at:

https://pubs.opengroup.org/onlinepubs/9699919799/functions/regcomp.html

as:

QUOTE
REG_EESCAPE
Trailing <backslash> character in pattern.
END QUOTE

In other words: an extra backslash not paired
with anything valid just after it --so it is
tailing whatever was before it.

If you meant the parameter received to point in
memory to:

\\s*

( 4 characters, plus a terminating '\0' after it,
a.k.a. R"%(\\s*)%" ) you likely want the C-string:

"\\\\s*"

as the argument, shown below:

regcomp(&regexp, "\\\\s*", REG_EXTENDED | REG_ICASE | REG_NOSUB)

If you meant some other character sequence in memory, I'd
have to know what it was to try to back-translate it to
C-source that would produce the correct content in the
memory pointed to.
Post by Fernando Apesteguía
if ( ret != 0) {
printf("regexp compilation failed: %d\n", ret);
}
return 0;
}
This one works in 12.2
It might not be rejected, but was does it do? And is that
conformant with:

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html

?
Post by Fernando Apesteguía
but fails to compile the regexp in FreeBSD
14.0-CURRENT #11 main-n245984-15221c552b3c with error 5 REG_EESCAPE
`\' applied to unescapable character.
Any help is appreciated.
Note: While I used C++11's notation as one way of
indicating string content, no C standard has the
notation to my knowledge.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)
Fernando Apesteguía
2021-04-27 14:05:42 UTC
Permalink
Post by Mark Millard via freebsd-hackers
Post by Fernando Apesteguía
Hi there,
I'm working with this port PR
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=255182
and the problem seems to boil down to a regular expression that does
not compile on current but it does in 12.2.
#include <regex.h>
#include <stdio.h>
int
main()
{
regex_t regexp;
int ret = regcomp(&regexp, "\\s*", REG_EXTENDED | REG_ICASE |
REG_NOSUB);
Here is my stab at notes for this . . .
It is not all that uncommon for error cases to be
initially mistreated but later toolchains to reject
instead of mistreating the same. I suspect that is
what is going on here. But the details seem to be
as follows.
Using C++11's raw_characters notation to specify
R"%(\s*)%"
\s*
(3 characters, plus a terminating '\0' present).
It is this later string contant that the regcomp
2nd parameter points to and that leads to the
error report.
The "s" is not valid after the backslash for Basic
Regular Expressions or for Extended Regular Expressions.
( https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html )
https://pubs.opengroup.org/onlinepubs/9699919799/functions/regcomp.html
QUOTE
REG_EESCAPE
Trailing <backslash> character in pattern.
END QUOTE
In other words: an extra backslash not paired
with anything valid just after it --so it is
tailing whatever was before it.
If you meant the parameter received to point in
\\s*
( 4 characters, plus a terminating '\0' after it,
"\\\\s*"
regcomp(&regexp, "\\\\s*", REG_EXTENDED | REG_ICASE | REG_NOSUB)
If you meant some other character sequence in memory, I'd
have to know what it was to try to back-translate it to
C-source that would produce the correct content in the
memory pointed to.
Post by Fernando Apesteguía
if ( ret != 0) {
printf("regexp compilation failed: %d\n", ret);
}
return 0;
}
This one works in 12.2
It might not be rejected, but was does it do? And is that
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html
?
Post by Fernando Apesteguía
but fails to compile the regexp in FreeBSD
14.0-CURRENT #11 main-n245984-15221c552b3c with error 5 REG_EESCAPE
`\' applied to unescapable character.
Any help is appreciated.
Note: While I used C++11's notation as one way of
indicating string content, no C standard has the
notation to my knowledge.
Thanks for the explanation, Mark.
Post by Mark Millard via freebsd-hackers
===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)
Continue reading on narkive:
Loading...