Printing UTF-8 characters

Discussion:

Farhan Khan

2018-02-01 06:15:34 UTC

Hi everyone,

Is there a standard way to render historically non-printable UTF-8
characters that will work across all terminals? I am trying to modify a
standard FreeBSD utility that may occasionally work with characters in
other languages. On some terminals, specifically FreeBSD running in
VirtualBox, I see question-marks rather than the expected character. I
wonder if this is the proper way to display such non-printable characters
or no?

I am not the most versed in encoding standards, so pardon any mistakes I
might have made.

Thanks,
--
Farhan Khan
PGP Fingerprint: B28D 2726 E2BC A97E 3854 5ABE 9A9F 00BC D525 16EE

Matthias Apitz

2018-02-01 07:28:31 UTC

Permalink

Post by Farhan Khan
Hi everyone,
Is there a standard way to render historically non-printable UTF-8
characters that will work across all terminals? I am trying to modify a
standard FreeBSD utility that may occasionally work with characters in
other languages. On some terminals, specifically FreeBSD running in
VirtualBox, I see question-marks rather than the expected character. I
wonder if this is the proper way to display such non-printable characters
or no?

Not sure what you mean with 'historically non-printable UTF-8'. UTF-8 is
an encoding form (one of more) to present Unicode Codepoints in bytes. If
you want to "print" them to paper or PDF there are ways to write them
with Postscript and with the correct font-support to bring them into
human readable form. If you want to "display" these UTF-8 bytes you need
a terminal-software with UTF-8 support, for example from the ports x11/rxvt-unicode
and the fonts for the Codepoint areas you want to display.

Btw: Can you display my signature line correctly? There is an UTF-8 encoded
Codepoint for a mobile telephone :-)

matthias

--
Matthias Apitz, ✉ ***@unixarea.de, ⌂ http://www.unixarea.de/ 📱 +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub

Farhan Khan

2018-02-01 15:42:36 UTC

Permalink

Post by Matthias Apitz

Not sure what you mean with 'historically non-printable UTF-8'. UTF-8 is
an encoding form (one of more) to present Unicode Codepoints in bytes. If
you want to "print" them to paper or PDF there are ways to write them
with Postscript and with the correct font-support to bring them into
human readable form. If you want to "display" these UTF-8 bytes you need
a terminal-software with UTF-8 support, for example from the ports x11/rxvt-unicode
and the fonts for the Codepoint areas you want to display.
Btw: Can you display my signature line correctly? There is an UTF-8 encoded
Codepoint for a mobile telephone :-)
matthias
--
Public GnuPG key: http://www.unixarea.de/key.pub

Sorry, that was a poorly phrased question on my part. Let me try again.
I am trying to make text align in columns in a terminal. My
understanding is that characters above 0x7E are 3 bytes in length. A
modern terminal will render that as either a single question-mark or
the character itself, making terminal column alignment easy. But how
would an older terminal display a 3-byte character? I am worried that
would render as 3 question marks and throw off column alignment. If
so, is there a proper way to perform alignment for both newer and
older terminals?

I am reading this email on Gmail's, so those characters properly
render for me :)

Thanks,

--
Farhan Khan
PGP Fingerprint: B28D 2726 E2BC A97E 3854 5ABE 9A9F 00BC D525 16EE

Conrad Meyer

2018-02-01 20:18:10 UTC

Permalink

You've said a number of things about UTF-8 that appear to be mistaken.
Start here: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

Post by Farhan Khan

Post by Matthias Apitz

Not sure what you mean with 'historically non-printable UTF-8'. UTF-8 is
an encoding form (one of more) to present Unicode Codepoints in bytes. If
you want to "print" them to paper or PDF there are ways to write them
with Postscript and with the correct font-support to bring them into
human readable form. If you want to "display" these UTF-8 bytes you need
a terminal-software with UTF-8 support, for example from the ports x11/rxvt-unicode
and the fonts for the Codepoint areas you want to display.
Btw: Can you display my signature line correctly? There is an UTF-8 encoded
Codepoint for a mobile telephone :-)
matthias
--
Public GnuPG key: http://www.unixarea.de/key.pub

Sorry, that was a poorly phrased question on my part. Let me try again.
I am trying to make text align in columns in a terminal. My
understanding is that characters above 0x7E are 3 bytes in length. A
modern terminal will render that as either a single question-mark or
the character itself, making terminal column alignment easy. But how
would an older terminal display a 3-byte character? I am worried that
would render as 3 question marks and throw off column alignment. If
so, is there a proper way to perform alignment for both newer and
older terminals?
I am reading this email on Gmail's, so those characters properly
render for me :)
Thanks,
--
Farhan Khan
PGP Fingerprint: B28D 2726 E2BC A97E 3854 5ABE 9A9F 00BC D525 16EE
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers

Matthias Apitz

2018-02-01 20:24:30 UTC

Permalink

Post by Conrad Meyer
You've said a number of things about UTF-8 that appear to be mistaken.
...

You are top posting, which is messing up things, and you are not very clear
about who said something wrong.

matthias

--
Sent from my Ubuntu phone
http://www.unixarea.de/

Bakul Shah

2018-02-02 03:51:15 UTC

Permalink

Post by Farhan Khan
Sorry, that was a poorly phrased question on my part. Let me try again.
I am trying to make text align in columns in a terminal. My
understanding is that characters above 0x7E are 3 bytes in length. A
modern terminal will render that as either a single question-mark or
the character itself, making terminal column alignment easy. But how
would an older terminal display a 3-byte character? I am worried that
would render as 3 question marks and throw off column alignment. If
so, is there a proper way to perform alignment for both newer and
older terminals?

UTF-8 can use upto 4 bytes to encode a unicode point,
depending on the script.

For what you want, you can use openoffice like programs that
understand unicode and can do complex text layout. Normal
terminal programs typically use monospace (fixed width) fonts
are simply not capable of what you want. The assumption that
one char means one rectangular cell on the screen is too
deeply woven in them. Particularly for Indic languages this
just doesn't work, You may have N unicode points, each of
which require 3 bytes, all together map to a one single glyph.

Farhan Khan

2018-06-20 01:34:09 UTC

Permalink

Post by Bakul Shah

UTF-8 can use upto 4 bytes to encode a unicode point,
depending on the script.
For what you want, you can use openoffice like programs that
understand unicode and can do complex text layout. Normal
terminal programs typically use monospace (fixed width) fonts
are simply not capable of what you want. The assumption that
one char means one rectangular cell on the screen is too
deeply woven in them. Particularly for Indic languages this
just doesn't work, You may have N unicode points, each of
which require 3 bytes, all together map to a one single glyph.

Hi all,

To follow-up from my earlier poorly asked question from a few months
back, how do I determine if the terminal is capable of printing UTF-8
encoded strings and/or unicode in general?
The obvious answer is to check the LANG variable via getenv(3), but
what if you are using "en_US.UTF-8" vs "en_GB.UTF-8"? Should I just
check for the string "UTF-8" in the LANG variable?

My concern is printing characters above 0x7F on terminals/encodings
that are not capable of displaying them, resulting in unusual
behavior.

Thanks,

--
Farhan Khan
PGP Fingerprint: B28D 2726 E2BC A97E 3854 5ABE 9A9F 00BC D525 16EE

Conrad Meyer

2018-06-20 02:46:18 UTC

Permalink

You want LC_CTYPE.

Post by Farhan Khan

Post by Bakul Shah

UTF-8 can use upto 4 bytes to encode a unicode point,
depending on the script.
For what you want, you can use openoffice like programs that
understand unicode and can do complex text layout. Normal
terminal programs typically use monospace (fixed width) fonts
are simply not capable of what you want. The assumption that
one char means one rectangular cell on the screen is too
deeply woven in them. Particularly for Indic languages this
just doesn't work, You may have N unicode points, each of
which require 3 bytes, all together map to a one single glyph.

Hi all,
To follow-up from my earlier poorly asked question from a few months
back, how do I determine if the terminal is capable of printing UTF-8
encoded strings and/or unicode in general?
The obvious answer is to check the LANG variable via getenv(3), but
what if you are using "en_US.UTF-8" vs "en_GB.UTF-8"? Should I just
check for the string "UTF-8" in the LANG variable?
My concern is printing characters above 0x7F on terminals/encodings
that are not capable of displaying them, resulting in unusual
behavior.
Thanks,
--
Farhan Khan
PGP Fingerprint: B28D 2726 E2BC A97E 3854 5ABE 9A9F 00BC D525 16EE
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers

Farhan Khan

2018-06-20 04:20:57 UTC

Permalink

Post by Conrad Meyer
You want LC_CTYPE.

Post by Farhan Khan

Post by Bakul Shah

UTF-8 can use upto 4 bytes to encode a unicode point,
depending on the script.
For what you want, you can use openoffice like programs that
understand unicode and can do complex text layout. Normal
terminal programs typically use monospace (fixed width) fonts
are simply not capable of what you want. The assumption that
one char means one rectangular cell on the screen is too
deeply woven in them. Particularly for Indic languages this
just doesn't work, You may have N unicode points, each of
which require 3 bytes, all together map to a one single glyph.

Hi all,
To follow-up from my earlier poorly asked question from a few months
back, how do I determine if the terminal is capable of printing UTF-8
encoded strings and/or unicode in general?
The obvious answer is to check the LANG variable via getenv(3), but
what if you are using "en_US.UTF-8" vs "en_GB.UTF-8"? Should I just
check for the string "UTF-8" in the LANG variable?
My concern is printing characters above 0x7F on terminals/encodings
that are not capable of displaying them, resulting in unusual
behavior.
Thanks,
--
Farhan Khan
PGP Fingerprint: B28D 2726 E2BC A97E 3854 5ABE 9A9F 00BC D525 16EE
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
"

Thanks Conrad!

I looked up exactly how locale(1) worked. Similar to what you suggested,
locale(1) did essentially this:

setlocale(LC_ALL, "");
charset = nl_langinfo(CODESET);

The final product was 'charset'.

Thanks!

Continue reading on narkive:

Search results for 'Printing UTF-8 characters' (Questions and Answers)

replies

ASCII/ANSI Character codes?

started 2008-02-25 15:41:00 UTC

programming & design

replies

Question on ASC11 Coding system?

started 2010-09-06 12:17:12 UTC

programming & design

replies

ascii characters?