If you don’t have Perl 6 on hand, you need to normalize strings, which means to bring them to the same Unicode form. codes # 5 # These actually count the graphemes say $snoopy. This is a non-problem in languages such as Perl 6: # Perl 6 # There are like "length" in JavaScript and Perl 5 say $snoopy. However, snoopy contains an é using a latin small letter e with acute (U 00E9), while in lucy the accented e is made up using two different code points: latin small letter e (U 0065) and combining acute accent (U 0301) since the accent is combining, it joins with the letter before it in a single grapheme.Ĭomparison is a problem as well, as the two string will not be equal one to the other when compared - and this might not be what you expect. A code point is any “thing” that the Unicode Consortium assigned a code to, while a grapheme is the visual thing you actually see on the computer screen.īoth strings in our example have 4 graphemes. The output of both these scripts is: Code points cité: 4Īch! What happened here, with the same string (apparently of 4 chars) having the same length?!? First of all, we should ditch the concept of character, which is way too vague (not to mention in some contexts it’s still a byte) and use the concepts of code point and grapheme. length $snoopy say " Code points $lucy : ". 20 use Encode qw/encode/ my $snoopy = " cit \N " say " Code points $snoopy : ". Here’s a small Perl 5 example for this: # Perl 5 use v5. Anyway, this is an history that most of us know, and now it’s clear to the most that characters do not map to bytes anymore. General adoption was possible mainly thanks to UTF-8, the encoding (dating back to 1993, by the way) which provided full compatibility with US-ASCII character set. The origins of Unicode date back to 1987, but it wasn’t until the late ’90s that it became well known, and general adoption really picked on after year 2000.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |