משתמש:בסג/יוניקוד

Dear Yaakov Shoham,

Thank you for your feedback on the Unicode Standard.

The issue you raise here, is not something that is either a "problem" with the Unicode Standard, nor something that can be changed in the Unicode Standard.

There is a recurrent misunderstanding of the nature of "canonical reordering" in the context of normalization in the Unicode Standard. Normalization is *not* intended to specify either input order or any linguistically correct or preferred order of strings. Normalization is a process which is used to specify which strings are *equivalent* by certain criteria -- not to stand in as some kind of spelling order or requirement for representation order of text.

See The Unicode Standard, 5.0, p. 115:

"The canonical order of character sequences does *not* imply any kind of linguistic correctness or linguistic preference for ordering of combining marks in sequences."

What the Unicode Standard does say is that a sequence:

bet + dagesh + patah

is to be treated as canonically equivalent to a sequence:

bet + patah + dagesh

And while it is clear that <bet, dagesh, patah> is a linguistically and graphologically preferable sequence, whereas <bet, patah, dagesh> is the NFC normalized sequence, it is *not* the job of NFC normalization to specify what is the linguistically and graphologically preferable sequence. That is the continual disconnect people have on this topic.

What the Unicode Standard does suggest, however, is that rendering systems, when rendering nikkud, points, and marks on Hebrew letters, should do so *correctly*, and the implications of canonical equivalence include the conclusion that whether you get <bet, dagesh, patah> or <bet, patah, dagesh>, your system *should* render them identically. If it doesn't, it's a bug in the rendering system and/or the font.

Incidentally, I've filed a comment for bugzilla bug #2399 on this topic, in case that will help others discussing the issue in the context of Wikimedia.

Regards,

--Ken Whistler, Unicode, Inc.