Discussion:
Tanween variants and Unicode
Mete Kural
2005-08-19 22:08:59 UTC
Permalink
Salaam Abdulhaq, Meor and all,

I wanted to ask you to refresh my memory on something we discussed about in the discussions.

As far as I remember we had decided that these tanweens are not currently supported in Unicode:

- tanween with a meem/tamweem/?
- sequential tanween/silent tanween/?

Were there any other tanweens that are currently not supported by Unicode?

Thank you,
Mete

--
Mete Kural
Touchtone Corporation
714-755-2810
--
Meor Ridzuan Meor Yahaya
2005-08-23 03:25:05 UTC
Permalink
I think as far as tanween is concern, that's about it. Of course, this
only considering it's visual appearance according to the madinah
mushaf ( not the whole tajweed rule)

Any new projects?
Post by Mete Kural
Salaam Abdulhaq, Meor and all,
I wanted to ask you to refresh my memory on something we discussed about in the discussions.
- tanween with a meem/tamweem/?
- sequential tanween/silent tanween/?
Were there any other tanweens that are currently not supported by Unicode?
Thank you,
Mete
--
Mete Kural
Touchtone Corporation
714-755-2810
--
_______________________________________________
General mailing list
http://lists.arabeyes.org/mailman/listinfo/general
Mete Kural
2005-08-23 17:42:51 UTC
Permalink
Hello Meor,

Thanks for the confirmation that there is no tanween other than tanween with a meem/tamweem and sequential tanween/silent tanween that is not supported by Unicode.
Post by Meor Ridzuan Meor Yahaya
Any new projects?
Tom is gonna present to the Unicode people in next months's Unicode conference in Florida God willing so I wanted to make sure that the list of missing Madina Mushaf Quranic features in Unicode is complete.

So can you think of anything else than the below list that is not supported in Unicode:

New character codes that are needed:
------------------------------------

- A new Arabic letter hamza is needed. This hamza will not be dis-joining like the current hamza 0621. When put between two joining letters it will not split them but float on top of them.

New Protocols that are needed:
------------------------------

- The contexual variant of superscript alef that shifts position when preceded by a fatha needs to be clarified. There is no need for a new character code here, just an explanation that the current superscript alef does shift position when preceded by a fatha.
- Tanween ending in meem: fathatan+superscript meem will trigger the "tamweem" symbol, and so forth for kasratan+superscript meem and dammatan+superscript meem. No new character code is needed, just a protocol that explains that the combination will trigger the corresponding glyph.
- Silent/sequential tanween: fathatan+sukuun code will trigger the silent tanween/sequential tanween glyph, and so forth for kasratan+sukuun and dammatan+sukuun. Sukuun is a good choice for a codepoint here since the noon sound of the tanween is in a way silenced. No new character code is needed, just a protocol that explains that the combination will trigger the corresponding glyph.

New canonical equivalences (this one is not absolutely needed for the Madinah Mushaf):
----------------------
- Basic tanween canonical equivalence: fatha+fatha needs to be made canonically equivalent to fathatan, and so on for kasratan and dammatan.

Kind regards,
Mete

---------- Original Message ----------------------------------
From: Meor Ridzuan Meor Yahaya <meor.ridzuan-***@public.gmane.org>
Reply-To: General Arabization Discussion <general-***@public.gmane.org>
Date: Tue, 23 Aug 2005 11:25:05 +0800
Post by Meor Ridzuan Meor Yahaya
I think as far as tanween is concern, that's about it. Of course, this
only considering it's visual appearance according to the madinah
mushaf ( not the whole tajweed rule)
Any new projects?
Post by Mete Kural
Salaam Abdulhaq, Meor and all,
I wanted to ask you to refresh my memory on something we discussed about in the discussions.
- tanween with a meem/tamweem/?
- sequential tanween/silent tanween/?
Were there any other tanweens that are currently not supported by Unicode?
Thank you,
Mete
--
Mete Kural
Touchtone Corporation
714-755-2810
--
_______________________________________________
General mailing list
http://lists.arabeyes.org/mailman/listinfo/general
--
Mete Kural
Touchtone Corporation
714-755-2810
--
Meor Ridzuan Meor Yahaya
2005-08-24 01:01:05 UTC
Permalink
Please see my comment below.
Post by Mete Kural
Hello Meor,
Thanks for the confirmation that there is no tanween other than tanween with a meem/tamweem and sequential tanween/silent tanween that is not supported by Unicode.
Post by Meor Ridzuan Meor Yahaya
Any new projects?
Tom is gonna present to the Unicode people in next months's Unicode conference in Florida God willing so I wanted to make sure that the list of missing Madina Mushaf Quranic features in Unicode is complete.
------------------------------------
- A new Arabic letter hamza is needed. This hamza will not be dis-joining like the current hamza 0621. When put between two joining letters it will not split them but float on top of them.
Yes , this is very needed.
Post by Mete Kural
------------------------------
- The contexual variant of superscript alef that shifts position when preceded by a fatha needs to be clarified. There is no need for a new character code here, just an explanation that the current superscript alef does shift position when preceded by a fatha.
At first, I did not understand this issues stressed by M Yousif
(original project maintainer). He insist on a new code point for the
small alef used in the Madinah Mushaf. In my opinion , there are at
least 3 problems if we don't introduce new codepoint:
1. At least there is one occurance of standalone small alef in the
Mushaf. According to unicode, this type of character is a spacing
character (that's why I encode it with a space+superscript alef), thus
have a different property than the superscript alef.
2. The small alef does not just shift position, it does occupy some
space as well. Of course, the rendering engine can insert a tatweel
for that, but I think it will complicate things even more (even for
basic arabic feature, many rendering engine have problems to render it
properly)
3. We will have problems to standardize a searching algorithm. In
madinah mushaf, there is no superscript alef as used by other Mushaf.
The alef always represent a missing alef. The superscript alef, on the
other hand as used by Pakistan Mushaf, always denote a mad. So, for
searching, we can always neglect a superscipt alef to search a word,
but for madinah style, we need to convert the small alef to an
ordinary alef (if preceded by fatha), or substitude the character
before the small alef with alef (without fatha). The problem is, a
program like Miscrosoft Word, it will never know how the text is
written: Madinah style, Pakistan style ot other style, thus by itself
it cannot differentiate . If we have a seperate codepoint for the 2,
we will not have this problem. We can develop a consistent searching
algorithm for all application.

Maybe we can discuss this matter more.

There are also some other glyph missing: the superscript waw is one of
them that I can think of right now.
Post by Mete Kural
- Tanween ending in meem: fathatan+superscript meem will trigger the "tamweem" symbol, and so forth for kasratan+superscript meem and dammatan+superscript meem. No new character code is needed, just a protocol that explains that the combination will trigger the corresponding glyph.
I think this is ok, but we might encounter some implementation problems.
Post by Mete Kural
- Silent/sequential tanween: fathatan+sukuun code will trigger the silent tanween/sequential tanween glyph, and so forth for kasratan+sukuun and dammatan+sukuun. Sukuun is a good choice for a codepoint here since the noon sound of the tanween is in a way silenced. No new character code is needed, just a protocol that explains that the combination will trigger the corresponding glyph.
I think I need to check on this. I'm not sure if sukun would be the
best choice. I still think a new code point will be better.
Post by Mete Kural
----------------------
- Basic tanween canonical equivalence: fatha+fatha needs to be made canonically equivalent to fathatan, and so on for kasratan and dammatan.
Kind regards,
Mete
---------- Original Message ----------------------------------
Date: Tue, 23 Aug 2005 11:25:05 +0800
Post by Meor Ridzuan Meor Yahaya
I think as far as tanween is concern, that's about it. Of course, this
only considering it's visual appearance according to the madinah
mushaf ( not the whole tajweed rule)
Any new projects?
Post by Mete Kural
Salaam Abdulhaq, Meor and all,
I wanted to ask you to refresh my memory on something we discussed about in the discussions.
- tanween with a meem/tamweem/?
- sequential tanween/silent tanween/?
Were there any other tanweens that are currently not supported by Unicode?
Thank you,
Mete
--
Mete Kural
Touchtone Corporation
714-755-2810
--
_______________________________________________
General mailing list
http://lists.arabeyes.org/mailman/listinfo/general
--
Mete Kural
Touchtone Corporation
714-755-2810
--
_______________________________________________
General mailing list
http://lists.arabeyes.org/mailman/listinfo/general
Gregg Reynolds
2005-08-25 17:07:07 UTC
Permalink
New Protocols that are needed: ------------------------------
...
- Tanween ending
in meem: fathatan+superscript meem will trigger the "tamweem" symbol,
and so forth for kasratan+superscript meem and dammatan+superscript
meem. No new character code is needed, just a protocol that explains
that the combination will trigger the corresponding glyph.
I must respectfully but vehemently object. You can't just merrily
redefine the semantics of codepoints that are already well-defined.
Fathatan means fathatan; any software that does not display it correctly
is broken, by definition. Ditto for superscript meem. If the one
follows the other, they must both be displayed.
Silent/sequential tanween: fathatan+sukuun code will trigger the
silent tanween/sequential tanween glyph, and so forth for
kasratan+sukuun and dammatan+sukuun. Sukuun is a good choice for a
codepoint here since the noon sound of the tanween is in a way
silenced. No new character code is needed, just a protocol that
explains that the combination will trigger the corresponding glyph.
Same objection. What if the author *wants* a sukuun over an -atan? By
the way, what exactly is a "silent/sequential" tanween? All tanween
variants have names in Arabic that translate quite well into English;
why not use them? By my reading, there is no such thing as a "silent
tanween"; there is an assimilated tanween, but assimilation and silence
are not the same thing. "Sukun" is definitely the wrong term.

See section 1.10 of http://www.arabink.com/patacode/encoding.pdf; see
also the bottom of p. 31 / top of p. 32.

This sort of redefinition may be ok for private experimenting, but as a
proposition for standardization it's frankly a terrible idea. Even for
private experimenting it isn't a good idea; the PUA is set aside
specifically for stuff like this.
New canonical equivalences (this one is not absolutely needed for the
Madinah Mushaf): ---------------------- - Basic tanween canonical
equivalence: fatha+fatha needs to be made canonically equivalent to
fathatan, and so on for kasratan and dammatan.
Here's the problem with this: why stop there? You can use precisely the
same argument to say that two consecutive vowels within a word should be
interpreted as one vowel + vowel lengthener. E.g. kitAb spelled kitaab.
Technically speaking, the alif in kitAb in fact denotes a lengthening
of the preceding fath, just as the second vowel in -atan denotes /n/.
Now consider kitaabaa - should the final aa be an alif or a fathatan?

Plus, what does this do for searching and sorting? A search for e.g.
fathatan won't find two consecutive fathas. So if you do this sort of
thing you'll get surprised users. OTOH, nothing says an editor can't
map two consecutive punches of the fatha key to the fathatan codepoint.

-gregg
Nadim Shaikli
2005-08-25 18:04:54 UTC
Permalink
Post by Mete Kural
Tom is gonna present to the Unicode people in next months's Unicode
conference in Florida God willing so I wanted to make sure that the list
of missing Madina Mushaf Quranic features in Unicode is complete.
------------------------------------
- A new Arabic letter hamza is needed. This hamza will not be dis-joining
like the current hamza 0621. When put between two joining letters it will
not split them but float on top of them.
I again highly suggest/encourage that everyone look into the document
that M.Yousif had put together awhile back to note what is needed and
what is missing.

http://arabeyes.org/~nadim/tmp/qu_prop.pdf

it was later formalized into the following document that wasn't
submitted,

http://arabeyes.org/~nadim/tmp/unicode_quran_prop.pdf

At a min look into the 'qu_prop.pdf' document for brevity.
Post by Mete Kural
------------------------------
- The contexual variant of superscript alef that shifts position when
preceded by a fatha needs to be clarified. There is no need for a new
character code here, just an explanation that the current superscript
alef does shift position when preceded by a fatha.
- Tanween ending in meem: fathatan+superscript meem will trigger the
"tamweem" symbol, and so forth for kasratan+superscript meem and
dammatan+superscript meem. No new character code is needed, just a protocol
that explains that the combination will trigger the corresponding glyph.
- Silent/sequential tanween: fathatan+sukuun code will trigger the silent
tanween/sequential tanween glyph, and so forth for kasratan+sukuun and
dammatan+sukuun. Sukuun is a good choice for a codepoint here since the noon
sound of the tanween is in a way silenced. No new character code is needed,
just a protocol that explains that the combination will trigger the
corresponding glyph.
----------------------
- Basic tanween canonical equivalence: fatha+fatha needs to be made
canonically equivalent to fathatan, and so on for kasratan and dammatan.
You can do whatever substitution you like and/or even spec those out if
you so desire, but at a minimum (and humor me here) the new scripts need
their own codepoint (like for sequential fathatan, etc). We need this
so that other font technologies (now and in the future) will be able to
reference them __in a consistent manner__. We, after all, want to make
sure that the same text if explicitly written (if one opted not to
substitute for instance) would work across platforms and fonts. Let's
not be restrictive here and have follow in the same spirit of what is
there now.

BTW: Mete, do please cite only relevant parts in your replies and don't
simply include the originating email in its entirety (the archives
are there for those looking to see the entire thing anyway).

Salam.

- Nadim




____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs
Gregg Reynolds
2005-08-25 19:19:23 UTC
Permalink
Nadim Shaikli wrote:
...
Post by Nadim Shaikli
You can do whatever substitution you like and/or even spec those out if
you so desire, but at a minimum (and humor me here) the new scripts need
their own codepoint (like for sequential fathatan, etc). We need this
You lost me there. What do you mean by "the new scripts"? Do you mean
"the new codepoints"?

-g
Nadim Shaikli
2005-08-25 22:07:31 UTC
Permalink
Post by Gregg Reynolds
...
Post by Nadim Shaikli
You can do whatever substitution you like and/or even spec those out if
you so desire, but at a minimum (and humor me here) the new scripts need
their own codepoint (like for sequential fathatan, etc). We need this
You lost me there. What do you mean by "the new scripts"? Do you mean
"the new codepoints"?
I mean the currently missing characters/glyphs - akin to "assimilated tanween"
(thus the 'sequential fathatan' mention).

Salam.

- Nadim


__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
Gregg Reynolds
2005-08-25 19:24:55 UTC
Permalink
Post by Nadim Shaikli
You can do whatever substitution you like and/or even spec those out if
you so desire, but at a minimum (and humor me here) the new scripts need
their own codepoint (like for sequential fathatan, etc). We need this
Forgot to mention/beseech: *please* no more "sequential fathatan". I
don't know what that means. Humor me. ;) And remember, I speak and
read Arabic, so I can make a reasonable guess as to what you mean. But
such terminology will only confuse the naive, which I usually favor, but
in this case I'm willing to make an exception. :)

Y'all may have other opinions, but for me the following work:

distinct tanween (i.e. Unicode -atans)
assimilated tanween (I think this is what sequential -atan means, but
am not sure)
nuun conversion mark (small meem)

-gregg
Mete Kural
2005-08-24 17:07:20 UTC
Permalink
Hello Meor,

Please find my comments below.
Post by Meor Ridzuan Meor Yahaya
Post by Mete Kural
- The contexual variant of superscript alef that shifts position when preceded by a fatha needs to be clarified. There is no need for a new character code here, just an explanation that the current superscript alef does shift position when preceded by a fatha.
At first, I did not understand this issues stressed by M Yousif
(original project maintainer). He insist on a new code point for the
small alef used in the Madinah Mushaf. In my opinion , there are at
1. At least there is one occurance of standalone small alef in the
Mushaf. According to unicode, this type of character is a spacing
character (that's why I encode it with a space+superscript alef), thus
have a different property than the superscript alef.
2. The small alef does not just shift position, it does occupy some
space as well. Of course, the rendering engine can insert a tatweel
for that, but I think it will complicate things even more (even for
basic arabic feature, many rendering engine have problems to render it
properly)
3. We will have problems to standardize a searching algorithm. In
madinah mushaf, there is no superscript alef as used by other Mushaf.
The alef always represent a missing alef. The superscript alef, on the
other hand as used by Pakistan Mushaf, always denote a mad. So, for
searching, we can always neglect a superscipt alef to search a word,
but for madinah style, we need to convert the small alef to an
ordinary alef (if preceded by fatha), or substitude the character
before the small alef with alef (without fatha). The problem is, a
program like Miscrosoft Word, it will never know how the text is
written: Madinah style, Pakistan style ot other style, thus by itself
it cannot differentiate . If we have a seperate codepoint for the 2,
we will not have this problem. We can develop a consistent searching
algorithm for all application.
Maybe we can discuss this matter more.
Yes this may deserve more discussion. If you can send me some scans from the Pakistani mushaf of the verses that you think will have a problem with the superscript alef that would be great. I do not have a copy of this Pakistani mushaf you're mentioning.

Also can you tell me which aya has the standalone small alef?
Post by Meor Ridzuan Meor Yahaya
There are also some other glyph missing: the superscript waw is one of
them that I can think of right now.
OK I think I forgot about this one. Can you remind me which verse had this superscript waw?
Post by Meor Ridzuan Meor Yahaya
Post by Mete Kural
- Tanween ending in meem: fathatan+superscript meem will trigger the "tamweem" symbol, and so forth for kasratan+superscript meem and dammatan+superscript meem. No new character code is needed, just a protocol that explains that the combination will trigger the corresponding glyph.
I think this is ok, but we might encounter some implementation problems.
You mean implementation problems in regards to rendering engine providers not supporting this feature. Well, once this is in the Unicode standard, then you can start bugging Microsoft and others to fix their rendering engine to support this since at that time it would be considered a bug for them to not support it. And until they fix it, a workaround could be used.
Post by Meor Ridzuan Meor Yahaya
Post by Mete Kural
- Silent/sequential tanween: fathatan+sukuun code will trigger the silent tanween/sequential tanween glyph, and so forth for kasratan+sukuun and dammatan+sukuun. Sukuun is a good choice for a codepoint here since the noon sound of the tanween is in a way silenced. No new character code is needed, just a protocol that explains that the combination will trigger the corresponding glyph.
I think I need to check on this. I'm not sure if sukun would be the
best choice. I still think a new code point will be better.
Sukun sounds like a good choice to me. But an alternative is the 06E0 Arabic Small High Upright Recktengular Zero. This character is used sometimes in the Medinah Mushaf on top of the silent alefs at the end of words like qaaloo. This could possibly even be a better choice than sukuun. But in any case a new codepoint is really not needed. When we have sukuun and 06E0 we don't really need a new codepoint for this function.

I look forward to you the aya numbers and Pakistani scans if possible.

Thank you,
Mete

--
Mete Kural
Touchtone Corporation
714-755-2810
--
Meor Ridzuan Meor Yahaya
2005-08-25 01:01:29 UTC
Permalink
Mete,

You can view samples of pakistan style mushaf at
http://www.quranpak.com/samples.htm . The company sells Quran
publishing software, with the calligraphy style preserved. I've no
idea how they implement it, but I doubt it'll be unicode compliant.
In any case, I think you already know the difference. The superscript
alef is used as mad symbol (the "a" sound, 2 harakat, without the
fatha) in most other writing style, while in madinah mushaf it only
have small alef, which represent a required alef in pronunciation but
missing in writing. The alef might be mad , might not, but does not
represent the "a" sound. The "a" sound always required a fatha.

You can find standalone small alef at sura 2, aya 72. I'm not sure
about other places.

You can find superscript waw at sura 17, aya 7. I think it only has
one occurance in the Mushaf.

For both tanween ending with meem and sequential one, we might have
implementation problem because of the following:
For a pure truetype font, it is almost impossible to implement (I
think ), without opentype support (the GSUB table). I think the same
applies to bitmap font.Suppose the rendering engine encounter the
sequence (fathatan + sukun, or fathatan + superscript meem), the
engine will know that it needs to replace it with a new glyph.
However, since the new glyph does not have a unicode code point, how
is the rendering engine will find the glyph? In truetype, each glyph
has it's own index number, unicode code is optional. However, there is
no standard in ttf which glyph should be assign to which index no.
So, to my knowledge it is almost impossible to implement this feature
using truetype, bitmap (bdf, windows .fon), and postscript font (not
to sure about the last one, but i think it is the same), unless
someone can tell me how this can be implemented , or some features
that I'm not aware of about these font. This issue does not arise with
opentype, since it has GSUB table. The font designer can easily tell
the rendering engine to substitute the sequence with the glyph he
wants (without the need ot unicode code point). So, please do consider
this issue.

Regards.
Mete Kural
2005-08-25 16:52:44 UTC
Permalink
Hello Meor,
Post by Meor Ridzuan Meor Yahaya
You can view samples of pakistan style mushaf at
http://www.quranpak.com/samples.htm . The company sells Quran
publishing software, with the calligraphy style preserved. I've no
idea how they implement it, but I doubt it'll be unicode compliant.
In any case, I think you already know the difference. The superscript
alef is used as mad symbol (the "a" sound, 2 harakat, without the
fatha) in most other writing style, while in madinah mushaf it only
have small alef, which represent a required alef in pronunciation but
missing in writing. The alef might be mad , might not, but does not
represent the "a" sound. The "a" sound always required a fatha.
I looked through the Pakistani Quran at the link. The samples only seem to contain a portion of Surat Al-Baqarah. After a quick browse I didn't encounter the usage of small alef as anything other than the representation of a required alef in pronunciation but missing in writing. Can you point me to a specific PDF and verse in the samples?

Looks like this company is doing what many others such as Harf, etc are doing; using their own non-standard encoding scheme. It might be partially based on Unicode but it's surely not Unicode since Unicode yet does not support all the features necessary for Quran printing. They've done a good job mashallah. Although I would be more interested in seeing scans of a traditional Pakistani mushaf rather than recent computer generated output. If we take any proposal to Unicode it should be scans from a traditional mushaf.
Post by Meor Ridzuan Meor Yahaya
You can find standalone small alef at sura 2, aya 72. I'm not sure
about other places.
You're talking about the small alef with hamza on top in faaddaara'tum. I don't see how this small alef is fundamentally different than other small alefs. When you say a standalone small alef, what do you mean? As far as I understand all small alefs in the Madinah Mushaf could be considered "standalone" since they simply substitute an alef that is pronounced but not written. This one happens to have a hamza over it of course which makes it a little different but it still has the one and the same function, which is substituting for an alef that is pronounced but not written. I don't see a different function of small alef in this example.

I think Unicode made a mistake in calling this U+0670 character Arabic Letter "Superscript" Alef and then further confusing the encoder by putting a note that says "actually a vowel sign, despite the name. In English textbooks for Arabic this character is mostly referred to as "Dagger" Alef and is not really that much of a superscript character. Superscript implies that a character is placed at a higher plane than other characters like the "squared" of x^2 but dagger alef can be placed high or low within the word based on the sorrounding characters. Like this 2:72 example, because it is preceded by a "ra" it is placed low. When it is preceded by some other letters is it positioned high. Its function doesn't change, it is still simply a symbol that represents an alef that is pronounced but not written.
Post by Meor Ridzuan Meor Yahaya
You can find superscript waw at sura 17, aya 7. I think it only has
one occurance in the Mushaf.
OK I'm trying to refresh my memory on this one here. I know we discussed this one and I kind of remember that we had concluded that this deserves its own codepoint but I can't remember why. Can you remind me what the functional difference of this small waw is compared to the other small waws in the Madinah Mushaf? (the small waws that usually come at the end of certain words)
Post by Meor Ridzuan Meor Yahaya
For both tanween ending with meem and sequential one, we might have
For a pure truetype font, it is almost impossible to implement (I
think ), without opentype support (the GSUB table). I think the same
applies to bitmap font.Suppose the rendering engine encounter the
sequence (fathatan + sukun, or fathatan + superscript meem), the
engine will know that it needs to replace it with a new glyph.
However, since the new glyph does not have a unicode code point, how
is the rendering engine will find the glyph? In truetype, each glyph
has it's own index number, unicode code is optional. However, there is
no standard in ttf which glyph should be assign to which index no.
So, to my knowledge it is almost impossible to implement this feature
using truetype, bitmap (bdf, windows .fon), and postscript font (not
to sure about the last one, but i think it is the same), unless
someone can tell me how this can be implemented , or some features
that I'm not aware of about these font. This issue does not arise with
opentype, since it has GSUB table. The font designer can easily tell
the rendering engine to substitute the sequence with the glyph he
wants (without the need ot unicode code point). So, please do consider
this issue.
Unicode Technical Commitee will not accept the addition of a new codepoint because a certain legacy font technologies is not capable of rendering it without a new codepoint. OpenType and other similar modern font technologies can easily handle this as you write. Besides someone who is trying to render Madinah Mushaf would not use primitive font technology anyways, they would use a more modern font technology such as OpenType. Otherwise the result would be really low quality.

Just to re-iterate, UTC is very conservative in terms of the addition of new codepoints. If there is an existing codepoint that will take care of the problem then this would be the preferred proposal.

Looking foward to hear from you again.

Salaam,
Mete

--
Mete Kural
Touchtone Corporation
714-755-2810
--
Mete Kural
2005-08-25 17:33:20 UTC
Permalink
Hello Gregg,
Post by Mete Kural
- Tanween ending
Post by Mete Kural
in meem: fathatan+superscript meem will trigger the "tamweem" symbol,
and so forth for kasratan+superscript meem and dammatan+superscript
meem. No new character code is needed, just a protocol that explains
that the combination will trigger the corresponding glyph.
I must respectfully but vehemently object. You can't just merrily
redefine the semantics of codepoints that are already well-defined.
Fathatan means fathatan; any software that does not display it correctly
is broken, by definition. Ditto for superscript meem. If the one
follows the other, they must both be displayed.
Well that is an interesting argument but I'm wondering what the practicality of it is. The only use case I can think of where someone would type a tanween and then a superscript meem would be when he is writing a document that lists various symbols used in Arabic. If he wants to simply write these letters next to each other, then it would be wise for him to put a space in between anyways since some of these symbols would be stacked on top of each other otherwise. So if the user puts a space between the tanween character and superscript meem he can display these characters next to each other. Other than this what else use case can you think of?
Post by Mete Kural
Post by Mete Kural
Silent/sequential tanween: fathatan+sukuun code will trigger the
silent tanween/sequential tanween glyph, and so forth for
kasratan+sukuun and dammatan+sukuun. Sukuun is a good choice for a
codepoint here since the noon sound of the tanween is in a way
silenced. No new character code is needed, just a protocol that
explains that the combination will trigger the corresponding glyph.
Same objection. What if the author *wants* a sukuun over an -atan? By
the way, what exactly is a "silent/sequential" tanween? All tanween
variants have names in Arabic that translate quite well into English;
why not use them? By my reading, there is no such thing as a "silent
tanween"; there is an assimilated tanween, but assimilation and silence
are not the same thing. "Sukun" is definitely the wrong term.
See section 1.10 of http://www.arabink.com/patacode/encoding.pdf; see
also the bottom of p. 31 / top of p. 32.
Yes I mean the assimilated tanween. I used the word sequential because in this list the word sequential has been used most commonly to refer to this character so I wanted to make sure list participants understand what character I'm talking about. Thanks for the cue for using the word "assimilated". Sounds good to me.
Post by Mete Kural
Post by Mete Kural
New canonical equivalences (this one is not absolutely needed for the
Madinah Mushaf): ---------------------- - Basic tanween canonical
equivalence: fatha+fatha needs to be made canonically equivalent to
fathatan, and so on for kasratan and dammatan.
Here's the problem with this: why stop there? You can use precisely the
same argument to say that two consecutive vowels within a word should be
interpreted as one vowel + vowel lengthener. E.g. kitAb spelled kitaab.
Technically speaking, the alif in kitAb in fact denotes a lengthening
of the preceding fath, just as the second vowel in -atan denotes /n/.
Now consider kitaabaa - should the final aa be an alif or a fathatan?
Plus, what does this do for searching and sorting? A search for e.g.
fathatan won't find two consecutive fathas. So if you do this sort of
thing you'll get surprised users. OTOH, nothing says an editor can't
map two consecutive punches of the fatha key to the fathatan codepoint.
This canonical equivalence of fathatan with fatha+fatha, etc. is personally not very important for me. This is one of the things Tom wants to propose and feels strongly about. At this point I haven't comprehended the real importance for this canonical equivalence so I would suggest you direct your questions about this one to Tom.

Thanks,
Mete

--
Mete Kural
Touchtone Corporation
714-755-2810
--
Gregg Reynolds
2005-08-25 17:46:40 UTC
Permalink
Hi,
Post by Mete Kural
Hello Gregg,
Post by Mete Kural
- Tanween ending
Post by Mete Kural
in meem: fathatan+superscript meem will trigger the "tamweem"
symbol, and so forth for kasratan+superscript meem and
dammatan+superscript meem. No new character code is needed, just
a protocol that explains that the combination will trigger the
corresponding glyph.
I must respectfully but vehemently object. You can't just merrily
redefine the semantics of codepoints that are already
well-defined. Fathatan means fathatan; any software that does not
display it correctly is broken, by definition. Ditto for
superscript meem. If the one follows the other, they must both be
displayed.
Well that is an interesting argument but I'm wondering what the
practicality of it is. The only use case I can think of where someone
would type a tanween and then a superscript meem would be when he is
writing a document that lists various symbols used in Arabic. If he
wants to simply write these letters next to each other, then it would
be wise for him to put a space in between anyways since some of these
symbols would be stacked on top of each other otherwise. So if the
user puts a space between the tanween character and superscript meem
he can display these characters next to each other. Other than this
what else use case can you think of?
I don't think there is a "realistic" use case, but that isn't really the
point. Or rather, the only realistic use case I can think of is a
misspelling. But that's important: the display software should reveal
the structure of the coded text accurately.

Anyway, one can't predict how users will want to use the writing system.
Think mathematics or linguistics. It's easy to imagine superscript
meem symbolizing some concept. Tanween I don't know, it's impossible to
predict creative uses of these symbols.

For sukun I can definitely imagine somebody wanting to place it above a
tanween mark.

In any case, the fundamental issue is simply that redefining the
semantics of established characters is a non-starter. In addition, this
proposal sounds pretty close to a spelling standardization, which is
outside the scope of Unicode.

If you're planning on proposing something like this to the Unicode crowd
I suggest you bring some robust body armour, 'cause I think they're
going to either completely ignore you or start hurling tomatoes.

...
Post by Mete Kural
This canonical equivalence of fathatan with fatha+fatha, etc. is
personally not very important for me. This is one of the things Tom
wants to propose and feels strongly about. At this point I haven't
comprehended the real importance for this canonical equivalence so I
would suggest you direct your questions about this one to Tom.
Ok. Who's Tom? Do you mean Thomas Milo?

-gregg
Mete Kural
2005-08-25 18:14:09 UTC
Permalink
Hi Gregg,
Post by Gregg Reynolds
For sukun I can definitely imagine somebody wanting to place it above a
tanween mark.
Can you please exemplify?
Post by Gregg Reynolds
Post by Mete Kural
This canonical equivalence of fathatan with fatha+fatha, etc. is
personally not very important for me. This is one of the things Tom
wants to propose and feels strongly about. At this point I haven't
comprehended the real importance for this canonical equivalence so I
would suggest you direct your questions about this one to Tom.
Ok. Who's Tom? Do you mean Thomas Milo?
Yes, Thomas Milo.

Regards,
Mete

--
Mete Kural
Touchtone Corporation
714-755-2810
--
Mete Kural
2005-08-25 18:37:52 UTC
Permalink
Hello Nadim,
Post by Nadim Shaikli
I again highly suggest/encourage that everyone look into the document
that M.Yousif had put together awhile back to note what is needed and
what is missing.
http://arabeyes.org/~nadim/tmp/qu_prop.pdf
it was later formalized into the following document that wasn't
submitted,
http://arabeyes.org/~nadim/tmp/unicode_quran_prop.pdf
I've known the original one for a long time. I have looked at the new one just now. The small noon, small yeh and small waw are already part of Unicode, though so I don't know why there were listed there.
Post by Nadim Shaikli
You can do whatever substitution you like and/or even spec those out if
you so desire, but at a minimum (and humor me here) the new scripts need
their own codepoint (like for sequential fathatan, etc). We need this
so that other font technologies (now and in the future) will be able to
reference them __in a consistent manner__. We, after all, want to make
sure that the same text if explicitly written (if one opted not to
substitute for instance) would work across platforms and fonts. Let's
not be restrictive here and have follow in the same spirit of what is
there now.
The argument that older font technologies are incapable of rendering the sequence correctly is not something that interests me personally. To give an example from another script family, Devanagari which is used in India, you need modern font technology (such as OpenType) to effectively render it because of the complexity of the script. Modern Qur'anic orthography is similarly complex compared to ordinary Arabic text because of the many marks that are added to the text. You won't get away from rendering this kind of orthography without modern font technology anyways. This technology is currently available on Windows, Mac, Linux, OpenBSD, you name it. Why do we need to make sure that Madinah Mushaf's Qur'anic orthography renders with legacy font technologies?
Post by Nadim Shaikli
BTW: Mete, do please cite only relevant parts in your replies and don't
simply include the originating email in its entirety (the archives
are there for those looking to see the entire thing anyway).
Will do. Thaks.

Regards,
Mete

--
Mete Kural
Touchtone Corporation
714-755-2810
--
Nadim Shaikli
2005-08-25 22:12:21 UTC
Permalink
Looks like this company [quranpak.com] is doing what many others such
as Harf, etc are doing; using their own non-standard encoding scheme.
It might be partially based on Unicode but it's surely not Unicode
since Unicode yet does not support all the features necessary for
Quran printing. They've done a good job mashallah.
This is partially why I'm saying let's give the various missing
characters/glyphs their own entries in the character code tables.

What happens now is that various vendors want to encode a character
say the assimilated tanween (I hope Gregg is happy :-) and simply
end-up randomly picking a non-used location which doesn't necessarily
equate to what another vendor is using. I'm not talking about display,
I'm simply noting that it would be best to leave it to the end-user
to pick and choose what characters/glyphs he/she would like to utilize.
The argument that older font technologies are incapable of rendering the
sequence correctly is not something that interests me personally. To give an
It might not interest you yet you should not impede others from being
innovative in case they want to solve this problem in a different manner.
The argument that a character should not exist due to the fact that there
are other means (notably advanced font technology) to get the job done is
not something everyone would buy into. Unicode is filled with examples
that would argue against this stance and saying "they made a mistake and
we can't correct it now due to legacy" is a cop-out. Simply put we need
to add 5-6 new characters and leave it be. At that point everyone will
be happy - the people into font technology can proceed to do what they'd
like and those using older/different methods can have a unified/standard
means to denote data.
Modern Qur'anic orthography is similarly complex compared to ordinary
Arabic text because of the many marks that are added to the text.
You won't get away from rendering this kind of orthography without
modern font technology anyways. This technology is currently available
on Windows, Mac, Linux, OpenBSD, you name it. Why do we need to make
sure that Madinah Mushaf's Qur'anic orthography renders with legacy
font technologies?
That's upto me (and all developers and users) to decide. Unicode is not
a rendering specification and it should NOT dictate how I am to proceed
to do what I'd like to do - as such, I simply need the various characters
to be given their own code-point (within 0600-06FF or FE70-FEFF) and I'd
happy disappear. You might have a very particular means to come up with
the results and you might be very justified in your current thinking but
why exclude others in pursuing other options either in addressing this
using older technologies and/or pursuing future alternatives. The characters
(or glyphs - depending on how you name them) exist and they need to be
accounted for.

Salam.

- Nadim


__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
Mete Kural
2005-08-25 22:38:31 UTC
Permalink
Hello Nadim,

I think I didn't communicate myself efficiently. I am not proposing that we should use a <tanween+modifier> sequence for tanween with small meem and assimilated tanween just to save the hassle of proposing six extra new codepoints to Unicode (although it would truly be quite a hassle to try to propose six new codepoints). It is because using a <tanween+modifier> sequence preserves the text's graphemic integrity better and results in a cleaner encoding. A fathatan is a fathatan, regardless of whether its pronounciation changes slightly. An assimilated fathatan or a fathatan with small meem is still a fathatan, in fact it is just as much fathatan as any other fathatan. For hundreds of years all of these fathatans were written the same exact way. In more recent times scribes have decided to write these two kinds of fathatans slightly differently to cue the un-educated reciter to pronounce c
orrectly. For that reason the logical way to encode this is the <fathatan+modifier> sequen
ce in order to preserve the fathatan codepoint. Using a seperate codepoint will break this graphemic integrity.

In Unicode Arabic there are several instances where certain codepoints break this kind of graphemic integrity. Some of these were added because that was the way it was in legacy Arabic codeblocks that were prepared a long time ago by corporations that wanted to localize their software into Arabic the cheapest and quickest way. Not much scholarly advice was sought. Your argument is that we can compromise from the graphemic integrity yet another time in order to allow legacy font technologies to render these tanween variants. My opinion is that it is better not to introduce yet another blunder into Unicode Arabic in order to support the legacy. We have different biases. Your bias is towards legacy support, my bias is towards graphemic integrity. This analysis doesn't resolve our differences but at least we can identify them better.

Kind regards,
Mete

---------- Original Message ----------------------------------
From: Nadim Shaikli <shaikli-/***@public.gmane.org>
Reply-To: General Arabization Discussion <general-***@public.gmane.org>
Date: Thu, 25 Aug 2005 15:12:21 -0700 (PDT)
Post by Nadim Shaikli
Looks like this company [quranpak.com] is doing what many others such
as Harf, etc are doing; using their own non-standard encoding scheme.
It might be partially based on Unicode but it's surely not Unicode
since Unicode yet does not support all the features necessary for
Quran printing. They've done a good job mashallah.
This is partially why I'm saying let's give the various missing
characters/glyphs their own entries in the character code tables.
What happens now is that various vendors want to encode a character
say the assimilated tanween (I hope Gregg is happy :-) and simply
end-up randomly picking a non-used location which doesn't necessarily
equate to what another vendor is using. I'm not talking about display,
I'm simply noting that it would be best to leave it to the end-user
to pick and choose what characters/glyphs he/she would like to utilize.
The argument that older font technologies are incapable of rendering the
sequence correctly is not something that interests me personally. To give an
It might not interest you yet you should not impede others from being
innovative in case they want to solve this problem in a different manner.
The argument that a character should not exist due to the fact that there
are other means (notably advanced font technology) to get the job done is
not something everyone would buy into. Unicode is filled with examples
that would argue against this stance and saying "they made a mistake and
we can't correct it now due to legacy" is a cop-out. Simply put we need
to add 5-6 new characters and leave it be. At that point everyone will
be happy - the people into font technology can proceed to do what they'd
like and those using older/different methods can have a unified/standard
means to denote data.
Modern Qur'anic orthography is similarly complex compared to ordinary
Arabic text because of the many marks that are added to the text.
You won't get away from rendering this kind of orthography without
modern font technology anyways. This technology is currently available
on Windows, Mac, Linux, OpenBSD, you name it. Why do we need to make
sure that Madinah Mushaf's Qur'anic orthography renders with legacy
font technologies?
That's upto me (and all developers and users) to decide. Unicode is not
a rendering specification and it should NOT dictate how I am to proceed
to do what I'd like to do - as such, I simply need the various characters
to be given their own code-point (within 0600-06FF or FE70-FEFF) and I'd
happy disappear. You might have a very particular means to come up with
the results and you might be very justified in your current thinking but
why exclude others in pursuing other options either in addressing this
using older technologies and/or pursuing future alternatives. The characters
(or glyphs - depending on how you name them) exist and they need to be
accounted for.
Salam.
- Nadim
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
--
Mete Kural
Touchtone Corporation
714-755-2810
--
Meor Ridzuan Meor Yahaya
2005-08-26 03:11:58 UTC
Permalink
So, seems like my timezone is totally different from the rest ...

Just to clarify, the use of small letters in Madinah Mushaf is
different from what is used by others. You can see at
http://www.quranpak.com/sample1.htm . The Superscript/small alef is
used to denote the "A" sound of 2 harakat, without the fatha. Take the
word "ala" , in the sample in spelled out as "ain fatha lam
superscriptalef alefmaksura", whereby in Madinah mushaf, "ain fatha
lam fatha alefmaksura superscriptalef". So, the superscript alef in
Madinah Mushaf is not really a supersciptalef, it is small alef. So,
in Madinah Mushaf, the superscript/small waw in sura 17, aya7 is
actually have the same function as other small waw at the end of word,
it is just that the "missing" waw for that word occured in the middle
of the word.

Unicode support for these character/symbol is very confusing at best.
For the small waw, they only have small waw, not superscript waw. For
the small yeh, they have both, the spacing glyph and the superscript
one. And the worst the the alef, named as superscript alef, but
described as a vowel mark, so I myself have no idea what it means, or
it what it suppose to represent. Take the SIL font for example, they
have the small waw as a non spacing glyph, contrary to Unicode
description. This is just one example how misleading the document
really is.

So, if you ask me, the best is for unicode to either change the
glyph/character property as propsed by Yousif, or add few more
codepoints for the "missing" glyph. Second approach probably can be
adopted faster.

And about the tanween/assimilated tanween, well Mete, I can tell you
that the only standard technology today that can support it is
Opentype. True, opentype support is available on Microsoft platform,
Linux, OS X, and maybe some other platform, but I think it is not
enough. First, opentype support for those platform are still
"problematic". If you look at gnome bugzilla, pango did some
workaround to make it compatible with uniscribe, but have to deviate
from the standard document. Also, tools to produce opentype font are
not widely available. The only good tool that I know is MS VOLT. Even
that, I personally thinks that MS VOLT is a buggy software, and I do
have proofs for it. The other tool is fontlab. Try creating GSUB and
GPOS table with fontlab, and I think you will go crazy. Maybe Gregg
know better.

Another important thing about technology: Pocket PC 2003 does not have
full opentype support. I'm not sure about palm, but I doubt it has.
So, we can't display the text on those platform. I think these
platform is very important for displaying the Quran, since it is the
most convenient for all. ( I really would like to get one specifically
for reading the Quran).

On side note, I've just started to understand how Visual Truetype
works (sort of). My problem was I started with arabic font, but all of
the documentation/samples/terminology are very tailored to latin font.
Yesterday I decided to use Bitstream Vera font, remove the hint, and
start hinting the font by going thru the document. I finally
understand something!! However, I'm still not sure how I can apply
those concept / method to arabic font. I'm thinking of using bitstream
font and merge it with my font to get a complete font. The license
seems to permit such modification, but not sure the implication of
doing that.

Regards.
Post by Mete Kural
Hello Nadim,
I think I didn't communicate myself efficiently. I am not proposing that we should use a <tanween+modifier> sequence for tanween with small meem and assimilated tanween just to save the hassle of proposing six extra new codepoints to Unicode (although it would truly be quite a hassle to try to propose six new codepoints). It is because using a <tanween+modifier> sequence preserves the text's graphemic integrity better and results in a cleaner encoding. A fathatan is a fathatan, regardless of whether its pronounciation changes slightly. An assimilated fathatan or a fathatan with small meem is still a fathatan, in fact it is just as much fathatan as any other fathatan. For hundreds of years all of these fathatans were written the same exact way. In more recent times scribes have decided to write these two kinds of fathatans slightly differently to cue the un-educated reciter to pronounce correctly. For that reason the logical way to encode this is the <fathatan+modifier> sequen
ce in order to preserve the fathatan codepoint. Using a seperate codepoint will break this graphemic integrity.
In Unicode Arabic there are several instances where certain codepoints break this kind of graphemic integrity. Some of these were added because that was the way it was in legacy Arabic codeblocks that were prepared a long time ago by corporations that wanted to localize their software into Arabic the cheapest and quickest way. Not much scholarly advice was sought. Your argument is that we can compromise from the graphemic integrity yet another time in order to allow legacy font technologies to render these tanween variants. My opinion is that it is better not to introduce yet another blunder into Unicode Arabic in order to support the legacy. We have different biases. Your bias is towards legacy support, my bias is towards graphemic integrity. This analysis doesn't resolve our differences but at least we can identify them better.
Kind regards,
Mete
Gregg Reynolds
2005-08-26 13:56:21 UTC
Permalink
Post by Meor Ridzuan Meor Yahaya
So, seems like my timezone is totally different from the rest ...
Hi,

Very busy these days, so no time till tomorrow to comment on some of
these interesting issues, but one or two practical things...
Post by Meor Ridzuan Meor Yahaya
have proofs for it. The other tool is fontlab. Try creating GSUB and
GPOS table with fontlab, and I think you will go crazy. Maybe Gregg
know better.
I think in principle it should be possible to write specs for these
tables in a higher level language, so that they can be shared among font
developers. I found I was able to make some GSUB tables the way I
wanted, but it turns out none of the OT engines out there supports the
full logic of Open Type! For example, I wrote contextual tables to
select the proper glyph for tanween, so e.g. <damma><tanween> would do
the right thing. Worked just great inside of the Fontlab environment,
but not in any of the editors I tried (notepad, word, some mac editors,
etc.) Fontlab tech support told me that only a few Adobe products
support contextual substitution in GSUB. :(

So it looks like we really need to write an OT Service provider to work
with Freetype. Anybody game? I've started looking at the Freetype
code; it's very clean and well-organized, and they have a bunch of OT
stuff that they're migrating out of FT, since it is outside of the scope
of FT.
Post by Meor Ridzuan Meor Yahaya
Another important thing about technology: Pocket PC 2003 does not
have full opentype support. I'm not sure about palm, but I doubt it
has. So, we can't display the text on those platform. I think these
platform is very important for displaying the Quran, since it is the
most convenient for all. ( I really would like to get one
specifically for reading the Quran).
Don't forget cell phones. I think it will be very common in future to
have text stuff displayed on cell phones. In the muslim world, there
will no doubt be services that allow one to download a verse or a page
or sura etc. to one's phone.
Post by Meor Ridzuan Meor Yahaya
On side note, I've just started to understand how Visual Truetype
works (sort of). My problem was I started with arabic font, but all
of the documentation/samples/terminology are very tailored to latin
font. Yesterday I decided to use Bitstream Vera font, remove the
hint, and start hinting the font by going thru the document. I
finally understand something!! However, I'm still not sure how I can
apply those concept / method to arabic font. I'm thinking of using
bitstream font and merge it with my font to get a complete font. The
license seems to permit such modification, but not sure the
implication of doing that.
Be sure to take a long look at the license. I don't know what the legal
status would be of mixing a GPL font with Vera. Might be better to use
the GPL fonts at http://www.nongnu.org/freefont/. FYI, my plan (well,
hope, anyway) is to add Arabic (and some other) glyphs to the monospaced
Courier-class font in the collection and then hint it for on-screen
viewing, so we will have a high-quality monospaced font for use in text
editors. Since all the characters now in the font have uniform stroke
widths and very simple forms, I hope it won't be too difficult for a
non-designer like me to come up with Arabic glyphs of compatible design.
We'll see.

-gregg
Meor Ridzuan Meor Yahaya
2005-08-29 00:47:15 UTC
Permalink
Post by Gregg Reynolds
Post by Meor Ridzuan Meor Yahaya
GPOS table with fontlab, and I think you will go crazy. Maybe Gregg
know better.
I think in principle it should be possible to write specs for these
tables in a higher level language, so that they can be shared among font
developers. I found I was able to make some GSUB tables the way I
wanted, but it turns out none of the OT engines out there supports the
full logic of Open Type! For example, I wrote contextual tables to
select the proper glyph for tanween, so e.g. <damma><tanween> would do
the right thing. Worked just great inside of the Fontlab environment,
but not in any of the editors I tried (notepad, word, some mac editors,
etc.) Fontlab tech support told me that only a few Adobe products
support contextual substitution in GSUB. :(
Well, I'm not sure this is true or not, but when I created the
contectual subtitution using VOLT for a similar case, it just works,
both under Windows and Linux. Notepad, Wordpad, Word, Gedit, etc... no
problem. I do had some other problems, but not typical CALT feature.
What I suspect is Fontlab's implementation of the GSUB table that have
problems. I've seen Fontlab under windows (Demo version) reading
arabic fonts, displaying some wierd GSUB table structure. I'm not sure
how the Mac version works. Maybe you can show me some of your work
and I'll try to find out what is the problem.
Post by Gregg Reynolds
So it looks like we really need to write an OT Service provider to work
with Freetype. Anybody game? I've started looking at the Freetype
code; it's very clean and well-organized, and they have a bunch of OT
stuff that they're migrating out of FT, since it is outside of the scope
of FT.
I don't think we need to create one, Pango and ICU is out there. We
just need to improve it to get it all right.


Regards.
Gregg Reynolds
2005-08-29 14:30:23 UTC
Permalink
Post by Meor Ridzuan Meor Yahaya
What I suspect is Fontlab's implementation of the GSUB table that have
problems. I've seen Fontlab under windows (Demo version) reading
arabic fonts, displaying some wierd GSUB table structure. I'm not sure
how the Mac version works. Maybe you can show me some of your work
and I'll try to find out what is the problem.
That would be great. I'm relatively new at making fonts, although I
studied the TT and OT specs quite a bit. I know that FontLab uses
Adobe's OT dev kit, so it should work properly, but who knows. I
probably won't be able to send you anything till the weekend or even
later.

Ideally, I think it would be good to package GPL fonts with textual OT
and TT hinting files. I expect to look into using the tools from
http://home.kabelfoon.nl/~slam/fonts/. In principle one should be able
to add OT tables to any font, without needing a full font/glyph editor,
if I understand things correctly.
Post by Meor Ridzuan Meor Yahaya
Post by Gregg Reynolds
So it looks like we really need to write an OT Service provider to work
with Freetype. Anybody game? I've started looking at the Freetype
code; it's very clean and well-organized, and they have a bunch of OT
stuff that they're migrating out of FT, since it is outside of the scope
of FT.
I don't think we need to create one, Pango and ICU is out there. We
just need to improve it to get it all right.
I'll have to look into those further, I guess. I suspect I'm thinking
of something much simpler, though. I understand Pango and ICU try to
provide a complete text mgmt system, whereas I'm just thinking of
something like an enhancement to Freetype - take a string of chars, and
return a string of glyphs after performing OT stuff. Maybe return some
datastructure that indicates the mapping from char to glyph. If Pango
implements something like this as a clean component then that would be
the place to start, I guess.

-gregg
Gregg Reynolds
2005-08-29 19:26:51 UTC
Permalink
Post by Mete Kural
Hello Nadim,
I think I didn't communicate myself efficiently. I am not proposing
that we should use a <tanween+modifier> sequence for tanween with
small meem and assimilated tanween just to save the hassle of
proposing six extra new codepoints to Unicode (although it would
truly be quite a hassle to try to propose six new codepoints). It is
because using a <tanween+modifier> sequence preserves the text's
graphemic integrity better and results in a cleaner encoding. A
fathatan is a fathatan, regardless of whether its pronounciation
changes slightly. An assimilated fathatan or a fathatan with small
meem is still a fathatan, in fact it is just as much fathatan as any
other fathatan. For hundreds of years all of these fathatans were
written the same exact way. In more recent times scribes have decided
to write these two kinds of fathatans slightly differently to cue the
un-educated reciter to pronounce correctly. For that reason the
logical way to encode this is the <fathatan+modifier> sequen ce in
order to preserve the fathatan codepoint. Using a seperate codepoint
will break this graphemic integrity.
Again, I respectfully but strongly disagree. What you've described is
not graphemic integrity but morphological integrity.

"Fathatan" is not even a distinctly recognizable concept in Arabic,
anyway. The mark to which this Unicode concept refers is not merely
"two fathas", it is a single fatha, plus a tanween mark that takes the
shape of a fatha in this context. Nor does it simple mean "fatha
munawwana"; it means *distinctly enunciated nuun" of fatha munawwana. A
mark of assimilated tanween - horizontally tiled vowel marks - does not
merely mean fatha munawwana either, it means fatha munawwana in which
the nuun of tanween is assimilated to the following consonant. This is
no different from an accented vowel in French receiving a phonetic
modification. Similarly for the other tanween variants - they all have
distinct meanings. To put it another way, these are not morphological
but phonological indicators, just like the other signs in the written
language.

So IMHO it is quite misleading to state that e.g. "an assimilated
fathatan or a fathatan with a small meem is still a fathatan".
Phononologically and graphically this is clearly untrue (or half-true at
best); morphologically it is true insofar as the underlying /n/ of
tanween signals indefiniteness. But Unicode, correctly IMO, does not
encode morphemes.

Note, btw, that the way the scribes do it *already* encodes tanween
modification, explicitly. If they had wanted to indicate idgham by
adding a distinct <idgham> mark to the distinct-tanween mark, they
surely could have done so. But they didn't. Why don't we follow their
lead? There is no need to add a modifier mark to an explicit tanween,
e.g. <fatha><tanween><idgham> or the like. That is certainly one way to
do the design, but I don't see a good reason to favor it over e.g.
<fatha><tanween-mudgham>. The only justification I can think of is that
the former design would allow reuse of <idgham> on other consonants.
But that is easily handled with <absolute-idgham> or whatever one
decides to call it.

The problem is simply that Unicode got it wrong by encoding the -atan
codepoints as distinct units (at least for written Arabic). The
solution is to point out that, while this may make sense for some
languages that use the Arabic script, it makes no sense at all for the
Arabic language. Therefore, additional codepoints should be adopted
that allow for proper Arabic writing, namely the various <tanween> elements.
Post by Mete Kural
In Unicode Arabic there are several instances where certain
codepoints break this kind of graphemic integrity. Some of these were
I guess I may not understand just what you mean by "graphemic
integrity". FWIW, I don't believe it's accurate to say that Unicode
encodes graphemes; if you run that past the Ken Whistlers and Mark
Davis' of the world I think they will dispute it.
Post by Mete Kural
added because that was the way it was in legacy Arabic codeblocks
that were prepared a long time ago by corporations that wanted to
localize their software into Arabic the cheapest and quickest way.
Not much scholarly advice was sought. Your argument is that we can
compromise from the graphemic integrity yet another time in order to
allow legacy font technologies to render these tanween variants. My
opinion is that it is better not to introduce yet another blunder
into Unicode Arabic in order to support the legacy.
This I don't follow at all.

Respectfully,

gregg
Mete Kural
2005-08-26 16:57:20 UTC
Permalink
Hello Meor,
Post by Meor Ridzuan Meor Yahaya
So, seems like my timezone is totally different from the rest ...
Yup I think most of us are either in North America or Middle East/Europe regions.
Post by Meor Ridzuan Meor Yahaya
Just to clarify, the use of small letters in Madinah Mushaf is
different from what is used by others. You can see at
http://www.quranpak.com/sample1.htm . The Superscript/small alef is
used to denote the "A" sound of 2 harakat, without the fatha. Take the
word "ala" , in the sample in spelled out as "ain fatha lam
superscriptalef alefmaksura", whereby in Madinah mushaf, "ain fatha
lam fatha alefmaksura superscriptalef". So, the superscript alef in
Madinah Mushaf is not really a supersciptalef, it is small alef.
In the Madinah Mushaf dagger/small alef lengthens the "A" sound of 'alaa. They have put a fatha on top of the lam which is unnecessary but that's the way the Madinah Mushaf works. They put a fatha on letters that are followed by a long "A" vowel even if it is unnecessary. In the Pakistani Mushaf they have not put the unnecessary fatha on top of the lam in 'alaa which is totally fine because that fatha is unnecessary since it is obvious that the vowel is "A" there. But the function of dagger/small alef remains the same in this word 'alaa in both the Madinah Mushaf and the Pakistani Mushaf which is to lengthen the "A" sound as if there was an extra alef there. Are there any other examples that you think there is a difference in the function of dagger/small alef among the Pakistani Mushaf and the Madinah Mushaf?

In the following verse you see that in the Pakistani Mushaf they have used dagger/small alef instead of hamza in the word 'aamannaa. This is an orthographic difference between the Pakistani Mushaf and the Madinah Mushaf. Both the hamza and the dagger/small alef in that case lengthen the "A" sound by representing an alef that is not there but pronounced as if it is there. Similarly in al-aakhir in verse 2:8. Again Pakistani Mushaf uses dagger/small alef whereas Madinah Mushaf uses a hamza. This is an orthographic difference, but the function of the dagger/small alef remains the same between both masahif in these cases. But if you can find other cases where the function differs please let me know. I am very interested in these kinds of things.
Post by Meor Ridzuan Meor Yahaya
So, in Madinah Mushaf, the superscript/small waw in sura 17, aya7 is
actually have the same function as other small waw at the end of word,
it is just that the "missing" waw for that word occured in the middle
of the word.
This small waw issue will probably have to wait for a future version of Unicode. They're not even accepting new codepoints for Unicode 5.0 any more and it's also hard to get codepoints in Unicode 5.1 any more. Some things may have to wait till Unicode 6.0. I am not even sure if they will accept the new hamza codepoint in Unicode 5.0 yet. So there may have to be more proposals in the future, piece by piece.
Post by Meor Ridzuan Meor Yahaya
Unicode support for these character/symbol is very confusing at best.
For the small waw, they only have small waw, not superscript waw. For
the small yeh, they have both, the spacing glyph and the superscript
one. And the worst the the alef, named as superscript alef, but
described as a vowel mark, so I myself have no idea what it means, or
it what it suppose to represent. Take the SIL font for example, they
have the small waw as a non spacing glyph, contrary to Unicode
description. This is just one example how misleading the document
really is.
I agree with you that some of the characters need to be clarified.
Post by Meor Ridzuan Meor Yahaya
So, if you ask me, the best is for unicode to either change the
glyph/character property as propsed by Yousif, or add few more
codepoints for the "missing" glyph. Second approach probably can be
adopted faster.
Faster? Not really. It takes over a year sometimes to get a new codepoint added to Unicode since with each new codepoint a whole bunch of discussion may take place. Some take even longer than a year. And then you have to wait for companies to support the new version of Unicode in their software which usually takes at least another year on top of that. For instance we have been working on the new hamza codepoint for two years now since it is considered kind of controversial and we still couldn't get it added. I hope it will make it to Unicode 5.0 but we're not even sure. There is a chance it might be delayed till Unicode 5.1.
Post by Meor Ridzuan Meor Yahaya
Another important thing about technology: Pocket PC 2003 does not have
full opentype support. I'm not sure about palm, but I doubt it has.
So, we can't display the text on those platform. I think these
platform is very important for displaying the Quran, since it is the
most convenient for all. ( I really would like to get one specifically
for reading the Quran).
Don't worry, by the time these new features get added to Unicode and companies start implementing the new Unicode version it will be at least several years anyways. By that time I hope OpenType support would be added to those platforms but of course font portability is another problem. For instance Bitstream recently added OpenType support to Symbian OS which runs on cellphones. And as far as I know there is some support for OpenType in Pocket PCs (OpenOffice.org says that there is some OpenType support in Pocket Word http://xml.openoffice.org/xmerge/plugins/pocketword.html). Windows Mobile 5 probably has a little better support. By the time Unicode 6.0 is out probably Windows Mobile 6 or something gets released and hopefully they will have yet better support there.

Well the thing is that it takes a while to add new features to Unicode Arabic after the introduction. As I said it has been two years since we've discussed the new hamza codepoint but the Unicode community is very conservative and it's hard to get things added. I think finally now they don't have reasonable objections to this new hamza codepoint so we will go ahead and propose it God willing. The new hamza codepoint may make it to Unicode 5.0 but any other codepoint will have to probably wait for Unicode 6..0 anyways. Unicode 6.0 is probably two years ahead of us. Add another year for companies to support Unicode 6.0. Which means that any new codepoint after this point will not be available to users as a standard for about three years. So as far as Unicode is concerned nothing is fast. For the next three years we have to figure out what "workarounds" to use for all of the currently missi
ng Qur'anic features (except hopefully the new hamza). So just to mention that we are look
ing at the long term here. If a codepoint gets added it won't be available as a standard "right now" anyways.

So if you want to propose six new codepoints for tanween variants by all means go ahead. We are already struggling with just one codepoint. Besides that I don't think it is a good idea to propose these six new codepoints, I wouldn't even have the time and energy to get six new codepoints accepted by the Unicode community anyways. If the Unicode community allows these six codepoints by any chance, then with canonical equivalences they would be made equivalent to <tanween+modifier> sequence, that is if the <tanween+modifier> sequence ever makes it into Unicode.

Regards,
Mete

--
Mete Kural
Touchtone Corporation
714-755-2810
--
Meor Ridzuan Meor Yahaya
2005-08-29 01:14:14 UTC
Permalink
Mete ,
Post by Mete Kural
I agree with you that some of the characters need to be clarified.
Post by Meor Ridzuan Meor Yahaya
So, if you ask me, the best is for unicode to either change the
glyph/character property as propsed by Yousif, or add few more
codepoints for the "missing" glyph. Second approach probably can be
adopted faster.
Faster? Not really. It takes over a year sometimes to get a new codepoint added to Unicode since with each new codepoint a whole bunch of discussion may take place. Some take even longer than a year. And then you have to wait for companies to support the new version of Unicode in their software which usually takes at least another year on top of that. For instance we have been working on the new hamza codepoint for two years now since it is considered kind of controversial and we still couldn't get it added. I hope it will make it to Unicode 5.0 but we're not even sure. There is a chance it might be delayed till Unicode 5.1.
When I say faster, I mean to implement it in technical perspective,
not Unicode adoption. The first option that I referred to basically
have the same problem with the new Hamza: it introduces the shaping
behaviour to the Arabic block. So far, in the Arabic block, we have
right joiner, dual joiner, mark, and stanalone ( I think that'a all).
So to propose a new character which have a stanalone and a medial
joiner only, the rendering engine will need to create a "new" feature,
and a new character group. So, repeat this process for all rendering
engine out there, we will have to wait for a very long time. Not to
mentioned the adoption by font developers.
Post by Mete Kural
Post by Meor Ridzuan Meor Yahaya
Another important thing about technology: Pocket PC 2003 does not have
full opentype support. I'm not sure about palm, but I doubt it has.
So, we can't display the text on those platform. I think these
platform is very important for displaying the Quran, since it is the
most convenient for all. ( I really would like to get one specifically
for reading the Quran).
Don't worry, by the time these new features get added to Unicode and companies start implementing the new Unicode version it will be at least several years anyways. By that time I hope OpenType support would be added to those platforms but of course font portability is another problem. For instance Bitstream recently added OpenType support to Symbian OS which runs on cellphones. And as far as I know there is some support for OpenType in Pocket PCs (OpenOffice.org says that there is some OpenType support in Pocket Word http://xml.openoffice.org/xmerge/plugins/pocketword.html). Windows Mobile 5 probably has a little better support. By the time Unicode 6.0 is out probably Windows Mobile 6 or something gets released and hopefully they will have yet better support there.
Well, pocket word does not really support opentype table, it just
support to display opentype font. Initially, I did'nt realize this,
but later I come to know that some people referring to Opentype font
to the font that wraps postscript font in opentype/truetype style
table, not the GSUB/GPOS table. I'm not sure about symbian, but Pocket
PC definitly does not have any support for GSUB/GPOS table. Windows CE
5 will have the support, but I did not look it up the spec for arabic.

So, if your concern is about unicode does not want to accept a new
codepoint, why not do this: proceed with your proposal, but make sure
the features that we require are hightlighted and spelled out as
clearly as possible. We do not want to introduce more ambiguity. While
you are at it, you might want to suggest to them to at least change
the description for superscript alef. So, if that get standardize, the
only thing missing is the small superscript waw. I'm not sure how to
takcle that.

At the same time, it might be a good idea for arabeyes to work on
something on how to make it backword compatible with other
technologies. One thing that come into my mind is to assign those
missing glyph to PUA, so that we can use it consistently across our
application. Others might want to follow.

Lastly, we need to start to develop search algorithm. I'm not sure how
we can develop one, based on your proposal. Maybe you have a better
idea. (This is not my area really, since I'm not an expert in Arabic).


Regards.
Nadim Shaikli
2005-08-29 17:45:40 UTC
Permalink
Post by Mete Kural
I am not proposing that we should use a <tanween+modifier> sequence
for tanween with small meem and assimilated tanween just to save the
hassle of proposing six extra new codepoints to Unicode (although it
would truly be quite a hassle to try to propose six new codepoints).
It is because using a <tanween+modifier> sequence preserves the text's
graphemic integrity better and results in a cleaner encoding. A fathatan
is a fathatan, regardless of whether its pronounciation changes slightly.
An assimilated fathatan or a fathatan with small meem is still a fathatan,
in fact it is just as much fathatan as any other fathatan. For hundreds of
years all of these fathatans were written the same exact way. In more
recent times scribes have decided to write these two kinds of fathatans
slightly differently to cue the un-educated reciter to pronounce correctly.
For that reason the logical way to encode this is the <fathatan+modifier>
sequence in order to preserve the fathatan codepoint. Using a seperate
codepoint will break this graphemic integrity.
You're free to reason it out whichever way you like, but I still think
we should see a code-point for each of these new characters. Doing that
would free everyone to do what they please - you can proceed with your
implementation method(s) and others can count on knowing which code-point
to potentially access if they opted for an alternate realization.

Something tells me I'm repeating myself - the point is not so much your
way (or mine) is better/best/more-reaonsable, but the point is to keep an
open mind on how people might potentially use this and give them all the
freedoms they've been afforded. What is being asked is very much within
the domain of what Unicode has done and is doing.
Post by Mete Kural
Your argument is that we can compromise from the graphemic integrity yet
another time in order to allow legacy font technologies to render these
tanween variants. My opinion is that it is better not to introduce yet
another blunder into Unicode Arabic in order to support the legacy.
We have different biases. Your bias is towards legacy support, my bias
is towards graphemic integrity. This analysis doesn't resolve our
differences but at least we can identify them better.
Agreed, but at the end of the day we should be putting forth enabling
technologies (past, present and future) for people to use and not limit
them to what some of us think is _THE_ solution. My bias doesn't preclude
you from doing what you'd like, whereas your's does and there-in lies
and serious issue.
Post by Mete Kural
So if you want to propose six new codepoints for tanween variants by all
means go ahead. We are already struggling with just one codepoint [hamza].
Besides that I don't think it is a good idea to propose these six new
codepoints, I wouldn't even have the time and energy to get six new
codepoints accepted by the Unicode community anyways. If the Unicode
community allows these six codepoints by any chance, then with canonical
equivalences they would be made equivalent to <tanween+modifier> sequence,
that is if the <tanween+modifier> sequence ever makes it into Unicode.
To propose or not propose is not a question of difficulty and we most
certainly should not shy away from it due to that. Again we should be
going forward with a complete solution to what is there now, we have
almost all we need with the exception of a few characters that might
have been overlooked for us to have a fully specified means to encode
the Quran. You have a different way of doing things which doesn't
require much from unicode and that is all great, but shouldn't we at
least try to address the missing pieces that are in unicode now fully
irrespective of difficulty and time ? Why pull the plug on what is
there now which many people are and will continue to use for generations
to come in lieu of an alternative - let's fix/add what is needed there
and also work on alternatives and leave it be... Seems rather encompassing
to me, yet I really don't understand the outright opposition to this.
Again, don't think of this as better/worse or old/new but think of it
as a means to enable functionality for some.

Salam.

- Nadim




____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs
Mete Kural
2005-08-29 18:15:55 UTC
Permalink
Hello Meor,
Post by Meor Ridzuan Meor Yahaya
So, if your concern is about unicode does not want to accept a new
codepoint, why not do this: proceed with your proposal, but make sure
the features that we require are hightlighted and spelled out as
clearly as possible. We do not want to introduce more ambiguity. While
you are at it, you might want to suggest to them to at least change
the description for superscript alef. So, if that get standardize, the
only thing missing is the small superscript waw. I'm not sure how to
takcle that.
My concern is not that Unicode does not want to accept new codepoints but rather that new codepoints that break graphemic integrity should not be added. God willing Tom and I will work to the best of our ability to make sure these new features are added to Unicode Arabic in the manner that is clear and understandable.
Post by Meor Ridzuan Meor Yahaya
At the same time, it might be a good idea for arabeyes to work on
something on how to make it backword compatible with other
technologies. One thing that come into my mind is to assign those
missing glyph to PUA, so that we can use it consistently across our
application. Others might want to follow.
PUA would be the place to use for these experimental initiatives.
Post by Meor Ridzuan Meor Yahaya
Lastly, we need to start to develop search algorithm. I'm not sure how
we can develop one, based on your proposal. Maybe you have a better
idea. (This is not my area really, since I'm not an expert in Arabic).
Yes search algorithm is important. Unfortunately I won't have the time be able to contribute to search algorithm efforts for at least another year. Anyone else who wants to start right now I would appreciate the effort.

Regards,
Mete



--
Mete Kural
Touchtone Corporation
714-755-2810
--
Mete Kural
2005-08-29 18:22:47 UTC
Permalink
Salaam Nadim,
Post by Nadim Shaikli
Agreed, but at the end of the day we should be putting forth enabling
technologies (past, present and future) for people to use and not limit
them to what some of us think is _THE_ solution. My bias doesn't preclude
you from doing what you'd like, whereas your's does and there-in lies
and serious issue.
Anybody is still free to propose these six tanween codepoints if they want to propose. There is nothing stopping them from doing so. If the UTC accepts such six codepoints the logical thing to do would be canonical equivalence with the tanween+modifier sequence that we are proposing. But we would discourage anyone from proposing such six new tanween codepoints. Although as I said they can go ahead if they want to. In the end the final decision is made not by us but the Unicode Technical Commitee.
Post by Nadim Shaikli
To propose or not propose is not a question of difficulty and we most
certainly should not shy away from it due to that. Again we should be
going forward with a complete solution to what is there now, we have
almost all we need with the exception of a few characters that might
have been overlooked for us to have a fully specified means to encode
the Quran. You have a different way of doing things which doesn't
require much from unicode and that is all great, but shouldn't we at
least try to address the missing pieces that are in unicode now fully
irrespective of difficulty and time ? Why pull the plug on what is
there now which many people are and will continue to use for generations
to come in lieu of an alternative - let's fix/add what is needed there
and also work on alternatives and leave it be... Seems rather encompassing
to me, yet I really don't understand the outright opposition to this.
Again, don't think of this as better/worse or old/new but think of it
as a means to enable functionality for some.
We are not pulling the plug on anyone. As I say the UTC is the decision maker. Anyone can make another proposal after our proposal if they don't like the result. Then, the decision would be up to the UTC to make.

Kind regards,
Mete

--
Mete Kural
Touchtone Corporation
714-755-2810
--
Nadim Shaikli
2005-08-29 21:49:58 UTC
Permalink
Post by Mete Kural
Anybody is still free to propose these six tanween codepoints if they want
to propose. There is nothing stopping them from doing so.
That I knew - but the point of this entire thread (as far as I'm concerned)
is to get others to see the other end of the argument (be it right or wrong)
and to at least sympathize with what is being said. It is no mystery that
Arabeyes doesn't have any pull with unicode (if you can even call it that)
and that there are a handful of "experts" that approve such proposals.
Mr T.Milo is such an expert and he's referenced often enough to know that
when it comes to arabic his stance and decision is more than likely to
prevail. As such it would have been wonderful for you and Thomas to at
a minimum agree with what was being said for the proposal would have no
chance whatsoever without some backing from those interested in this topic
on _this_ list.
Post by Mete Kural
But we would discourage anyone from proposing such six new tanween
codepoints. Although as I said they can go ahead if they want to.
In the end the final decision is made not by us but the Unicode
Technical Commitee.
Out of curiosity, who is the "we" in the 'we would discourage' above ?
Post by Mete Kural
We are not pulling the plug on anyone. As I say the UTC is the decision
maker. Anyone can make another proposal after our proposal if they don't
like the result. Then, the decision would be up to the UTC to make.
As noted without buy-in from those involved with unicode on this list we
have an extremely limited change to get anything through to the committee.

Sorry, but I continue to be baffled regarding this topic and the rather
limiting (to many) solution sought.

Salam.

- Nadim


__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
Mete Kural
2005-08-29 20:22:05 UTC
Permalink
Post by Gregg Reynolds
I guess I may not understand just what you mean by "graphemic
integrity". FWIW, I don't believe it's accurate to say that Unicode
encodes graphemes; if you run that past the Ken Whistlers and Mark
Davis' of the world I think they will dispute it.
This is our interpretation of Unicode's mission statement that "The Unicode Standard encodes characters". What do you suggest that Unicode encodes?

Regards,
Mete

--
Mete Kural
Touchtone Corporation
714-755-2810
--
Mete Kural
2005-08-29 20:23:49 UTC
Permalink
Post by Mete Kural
This is our interpretation of Unicode's mission statement that "The Unicode Standard encodes characters". What do you suggest that Unicode encodes?
Sorry, I meant to say:

This is our interpretation of Unicode's mission statement that "The Unicode Standard encodes characters" in the case of Arabic.

Regards,
Mete

--
Mete Kural
Touchtone Corporation
714-755-2810
--
Mete Kural
2005-08-29 21:19:21 UTC
Permalink
Here is Unicode Technical Standard #18 written by Mark Davis himself that gives clues to what Unicode means by "characters":

http://unicode.org/reports/tr18/
Look for:

"One or more Unicode characters may make up what the user thinks of as a character. To avoid ambiguity with the computer use of the term character, this is called a grapheme cluster. For example, "G" + acute-accent is a grapheme cluster: it is thought of as a single character by users, yet is actually represented by two Unicode characters."

Regards,
Mete



---------- Original Message ----------------------------------
From: "Mete Kural" <metek-6/ELSmrcqeUu8xhjR5IN5AC/***@public.gmane.org>
Reply-To: General Arabization Discussion <general-***@public.gmane.org>
Date: Mon, 29 Aug 2005 13:23:49 -0700
Post by Mete Kural
Post by Mete Kural
This is our interpretation of Unicode's mission statement that "The Unicode Standard encodes characters". What do you suggest that Unicode encodes?
This is our interpretation of Unicode's mission statement that "The Unicode Standard encodes characters" in the case of Arabic.
Regards,
Mete
--
Mete Kural
Touchtone Corporation
714-755-2810
--
--
Mete Kural
Touchtone Corporation
714-755-2810
--
Gregg Reynolds
2005-08-30 02:29:38 UTC
Permalink
Post by Mete Kural
Here is Unicode Technical Standard #18 written by Mark Davis himself
"One or more Unicode characters may make up what the user thinks of
as a character. To avoid ambiguity with the computer use of the term
character, this is called a grapheme cluster. For example, "G" +
acute-accent is a grapheme cluster: it is thought of as a single
character by users, yet is actually represented by two Unicode
characters."
Regards, Mete
Hi Mete,

Just wanted to let you know that I'm pretty booked for the next few
days, but will respond. Personally, I find this topic of "what is a
character" quite fascinating, so I warn you that if I can't persuade you
I may bore you to death. ;)

-gregg
Mete Kural
2005-08-29 22:31:19 UTC
Permalink
Hello Nadim,
Post by Nadim Shaikli
Out of curiosity, who is the "we" in the 'we would discourage' above ?
I meant Tom and me. But I would not like to speak for Tom although I do know his view regarding this tanween issue and he doesn't support proposing six new codepoints for each tanween and rather supports the <tanween+modifier> pattern. This was proposed by himself not by me and I agreed that it is the appropriate way to handle the matter at hand.
Post by Nadim Shaikli
As noted without buy-in from those involved with unicode on this list we
have an extremely limited change to get anything through to the committee.
That is why I posted about this on this list in the first place. But it seems like everybody has a different approach to handle the issue and not willing to change their minds except maybe Meor. As far as I understand (referenced individuals, correct me if I'm wrong):

Tom and I support the <tanween+modifier> approach.
Gregg supports the <vowel+modifier> approach (by vowel, I mean fatha/damma/kasra here).
Meor is somewhat neutral but he prefers the single codepoint for each tanween variant approach since it is easier to implement.
You and Mohammed Yousif support the single codepoint for each tanween variant approach.

Kind regards,
Mete

--
Mete Kural
Touchtone Corporation
714-755-2810
--
Nadim Shaikli
2005-08-29 23:23:45 UTC
Permalink
- Tom and I support the <tanween+modifier> approach.
- Gregg supports the <vowel+modifier> approach (by vowel, I mean
fatha/damma/kasra here).
- Meor is somewhat neutral but he prefers the single codepoint
for each tanween variant approach since it is easier to implement.
- Nadim and Mohammed Yousif support the single codepoint for each
tanween variant approach.
So if you really boil it down there are two approaches here,

a. With a modifier (of some kind)
b. A codepoint for each character

So why not do both - the 'b' option will give you standardized
backwards compatibility as well as functionality on restricted
or non-font based approaches while the 'a' option would result
in a more preferred standardized approach that is font technology
driven.

Seems like a plausible win-win situation to me, no ?

Salam.

- Nadim




____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs
Mete Kural
2005-08-29 22:37:21 UTC
Permalink
So I think what would be the best is that once this proposal goes into public review, everyone who has a different approach should comment and discuss about the matter on the Unicode platform. This includes Meor, Gregg, you and Mohammed Yousif. The public review website is here:

http://www.unicode.org/review/

Regards,
Mete

---------- Original Message ----------------------------------
From: "Mete Kural" <metek-6/ELSmrcqeUu8xhjR5IN5AC/***@public.gmane.org>
Reply-To: General Arabization Discussion <general-***@public.gmane.org>
Date: Mon, 29 Aug 2005 15:31:19 -0700
Post by Mete Kural
Hello Nadim,
Post by Nadim Shaikli
Out of curiosity, who is the "we" in the 'we would discourage' above ?
I meant Tom and me. But I would not like to speak for Tom although I do know his view regarding this tanween issue and he doesn't support proposing six new codepoints for each tanween and rather supports the <tanween+modifier> pattern. This was proposed by himself not by me and I agreed that it is the appropriate way to handle the matter at hand.
Post by Nadim Shaikli
As noted without buy-in from those involved with unicode on this list we
have an extremely limited change to get anything through to the committee.
Tom and I support the <tanween+modifier> approach.
Gregg supports the <vowel+modifier> approach (by vowel, I mean fatha/damma/kasra here).
Meor is somewhat neutral but he prefers the single codepoint for each tanween variant approach since it is easier to implement.
You and Mohammed Yousif support the single codepoint for each tanween variant approach.
Kind regards,
Mete
--
Mete Kural
Touchtone Corporation
714-755-2810
--
--
Mete Kural
Touchtone Corporation
714-755-2810
--
Meor Ridzuan Meor Yahaya
2005-08-30 00:05:32 UTC
Permalink
Mete,
I don't have any strong preference because I'm not the expert here.
However, I do have a good experience on the technology implementation
out there.

One thing to consider about your proposal. If you choose the approach
of using a tanween+modifier (as implemented by my text file), I
suggest you to use other than sukun, since this will make many,many
system out there to break. Why? One word: Microsoft. They will treat
the sequence invalid, thus will render the dotted circle (please refer
to http://www.microsoft.com/typography/OpenType%20Dev/arabic/shaping.mspx
). Since in their implementation they have the sequence as invalid
build in, this will upset a lot of people. Plus, I think they will
oppose the porposal because of that. That is the reason why I did not
choose sukun for that purpose. However, since 06DF and 06E0 are in the
same group with 06E2 (one of my choosen modifier), it might work, but
I'm not sure what it means to have that sequence.

Regards.
Post by Mete Kural
http://www.unicode.org/review/
Regards,
Mete
---------- Original Message ----------------------------------
Date: Mon, 29 Aug 2005 15:31:19 -0700
Post by Mete Kural
Hello Nadim,
Post by Nadim Shaikli
Out of curiosity, who is the "we" in the 'we would discourage' above ?
I meant Tom and me. But I would not like to speak for Tom although I do know his view regarding this tanween issue and he doesn't support proposing six new codepoints for each tanween and rather supports the <tanween+modifier> pattern. This was proposed by himself not by me and I agreed that it is the appropriate way to handle the matter at hand.
Post by Nadim Shaikli
As noted without buy-in from those involved with unicode on this list we
have an extremely limited change to get anything through to the committee.
Tom and I support the <tanween+modifier> approach.
Gregg supports the <vowel+modifier> approach (by vowel, I mean fatha/damma/kasra here).
Meor is somewhat neutral but he prefers the single codepoint for each tanween variant approach since it is easier to implement.
You and Mohammed Yousif support the single codepoint for each tanween variant approach.
Kind regards,
Mete
--
Mete Kural
Touchtone Corporation
714-755-2810
--
--
Mete Kural
Touchtone Corporation
714-755-2810
--
_______________________________________________
General mailing list
http://lists.arabeyes.org/mailman/listinfo/general
Mete Kural
2005-08-29 23:49:52 UTC
Permalink
Hello Nadim,
Post by Nadim Shaikli
So if you really boil it down there are two approaches here,
a. With a modifier (of some kind)
b. A codepoint for each character
So why not do both - the 'b' option will give you standardized
backwards compatibility as well as functionality on restricted
or non-font based approaches while the 'a' option would result
in a more preferred standardized approach that is font technology
driven.
I mentioned that if the UTC would ever accept a codepoint for each tanween variant, they could be made cannonically equivalent to the modifier approach. So both approaches could simultaneously be in use. It's not my preference but if both were to be accepted by the UTC, that would be the case.

First of all, there is no one proposal on the table right now that takes care of all the missing characters not supported by Unicode. The primary person who is making the current proposal is Tom. I'm just helping support his case. The current proposal Tom is taking to the UTC includes just the new hamza codepoint and the tanween variants with modifier approach. There are still more things such as the small waw.. So there has to be multiple proposals anyways. If someone wants to do another proposal for seperate codepoints for tanween variants they can go ahead and do it. I know that Tom is not willing to do this himself. He does not think this is a proper way to handle this encoding matter. You can try to convince him if you want (his email is t.milo-/NLkJaSkS4VmR6Xm/***@public.gmane.org - he posted to this list with this email address as well).

Additionally your font vs. non-font based approach comparison is confusing. Remember that Postscript fonts are fonts as well.

Regards,
Mete



--
Mete Kural
Touchtone Corporation
714-755-2810
--
Mete Kural
2005-08-30 00:26:24 UTC
Permalink
Hello Meor,
Post by Meor Ridzuan Meor Yahaya
choose sukun for that purpose. However, since 06DF and 06E0 are in the
same group with 06E2 (one of my choosen modifier), it might work, but
I'm not sure what it means to have that sequence.
Actually we recently decided against using sukuun since 06DF or 06E0 are better choices. These silence consonants whereas sukuun indicates that a consonant is vowelless. The analogy of using 06DF or 06E0 is better since symbolically you can think of these codepoints indicating that the noon consonant at the end of the tanween is silenced (or assimilated as Gregg would say).

So I don't think we will be proposing the modifier to be sukuun, which turns out to be a not so fit analogy. That should help you avoid the problem with the current MS implementation.

You can see examples of 06DF and 06E0 here:
Loading Image...
(The Unicode Arabic code chart displays the wrong glyphs for 06DF and 06E0, the above are accurate. Tom made a proposal on this and God willing the code chart will be fixed for Unicode 5.0).

Regards,
Mete

--
Mete Kural
Touchtone Corporation
714-755-2810
--
Meor Ridzuan Meor Yahaya
2005-08-30 00:34:13 UTC
Permalink
But don't forget one thing, the 06df and 06E0 in only used in the
Madinah Mushaf in combination with alef, and only alef.

Regards.
Post by Mete Kural
Hello Meor,
Post by Meor Ridzuan Meor Yahaya
choose sukun for that purpose. However, since 06DF and 06E0 are in the
same group with 06E2 (one of my choosen modifier), it might work, but
I'm not sure what it means to have that sequence.
Actually we recently decided against using sukuun since 06DF or 06E0 are better choices. These silence consonants whereas sukuun indicates that a consonant is vowelless. The analogy of using 06DF or 06E0 is better since symbolically you can think of these codepoints indicating that the noon consonant at the end of the tanween is silenced (or assimilated as Gregg would say).
So I don't think we will be proposing the modifier to be sukuun, which turns out to be a not so fit analogy. That should help you avoid the problem with the current MS implementation.
http://www.unicode.org/review/pr-73-arabic.jpg
(The Unicode Arabic code chart displays the wrong glyphs for 06DF and 06E0, the above are accurate. Tom made a proposal on this and God willing the code chart will be fixed for Unicode 5.0).
Regards,
Mete
--
Mete Kural
Touchtone Corporation
714-755-2810
--
_______________________________________________
General mailing list
http://lists.arabeyes.org/mailman/listinfo/general
Gregg Reynolds
2005-08-30 02:46:30 UTC
Permalink
Post by Meor Ridzuan Meor Yahaya
But don't forget one thing, the 06df and 06E0 in only used in the
Madinah Mushaf in combination with alef, and only alef.
I don't know if my copy is Madinah or not, but in the end pages it gives
and example of "al-Sifr al-mustatiir" on a dotless ya.

But that is irrelevant, in my opinion. I am fundamentally and adamantly
opposed to the notion that well-established codepoints should be given
dual semantics. If you want new semantics, propose a new codepoint.

-gregg
Meor Ridzuan Meor Yahaya
2005-08-30 05:22:49 UTC
Permalink
Sorry, my bad. The 06DF is defined for "illah" character, which is
alef, waw and yeh if i'm not mistaken. THe 06E0 is strictly for alef.
Anyway, the used for it is very limited.

Regards.
Post by Gregg Reynolds
Post by Meor Ridzuan Meor Yahaya
But don't forget one thing, the 06df and 06E0 in only used in the
Madinah Mushaf in combination with alef, and only alef.
I don't know if my copy is Madinah or not, but in the end pages it gives
and example of "al-Sifr al-mustatiir" on a dotless ya.
But that is irrelevant, in my opinion. I am fundamentally and adamantly
opposed to the notion that well-established codepoints should be given
dual semantics. If you want new semantics, propose a new codepoint.
-gregg
_______________________________________________
General mailing list
http://lists.arabeyes.org/mailman/listinfo/general
Meor Ridzuan Meor Yahaya
2005-09-14 05:36:52 UTC
Permalink
Salam to all,
Does anyone knows about this code? The description says ARABIC MARK
NOON GNUNNA, then further describe it as Kashmiri and Baluchi,
nazalization in Urdu. I think I've not seen the mark before, and it is
difficult to know how exactly does it look like from Unicode document.

Previously, I thought of using this code to indicate the
sequential/assimilated tanween (seems the most logical choice), but
however it does have problems when trying to implement it under
windows.

Wassalam.
Thomas Milo
2005-09-14 14:05:23 UTC
Permalink
Hi meor,
Post by Meor Ridzuan Meor Yahaya
Salam to all,
Does anyone knows about this code? The description says ARABIC MARK
NOON GNUNNA, then further describe it as Kashmiri and Baluchi,
nazalization in Urdu. I think I've not seen the mark before, and it is
difficult to know how exactly does it look like from Unicode document.
U+0658 ARABIC MARK NOON GHUNNA is a supplement to the U+06BA ARABIC LETTER
NOON GHUNNA (an undotted noon for Urdu), for disambiguation in non-final
positions. It looks like the Latin BREVE mark (bottom half of circle)
Post by Meor Ridzuan Meor Yahaya
Previously, I thought of using this code to indicate the
sequential/assimilated tanween (seems the most logical choice), but
however it does have problems when trying to implement it under
windows.
Why not use U+06E8 SMALL NOON ABOVE which presently serves only one instance
in the whole of the Qur'an, or the history of Arabic literacy for that
matter?

t

Loading...