World: r3wp

Join the discussions in the REBOL3 world...

[Script Library] REBOL.org: Script library and Mailing list archive

older newer	first last
PeterWood 14-Mar-2009 [762]	At the moment, I'd be worried about standarising the Library on utf-8 as the effect of multibyte characters would have during script and mail processing is not understood. It could well be that the system handles multibyte characters without a hitch but nobody knows yet. I have started to write some scripts to try to help move to a consistent character encoding of the Library data but, due to time constraints, I have been very slow.
Anton 14-Mar-2009 [763x3]	Why worry? Just do it. :-P
	What version of rebol is being used by rebol.org ?
	Sunanda, can you publish some files with the 8-bit ascii and note what the problems are ?
Maxim 14-Mar-2009 [766x3]	sunanda, you can force the character encoding in the html page header... I've used that before and it worked for me.
	note, I don't mean the http header, but the actual <HEAD> tag.
	I had the same kind of issues on another system. nowadays, the default encoding has become UTF-8 for many/most html handlers, so if its not specified, many new browers and tools will incorrectly break up the character data.
Sunanda 14-Mar-2009 [769x2]	Anton, REBOL.org uses 2.5.6.4.1 The obvious bad file is the one Scott added recently: http://www.rebol.org/view-script.r?script=ascii-math.r If you view it with that URL, all looks good. If you click the [Download script] link you'll see many spurious high-ascii chars in the source. Those high ascii _are_ actually in the source. But where they came from is a mystery.
Sunanda 14-Mar-2009 [769x2]	Maxim, REBOL.org emits a header <meta http-equiv="Content-Type" content="text/html;charset=utf-8"> Yeah, I know we aren't utf-8 -- but experiment has shown that's the moste acceptable charset. Not sure what you are saying we could put in <head> -- can you be more specific.
Maxim 14-Mar-2009 [771]	there is a specific charset for western -iso, which ensure the extra 127 bytes are correct. <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
Sunanda 14-Mar-2009 [772]	Thanks......We used to have that, but it created some other problems. I'll have to try to remember what and why :-) And it does not solve the download problem (I know, I tried yesterday).
PeterWood 14-Mar-2009 [773]	I think the root of the problem is that when the Library system was first written, no account was taken of character encoding. As a result, not only is the data encoded as it was when originally submitted but the method of encoding is not even known. Whatever charset is specified in the http header is not going to be correct for all scripts and messages. Using charset=utf8 seems to cause the least problems. Though for example, it will cause many ISO-8859-1 "high bit" characters to be incorrectly displayed.
Chris 14-Mar-2009 [774x3]	Do you have any stats on how many 'high bit' characters are now contained in Library content?
	Or scope? - minimal; limited; too many to be trivial...
	Re. ISO-8859-1 - the most obvious problem is the limitation - 256 chars vs. UCS-1+
Sunanda 14-Mar-2009 [777]	No actual stats. Just from feel: * Scripts -- very few * Posts on the ML -- a few dozen * AltME archive -- no idea
Gabriele 15-Mar-2009 [778]	Sunanda, I can tell you where does chars come from. if your page is set as utf-8, then the script as been uploaded by the browser as utf-8. when you view it in the brower, it shows correctly as utf-8. when you download it, it is still utf-8, but if you view it with something that believes it's latin1 (eg. the rebol 2 console on windows set as latin1), it won't show up correctly.
Anton 15-Mar-2009 [779x6]	Sunanda, you're right about that ascii-math.r file. When I clicked the [Download script] link, the browser (konqueror) downloaded and directly opened it with the editor (SciTE). SciTE thought it was 8-bit ascii, and showed the characters incorrectly. All I had to do was change the file encoding from 8-bit to utf-8 and the characters appeared correctly. I guess the editor had no way of determining the encoding, and incorrectly guessed 8-bit ascii.
	The view-script.r html source for the page correctly advertises the encoding as utf-8, so the browser shows it correctly.
	So I'm pretty happy with the way that script was handled by the software here.
	Except for R2 console, of course.
	R3 console seems to handle it better.
	Any other scripts you can find showing problems ?
Sunanda 16-Mar-2009 [785x2]	Thanks Gabriele -- that's a clear explanation, and has helped me work out what is going on. Anton and Gabriele -- I have tried changing the charset we emit on the download to say UTF-8. But that makes little difference. As both of you note, once the file has been saved then (without a MAC-type resource fork) there is no obvious indication of the encoding. And several editors I have tried get it wrong -- thus "revealing" the extra ASCII chars. Not sure what the solution is other than to de-UTF-8 files on download.
Sunanda 16-Mar-2009 [785x2]	Anton -- not yet run a crawl to check for other scripts with high ascii chars.
Anton 16-Mar-2009 [787x3]	Which editors? I think most editors these days allow manually changing the encoding, so developers who notice strange characters can just change it themselves. Maybe it would be helpful to add a rebol.org library script header advertising the encoding (when it is known, and when not). I don't recommend 'de-UTF-8'ing files on download - that's just going to confuse things more, especially when the file is view-script.r'd as utf-8 just beforehand.
	It seems the responsibility lies with the clients to interpret encodings properly. As we move to a unicode world, software assuming 8-bit encodings are some ASCII encoding should drop off. But until the transition is complete, there's not much we can do about client software guessing wrong like that, except stating the encoding in the script header, in the web page that provides the download link, and by helping confused newbies.
	Are rebol.org uploaders asked to declare the encoding used?
swall 16-Mar-2009 [790]	If the offending downloaded script is executed in Rebol/Core, the extra ASCII chars are also present in the executed code. The script defines � to be 0.5. If "help �" is typed into the console, the result is "Found these words: ½ decimal! 0.5". However, if the script is executed in Rebol/View, the result is "� is a decimal of value: 0.5". It seems that View handles it correctly, while Core doesn't.
Sunanda 16-Mar-2009 [791]	Thanks guys. Other scripts with the same problem.....there are a couple. About 10% of all scripts have at least one extended ASCII char....But most of them are acceptable in LATIN-1 code page / charset (eg copyright symbol, some accented letters). It's just a very few scripts that use 1/4 and similar symbols that cause the problem. What other editors? Windows NOTEPAD is one example of a common one that gets this wrong.
swall 16-Mar-2009 [792]	Vim and Editor� display the chars incorrectly. Notepad++ shows the chars correctly.
Sunanda 16-Mar-2009 [793]	Of the various editors / word processors I have immediately to hand: -- credit.exe -- [my usual editor] shows incorrect chars, and has no option to switch to UTF-8 -- open office writer -- works fine if you take the UTF-8 option when asked -- ms word -- claims file is corrupt -- word perfect -- makes a complete mess -- R2/View's built in editor ( editor %/c/path to my local copy//ascii-math.r) -- shows incorrect chars
Anton 17-Mar-2009 [794x2]	Vim supports unicode and on my system shows the characters correctly.
Anton 17-Mar-2009 [794x2]	Ok, so there are some editors which don't support unicode, don't guess encoding correctly, or can change encoding only with difficulty. How about this suggestion; if a rebol.org script is known to be UTF-8, then an additional link should appear: [Download as ASCII] download-a-script?script-name=ascii-math.r&encoded-as=8-bit-ascii which transcodes a UTF-8 file to ASCII. Just have to get a conversion function in place for this to work.
Gabriele 17-Mar-2009 [796x2]	Sunanda: given that R2 uses the host current code page, I think the best way would be for the user to convert the script after downloading it. On Linux or Mac for eg, UTF-8 is perfect for Core scripts as the terminal is UTF-8. On Windows or for View scripts, you'll get the host code page displayed anyway, so the user has to do the conversion. A tool to do that automatically would be nice (I have the code, it will be released soon, but you may need to wait a couple weeks more).
Gabriele 17-Mar-2009 [796x2]	All these troubles go away with R3... but I think it would be nice if R2 recognized UTF-8 and converted it on the fly; we could add a BOM at the beginning to make that easier.
Chris 17-Mar-2009 [798]	http://en.wikipedia.org/wiki/Byte_Order_Mark#cite_note-0
swall 17-Mar-2009 [799x2]	Anton: you're right Vim does display the file correctly, although not by default. I guess it helps when you read the manual. :-)
swall 17-Mar-2009 [799x2]	Gabriele: Where is the host code page set? On Windows, is it set differently for View and Core? Is that why the downloaded script works as expected in View but not in Core?
Anton 17-Mar-2009 [801x2]	Yes, use of BOM has its own troubles. I don't think it's a good idea.
Anton 17-Mar-2009 [801x2]	swall, yes, strange, I can't remember configuring vim for utf-8 (I don't use it regularly), but it displayed correctly straight away for me. Must be some dark config option or something...
Sunanda 17-Mar-2009 [803]	Thanks everyone. I think our first step is to add a warning to any download for scripts that contain UTF-8 chars. So, for that I need a function: utf-8?: func [data [string!] [ ...] ; returns true or false [and perhaps "not sure" in ambiguous cases] I've done the easy part :-) Can anyone help with the difficult "..." part ? It is not as simple as just looking for ASCII > 128 .... some high ASCII is acceptable as part of, say, ISO 8859-1
PeterWood 17-Mar-2009 [804x2]	I have a function which finds utf-8 multi byte character sequences in a string. Given the code ranges for mulit-byte characters, it would be rare to find such a sequence accidentally.
PeterWood 17-Mar-2009 [804x2]	It's about 65 lines so rather than post it here I will email you a copy.
Sunanda 18-Mar-2009 [806]	Thanks.
Anton 18-Mar-2009 [807]	Ok, so things seem to be proceeding well. The rebol.org Library's support for utf-8 was actually stronger than thought, and what're being added are functions to help deal with legacy client apps which misidentify the file encoding.
PeterWood 18-Mar-2009 [808]	It's not just legacy client apps unless you consider all Rebol/View scripts as legacy apps.
Anton 18-Mar-2009 [809x2]	Yes, I do.
Anton 18-Mar-2009 [809x2]	I understand what you mean, and obviously the definition of "legacy" is a bit fuzzy.
Sunanda 18-Mar-2009 [811]	Using Peter's code (thanks again!), I've made two changes to the download-a-script link: 1. if we find UTF-8 chars in a script, we download it with the HTTP content type charset=utf-8 But that probably makes no practical difference. A downloaded script will be saved by the browser, and then opened by a text editor. The text editor is unlikey to be passed the charset setting. So: 2. Scripts with UTF-8 encoding are downloaded with a few lines of comment at their top. The comment explains the possible problem. Thanks to all for the comments and help with getting things this far.
older newer	first last