r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Red] Red language group

Kaj
11-Oct-2011
[3538]
I left them in for a while to make the separation with the optionally 
following layout parameters clearer, but in the latest version I 
reconsidered
Dockimbel
11-Oct-2011
[3539x2]
Anyone knows where to find exhaustive lists of invalid UTF-8 encoding 
ranges?
I am calculating them by hand, so I might miss some.
Andreas
11-Oct-2011
[3541x3]
C0, C1, F5-FF must never occur in UTF-8.
80-BF are continuation bytes.
Is that what you are after?
Dockimbel
11-Oct-2011
[3544]
Yes, but I was searching for an exhaustive list of rules.
Andreas
11-Oct-2011
[3545x2]
RFC3629 has a (non-normative) ABNF, if I remember correctly.
http://tools.ietf.org/html/rfc3629#section-4s
Dockimbel
11-Oct-2011
[3547x3]
Here are the parse rules I came up with so far: https://gist.github.com/1278718
I think I am missing some overlong combinations.
I am also unsure of the valid range of the 2nd byte in the four-bytes 
encoding.
Andreas
11-Oct-2011
[3550]
one-byte-codepoint: charset [#"^(00)" - #"^(7F)]
Dockimbel
11-Oct-2011
[3551]
Right, fixing that.
Andreas
11-Oct-2011
[3552x4]
tail-bytes: charset [#"^(80)" - #"^(BF)]

two-byte-codepoint: reduce [charset [#"^(C2)" - #"^(DF)] tail-bytes]
tail-bytes == cont-byte
three-byte-codepoint: reduce [
  #"^(E0)" charset [#"^(A0)" - #"^(BF)] cont-byte
| charset [#"^(E1)" - #"^(EC)"] 2 cont-byte
| #"^(ED)" charset [#"^(80)" - #"^(9F)] cont-byte
| charset [#"^(EE)" - #"^(EF)"] 2 cont-byte 
]
four-byte-codepoint: reduce [
  #"^(F0)" charset [#"^(90)" - #"^(BF)] 2 cont-byte
| charset [#"^(F1)" - #"^(F3)"] 3 cont-byte
| #"^(F4)" charset [#"^(80)" - #"^(8F)] 2 cont-byte
]
Dockimbel
11-Oct-2011
[3556x2]
Thanks, I see that everything I need is in http://tools.ietf.org/html/rfc3629#section-4
BrianH: what was the CureCode ticket where you've summed up the word! 
Unicode parsing rules?
BrianH
11-Oct-2011
[3558x3]
http://issue.cc/r3/1302for the ASCII range in R3. The R3 parser 
tends to be excessively forgiving outside the ASCII range, accepting 
too much, though I haven't done the thorough test.
You might also consider looking at the source of INVALID-UTF? in 
R2, which is MIT licensed from R2/Forward.
It would still be a good idea to review the Unicode standard to determine 
which of the characters should be treated as spaces, but that would 
still be a problem for R3 because all of the delimiters it currently 
supports are one byte in UTF-8 for efficiency. If other delimiters 
are supported, R3's parser will be much slower.
Dockimbel
12-Oct-2011
[3561]
Thanks. For whitespaces, I have already taken higher Unicode codepoints 
into account (from this list: http://en.wikipedia.org/wiki/Whitespace_character).
Andreas
12-Oct-2011
[3562x2]
Completely forgot about INVALID-UTF? :)
After having a quick glance at it, at least for utf8 it's quite basic 
and does not take any of the above overlong combinations into account.
BrianH
12-Oct-2011
[3564x4]
The policy on overlong combinations was set by R3, where there isn't 
as much need to flag them. Overlong combinations are a problem in 
UTF-8 for code that works on the binary encoding directly, instead 
of translating to Unicode first. The only function in R3 that operates 
that way is TRANSCODE, so as long as it doesn't choke on overlong 
combinations there is no problem with them being allowed. It might 
be good to add a /strict option to INVALID-UTF? though to make it 
check for them.
Speaking of which, I don't think anyone has tried overlong combinations 
with TRANSCODE yet. We should look into that.
(I mean, aside from Carl possible doing so internally.)
As long as they are interpreted exactly the same as the short encoding 
of the value, no problems.
Andreas
12-Oct-2011
[3568]
(Let's switch to !REBOL3.)
Kaj
13-Oct-2011
[3569x3]
Implemented GTK table layouts
For example:
table [2 2  5 5
	button "X"  button "O"
	button "O"  button "X"
]
amacleod
18-Oct-2011
[3572]
Kaj, I love what you are doing. Just curious if you looked at QT, 
it seems to be avail on more platforms - phone wise- which is a major 
plus...
Is it more difficult to impliment?
Kaj
18-Oct-2011
[3573x12]
Thanks. As it happens, I looked into binding Qt last week
I never liked either GTK or Qt. The reason I'm binding one anyway 
is that we want native platform user interfaces for Red. Linux and 
BSD don't have a native interface, but if you have to appoint one, 
you have to appoint two: GTK and Qt
The reason I chose GTK is that it's written in C, which makes it 
natural to bind to Red/System. Almost all other open source GUI toolkits, 
including Qt, are written in C++, which is much more problematic 
to bind
Basically, to bind a C++ library, you have to write two bindings: 
one from C++ to C, and then one from C to your target language. This 
is because only C++ knows what C++ objects mean, and C++ claims that 
its object classes are a program's interface
So you can write a binding from Red/System to a C library purely 
in Red/System, while a C++ binding would also require writing an 
extra bridge in C++. Even after this initial hurdle, apart from the 
maintenance, a remaining problem would be that the C++ bridge needs 
a traditional development environment, so the wonderful abitlity 
of Red to crosscompile to anything would be negated for a large part. 
Basically the same problem that REBOL 3 extensions have
Intrepid readers will note that one of the libraries I bind, 0MQ, 
is written in C++. However, the 0MQ designers wisely decided to define 
the interface in C, so that all languages can bind to it
For generic libraries, binding tools exist, such as SWIG and SIP. 
Unfortunately, they don't solve the problem but only assist a little, 
and the result is very bloated
Since a few years, Qt and KDE use a new tool: Smoke. It's more automated, 
so it looks like it can generate a C interface without writing C++ 
yourself. However, the cross-compilation problem still exists. Because 
the tool is so generic, the bindings it generates are also quite 
bloated and probably otherwise inefficient. In any case, it's just 
the first step for a Red binding, because I put abstraction layers 
over my bindings that are much more REBOL like
Another consideration for me is that GTK is more fragmented, but 
that also makes it more modular than Qt. From the viewpoint of Syllable, 
it makes it harder to integrate completely, but easier to integrate 
just some selected pieces, which is what I am after
While it's true that Qt is more portable than GTK, I'm not sure it's 
significant. The only phone platform I know that uses it is Meego, 
but Nokia has sidetracked that. Samsung's own phone platform in Bada, 
for example, uses GTK. There's also recent DirectFB support in GTK, 
while the Qt port to DirectFB is obsolete
So I chose GTK to support as the "native" GUI for Linux and BSD. 
It can also run on several other platforms until we have native support 
for those
I'm not planning to fragment the effort by doing a Qt binding as 
well, but I did evaluate it, and the decision could change if I would 
be funded for it
amacleod
18-Oct-2011
[3585]
Interesting stuff...thanks
Endo
19-Oct-2011
[3586]
Thank you for the good explanations Kaj.
Gabriele
19-Oct-2011
[3587]
TL;DR: the creators of C++ and C++ compilers decided that the world 
was not complicated enough, so they worked hard to make it more complicated.