r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[REBOL Syntax] Discussions about REBOL syntax

Andreas
19-Feb-2012
[296]
We are not expanding anything :) We are just describing what syntactical 
rules the REBOL email! literal syntax follows.
BrianH
19-Feb-2012
[297]
I'm a little more concerned with R3 URL syntax though, since in that 
case there are real bugs that have already affected people in real 
cases, and because hypothetically a lot of the bugs are fixable in 
mezzanine code.
Andreas
19-Feb-2012
[298]
And as the email! datatype can be used for many a purpose within 
dialects, it does not necessarily have to match RFC822 (or rather 
5322) exactly.
Steeve
19-Feb-2012
[299]
but the syntax checking can't be corrected witth mezzs right ?
Andreas
19-Feb-2012
[300]
(Which would be a relatively complex problem anyway ...

http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html)
BrianH
19-Feb-2012
[301x2]
Steeve: For emails, no. For urls, yes.
For url! the syntax checking is mostly done by the DECODE-URL mezzanine. 
We can't change what is recognized as a url! by REBOL, but we can 
change how the data is treated once it's recognized. There are errors 
in escape handling, for instance.
Steeve
19-Feb-2012
[303]
Corrected version, works with R2 and R3:

escape-uri: [#"%" 2 hex-digit]
email-char: complement union charset {%@:} termination-char
email-esc: [email-char | escape-uri]
email-syntax: [
	[
		#":" any [email-esc | #":" ] #"@" any [email-esc | #":" ]
		| not #"<" some email-esc #"@" any email-esc
	]
	termination
]
Andreas
19-Feb-2012
[304]
Ah, was wondering. So we can't change the syntax or url!s in R3 as 
well, we can only improve/bugfix url! handling.
BrianH
19-Feb-2012
[305]
You'd be surprised at how flexible the syntax of url! is in R3 :)
Andreas
19-Feb-2012
[306]
I don't think I would.
BrianH
19-Feb-2012
[307x2]
Fair enough. But if you can figure out exactly hor MOLD handles escaping 
of urls, that would help narrow down what bugs we can fix in DECODE-URL.
hor -> how
Andreas
19-Feb-2012
[309]
I would be slightly surprised if it is more flexible than string 
syntax, but I somehow doubt that :)
BrianH
19-Feb-2012
[310]
Fewer escaping methods, so no. What's weird is that some kinds of 
string escaping work for the file! type.
Steeve
20-Feb-2012
[311]
It's calm here
Ladislav
20-Feb-2012
[312x2]
committed a couple of 1903-5 additions. You were right that #1905 
is ugly, Steeve.
Caught up with the code posted above.
Steeve
23-Feb-2012
[314x5]
url! syntax (both R2,R3)
I've not created specific charsets, so the rule is more verbose.

- The first char! same as for word! (less "+-")
- Must contain at least one ':'
- "/" Allowed only after the first ":"
- Escape-uri allowed like in email!

url-syntax: [
	not digit not #"'" not sign word-char
	any [escape-uri | not termination-char not #":" skip]
	#":"
	any [escape-uri | #"/" | not termination-char skip]
]
Forgot the case when it begins with '"." 
I should have stick with the word-syntax much closer
url-syntax: [
	[#"." not digit | not digit not #"'" not sign word-char]
	any [escape-uri | not termination-char not #":" skip]
	#":"
	any [escape-uri | #"/" | not termination-char skip]
]
hum... still wrong
url-syntax: [
	not [digit | #"'" | #"." digit | sign] word-char
	any [escape-uri | not termination-char not #":" skip]
	#":"
	any [escape-uri | #"/" | not termination-char skip]
]
BrianH
23-Feb-2012
[319x3]
That's a good start! I'm really curious about whether ulrs and emails 
deal with chars over 127, especially in R3. As far as I know, the 
URI standards don't support them directly, but various internationalization 
extensions add recodings for these non-ASCII characters. It would 
be good to know exactly which chars supported in the data model, 
so we can hack the code that supports that data to match.
When last I checked, R3 considers all chars over 127 to be word-chars. 
It is considered to be non of REBOL's business whether a printer 
or display would show the character, so that even includes the additional 
Unicode space and control characters beyond ASCII. R3 has a binary 
parser, you see.
non of -> none of
Steeve
23-Feb-2012
[322]
yeah
BrianH
23-Feb-2012
[323]
Do you know if the REBOL syntax parser (LOAD and TRANSCODE) handles 
the unescaping and puts the decoded data into the url! structure, 
or if that is handled by the DECODE-URL mezzanine code? I'm hoping 
it's handled by the mezzanine, because it's broken in both R2 and 
R3 and mezzanine changes are the only kind we can make at the moment.
Maxim
23-Feb-2012
[324x3]
AFAICT  it's part of the datatype... since a space will go back and 
forth when you go to/from URL! and other types like string

(in R2 at least):
>> to-url "gogo://a.com/space here"
== gogo://a.com/space here
>> to-string gogo://a.com/space here
== "gogo://a.com/space here"
or did I get you wron?
wrong
Steeve
23-Feb-2012
[327]
Brian, Can you show me what is broken ? I'm a bit unsettled by your 
concern
BrianH
23-Feb-2012
[328x3]
The escape decoding gets done too early. The decoding should not 
be done after until the URI structure has been parsed. If you do 
the escape decoding too early, characters that are escaped so that 
they won't be treated as syntax characters (like /) are treated as 
syntax characters erroneously. This is a bad problem for schemes 
like HTTP or FTP that can use usernames and passwords, because the 
passwords in particular either get corrupted or have inappropriately 
restricted character sets. IDN encoding should be put off until the 
last minute too, once we add support for Unicode to the url handlers 
of HTTP, plus any others that should support that standard.
Given that the URI structure is parsed by DECODE-URL (or the R3 equivalent), 
that means that any unescaping should be done in that function, or 
in the scheme handler itself, not in the native code that runs before 
the mezzanine code is called.
Re-escaping in MOLD is OK though. It's the input that's the problem, 
not the output.
Maxim
23-Feb-2012
[331]
yep... and I've lost hours trying to get some ftp code to work because 
it had strange urls (with passwds)... which the interpreter would 
break all the time. 

At some point you are mystified by what is the actual URL being sent 
to the server.


once you see what is going on, you can get it to work, but realizing 
that you didn't actually send the url you expect, can take quite 
a long time to realize and properly fix once you've got a whole app 
expecting/playing with urls.
BrianH
23-Feb-2012
[332]
I've been hoping to fix that. I can load a hot-patch into R2, and 
include a patch in a host kit build in R3 or replace functions from 
%rebol.r if necessary.
Steeve
23-Feb-2012
[333x5]
Ok I try to resume our concern.

The url! and email! syntax is more permissive than a valid URI. It's 
not a problem nor a design flaw.

The escape decoding should not be done at all when decoded as a part 
of an url! or email!. Right, but it will not be corrected until Carl 
does it.

DECODE-URL can be rewritten (used by schemes). The parser is too 
strict and can't deal with complex forms.
Lot of inconsistencies with file! datatype between R2 and R3.
Escaping notation = huge mess
you can use 2 forms for file! :
in R2
- %"*"  quoted sting file, with ^ escape notation allowed
- %*  Form  with %ff escape notation allowed  
in R3
- quoted string file works fine

- in the %* form, the % escape notation works fine but the ^ char 
mess up  things in some cases without issuing an error
In the %* form, R3 should recognise the ^ char as a normal char (not 
one escaping notation) as R2 does.
So for the moment; I think it's better to reject the ^ char in the 
R3 syntax
Maxim
23-Feb-2012
[338]
yeah, its surely some left over copy/paste code from the string loader, 
left in the file loader by error.
BrianH
23-Feb-2012
[339x3]
Worse than being a huge mess, R2 and R3 have different messes. R2 
MOLD fails to encode the % character properly. R3 chokes on the ^ 
character in unquoted mode, and allows both ^ and % escaping in quoted 
mode, and MOLDs the ^ character without encoding it (a problem because 
it chokes on that character). Overall the R2 MOLD problem is worse 
than all of the R3 problems put together because % is a more common 
character in filenames than ^, but both need fixing. I wish it just 
did one escaping method for files, % escaping, or did only % escaping 
for unquoted files and only ^ escaping for quoted files. % escaping 
doesn't support Unicode characters over 255, but no characters like 
that need to be escaped anyways - they can be written directly.
R2 file! syntax may have more problems that I'm not aware of though.
I guess that I just want the escaping behavior Steeve described for 
R2, but with the MOLD of %%25 fix from R3, along with % by itself 
being interpreted as and molding as %"".
Steeve
24-Feb-2012
[342x4]
file-char: complement union charset {%:@} termination-char
file-char/#"/": true	;** #"/" added
file-syntax: [
	#"%" [
		quoted-string
		| any [file-char | escape-uri] ;** fail on ^ char
	] termination
]
alternative-syntax R2 file-syntax: [
	#"%" [
		quoted-string
		| some [file-char | escape-uri | #"^^"]  ;** ^ valid char
	] termination
]
Missing rules...
path! refinement! date! time! 
Anything else ???
pair!
Sources
https://github.com/rebolsource/rebol-syntax