r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Parse] Discussion of PARSE dialect

BrianH
2-May-2011
[5840x2]
[set var end] sets the var to none; [copy var end] sets to none in 
R2, the empty string/block in R3; [thru end] doesn't match, so it 
should just get a warning in case the rules were written to expect 
that; [opt end] is definitely legit; perhaps [any end] and [some 
end] should get warnings for R2, but keep in mind that rules like 
[any [end]] and [some [end]] are much more common, have the same 
effect, and are more difficult to detect; [into end] properly trigers 
an error in R2 and R3 because the end is not in a block, while [into 
[end]] is legit and safe.
So you want to allow COPY, SET and OPT. Warn about THRU (because 
of the bug), ANY and SOME, because of R3 compatibility. Trigger an 
error for INTO if its argument rule isn't a block or a word referring 
to a block, but nothing special if that rule is END.
Geomol
4-May-2011
[5842x2]
[any end]Êand [some end]

As we don't have warnings, I suggest these to produce errors. They 
can produce endless loops, and that should be pointed out in the 
docs, if they don't produce errors.
[opt end]

Yes, it's legit, but what's the point of this combination? At best, 
the programmer knows, what she does, and the combination will do 
nothing else than slowing the program down. At worst, the programmer 
misinterpret this combination, and since it doesn't produce an error 
or anything, it's a source of confusion. I suggest to make it produce 
an error.
[into end]
Produces an error today, so fine.
[set end ...] and [copy end ...]

I wasn't thinking of [set var end], but about setting a var named 
end to something, like [set end integer!]. Problem with this is, 
that now the var, end, can be used and looks exactly like the keyword, 
end, maybe leading to confusion. But after a second thought, maybe 
this being allowed is ok.
[thru end]

Making this produce an error will solve the problem with the confusion 
around, what this combination mean. And in the first place, it's 
a bad way to produce a 'fail' rule (in R2, in R3 it has the value 
true, and parsing continues). It's slow compared to e.g. [end skip].
These are just suggestions to make a better PARSE. I've learnt, it's 
a good idea to not allow most combinations of keywords in R2 parse. 
Another example:

>> parse [] [opt into ['a]]      
== false
>> bparse [] [opt into ['a]]
** User Error: Invalid argument: into


The PARSE result is wrong, as I see it. My BPARSE produce an error. 
Better?
Ladislav
4-May-2011
[5844x4]
[any end]and [some end]As we don't have warnings, I suggest these 
to produce errors.


- it is impossible to trigger errors every time an infinite loop 
is encountered
- this case has been discussed and the solution was found already
[opt end] ...I suggest to make it produce an error.

- not reasonable, the rule *is* legitimate, as you noted
What you suggest is just a bunch of exceptions in the behaviour, 
which is always bad
You should rather look up how the "infinite loop problem" when using 
ANY and SOME was solved
Geomol
4-May-2011
[5848x2]
Here: http://www.rebol.com/r3/docs/concepts/parsing-summary.html#section-11

Input position must change
. And the solution was to invent a new keyword, WHILE. Hm...
I try to keep it simple.
Ladislav
4-May-2011
[5850]
This is much simpler than your exception:

- actually working, your exception does not
- not slowing down parsing
Geomol
4-May-2011
[5851]
ok :)
Ladislav
4-May-2011
[5852x2]
As to the WHILE keyword: some people may never use it, being content 
with SOME and AND as they work in R3
I mean SOME and ANY
BrianH
4-May-2011
[5854]
If you're going to make a better parse, it might be good to take 
into account the efforts that have already started to improve it 
in R3. The R3 improvements need a little work in some cases, but 
the thought that went into the process is quite valuable.


[set end ...] or [copy end ...]: In R3, using any PARSE keyword (not 
just 'end) in a rule for other reasons triggers an error.
>> parse [a] [set end skip]
** Script error: PARSE - command cannot be used as variable: end

[any end] or [some end]: What Ladislav said.


[opt end]: The point of the combination is [opt [end (do something)]]. 
[opt anything] is no more useless than [opt end]. Don't exclude something 
that has no effect just for that reason. Remember, [none] has no 
effect as well, but it's still valuable for making rules more readable.
onetom
12-May-2011
[5855]
>> parse/all "/docs/rfq/" "/"
== ["" "docs" "rfq"]

shouldn't this be either
["docs" "rfq"]
or
["" "docs" "rfq" ""]
for the sake of consistency?
Maxim
12-May-2011
[5856]
yes it should.  :-(
Geomol
13-May-2011
[5857]
Maxim, you asked for a function version of string parse. Was that 
because of situations like this?
Maxim
13-May-2011
[5858]
its because I do  A LOT more parsing on strings than on blocks.... 
one of the reasons is that Carl won't allow us to ignore commas in 
string data.  so the vast majority of data which could be read directly 
by rebol is incompatible.   


this is still one of my pet peeves in rebol.  trying to be pure, 
sometimes, just isn't usefull in real life.   PARSE is about managing 
external data, I hate the fact that PARSE isn't trying to be friendly 
with the vast majority of data out there.
Geomol
13-May-2011
[5859]
Do you mean, you want to be able to parse like this?

>> parse [hello, world!] [2 word!]
Maxim
13-May-2011
[5860x2]
its happened often yes.  less lately, since I'm dealing more with 
XML and less with raw data.
more like:

parse load/all "hello, world!" [2 word!]
Geomol
13-May-2011
[5862x3]
I have wondered sometimes, what effects it would have, if such commas 
was just ignored. We need commas in numbers, but maybe commas could 
just be ignored beside that.
So do you suggest, load/all "hello, world!" should return [hello 
world!] ? (Notice no comma.)
And without space, comma should maybe split the text? Like:
>> load/all "hello,world!"
== [hello world!]
Maxim
13-May-2011
[5865x2]
yes, I always thought that commas should be removed of decimals, 
and simply ignored when loaded.


in mechanical data, commas are never used for decimals.  because 
apps need to load it back and all software accept that dots are for 
decimals and commas for separating lists.   why should REBOL try 
to be different, its just alienating itself from all the data it 
could gobble up effortlessly.
so a comma would be an exact alias for a space, when its not within 
a string.
Geomol
13-May-2011
[5867x2]
I almost agree. Here we use comma as decimal point. A few countries 
does that. So all data with money amounts have numbers with comma 
as decimal point here.
But it should be possible to take care of those numbers with commas, 
and ignore all other commas, I think. As we don't ever write
42,
but always something like
42,00

if it's a decimal. So if 42, is seen, it can just be read as integer 
42 and ignore the comma (if using load/all for example).
onetom
13-May-2011
[5869x3]
this is exactly the reason why CSV was it a really fucked up idea. 
comas are there in sentences and multivalued fields, not just numbers.
i always use TSV.
it would make sense to settle w some CSV parser, but not as a default 
behaviour. i was already surprised that parse handles double quotes 
too...
>> parse/all {"asd qwe" zxc} none
== ["asd qwe" " zxc"]

>> parse/all {"asd qwe" zxc} " " 
== ["asd qwe" "zxc"]


it's nice, but it also means there is no plain "split-by-a-character" 
function in rebol, which is just as annoying as missing a join-by-a-character
Tomc
14-May-2011
[5872]
Although gerneral happy with the default parse seperators find it 
neglegent to not permit overriding them.  and like Max finds, block 
parsing ia a rarity when working with real world data streams.
Maxim
15-May-2011
[5873x2]
parse/all string none actually is a CSV loader.  its not a split 
functions.   I always found this dumb, but its the way Carl implemented 
it.
rule, when given as a string is used to specify the CSV separator.
onetom
15-May-2011
[5875]
it should also honor line breaks within strings then
Maxim
15-May-2011
[5876]
eh, didn't know it didn't ! yeah that sucks.
Sunanda
18-Jun-2011
[5877]
Question on string and block parsing:
   http://stackoverflow.com/questions/6392533
Steeve
18-Jun-2011
[5878x2]
only the second string is checked.
Should be:
['apple some [and string! into ["a" some "b" ]]]
can't post the response
Sunanda
18-Jun-2011
[5880]
Want me to post it for you?
Steeve
18-Jun-2011
[5881]
yep ;-)
Sunanda
18-Jun-2011
[5882]
Done, thanks.
onetom
4-Aug-2011
[5883]
Parse (YC S11): A Heroku For Mobile Apps.
Great name for a startup...

http://techcrunch.com/2011/08/04/yc-funded-parse-a-heroku-for-mobile-apps/
Sunanda
31-Oct-2011
[5884]
Can anyone gift me an effecient R2 'parse solution for this problem 
(I am assuming 'parse will out-perform any other approach):

SET UP

I have a huge list of HTML named character entities, eg (a very short 
example):

       named-entities: ["nbsp" "cent" "agrave" "larr" "rarr" "crarr" ] ;; 
       etc
   
And I have some text that may contain some named entities, eg:

       text: "To send, press the ← arrow & then press ↵."
   
PROBLEM

I want to escape every "&" in the text, unless it is part of a named 
entity, eg (assuming a function called escape-amps):
        probe escape-amps text entities

         == "To send, press the ← arrow & then press ↵."
  
TO MAKE IT EASY....

You can can assume a different set up for the named-entities block 
if you want; eg, this may be better for you:

       named-entities: [" " "¢" "à" "←" "→" "↵" 
       ] ;; etc 
   
Any help on this would be much appreciated!
Geomol
31-Oct-2011
[5885x3]
ne: ["←" | "↵"]	; and the rest of the named entities
s: "To send, press the ← arrow & then press ↵."
parse s [
	any [
		to #"&" [ne | skip mark: (insert mark "amp;")]
	]
]
s

== {To send, press the ← arrow & then press ↵.}
It may be faster to drop the & from the entities and change the rule 
to:

any [thru #"&" [ne | mark: (insert mark "amp;")]
That's strange. My 2nd suggestion gives a different result:

ne: ["larr;" | "crarr;"]
s: "To send, press the ← arrow & then press ↵."
parse s [
	any [
		thru #"&" [ne | mark: (insert mark "amp;")]
	]
]
s

== {To send, press the ← arrow & amp;then press ↵.}

Seems like a bug, or am I just tired?
Sunanda
31-Oct-2011
[5888]
Thanks for the quick contributions, geomol.

I see a different result too -- a space between the "&" and the "amp"
Pekr
31-Oct-2011
[5889]
not fluent with html escaping, what's the aim? To replace stand-alone 
#"&" with "&amp"?