World: r3wp

Join the discussions in the REBOL3 world...

[Parse] Discussion of PARSE dialect

older newer	first last
onetom 13-May-2011 [5869x3]	this is exactly the reason why CSV was it a really fucked up idea. comas are there in sentences and multivalued fields, not just numbers. i always use TSV.
	it would make sense to settle w some CSV parser, but not as a default behaviour. i was already surprised that parse handles double quotes too...
	>> parse/all {"asd qwe" zxc} none == ["asd qwe" " zxc"] >> parse/all {"asd qwe" zxc} " " == ["asd qwe" "zxc"] it's nice, but it also means there is no plain "split-by-a-character" function in rebol, which is just as annoying as missing a join-by-a-character
Tomc 14-May-2011 [5872]	Although gerneral happy with the default parse seperators find it neglegent to not permit overriding them. and like Max finds, block parsing ia a rarity when working with real world data streams.
Maxim 15-May-2011 [5873x2]	parse/all string none actually is a CSV loader. its not a split functions. I always found this dumb, but its the way Carl implemented it.
Maxim 15-May-2011 [5873x2]	rule, when given as a string is used to specify the CSV separator.
onetom 15-May-2011 [5875]	it should also honor line breaks within strings then
Maxim 15-May-2011 [5876]	eh, didn't know it didn't ! yeah that sucks.
Sunanda 18-Jun-2011 [5877]	Question on string and block parsing: http://stackoverflow.com/questions/6392533
Steeve 18-Jun-2011 [5878x2]	only the second string is checked. Should be: ['apple some [and string! into ["a" some "b" ]]]
Steeve 18-Jun-2011 [5878x2]	can't post the response
Sunanda 18-Jun-2011 [5880]	Want me to post it for you?
Steeve 18-Jun-2011 [5881]	yep ;-)
Sunanda 18-Jun-2011 [5882]	Done, thanks.
onetom 4-Aug-2011 [5883]	Parse (YC S11): A Heroku For Mobile Apps. Great name for a startup... http://techcrunch.com/2011/08/04/yc-funded-parse-a-heroku-for-mobile-apps/
Sunanda 31-Oct-2011 [5884]	Can anyone gift me an effecient R2 'parse solution for this problem (I am assuming 'parse will out-perform any other approach): SET UP I have a huge list of HTML named character entities, eg (a very short example): named-entities: ["nbsp" "cent" "agrave" "larr" "rarr" "crarr" ] ;; etc And I have some text that may contain some named entities, eg: text: "To send, press the ← arrow & then press &crarr;." PROBLEM I want to escape every "&" in the text, unless it is part of a named entity, eg (assuming a function called escape-amps): probe escape-amps text entities == "To send, press the ← arrow & then press &crarr;." TO MAKE IT EASY.... You can can assume a different set up for the named-entities block if you want; eg, this may be better for you: named-entities: [" " "¢" "à" "←" "→" "&crarr;" ] ;; etc Any help on this would be much appreciated!
Geomol 31-Oct-2011 [5885x3]	ne: ["←" \| "&crarr;"] ; and the rest of the named entities s: "To send, press the ← arrow & then press &crarr;." parse s [ any [ to #"&" [ne \| skip mark: (insert mark "amp;")] ] ] s == {To send, press the ← arrow & then press &crarr;.}
	It may be faster to drop the & from the entities and change the rule to: any [thru #"&" [ne \| mark: (insert mark "amp;")]
	That's strange. My 2nd suggestion gives a different result: ne: ["larr;" \| "crarr;"] s: "To send, press the ← arrow & then press &crarr;." parse s [ any [ thru #"&" [ne \| mark: (insert mark "amp;")] ] ] s == {To send, press the ← arrow & amp;then press &crarr;.} Seems like a bug, or am I just tired?
Sunanda 31-Oct-2011 [5888]	Thanks for the quick contributions, geomol. I see a different result too -- a space between the "&" and the "amp"
Pekr 31-Oct-2011 [5889x2]	not fluent with html escaping, what's the aim? To replace stand-alone #"&" with "&amp"?
Pekr 31-Oct-2011 [5889x2]	also remember - parse does not count spaces in. You are better in using parse/all
Ladislav 31-Oct-2011 [5891]	'I want to escape every "&" in the text, unless it is part of a named entity' - just to make sure: if the entity is not in the ENTITIES list, like e.g. " and it is encountered in the given TEXT, what exactly should happen?
Sunanda 31-Oct-2011 [5892x3]	The aim --- Basically, yes, Petr.
	Ladislav -- if it is not in the list, then I'd like it escaped, please. Think of it as a whitelist of ecceptable named entities. All others are suspect :)
	ecceptable ==> acceptable
Ladislav 31-Oct-2011 [5895]	Yes, OK, I just wanted to know
Pekr 31-Oct-2011 [5896]	Geomol - your code basically works, no? Just use parse/all: >> parse/all s [any [thru #"&" [ne \| mark: (insert mark "amp;")]]] == false >> s == {To send, press the ← arrow & then press &crarr;.}
Ladislav 31-Oct-2011 [5897x6]	I guess, that this should be efficient: alpha: make bitset! [#"a" - #"z" #"A" - #"Z"] escape-amps: func [ text [string!] entities [hash!] /local result pos1 pos2 ][ result: copy "" parse/all text [ pos1: any [ ; find the next amp thru #"&" pos2: [ ; entity check some alpha pos3: #";" ( ; entity candidate unless find entities copy/part pos2 pos3 [ ; not an entity insert insert tail result copy/part pos1 pos2 "amp;" pos1: pos2 ] ) \| ( ; not an entity insert insert tail result copy/part pos1 pos2 "amp;" pos1: pos2 ) ] \| (insert tail result pos1) end skip ; no amp found ] ] result ]
	(in place inserts are too slow)
	(= inefficient)
	Err: pos3 should be added as a local
	This is how it works: >> probe escape-amps text named-entities {To send, press the ← arrow & then press &crarr;.&susp;123} == {To send, press the ← arrow & then press &crarr;.&susp;123}
	With TEXT defined: >> text: "To send, press the ← arrow & then press &crarr;.&susp;123"
Geomol 31-Oct-2011 [5903]	Pekr, yeah, probably because I left out the /all refinement. Makes sense.
Sunanda 31-Oct-2011 [5904]	Thanks Ladislav and Geomol. Both your solutions work with my test data -- that's always a good sign :) I'll do some timing tests with large entity lists ..... But I won't be able to do that for 24 hours. Other approaches still welcome!
Andreas 31-Oct-2011 [5905]	Two suggestions: - store your named entities as a hash! (order of magnitude speedup for FIND) - if you have loooong "words", restrict Ladislav's `some alpha` to the maximum length of a valid entity
Ladislav 31-Oct-2011 [5906]	This alternative does not use the COPY call, so, it has to be faster: alpha: make bitset! [#"a" - #"z" #"A" - #"Z"] escape-amps: func [ text [string!] entities [hash!] /local result pos1 pos2 pos3 ][ result: copy "" parse/all text [ pos1: any [ ; find the next amp thru #"&" pos2: [ ; entity check some alpha pos3: #";" ( ; entity candidate unless find entities copy/part pos2 pos3 [ ; not an entity insert insert/part tail result pos1 pos2 "amp;" pos1: pos2 ] ) \| ( ; not an entity insert insert/part tail result pos1 pos2 "amp;" pos1: pos2 ) ] \| (insert tail result pos1) end skip ; no amp found ] ] result ]
PeterWood 1-Nov-2011 [5907x3]	Perhaps building a parse rule from the list of entities may be faster if there is a lot of text to process: This assumes the entities are provided as strings in a block. escape-amps: func [ text [string!] entities [block!] ][ skip-it: complement charset [#"&"] entity: copy [] foreach ent entities [ insert entity compose [(ent) \|]] head remove back tail entity parse/all text [ any [ entity \| "&" pos: (insert pos "amp;" pos: skip pos 4) :pos \| some skip-it ] ] head tex t ]
	That should read head text at the end of the function.
	Also I feel using skip could be very slow if the text contains a lot of "non-matching text". The "skip-it" technique could also be applied to Ladislav's code.
Ladislav 1-Nov-2011 [5910x3]	'The "skip-it" technique could also be applied to Ladislav's code.' - I do not think so
	(not that it cannot be applied, but, it is not efficient, in my opinion)
	Regarding the optimizations: - my code is optimized for the case when there are many entities. (hash! search, as Andreas suggested as well) When the number of entities is small, this optimization does not help - my code is optimized for the case when the TEXT is large (append is much faster than in place insert), for small texts this optimization does not help
Gabriele 1-Nov-2011 [5913]	Sunanda, note that this is already available in the text encoding module: http://www.rebol.it/power-mezz/mezz/text-encoding.html
Sunanda 1-Nov-2011 [5914x3]	Wow -- thanks Gabriele. For me, your powermezz is a much overlooked gem. I fear I have, in effect, badly implemented chunks of your functionality over the past few months while I've worked on an application that takes unconstrained text and constrains it to look okay in a web page and when printed via LaTeX. I should have read the documentation first!
	I've put aside looking at the powermezz for now, and simply decided to use one of the three case-specific solutions offered here. I made some tweaks to ensure the comparisons I was making were fair (and met a previously unstated condition). -- each in a func -- each works case sensitively (as previously unstated) -- use the complete entity set as defined by the WC3 -- changed Ladislav's Charset as some named entities have digits in their names -- moved Peter's set-up of his entity list out of the function and into one-off init code. It's been a fun hour of twiddling other people's code.....If you want your modifed code -- please kust ask. Timing results next .....
	My test data was heavily weighted towards the live conditions I expect to encounter (average text length 2000. Most texts are unlikely to have more than 1 named entity). All three scripts produced the same results -- so top marks for meeting the spec! Under my test conditions, Ladislav was fastest, followed by Geomol, followed by Peter. Other test conditions changed those rankings....So nothing is absolute. Using a Hash! contributed a lot to Ladislav's speed -- when I tried it as a Block! it was only slightly faster than Geomol's.....What a pity R3 removes hash! Thanks for contributing these solutions -- I've enjoyed looking at your code and marvelling at the different approaches REBOL makes possible.
Ladislav 1-Nov-2011 [5917]	Using a Hash! contributed a lot to Ladislav's speed -- when I tried it as a Block! it was only slightly faster than Geomol's.....What a pity R3 removes hash! - no problem, in R3 you can use map!
Sunanda 1-Nov-2011 [5918]	That's true, but map! isa bit awkward for just looking up an item in a list.....Map! is optimised for retrieving a value associated with a key.
older newer	first last