World: r3wp
[Parse] Discussion of PARSE dialect
older newer | first last |
onetom 13-May-2011 [5869x3] | this is exactly the reason why CSV was it a really fucked up idea. comas are there in sentences and multivalued fields, not just numbers. i always use TSV. |
it would make sense to settle w some CSV parser, but not as a default behaviour. i was already surprised that parse handles double quotes too... | |
>> parse/all {"asd qwe" zxc} none == ["asd qwe" " zxc"] >> parse/all {"asd qwe" zxc} " " == ["asd qwe" "zxc"] it's nice, but it also means there is no plain "split-by-a-character" function in rebol, which is just as annoying as missing a join-by-a-character | |
Tomc 14-May-2011 [5872] | Although gerneral happy with the default parse seperators find it neglegent to not permit overriding them. and like Max finds, block parsing ia a rarity when working with real world data streams. |
Maxim 15-May-2011 [5873x2] | parse/all string none actually is a CSV loader. its not a split functions. I always found this dumb, but its the way Carl implemented it. |
rule, when given as a string is used to specify the CSV separator. | |
onetom 15-May-2011 [5875] | it should also honor line breaks within strings then |
Maxim 15-May-2011 [5876] | eh, didn't know it didn't ! yeah that sucks. |
Sunanda 18-Jun-2011 [5877] | Question on string and block parsing: http://stackoverflow.com/questions/6392533 |
Steeve 18-Jun-2011 [5878x2] | only the second string is checked. Should be: ['apple some [and string! into ["a" some "b" ]]] |
can't post the response | |
Sunanda 18-Jun-2011 [5880] | Want me to post it for you? |
Steeve 18-Jun-2011 [5881] | yep ;-) |
Sunanda 18-Jun-2011 [5882] | Done, thanks. |
onetom 4-Aug-2011 [5883] | Parse (YC S11): A Heroku For Mobile Apps. Great name for a startup... http://techcrunch.com/2011/08/04/yc-funded-parse-a-heroku-for-mobile-apps/ |
Sunanda 31-Oct-2011 [5884] | Can anyone gift me an effecient R2 'parse solution for this problem (I am assuming 'parse will out-perform any other approach): SET UP I have a huge list of HTML named character entities, eg (a very short example): named-entities: ["nbsp" "cent" "agrave" "larr" "rarr" "crarr" ] ;; etc And I have some text that may contain some named entities, eg: text: "To send, press the ← arrow & then press ↵." PROBLEM I want to escape every "&" in the text, unless it is part of a named entity, eg (assuming a function called escape-amps): probe escape-amps text entities == "To send, press the ← arrow & then press ↵." TO MAKE IT EASY.... You can can assume a different set up for the named-entities block if you want; eg, this may be better for you: named-entities: [" " "¢" "à" "←" "→" "↵" ] ;; etc Any help on this would be much appreciated! |
Geomol 31-Oct-2011 [5885x3] | ne: ["←" | "↵"] ; and the rest of the named entities s: "To send, press the ← arrow & then press ↵." parse s [ any [ to #"&" [ne | skip mark: (insert mark "amp;")] ] ] s == {To send, press the ← arrow & then press ↵.} |
It may be faster to drop the & from the entities and change the rule to: any [thru #"&" [ne | mark: (insert mark "amp;")] | |
That's strange. My 2nd suggestion gives a different result: ne: ["larr;" | "crarr;"] s: "To send, press the ← arrow & then press ↵." parse s [ any [ thru #"&" [ne | mark: (insert mark "amp;")] ] ] s == {To send, press the ← arrow & amp;then press ↵.} Seems like a bug, or am I just tired? | |
Sunanda 31-Oct-2011 [5888] | Thanks for the quick contributions, geomol. I see a different result too -- a space between the "&" and the "amp" |
Pekr 31-Oct-2011 [5889x2] | not fluent with html escaping, what's the aim? To replace stand-alone #"&" with "&"? |
also remember - parse does not count spaces in. You are better in using parse/all | |
Ladislav 31-Oct-2011 [5891] | 'I want to escape every "&" in the text, unless it is part of a named entity' - just to make sure: if the entity is not in the ENTITIES list, like e.g. " and it is encountered in the given TEXT, what exactly should happen? |
Sunanda 31-Oct-2011 [5892x3] | The aim --- Basically, yes, Petr. |
Ladislav -- if it is not in the list, then I'd like it escaped, please. Think of it as a whitelist of ecceptable named entities. All others are suspect :) | |
ecceptable ==> acceptable | |
Ladislav 31-Oct-2011 [5895] | Yes, OK, I just wanted to know |
Pekr 31-Oct-2011 [5896] | Geomol - your code basically works, no? Just use parse/all: >> parse/all s [any [thru #"&" [ne | mark: (insert mark "amp;")]]] == false >> s == {To send, press the ← arrow & then press ↵.} |
Ladislav 31-Oct-2011 [5897x6] | I guess, that this should be efficient: alpha: make bitset! [#"a" - #"z" #"A" - #"Z"] escape-amps: func [ text [string!] entities [hash!] /local result pos1 pos2 ][ result: copy "" parse/all text [ pos1: any [ ; find the next amp thru #"&" pos2: [ ; entity check some alpha pos3: #";" ( ; entity candidate unless find entities copy/part pos2 pos3 [ ; not an entity insert insert tail result copy/part pos1 pos2 "amp;" pos1: pos2 ] ) | ( ; not an entity insert insert tail result copy/part pos1 pos2 "amp;" pos1: pos2 ) ] | (insert tail result pos1) end skip ; no amp found ] ] result ] |
(in place inserts are too slow) | |
(= inefficient) | |
Err: pos3 should be added as a local | |
This is how it works: >> probe escape-amps text named-entities {To send, press the ← arrow & then press ↵.&susp;123} == {To send, press the ← arrow & then press ↵.&susp;123} | |
With TEXT defined: >> text: "To send, press the ← arrow & then press ↵.&susp;123" | |
Geomol 31-Oct-2011 [5903] | Pekr, yeah, probably because I left out the /all refinement. Makes sense. |
Sunanda 31-Oct-2011 [5904] | Thanks Ladislav and Geomol. Both your solutions work with my test data -- that's always a good sign :) I'll do some timing tests with large entity lists ..... But I won't be able to do that for 24 hours. Other approaches still welcome! |
Andreas 31-Oct-2011 [5905] | Two suggestions: - store your named entities as a hash! (order of magnitude speedup for FIND) - if you have loooong "words", restrict Ladislav's `some alpha` to the maximum length of a valid entity |
Ladislav 31-Oct-2011 [5906] | This alternative does not use the COPY call, so, it has to be faster: alpha: make bitset! [#"a" - #"z" #"A" - #"Z"] escape-amps: func [ text [string!] entities [hash!] /local result pos1 pos2 pos3 ][ result: copy "" parse/all text [ pos1: any [ ; find the next amp thru #"&" pos2: [ ; entity check some alpha pos3: #";" ( ; entity candidate unless find entities copy/part pos2 pos3 [ ; not an entity insert insert/part tail result pos1 pos2 "amp;" pos1: pos2 ] ) | ( ; not an entity insert insert/part tail result pos1 pos2 "amp;" pos1: pos2 ) ] | (insert tail result pos1) end skip ; no amp found ] ] result ] |
PeterWood 1-Nov-2011 [5907x3] | Perhaps building a parse rule from the list of entities may be faster if there is a lot of text to process: This assumes the entities are provided as strings in a block. escape-amps: func [ text [string!] entities [block!] ][ skip-it: complement charset [#"&"] entity: copy [] foreach ent entities [ insert entity compose [(ent) |]] head remove back tail entity parse/all text [ any [ entity | "&" pos: (insert pos "amp;" pos: skip pos 4) :pos | some skip-it ] ] head tex t ] |
That should read head text at the end of the function. | |
Also I feel using skip could be very slow if the text contains a lot of "non-matching text". The "skip-it" technique could also be applied to Ladislav's code. | |
Ladislav 1-Nov-2011 [5910x3] | 'The "skip-it" technique could also be applied to Ladislav's code.' - I do not think so |
(not that it cannot be applied, but, it is not efficient, in my opinion) | |
Regarding the optimizations: - my code is optimized for the case when there are many entities. (hash! search, as Andreas suggested as well) When the number of entities is small, this optimization does not help - my code is optimized for the case when the TEXT is large (append is much faster than in place insert), for small texts this optimization does not help | |
Gabriele 1-Nov-2011 [5913] | Sunanda, note that this is already available in the text encoding module: http://www.rebol.it/power-mezz/mezz/text-encoding.html |
Sunanda 1-Nov-2011 [5914x3] | Wow -- thanks Gabriele. For me, your powermezz is a much overlooked gem. I fear I have, in effect, badly implemented chunks of your functionality over the past few months while I've worked on an application that takes unconstrained text and constrains it to look okay in a web page and when printed via LaTeX. I should have read the documentation first! |
I've put aside looking at the powermezz for now, and simply decided to use one of the three case-specific solutions offered here. I made some tweaks to ensure the comparisons I was making were fair (and met a previously unstated condition). -- each in a func -- each works case sensitively (as previously unstated) -- use the complete entity set as defined by the WC3 -- changed Ladislav's Charset as some named entities have digits in their names -- moved Peter's set-up of his entity list out of the function and into one-off init code. It's been a fun hour of twiddling other people's code.....If you want your modifed code -- please kust ask. Timing results next ..... | |
My test data was heavily weighted towards the live conditions I expect to encounter (average text length 2000. Most texts are unlikely to have more than 1 named entity). All three scripts produced the same results -- so top marks for meeting the spec! Under my test conditions, Ladislav was fastest, followed by Geomol, followed by Peter. Other test conditions changed those rankings....So nothing is absolute. Using a Hash! contributed a lot to Ladislav's speed -- when I tried it as a Block! it was only slightly faster than Geomol's.....What a pity R3 removes hash! Thanks for contributing these solutions -- I've enjoyed looking at your code and marvelling at the different approaches REBOL makes possible. | |
Ladislav 1-Nov-2011 [5917] | Using a Hash! contributed a lot to Ladislav's speed -- when I tried it as a Block! it was only slightly faster than Geomol's.....What a pity R3 removes hash! - no problem, in R3 you can use map! |
Sunanda 1-Nov-2011 [5918] | That's true, but map! isa bit awkward for just looking up an item in a list.....Map! is optimised for retrieving a value associated with a key. |
older newer | first last |