World: r3wp
[Parse] Discussion of PARSE dialect
older newer | first last |
Geomol 13-May-2011 [5864] | And without space, comma should maybe split the text? Like: >> load/all "hello,world!" == [hello world!] |
Maxim 13-May-2011 [5865x2] | yes, I always thought that commas should be removed of decimals, and simply ignored when loaded. in mechanical data, commas are never used for decimals. because apps need to load it back and all software accept that dots are for decimals and commas for separating lists. why should REBOL try to be different, its just alienating itself from all the data it could gobble up effortlessly. |
so a comma would be an exact alias for a space, when its not within a string. | |
Geomol 13-May-2011 [5867x2] | I almost agree. Here we use comma as decimal point. A few countries does that. So all data with money amounts have numbers with comma as decimal point here. |
But it should be possible to take care of those numbers with commas, and ignore all other commas, I think. As we don't ever write 42, but always something like 42,00 if it's a decimal. So if 42, is seen, it can just be read as integer 42 and ignore the comma (if using load/all for example). | |
onetom 13-May-2011 [5869x3] | this is exactly the reason why CSV was it a really fucked up idea. comas are there in sentences and multivalued fields, not just numbers. i always use TSV. |
it would make sense to settle w some CSV parser, but not as a default behaviour. i was already surprised that parse handles double quotes too... | |
>> parse/all {"asd qwe" zxc} none == ["asd qwe" " zxc"] >> parse/all {"asd qwe" zxc} " " == ["asd qwe" "zxc"] it's nice, but it also means there is no plain "split-by-a-character" function in rebol, which is just as annoying as missing a join-by-a-character | |
Tomc 14-May-2011 [5872] | Although gerneral happy with the default parse seperators find it neglegent to not permit overriding them. and like Max finds, block parsing ia a rarity when working with real world data streams. |
Maxim 15-May-2011 [5873x2] | parse/all string none actually is a CSV loader. its not a split functions. I always found this dumb, but its the way Carl implemented it. |
rule, when given as a string is used to specify the CSV separator. | |
onetom 15-May-2011 [5875] | it should also honor line breaks within strings then |
Maxim 15-May-2011 [5876] | eh, didn't know it didn't ! yeah that sucks. |
Sunanda 18-Jun-2011 [5877] | Question on string and block parsing: http://stackoverflow.com/questions/6392533 |
Steeve 18-Jun-2011 [5878x2] | only the second string is checked. Should be: ['apple some [and string! into ["a" some "b" ]]] |
can't post the response | |
Sunanda 18-Jun-2011 [5880] | Want me to post it for you? |
Steeve 18-Jun-2011 [5881] | yep ;-) |
Sunanda 18-Jun-2011 [5882] | Done, thanks. |
onetom 4-Aug-2011 [5883] | Parse (YC S11): A Heroku For Mobile Apps. Great name for a startup... http://techcrunch.com/2011/08/04/yc-funded-parse-a-heroku-for-mobile-apps/ |
Sunanda 31-Oct-2011 [5884] | Can anyone gift me an effecient R2 'parse solution for this problem (I am assuming 'parse will out-perform any other approach): SET UP I have a huge list of HTML named character entities, eg (a very short example): named-entities: ["nbsp" "cent" "agrave" "larr" "rarr" "crarr" ] ;; etc And I have some text that may contain some named entities, eg: text: "To send, press the ← arrow & then press ↵." PROBLEM I want to escape every "&" in the text, unless it is part of a named entity, eg (assuming a function called escape-amps): probe escape-amps text entities == "To send, press the ← arrow & then press ↵." TO MAKE IT EASY.... You can can assume a different set up for the named-entities block if you want; eg, this may be better for you: named-entities: [" " "¢" "à" "←" "→" "↵" ] ;; etc Any help on this would be much appreciated! |
Geomol 31-Oct-2011 [5885x3] | ne: ["←" | "↵"] ; and the rest of the named entities s: "To send, press the ← arrow & then press ↵." parse s [ any [ to #"&" [ne | skip mark: (insert mark "amp;")] ] ] s == {To send, press the ← arrow & then press ↵.} |
It may be faster to drop the & from the entities and change the rule to: any [thru #"&" [ne | mark: (insert mark "amp;")] | |
That's strange. My 2nd suggestion gives a different result: ne: ["larr;" | "crarr;"] s: "To send, press the ← arrow & then press ↵." parse s [ any [ thru #"&" [ne | mark: (insert mark "amp;")] ] ] s == {To send, press the ← arrow & amp;then press ↵.} Seems like a bug, or am I just tired? | |
Sunanda 31-Oct-2011 [5888] | Thanks for the quick contributions, geomol. I see a different result too -- a space between the "&" and the "amp" |
Pekr 31-Oct-2011 [5889x2] | not fluent with html escaping, what's the aim? To replace stand-alone #"&" with "&"? |
also remember - parse does not count spaces in. You are better in using parse/all | |
Ladislav 31-Oct-2011 [5891] | 'I want to escape every "&" in the text, unless it is part of a named entity' - just to make sure: if the entity is not in the ENTITIES list, like e.g. " and it is encountered in the given TEXT, what exactly should happen? |
Sunanda 31-Oct-2011 [5892x3] | The aim --- Basically, yes, Petr. |
Ladislav -- if it is not in the list, then I'd like it escaped, please. Think of it as a whitelist of ecceptable named entities. All others are suspect :) | |
ecceptable ==> acceptable | |
Ladislav 31-Oct-2011 [5895] | Yes, OK, I just wanted to know |
Pekr 31-Oct-2011 [5896] | Geomol - your code basically works, no? Just use parse/all: >> parse/all s [any [thru #"&" [ne | mark: (insert mark "amp;")]]] == false >> s == {To send, press the ← arrow & then press ↵.} |
Ladislav 31-Oct-2011 [5897x6] | I guess, that this should be efficient: alpha: make bitset! [#"a" - #"z" #"A" - #"Z"] escape-amps: func [ text [string!] entities [hash!] /local result pos1 pos2 ][ result: copy "" parse/all text [ pos1: any [ ; find the next amp thru #"&" pos2: [ ; entity check some alpha pos3: #";" ( ; entity candidate unless find entities copy/part pos2 pos3 [ ; not an entity insert insert tail result copy/part pos1 pos2 "amp;" pos1: pos2 ] ) | ( ; not an entity insert insert tail result copy/part pos1 pos2 "amp;" pos1: pos2 ) ] | (insert tail result pos1) end skip ; no amp found ] ] result ] |
(in place inserts are too slow) | |
(= inefficient) | |
Err: pos3 should be added as a local | |
This is how it works: >> probe escape-amps text named-entities {To send, press the ← arrow & then press ↵.&susp;123} == {To send, press the ← arrow & then press ↵.&susp;123} | |
With TEXT defined: >> text: "To send, press the ← arrow & then press ↵.&susp;123" | |
Geomol 31-Oct-2011 [5903] | Pekr, yeah, probably because I left out the /all refinement. Makes sense. |
Sunanda 31-Oct-2011 [5904] | Thanks Ladislav and Geomol. Both your solutions work with my test data -- that's always a good sign :) I'll do some timing tests with large entity lists ..... But I won't be able to do that for 24 hours. Other approaches still welcome! |
Andreas 31-Oct-2011 [5905] | Two suggestions: - store your named entities as a hash! (order of magnitude speedup for FIND) - if you have loooong "words", restrict Ladislav's `some alpha` to the maximum length of a valid entity |
Ladislav 31-Oct-2011 [5906] | This alternative does not use the COPY call, so, it has to be faster: alpha: make bitset! [#"a" - #"z" #"A" - #"Z"] escape-amps: func [ text [string!] entities [hash!] /local result pos1 pos2 pos3 ][ result: copy "" parse/all text [ pos1: any [ ; find the next amp thru #"&" pos2: [ ; entity check some alpha pos3: #";" ( ; entity candidate unless find entities copy/part pos2 pos3 [ ; not an entity insert insert/part tail result pos1 pos2 "amp;" pos1: pos2 ] ) | ( ; not an entity insert insert/part tail result pos1 pos2 "amp;" pos1: pos2 ) ] | (insert tail result pos1) end skip ; no amp found ] ] result ] |
PeterWood 1-Nov-2011 [5907x3] | Perhaps building a parse rule from the list of entities may be faster if there is a lot of text to process: This assumes the entities are provided as strings in a block. escape-amps: func [ text [string!] entities [block!] ][ skip-it: complement charset [#"&"] entity: copy [] foreach ent entities [ insert entity compose [(ent) |]] head remove back tail entity parse/all text [ any [ entity | "&" pos: (insert pos "amp;" pos: skip pos 4) :pos | some skip-it ] ] head tex t ] |
That should read head text at the end of the function. | |
Also I feel using skip could be very slow if the text contains a lot of "non-matching text". The "skip-it" technique could also be applied to Ladislav's code. | |
Ladislav 1-Nov-2011 [5910x3] | 'The "skip-it" technique could also be applied to Ladislav's code.' - I do not think so |
(not that it cannot be applied, but, it is not efficient, in my opinion) | |
Regarding the optimizations: - my code is optimized for the case when there are many entities. (hash! search, as Andreas suggested as well) When the number of entities is small, this optimization does not help - my code is optimized for the case when the TEXT is large (append is much faster than in place insert), for small texts this optimization does not help | |
Gabriele 1-Nov-2011 [5913] | Sunanda, note that this is already available in the text encoding module: http://www.rebol.it/power-mezz/mezz/text-encoding.html |
older newer | first last |