r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Parse] Discussion of PARSE dialect

onetom
13-May-2011
[5869x3]
this is exactly the reason why CSV was it a really fucked up idea. 
comas are there in sentences and multivalued fields, not just numbers.
i always use TSV.
it would make sense to settle w some CSV parser, but not as a default 
behaviour. i was already surprised that parse handles double quotes 
too...
>> parse/all {"asd qwe" zxc} none
== ["asd qwe" " zxc"]

>> parse/all {"asd qwe" zxc} " " 
== ["asd qwe" "zxc"]


it's nice, but it also means there is no plain "split-by-a-character" 
function in rebol, which is just as annoying as missing a join-by-a-character
Tomc
14-May-2011
[5872]
Although gerneral happy with the default parse seperators find it 
neglegent to not permit overriding them.  and like Max finds, block 
parsing ia a rarity when working with real world data streams.
Maxim
15-May-2011
[5873x2]
parse/all string none actually is a CSV loader.  its not a split 
functions.   I always found this dumb, but its the way Carl implemented 
it.
rule, when given as a string is used to specify the CSV separator.
onetom
15-May-2011
[5875]
it should also honor line breaks within strings then
Maxim
15-May-2011
[5876]
eh, didn't know it didn't ! yeah that sucks.
Sunanda
18-Jun-2011
[5877]
Question on string and block parsing:
   http://stackoverflow.com/questions/6392533
Steeve
18-Jun-2011
[5878x2]
only the second string is checked.
Should be:
['apple some [and string! into ["a" some "b" ]]]
can't post the response
Sunanda
18-Jun-2011
[5880]
Want me to post it for you?
Steeve
18-Jun-2011
[5881]
yep ;-)
Sunanda
18-Jun-2011
[5882]
Done, thanks.
onetom
4-Aug-2011
[5883]
Parse (YC S11): A Heroku For Mobile Apps.
Great name for a startup...

http://techcrunch.com/2011/08/04/yc-funded-parse-a-heroku-for-mobile-apps/
Sunanda
31-Oct-2011
[5884]
Can anyone gift me an effecient R2 'parse solution for this problem 
(I am assuming 'parse will out-perform any other approach):

SET UP

I have a huge list of HTML named character entities, eg (a very short 
example):

       named-entities: ["nbsp" "cent" "agrave" "larr" "rarr" "crarr" ] ;; 
       etc
   
And I have some text that may contain some named entities, eg:

       text: "To send, press the ← arrow & then press ↵."
   
PROBLEM

I want to escape every "&" in the text, unless it is part of a named 
entity, eg (assuming a function called escape-amps):
        probe escape-amps text entities

         == "To send, press the ← arrow & then press ↵."
  
TO MAKE IT EASY....

You can can assume a different set up for the named-entities block 
if you want; eg, this may be better for you:

       named-entities: [" " "¢" "à" "←" "→" "↵" 
       ] ;; etc 
   
Any help on this would be much appreciated!
Geomol
31-Oct-2011
[5885x3]
ne: ["←" | "↵"]	; and the rest of the named entities
s: "To send, press the ← arrow & then press ↵."
parse s [
	any [
		to #"&" [ne | skip mark: (insert mark "amp;")]
	]
]
s

== {To send, press the ← arrow & then press ↵.}
It may be faster to drop the & from the entities and change the rule 
to:

any [thru #"&" [ne | mark: (insert mark "amp;")]
That's strange. My 2nd suggestion gives a different result:

ne: ["larr;" | "crarr;"]
s: "To send, press the ← arrow & then press ↵."
parse s [
	any [
		thru #"&" [ne | mark: (insert mark "amp;")]
	]
]
s

== {To send, press the ← arrow & amp;then press ↵.}

Seems like a bug, or am I just tired?
Sunanda
31-Oct-2011
[5888]
Thanks for the quick contributions, geomol.

I see a different result too -- a space between the "&" and the "amp"
Pekr
31-Oct-2011
[5889x2]
not fluent with html escaping, what's the aim? To replace stand-alone 
#"&" with "&amp"?
also remember - parse does not count spaces in. You are better in 
using parse/all
Ladislav
31-Oct-2011
[5891]
'I want to escape every "&" in the text, unless it is part of a named 
entity' - just to make sure: if the entity is not in the ENTITIES 
list, like e.g. " and it is encountered in the given TEXT, what 
exactly should happen?
Sunanda
31-Oct-2011
[5892x3]
The aim --- Basically, yes, Petr.
Ladislav -- if it is not in the list, then I'd like it escaped, please.

Think of it as a whitelist of ecceptable named entities. All others 
are suspect :)
ecceptable ==> acceptable
Ladislav
31-Oct-2011
[5895]
Yes, OK, I just wanted to know
Pekr
31-Oct-2011
[5896]
Geomol - your code basically works, no? Just use parse/all:


>> parse/all s [any [thru #"&" [ne | mark: (insert mark "amp;")]]]
== false
>> s
== {To send, press the ← arrow & then press ↵.}
Ladislav
31-Oct-2011
[5897x6]
I guess, that this should be efficient:

alpha: make bitset! [#"a" - #"z" #"A" - #"Z"]
escape-amps: func [
	text [string!]
	entities [hash!]
	/local result pos1 pos2
][
	result: copy ""
	parse/all text [
		pos1:
		any [
			; find the next amp
			thru #"&"
			pos2:
			[
				; entity check
				some alpha pos3: #";" (
					; entity candidate
					unless find entities copy/part pos2 pos3 [
						; not an entity
						insert insert tail result copy/part pos1 pos2 "amp;"
						pos1: pos2
					]
				)
				| (
					; not an entity
					insert insert tail result copy/part pos1 pos2 "amp;"
					pos1: pos2
				)
			]
			| (insert tail result pos1) end skip ; no amp found
		]
	]
	result
]
(in place inserts are too slow)
(= inefficient)
Err: pos3 should be added as a local
This is how it works:

>> probe escape-amps text named-entities

{To send, press the ← arrow & then press ↵.&susp;123}

== {To send, press the ← arrow & then press ↵.&susp;123}
With TEXT defined:


>> text: "To send, press the ← arrow & then press ↵.&susp;123"
Geomol
31-Oct-2011
[5903]
Pekr, yeah, probably because I left out the /all refinement. Makes 
sense.
Sunanda
31-Oct-2011
[5904]
Thanks Ladislav and Geomol.

Both your solutions work with my test data -- that's always a good 
sign :)


I'll do some timing tests with large entity lists ..... But I won't 
be able to do that for 24 hours.

Other approaches still welcome!
Andreas
31-Oct-2011
[5905]
Two suggestions:


- store your named entities as a hash! (order of magnitude speedup 
for FIND)


- if you have loooong "words", restrict Ladislav's `some alpha` to 
the maximum length of a valid entity
Ladislav
31-Oct-2011
[5906]
This alternative does not use the COPY call, so, it has to be faster:

alpha: make bitset! [#"a" - #"z" #"A" - #"Z"]
escape-amps: func [
	text [string!]
	entities [hash!]
	/local result pos1 pos2 pos3
][
	result: copy ""
	parse/all text [
		pos1:
		any [
			; find the next amp
			thru #"&"
			pos2:
			[
				; entity check
				some alpha pos3: #";" (
					; entity candidate
					unless find entities copy/part pos2 pos3 [
						; not an entity
						insert insert/part tail result pos1 pos2 "amp;"
						pos1: pos2
					]
				)
				| (
					; not an entity
					insert insert/part tail result pos1 pos2 "amp;"
					pos1: pos2
				)
			]
			| (insert tail result pos1) end skip ; no amp found
		]
	]
	result
]
PeterWood
1-Nov-2011
[5907x3]
Perhaps building a parse rule from the list of entities may be faster 
if there is a lot of text to process:

This assumes the entities are provided as strings in a block.

escape-amps: func [

  text [string!]

  entities [block!]

][
  
  skip-it: complement charset [#"&"]

  entity: copy []

  foreach ent entities [ insert entity compose [(ent) |]]

  head remove back tail entity

  parse/all text 
[
    any [

      entity |

      "&" pos: (insert pos "amp;" pos: skip pos 4) :pos |

      some skip-it
     ]

  ]

  head tex
t
]
That should read head text at the end of the function.
Also I feel using skip could be very slow if the text contains a 
lot of "non-matching text". The "skip-it" technique could also be 
applied to Ladislav's code.
Ladislav
1-Nov-2011
[5910x3]
'The "skip-it" technique could also be applied to Ladislav's code.' 
- I do not think so
(not that it cannot be applied, but, it is not efficient, in my opinion)
Regarding the optimizations:


- my code is optimized for the case when there are many entities. 
(hash! search, as Andreas suggested as well) When the number of entities 
is small, this optimization does not help

- my code is optimized for the case when the TEXT is large (append 
is much faster than in place insert), for small texts this optimization 
does not help
Gabriele
1-Nov-2011
[5913]
Sunanda, note that this is already available in the text encoding 
module: http://www.rebol.it/power-mezz/mezz/text-encoding.html
Sunanda
1-Nov-2011
[5914x3]
Wow -- thanks Gabriele. For me, your powermezz is a much overlooked 
gem.


I fear I have, in effect, badly implemented chunks of your functionality 
over the past few months while I've worked on an application that 
takes unconstrained text and constrains it to look okay in a web 
page and when printed via LaTeX.

I should have read the documentation first!
I've put aside looking at the powermezz for now, and simply decided 
to use one of the three case-specific solutions offered here.


I  made some tweaks to ensure the comparisons I was making were fair 
(and met a previously unstated condition).
 -- each in a func
 -- each works case sensitively (as previously unstated)
 -- use the complete entity set as defined by the WC3

 -- changed Ladislav's Charset as some named entities have digits 
 in their names

 -- moved Peter's set-up of his entity list out of the function and 
 into one-off init code.


It's been a fun hour of twiddling other people's code.....If you 
want your modifed code -- please kust ask.

Timing results next .....
My test data was heavily weighted towards the live conditions I expect 
to encounter (average text length 2000. Most texts are unlikely to 
have more than 1 named entity).


All three scripts produced the same results -- so top marks for meeting 
the spec!


Under my test conditions, Ladislav was fastest, followed by Geomol, 
followed by Peter.


Other  test conditions changed those rankings....So nothing is absolute.


Using a Hash! contributed a lot to Ladislav's speed -- when I tried 
it as a Block! it was only slightly faster than Geomol's.....What 
a pity R3 removes hash!


Thanks for contributing these solutions -- I've enjoyed looking at 
your code and marvelling at the different approaches REBOL makes 
possible.
Ladislav
1-Nov-2011
[5917]
Using a Hash! contributed a lot to Ladislav's speed -- when I tried 
it as a Block! it was only slightly faster than Geomol's.....What 
a pity R3 removes hash!
 - no problem, in R3 you can use map!
Sunanda
1-Nov-2011
[5918]
That's true,  but map! isa bit awkward for just looking up an item 
in a list.....Map! is optimised for retrieving a value associated 
with a key.