r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Parse] Discussion of PARSE dialect

Sunanda
18-Jun-2011
[5882]
Done, thanks.
onetom
4-Aug-2011
[5883]
Parse (YC S11): A Heroku For Mobile Apps.
Great name for a startup...

http://techcrunch.com/2011/08/04/yc-funded-parse-a-heroku-for-mobile-apps/
Sunanda
31-Oct-2011
[5884]
Can anyone gift me an effecient R2 'parse solution for this problem 
(I am assuming 'parse will out-perform any other approach):

SET UP

I have a huge list of HTML named character entities, eg (a very short 
example):

       named-entities: ["nbsp" "cent" "agrave" "larr" "rarr" "crarr" ] ;; 
       etc
   
And I have some text that may contain some named entities, eg:

       text: "To send, press the ← arrow & then press ↵."
   
PROBLEM

I want to escape every "&" in the text, unless it is part of a named 
entity, eg (assuming a function called escape-amps):
        probe escape-amps text entities

         == "To send, press the ← arrow & then press ↵."
  
TO MAKE IT EASY....

You can can assume a different set up for the named-entities block 
if you want; eg, this may be better for you:

       named-entities: [" " "¢" "à" "←" "→" "↵" 
       ] ;; etc 
   
Any help on this would be much appreciated!
Geomol
31-Oct-2011
[5885x3]
ne: ["←" | "↵"]	; and the rest of the named entities
s: "To send, press the ← arrow & then press ↵."
parse s [
	any [
		to #"&" [ne | skip mark: (insert mark "amp;")]
	]
]
s

== {To send, press the ← arrow & then press ↵.}
It may be faster to drop the & from the entities and change the rule 
to:

any [thru #"&" [ne | mark: (insert mark "amp;")]
That's strange. My 2nd suggestion gives a different result:

ne: ["larr;" | "crarr;"]
s: "To send, press the ← arrow & then press ↵."
parse s [
	any [
		thru #"&" [ne | mark: (insert mark "amp;")]
	]
]
s

== {To send, press the ← arrow & amp;then press ↵.}

Seems like a bug, or am I just tired?
Sunanda
31-Oct-2011
[5888]
Thanks for the quick contributions, geomol.

I see a different result too -- a space between the "&" and the "amp"
Pekr
31-Oct-2011
[5889x2]
not fluent with html escaping, what's the aim? To replace stand-alone 
#"&" with "&amp"?
also remember - parse does not count spaces in. You are better in 
using parse/all
Ladislav
31-Oct-2011
[5891]
'I want to escape every "&" in the text, unless it is part of a named 
entity' - just to make sure: if the entity is not in the ENTITIES 
list, like e.g. " and it is encountered in the given TEXT, what 
exactly should happen?
Sunanda
31-Oct-2011
[5892x3]
The aim --- Basically, yes, Petr.
Ladislav -- if it is not in the list, then I'd like it escaped, please.

Think of it as a whitelist of ecceptable named entities. All others 
are suspect :)
ecceptable ==> acceptable
Ladislav
31-Oct-2011
[5895]
Yes, OK, I just wanted to know
Pekr
31-Oct-2011
[5896]
Geomol - your code basically works, no? Just use parse/all:


>> parse/all s [any [thru #"&" [ne | mark: (insert mark "amp;")]]]
== false
>> s
== {To send, press the ← arrow & then press ↵.}
Ladislav
31-Oct-2011
[5897x6]
I guess, that this should be efficient:

alpha: make bitset! [#"a" - #"z" #"A" - #"Z"]
escape-amps: func [
	text [string!]
	entities [hash!]
	/local result pos1 pos2
][
	result: copy ""
	parse/all text [
		pos1:
		any [
			; find the next amp
			thru #"&"
			pos2:
			[
				; entity check
				some alpha pos3: #";" (
					; entity candidate
					unless find entities copy/part pos2 pos3 [
						; not an entity
						insert insert tail result copy/part pos1 pos2 "amp;"
						pos1: pos2
					]
				)
				| (
					; not an entity
					insert insert tail result copy/part pos1 pos2 "amp;"
					pos1: pos2
				)
			]
			| (insert tail result pos1) end skip ; no amp found
		]
	]
	result
]
(in place inserts are too slow)
(= inefficient)
Err: pos3 should be added as a local
This is how it works:

>> probe escape-amps text named-entities

{To send, press the ← arrow & then press ↵.&susp;123}

== {To send, press the ← arrow & then press ↵.&susp;123}
With TEXT defined:


>> text: "To send, press the ← arrow & then press ↵.&susp;123"
Geomol
31-Oct-2011
[5903]
Pekr, yeah, probably because I left out the /all refinement. Makes 
sense.
Sunanda
31-Oct-2011
[5904]
Thanks Ladislav and Geomol.

Both your solutions work with my test data -- that's always a good 
sign :)


I'll do some timing tests with large entity lists ..... But I won't 
be able to do that for 24 hours.

Other approaches still welcome!
Andreas
31-Oct-2011
[5905]
Two suggestions:


- store your named entities as a hash! (order of magnitude speedup 
for FIND)


- if you have loooong "words", restrict Ladislav's `some alpha` to 
the maximum length of a valid entity
Ladislav
31-Oct-2011
[5906]
This alternative does not use the COPY call, so, it has to be faster:

alpha: make bitset! [#"a" - #"z" #"A" - #"Z"]
escape-amps: func [
	text [string!]
	entities [hash!]
	/local result pos1 pos2 pos3
][
	result: copy ""
	parse/all text [
		pos1:
		any [
			; find the next amp
			thru #"&"
			pos2:
			[
				; entity check
				some alpha pos3: #";" (
					; entity candidate
					unless find entities copy/part pos2 pos3 [
						; not an entity
						insert insert/part tail result pos1 pos2 "amp;"
						pos1: pos2
					]
				)
				| (
					; not an entity
					insert insert/part tail result pos1 pos2 "amp;"
					pos1: pos2
				)
			]
			| (insert tail result pos1) end skip ; no amp found
		]
	]
	result
]
PeterWood
1-Nov-2011
[5907x3]
Perhaps building a parse rule from the list of entities may be faster 
if there is a lot of text to process:

This assumes the entities are provided as strings in a block.

escape-amps: func [

  text [string!]

  entities [block!]

][
  
  skip-it: complement charset [#"&"]

  entity: copy []

  foreach ent entities [ insert entity compose [(ent) |]]

  head remove back tail entity

  parse/all text 
[
    any [

      entity |

      "&" pos: (insert pos "amp;" pos: skip pos 4) :pos |

      some skip-it
     ]

  ]

  head tex
t
]
That should read head text at the end of the function.
Also I feel using skip could be very slow if the text contains a 
lot of "non-matching text". The "skip-it" technique could also be 
applied to Ladislav's code.
Ladislav
1-Nov-2011
[5910x3]
'The "skip-it" technique could also be applied to Ladislav's code.' 
- I do not think so
(not that it cannot be applied, but, it is not efficient, in my opinion)
Regarding the optimizations:


- my code is optimized for the case when there are many entities. 
(hash! search, as Andreas suggested as well) When the number of entities 
is small, this optimization does not help

- my code is optimized for the case when the TEXT is large (append 
is much faster than in place insert), for small texts this optimization 
does not help
Gabriele
1-Nov-2011
[5913]
Sunanda, note that this is already available in the text encoding 
module: http://www.rebol.it/power-mezz/mezz/text-encoding.html
Sunanda
1-Nov-2011
[5914x3]
Wow -- thanks Gabriele. For me, your powermezz is a much overlooked 
gem.


I fear I have, in effect, badly implemented chunks of your functionality 
over the past few months while I've worked on an application that 
takes unconstrained text and constrains it to look okay in a web 
page and when printed via LaTeX.

I should have read the documentation first!
I've put aside looking at the powermezz for now, and simply decided 
to use one of the three case-specific solutions offered here.


I  made some tweaks to ensure the comparisons I was making were fair 
(and met a previously unstated condition).
 -- each in a func
 -- each works case sensitively (as previously unstated)
 -- use the complete entity set as defined by the WC3

 -- changed Ladislav's Charset as some named entities have digits 
 in their names

 -- moved Peter's set-up of his entity list out of the function and 
 into one-off init code.


It's been a fun hour of twiddling other people's code.....If you 
want your modifed code -- please kust ask.

Timing results next .....
My test data was heavily weighted towards the live conditions I expect 
to encounter (average text length 2000. Most texts are unlikely to 
have more than 1 named entity).


All three scripts produced the same results -- so top marks for meeting 
the spec!


Under my test conditions, Ladislav was fastest, followed by Geomol, 
followed by Peter.


Other  test conditions changed those rankings....So nothing is absolute.


Using a Hash! contributed a lot to Ladislav's speed -- when I tried 
it as a Block! it was only slightly faster than Geomol's.....What 
a pity R3 removes hash!


Thanks for contributing these solutions -- I've enjoyed looking at 
your code and marvelling at the different approaches REBOL makes 
possible.
Ladislav
1-Nov-2011
[5917]
Using a Hash! contributed a lot to Ladislav's speed -- when I tried 
it as a Block! it was only slightly faster than Geomol's.....What 
a pity R3 removes hash!
 - no problem, in R3 you can use map!
Sunanda
1-Nov-2011
[5918]
That's true,  but map! isa bit awkward for just looking up an item 
in a list.....Map! is optimised for retrieving a value associated 
with a key.
Ladislav
1-Nov-2011
[5919x4]
as follows: 

entities-map: make map! []
foreach entity entities-block [entities-map/:entity: true]
so, keys are the entities, and the value is either true (for an entity) 
or none
I think it is OK that way
Another solution is to use a sorted block and a binary search, which 
should be about the same speed as hash
Sunanda
1-Nov-2011
[5923]
Yes, it is doable with map! -- but, as I said awkward.


Another issue (or perhaps just unfixed bug) is the lack of case sensitivity 
with map!
    select/case make map! ["A" true] "a"
     == true

The current work-around is to use binary rather than string data:
    select make map! reduce [to-binary "A" true]  to-binary "a"
    == none
Ladislav
1-Nov-2011
[5924x3]
yes, right, that is an issue
BTW, I think, that there is a possible optimization not using the 
charset you mention
Are you still interested?
Sunanda
1-Nov-2011
[5927]
Yes please!
Ladislav
14-Nov-2011
[5928x4]
Sorry for not continuing with it, Sunanda, but when I gave it a second 
thought, it did not look like a possible speed-up could be worth 
the source code complication.
Another Parse discussion subject:


It looked to me like a good idea to be able in one Parse pass to 
sometimes match some strings in a case-sensitive way and other strings 
in a case-insensitive way. This is not possible using the /CASE refinement, 
since the refinement makes all comparison case sensitive, or if not 
used, all comparisons are case insensitive. Wouldn't it be good to 
be able to adjust the comparison sensitivity on-the-fly during parsing?
I think, that it should not be overly complicated to achieve the 
goal e.g. by using a CASE keyword in PARSE.
(for switching to case-sensitive mode, and e.g. a NO-CASE for switching 
to case-insensitive mode)