World: r3wp

Join the discussions in the REBOL3 world...

[Parse] Discussion of PARSE dialect

older newer	first last
Henrik 18-Oct-2010 [5311]	Steeve, yes.
Steeve 18-Oct-2010 [5312]	R3 does it
AdrianS 18-Oct-2010 [5313]	Graham, try http://gskinner.com/RegExrfor working out regexes. It has a really nice UI where you can hover over the components of the regex and see exactly what they do.
GrahamC 18-Oct-2010 [5314]	Thanks
Sunanda 4-Nov-2010 [5315]	Question on StackOverflow.....there must be a better answer than mine, and I'd suspect it involves PARSE (better answers usually do:) http://stackoverflow.com/questions/4093714/is-there-finer-granularity-than-load-next-for-reading-structured-data
GrahamC 4-Nov-2010 [5316x3]	Use fixed length records
	Anyone got a parse rule that strips out everything between tags in an "xml" document
	whitespace: charset [ "^/^- " ] swsp: [ any whitespace ] result: copy "" parse/all pqri-xml [ some [ copy t thru ">" (append result t) swsp to "<" ]]
Ladislav 4-Nov-2010 [5319]	Posted an answer mentioning the test framework, which does almost exactly what Fork asked
Gabriele 5-Nov-2010 [5320x3]	also, Carl's clean-script and script colorizer use parse + load/next to do the same thing. my Wetan uses the same method.
	http://www.colellachiara.com/soft/MD3/emitters/wetan.html#section-4.2
	basically, as long as you skip over [, (, ), and ] you can just use load/next. I'm also skipping over #[ because I want to preserve literal values while formatting (that is, preserve what the user typed)
Oldes 1-Dec-2010 [5323]	How to use the new INTO parse keyword? Could it be used to avoid the temp parse like in this (very simplified example)? parse "<a>123</a>" [thru "<a>" copy tmp to "</a>" (probe tmp probe parse tmp ["123"]) to end] Note that I know that in this example it's easy to use just one parse and avoid the temp.
Ladislav 1-Dec-2010 [5324x3]	INTO is neither new, not it is meant for string parsing
	You can take advantage of using it when parsing a block and needing to parse a subblock (of any-block! type) or a substring
	(of the said block)
Oldes 1-Dec-2010 [5327]	can you give me a simple example, please?
Ladislav 1-Dec-2010 [5328x2]	>> parse [a b "123" c] [2 word! into [3 skip] word!] == true
Ladislav 1-Dec-2010 [5328x2]	>> parse [a b c/d/e] [2 word! into [3 word!]] == true
Oldes 1-Dec-2010 [5330x2]	I understand now, thanks.
Oldes 1-Dec-2010 [5330x2]	it's very useful, I woder why I've not found it earlier :)
Ladislav 1-Dec-2010 [5332]	The substring property is just a recent addition
Oldes 1-Dec-2010 [5333]	And is there any nice solution for my string parsing above? I can live with the temps, just was thinking if it could be done better.. anyway, at least I know how to use INTO:)
Ladislav 1-Dec-2010 [5334x2]	That is normally a "job" for a subrule
Ladislav 1-Dec-2010 [5334x2]	it looks, that you could use e.g. the REJECT keyword
Oldes 1-Dec-2010 [5336x2]	I know, but that would require complex rules, I'm lazy parser:) Btw.. my real example looks like: some [ thru {<h2><a} thru ">" copy name to {<} copy doc to {^/ </div>} ( parse doc [ thru {<pre class="code">} copy code to {</pre} ( probe name probe code ) any [ thru {<h5>} copy arg to {<} thru {<ol><p>} copy arg-desc to {</p></ol>} ( printf [" * " 10 " - "] reduce [arg arg-desc] ) ] ] ) ]
Oldes 1-Dec-2010 [5336x2]	Never mind, I can live with current way anyway.. I was just wondering if the INTO is not intended for such a cases. Now I know it isn't.
Ladislav 1-Dec-2010 [5338x3]	For comparison, a similar rule can be written as follows: some [ thru {<h2><a} thru ">" copy name to {<} copy doc any [ and {^/ </div>} break \| thru {<pre class="code">} copy code to {</pre} ( probe name probe code ) any [ thru {<h5>} copy arg to {<} thru {<ol><p>} copy arg-desc to {</p></ol>} (printf [" * " 10 " - "] reduce [arg arg-desc]) ] \| skip ] ]
	Aha, sorry, that is not similar enough :-( To be similar, it should look as follows, I guess: some [ thru {<h2><a} thru ">" copy name to {<} copy doc any [ thru {<pre class="code">} copy code to {</pre} ( probe name probe code ) any [ thru {<h5>} copy arg to {<} thru {<ol><p>} copy arg-desc to {</p></ol>} (printf [" * " 10 " - "] reduce [arg arg-desc]) ] to {^/ </div>} ] ]
	Still not cigar, third time: some [ thru {<h2><a} thru ">" copy name to {<} copy doc [ thru {<pre class="code">} copy code to {</pre} ( probe name probe code ) any [ thru {<h5>} copy arg to {<} thru {<ol><p>} copy arg-desc to {</p></ol>} (printf [" * " 10 " - "] reduce [arg arg-desc]) ] to {^/ </div>} ] ]
Oldes 1-Dec-2010 [5341x2]	That's not correct.. there is a reason for the temp parse and that's here because thru "<h5" would skip out of the div.
Oldes 1-Dec-2010 [5341x2]	the DOC is just the temp var for the second parse.
Ladislav 1-Dec-2010 [5343]	But, in that case your "inner parse" fails, without you noticing it?
Oldes 1-Dec-2010 [5344x2]	why? it does not fails.. or maybe fails, but I have the data from the doc div, that's all.. it's lazy parsing :)
Oldes 1-Dec-2010 [5344x2]	btw.. I need to parse the source only once so I really don't have to care about some exceptions.
Ladislav 1-Dec-2010 [5346]	I have the data - I doubt you get the data if the "inner parse" fails
Oldes 1-Dec-2010 [5347x2]	believe me I have.. :) the script is already ready.. I was just thinking if there is some special parse keyword, like INTO, so I could do it without the second parse next time, that's all. I use such a lazy parsing very often.
Oldes 1-Dec-2010 [5347x2]	in your case I would need to jump at least over each tag start, not using thru "<h5". But then there would be problem, that I need to stop the doc div only if it's exactly "^/ </div" (to avoid case that there would be another inner giv). I know it's not safe, but I can see what I do by examining the source I want to parse first. (240kB html in my case)
Ladislav 1-Dec-2010 [5349]	Aha, that "I can see what I do by examining..." looks substantial. Nevertheless, there is still a way how to do a similar thing without calling Parse again
Oldes 1-Dec-2010 [5350]	I believe, but important is if it would be easy enough to satisfy my lazines... something like the INTO for block parsing.
Ladislav 1-Dec-2010 [5351x4]	what about this, is it the rule you wanted? some [ thru {<h2><a} thru ">" copy name to {<} to {^/ </div>} doc: [ thru {<pre class="code">} copy code to {</pre} ( probe name probe code ) any [ here: if (lesser? index? here index? doc) thru {<h5>} copy arg to {<} thru {<ol><p>} copy arg-desc to {</p></ol>} (printf [" * " 10 " - "] reduce [arg arg-desc]) ] ] :doc ]
	aha, I missed there should be doc-start and doc-end
	some [ thru {<h2><a} thru ">" copy name to {<} doc-start: to {^/ </div>} doc-end: :doc-start [ thru {<pre class="code">} copy code to {</pre} ( probe name probe code ) any [ thru {<h5>} here: if (lesser? index? here index? doc-end) copy arg to {<} thru {<ol><p>} copy arg-desc to {</p></ol>} (printf [" * " 10 " - "] reduce [arg arg-desc]) ] ] :doc-end ]
	Nevertheless, both the variant you posted, as well as the variant I posted parse a part of the text more than once. A variant parsing the text only once can be written as well.
Steeve 1-Dec-2010 [5355]	this should work with R3: some [ thru {<h2><a} thru ">" copy name to {<} copy doc to {^/ </div>} :doc thru {<pre class="code">} copy code to {</pre} ( probe name probe code ) any [ thru {<h5>} copy arg to {<} thru {<ol><p>} copy arg-desc to {</p></ol>} ( printf [" * " 10 " - "] reduce [arg arg-desc] ) ] ] notice the :doc, which allows to switch the current input parsed
Oldes 1-Dec-2010 [5356]	The last Ladislav's version is working, but it's far to be easy to use for lazy parsing. I think that I will stay with my version;-)
Ladislav 1-Dec-2010 [5357]	Use what suits your needs best. Nevertheless, as far as code size, etc. are compared, they are the same (even sharing the property, that the part of code is parsed twice).
BrianH 1-Dec-2010 [5358]	Was Carl's proposed LIMIT keyword implemented yet?
Ladislav 1-Dec-2010 [5359]	Not yet, I guess.
BrianH 1-Dec-2010 [5360]	That is what he proposed to deal with this issue. I look forward to it.
older newer	first last