World: r3wp

Join the discussions in the REBOL3 world...

[Parse] Discussion of PARSE dialect

older newer	first last
Janko 14-Feb-2009 [3549]	I intended to make a blogpost .. "REBOL parse challenge" and present this problem and ask if people can provide solutions in other languages that would be more elgant ... (in similar note as the "arc challenge" ... now that it seems even more hard nut to crack I should probably really do it .. does anyone think this would be easy to solve using the conventional language? (I think not)
Anton 14-Feb-2009 [3550]	I'm sure there are some elegant solutions in other languages too.
Janko 14-Feb-2009 [3551]	hm.. would this be nicely solvable with a regex? .. I think it would be quite a pain by using regular string functions like strpos substr etc... having the same requirenments (one or more spaces/tabs/newlines " or ' , undefined order)
Anton 14-Feb-2009 [3552]	I don't know - I only learn regex when I have to .. then a short time later I forget.
Janko 14-Feb-2009 [3553]	yes, me also
Anton 14-Feb-2009 [3554]	perl could do it pretty quick, I'm sure.
Janko 14-Feb-2009 [3555x4]	perl pro would certanly use regex (that is the initial home of it) :) ... I think parse and regex are best for some different problems, I am just not sure if this one is better solved with one or the other
	regex I imagine sucks at structured stuff , where you have to make some sort of state machine , for example I don't think regex can well parse xml ... state machines are exelent at that but they do require more code than parse would
	I will see with the "parse challenge" .. if I would want to be really sneaky I could ask if anyone can solve this in perl comunity .. and if their solution would suck more than rebol's then make the blogpost :)
	but I am not like that ;)
Anton 14-Feb-2009 [3559x2]	Yeah, I'm not really sure what that would prove. :)
Anton 14-Feb-2009 [3559x2]	What would you build a state machine with, which would generate so much code ?
Janko 14-Feb-2009 [3561]	I don't fully understand your question?
Anton 14-Feb-2009 [3562x2]	You say "state machines ... require more code". What code ? Obviously, you can build a state machine in any language, but I guess I'm wondering what ... ohh... I'm so tired after all those cheese sandwiches....
Anton 14-Feb-2009 [3562x2]	Anyway, I think I understand what you're saying. A state machine is big and clunky, expressing everything you don't want to hear about, while parse allows you to express your target more directly, cutting through anything you don't want without having to specify it.
Janko 14-Feb-2009 [3564]	I don't know the exact term for this but I build many parsers for things like xml, wiki text and some other custom things in various lower level langauges using simple state machine (at least that's how I called it)... To my understanding you can parse anything with something like that, also structured nested data with it but it of course takes some more coding than this rebol solution... what I mean as a state machine is a loop that accepts characters or words and has a predefined number of states and code for what to do at each state and when to switch to another state etc..
Anton 14-Feb-2009 [3565]	Right, yes. We agree.
Janko 14-Feb-2009 [3566]	ok :)
Anton 14-Feb-2009 [3567]	What is the next problem ?
Janko 14-Feb-2009 [3568]	that was the big stopper that you just solved for me.. there are no other problems for now .. just the wilingness to type in all the code :) ..
Anton 14-Feb-2009 [3569x2]	I know what it could be - eg: <img src=afile.jpg> <img src="afile.jpg> <img src='afile.jpg'>
Anton 14-Feb-2009 [3569x2]	The first one without any quotes causes a little bit of a problem (solvable).
Janko 14-Feb-2009 [3571x2]	maybe you can make OPT [ " \| ' ] ?
Janko 14-Feb-2009 [3571x2]	copy to [ " \| > \| ' ] ?
Anton 14-Feb-2009 [3573]	You have to use a variable to store which one was used, then parse until that character is encountered again.
Janko 14-Feb-2009 [3574x2]	yes, thats how I did it
Janko 14-Feb-2009 [3574x2]	>> "content=" m: skip (m1: first m ) copy T to m1<<
Anton 14-Feb-2009 [3576]	So you did.
Janko 14-Feb-2009 [3577]	in meta tags example
Anton 14-Feb-2009 [3578]	But when no quotes are used, it gets tricky, eg: <img src= afile.jpg width=10>
Janko 14-Feb-2009 [3579]	what I have the biggest problem (that I thought is unsolvable - but I have to study your example why it works) is the order of things
Anton 14-Feb-2009 [3580]	Is this a surprise ? >> parse "abc" [some ["b" \| "c" \| "a"]] == true
Janko 14-Feb-2009 [3581x2]	hm.. I don't know right now.. you confused me.. I thought I tried everything and it just didn't work what I needed but I don't have example in my head
Janko 14-Feb-2009 [3581x2]	I will try to think of one
Anton 14-Feb-2009 [3583]	Yes, it takes a little while to become familiar with parse.
Janko 14-Feb-2009 [3584]	this does surprise me a little , but I am not sure if this was the problem or something else, because I hrought I tried with some and all things
Anton 14-Feb-2009 [3585]	It means, basically: SOME: Do this 1 or more times, until fail or end is reached: [Try "b", if that fails, try "c". If that fails, try "a"] <--- Given "a" "b" "c", this rule always succeeds.
Janko 14-Feb-2009 [3586x2]	aha.. I think / hope I found an example of my problem ( I already settled that I have to do thins like this in multiple passes )
Janko 14-Feb-2009 [3586x2]	( the problem is at things where things repeat adn I don't know in which order they will appear .. I had this problem with parsing something like simplified wiki text ) >> a: "start1 1 end start2 2 end start1 3 end" == "start1 1 end start2 2 end start1 3 end" >> parse a [ SOME [ [ thru "start2" \| thru "start1" ] copy T to "end" (print T) ] to end ] 2 3 == true >> parse a [ SOME [ [ thru "start1" \| thru "start2" ] copy T to "end" (print T) ] to end ] 1 3 == true ( to not give impression I have only problems with parse, I used parse to solve many things that would be headhurting any other way... these and problem upthere are just cases where I got into trouble)
Anton 14-Feb-2009 [3588x3]	Yes, multiple passes can make the code simpler.
	Ah, here it's good to use nested rules to cut down the code.
	apiece: [copy T to "end" (?? T)] parse a [some [thru "start2" apiece \| thru "start1" apiece] to end]
Janko 14-Feb-2009 [3591x2]	This is basically not a problem , as I solve these things wiht multiple passes and it works more than fast enought for me that way also ... I think this problem would not exist if in case of [ .. \| .. \| .. ] parse would check all options and take the one stat is least characters away from current position (that comes true the first) .. but this would most probably slow down the parse and you would loose the feature that you define "priority" with [ .. \| .. \| .. ] now .. so maybe if there would be a different \| for this
Janko 14-Feb-2009 [3591x2]	( I have to go to eat... will be back .. thanks a lot for before)
Anton 14-Feb-2009 [3593]	no worries - I must sleep. :)
Janko 14-Feb-2009 [3594x2]	hm.. interesting solution .. never thought of doing it this way!! this would maybe solve these problems I had
Janko 14-Feb-2009 [3594x2]	hm.. really thanks for this example.. I took it as unsolvable, but this is totaly elegant way to solve it .. I will need to think on this a little and do some more examples to difest it :) thanks
Anton 14-Feb-2009 [3596]	Not 100% elegant yet ! But glad to help, anyway.
Oldes 14-Feb-2009 [3597]	If you need to parse complex structures, like the marup language, you should use charsets and not 'to or 'thru commands... for example you cannot say that tag starts with < and ends with > because such a tag is valid as well: <input value="<>"> The 'to and 'thru commands are useful, if you, for example, do datamining and don't care to parse all page structure to get just a bit of information from it.
Janko 14-Feb-2009 [3598]	Oldes, your examples were so far too hard for me to grasp (but I am getting there :) ) ... I imagine they are more like what I described above as state machines with which you can parse everything even structured/nested data. I will need to study charset parsing at some point. I agree with your point otherwise but just in this case <> & " ' are not alowed in HTML (or at least XHTML) and should always be encoded ( but are not always) I think
older newer	first last