World: r3wp

Join the discussions in the REBOL3 world...

[Parse] Discussion of PARSE dialect

older newer	first last
Janko 14-Feb-2009 [3564]	I don't know the exact term for this but I build many parsers for things like xml, wiki text and some other custom things in various lower level langauges using simple state machine (at least that's how I called it)... To my understanding you can parse anything with something like that, also structured nested data with it but it of course takes some more coding than this rebol solution... what I mean as a state machine is a loop that accepts characters or words and has a predefined number of states and code for what to do at each state and when to switch to another state etc..
Anton 14-Feb-2009 [3565]	Right, yes. We agree.
Janko 14-Feb-2009 [3566]	ok :)
Anton 14-Feb-2009 [3567]	What is the next problem ?
Janko 14-Feb-2009 [3568]	that was the big stopper that you just solved for me.. there are no other problems for now .. just the wilingness to type in all the code :) ..
Anton 14-Feb-2009 [3569x2]	I know what it could be - eg: <img src=afile.jpg> <img src="afile.jpg> <img src='afile.jpg'>
Anton 14-Feb-2009 [3569x2]	The first one without any quotes causes a little bit of a problem (solvable).
Janko 14-Feb-2009 [3571x2]	maybe you can make OPT [ " \| ' ] ?
Janko 14-Feb-2009 [3571x2]	copy to [ " \| > \| ' ] ?
Anton 14-Feb-2009 [3573]	You have to use a variable to store which one was used, then parse until that character is encountered again.
Janko 14-Feb-2009 [3574x2]	yes, thats how I did it
Janko 14-Feb-2009 [3574x2]	>> "content=" m: skip (m1: first m ) copy T to m1<<
Anton 14-Feb-2009 [3576]	So you did.
Janko 14-Feb-2009 [3577]	in meta tags example
Anton 14-Feb-2009 [3578]	But when no quotes are used, it gets tricky, eg: <img src= afile.jpg width=10>
Janko 14-Feb-2009 [3579]	what I have the biggest problem (that I thought is unsolvable - but I have to study your example why it works) is the order of things
Anton 14-Feb-2009 [3580]	Is this a surprise ? >> parse "abc" [some ["b" \| "c" \| "a"]] == true
Janko 14-Feb-2009 [3581x2]	hm.. I don't know right now.. you confused me.. I thought I tried everything and it just didn't work what I needed but I don't have example in my head
Janko 14-Feb-2009 [3581x2]	I will try to think of one
Anton 14-Feb-2009 [3583]	Yes, it takes a little while to become familiar with parse.
Janko 14-Feb-2009 [3584]	this does surprise me a little , but I am not sure if this was the problem or something else, because I hrought I tried with some and all things
Anton 14-Feb-2009 [3585]	It means, basically: SOME: Do this 1 or more times, until fail or end is reached: [Try "b", if that fails, try "c". If that fails, try "a"] <--- Given "a" "b" "c", this rule always succeeds.
Janko 14-Feb-2009 [3586x2]	aha.. I think / hope I found an example of my problem ( I already settled that I have to do thins like this in multiple passes )
Janko 14-Feb-2009 [3586x2]	( the problem is at things where things repeat adn I don't know in which order they will appear .. I had this problem with parsing something like simplified wiki text ) >> a: "start1 1 end start2 2 end start1 3 end" == "start1 1 end start2 2 end start1 3 end" >> parse a [ SOME [ [ thru "start2" \| thru "start1" ] copy T to "end" (print T) ] to end ] 2 3 == true >> parse a [ SOME [ [ thru "start1" \| thru "start2" ] copy T to "end" (print T) ] to end ] 1 3 == true ( to not give impression I have only problems with parse, I used parse to solve many things that would be headhurting any other way... these and problem upthere are just cases where I got into trouble)
Anton 14-Feb-2009 [3588x3]	Yes, multiple passes can make the code simpler.
	Ah, here it's good to use nested rules to cut down the code.
	apiece: [copy T to "end" (?? T)] parse a [some [thru "start2" apiece \| thru "start1" apiece] to end]
Janko 14-Feb-2009 [3591x2]	This is basically not a problem , as I solve these things wiht multiple passes and it works more than fast enought for me that way also ... I think this problem would not exist if in case of [ .. \| .. \| .. ] parse would check all options and take the one stat is least characters away from current position (that comes true the first) .. but this would most probably slow down the parse and you would loose the feature that you define "priority" with [ .. \| .. \| .. ] now .. so maybe if there would be a different \| for this
Janko 14-Feb-2009 [3591x2]	( I have to go to eat... will be back .. thanks a lot for before)
Anton 14-Feb-2009 [3593]	no worries - I must sleep. :)
Janko 14-Feb-2009 [3594x2]	hm.. interesting solution .. never thought of doing it this way!! this would maybe solve these problems I had
Janko 14-Feb-2009 [3594x2]	hm.. really thanks for this example.. I took it as unsolvable, but this is totaly elegant way to solve it .. I will need to think on this a little and do some more examples to difest it :) thanks
Anton 14-Feb-2009 [3596]	Not 100% elegant yet ! But glad to help, anyway.
Oldes 14-Feb-2009 [3597]	If you need to parse complex structures, like the marup language, you should use charsets and not 'to or 'thru commands... for example you cannot say that tag starts with < and ends with > because such a tag is valid as well: <input value="<>"> The 'to and 'thru commands are useful, if you, for example, do datamining and don't care to parse all page structure to get just a bit of information from it.
Janko 14-Feb-2009 [3598]	Oldes, your examples were so far too hard for me to grasp (but I am getting there :) ) ... I imagine they are more like what I described above as state machines with which you can parse everything even structured/nested data. I will need to study charset parsing at some point. I agree with your point otherwise but just in this case <> & " ' are not alowed in HTML (or at least XHTML) and should always be encoded ( but are not always) I think
Oldes 14-Feb-2009 [3599]	You are right.. but if you use it with browser, it works.. web is full of not validate pages:).. But I agree, that it was not good example.
amacleod 22-Feb-2009 [3600x2]	Is there a way to force parse to inclose results in {} instead of double quotes "" regardless of length?
amacleod 22-Feb-2009 [3600x2]	never mind I see my prob...
MaxV 20-Mar-2009 [3602]	Hello everybody! I have a problem. I need to extract email addresses from a big text like bla bla [me-:-demo-:-com] bla bla ... <[you-:-example-:-org]> etc. [he-:-italy-:-it] There is possible to obtain a text with all the addresses withou the "<" and ">"?
Pekr 20-Mar-2009 [3603]	I am not sure I understand what you are upto ....
Maxim 20-Mar-2009 [3604]	do you want both emails within the <> and those without?
Geomol 20-Mar-2009 [3605]	>> str: "bla bla [me-:-demo-:-com] bla bla ... <[you-:-example-:-org]> etc. [he-:-italy-:-it]" >> foreach w parse str none [if find e: to-email load w "@" [print e]] [me-:-demo-:-com] [you-:-example-:-org] [he-:-italy-:-it] or something.
Pekr 20-Mar-2009 [3606x3]	eh, nice :-)
	Here's absolutly terrible parser - it does NOT follow RFC, allow any combination of alpha chars, dots, one @ char, and the same, once again to the next space char ... space: #" " mailchar: charset [#"0" - #"9" #"A" - #"Z" #"a" - #"z" ".-"] at-char: #"@" email: [ space start: some mailchar at-char some mailchar end: space (print copy/part start end) ] str: "afadfa adfa asdfasdfa fd [asdfas-:-adfadf-:-adfa-adfadfsda-:-com] adfafaf a af" parse/all str [any [email \| skip]]
	That eliminates email adresses inside of < >, but maybe it was not an intention?
btiffin 20-Mar-2009 [3609]	It would be nice if REBOL could LOAD foreign! data. :) Hint hint wink wink. And being here in a public REBOL forum I might get in trouble for suggesting this one. $ grep -o -E '\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b' files...
Pekr 20-Mar-2009 [3610]	Brian ... you post is broken ... it contains some strange binary fragments :-)
Geomol 20-Mar-2009 [3611]	Brian, you can probably do that grep with a few CHARSET and PARSE in REBOL.
btiffin 20-Mar-2009 [3612]	And actually I think it's wrong anyway ... as it should be. Posting regex in a REBOL forum ... shame on me. ;)
MaxV 23-Mar-2009 [3613]	Thank you, I'll try Pekr solution. I don't need the "<" and ">" characters. However, where I can found some good parse documentation?
older newer	first last