r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Parse] Discussion of PARSE dialect

Janko
14-Feb-2009
[3564]
I don't know the exact term for this but I build many parsers for 
things like xml, wiki text and some other custom things in various 
lower level langauges using simple state machine (at least that's 
how I called it)... To my understanding you can parse anything with 
something like that, also structured nested data with it but it of 
course takes some more coding than this rebol solution... what I 
mean as a state machine is a loop that accepts characters or words 
and has a predefined number of states and code for what to do at 
each state and when to switch to another state etc..
Anton
14-Feb-2009
[3565]
Right, yes. We agree.
Janko
14-Feb-2009
[3566]
ok :)
Anton
14-Feb-2009
[3567]
What is the next problem ?
Janko
14-Feb-2009
[3568]
that was the big stopper that you just solved for me.. there are 
no other problems for now .. just the wilingness to type in all the 
code :) ..
Anton
14-Feb-2009
[3569x2]
I know what it could be - eg:
<img src=afile.jpg>
<img src="afile.jpg>
<img src='afile.jpg'>
The first one without any quotes causes a little bit of a problem 
(solvable).
Janko
14-Feb-2009
[3571x2]
maybe you can make OPT [ " | ' ] ?
copy to [ " | > | ' ] ?
Anton
14-Feb-2009
[3573]
You have to use a variable to store which one was used, then parse 
until that character is encountered again.
Janko
14-Feb-2009
[3574x2]
yes, thats how I did it
>> "content=" m: skip (m1: first m ) copy T to m1<<
Anton
14-Feb-2009
[3576]
So you did.
Janko
14-Feb-2009
[3577]
in meta tags example
Anton
14-Feb-2009
[3578]
But when no quotes are used, it gets tricky, eg:
<img src= afile.jpg width=10>
Janko
14-Feb-2009
[3579]
what I have the biggest problem (that I thought is unsolvable - but 
I have to study your example why it works)  is the order of things
Anton
14-Feb-2009
[3580]
Is this a surprise ?
>> parse "abc" [some ["b" | "c" | "a"]]
== true
Janko
14-Feb-2009
[3581x2]
hm.. I don't know right now.. you confused me.. I thought I tried 
everything and it just didn't work what I needed but I don't have 
example in my head
I will try to think of one
Anton
14-Feb-2009
[3583]
Yes, it takes a little while to become familiar with parse.
Janko
14-Feb-2009
[3584]
this does surprise me a little , but I am not sure if this was the 
problem or something else, because I hrought I tried with some and 
all things
Anton
14-Feb-2009
[3585]
It means, basically:
	SOME: Do this 1 or more times, until fail or end is reached:

  [Try "b", if that fails, try "c". If that fails, try "a"]     <--- 
  Given "a" "b" "c", this rule always succeeds.
Janko
14-Feb-2009
[3586x2]
aha.. I think / hope I found an example of my problem ( I already 
settled that I have to do thins like this in multiple passes )
( the problem is at things where things repeat adn I don't know in 
which order they will appear .. I had this problem with parsing something 
like simplified wiki text )
>> a: "start1 1 end start2 2 end start1 3 end"
== "start1 1 end start2 2 end start1 3 end"

>> parse a [ SOME [ [ thru "start2" | thru "start1" ] copy T to "end" 
(print T) ] to end ]
 2
 3
== true

>> parse a [ SOME [ [ thru "start1" | thru "start2" ] copy T to "end" 
(print T) ] to end ]
 1
 3
== true

( to not give impression I have only problems with parse, I used 
parse to solve many things that would be headhurting any other way... 
these and problem upthere are just cases where I got into trouble)
Anton
14-Feb-2009
[3588x3]
Yes, multiple passes can make the code simpler.
Ah, here it's good to use nested rules to cut down the code.
apiece: [copy T to "end" (?? T)]

parse a [some [thru "start2" apiece | thru "start1" apiece]  to end]
Janko
14-Feb-2009
[3591x2]
This is basically not a problem , as I solve these things wiht multiple 
passes and it works more than fast enought for me that way also ... 
I think this problem would not exist if in case of [ .. | .. | .. 
] parse would check all options and take the one stat is least characters 
away from current position (that comes true the first) .. but this 
would most probably slow down the parse and you would loose the feature 
that you define "priority" with [ .. | ..  | .. ] now .. so maybe 
if there would be a different | for this
( I have to go to eat... will be back .. thanks a lot for before)
Anton
14-Feb-2009
[3593]
no worries - I must sleep. :)
Janko
14-Feb-2009
[3594x2]
hm.. interesting solution .. never thought of doing it this way!! 
this would maybe solve these problems I had
hm.. really thanks for this example.. I took it as unsolvable, but 
this is totaly elegant way to solve it .. I will need to think on 
this a little and do some more examples to difest it :) thanks
Anton
14-Feb-2009
[3596]
Not 100% elegant yet !  But glad to help, anyway.
Oldes
14-Feb-2009
[3597]
If you need to parse complex structures, like the marup language, 
you should use charsets and not 'to or 'thru commands... for example 
you cannot say that tag starts with < and ends with > because such 
a tag is valid as well:
<input value="<>">

The 'to and 'thru commands are useful, if you, for example, do datamining 
and don't care to parse all page structure to get just a bit of information 
from it.
Janko
14-Feb-2009
[3598]
Oldes, your examples were so far too hard for me to grasp (but I 
am getting there :) ) ... I imagine they are more like what I described 
above as state machines with which you can parse everything even 
structured/nested data. I will need to study charset parsing at some 
point. I agree with your point otherwise but just in this case <> 
& " ' are not alowed in HTML (or at least XHTML) and should always 
be encoded ( but are not always) I think
Oldes
14-Feb-2009
[3599]
You are right.. but if you use it with browser, it works.. web is 
full of not validate pages:).. But I agree, that it was not good 
example.
amacleod
22-Feb-2009
[3600x2]
Is there a way to force parse to inclose results in {} instead of 
double quotes "" regardless of length?
never mind I see my prob...
MaxV
20-Mar-2009
[3602]
Hello everybody!

I have a problem. I need to extract email addresses from a big text 
like

bla bla [me-:-demo-:-com] bla bla ...  <[you-:-example-:-org]>  etc. [he-:-italy-:-it]

There is possible to obtain a text with all the addresses withou 
the "<" and ">"?
Pekr
20-Mar-2009
[3603]
I am not sure I understand what you are upto ....
Maxim
20-Mar-2009
[3604]
do you want both emails within the <> and those without?
Geomol
20-Mar-2009
[3605]
>> str: "bla bla [me-:-demo-:-com] bla bla ...  <[you-:-example-:-org]>  etc. 
[he-:-italy-:-it]"

>> foreach w parse str none [if find e: to-email load w "@" [print 
e]]
[me-:-demo-:-com]
[you-:-example-:-org]
[he-:-italy-:-it]

or something.
Pekr
20-Mar-2009
[3606x3]
eh, nice :-)
Here's absolutly terrible parser - it does NOT follow RFC, allow 
any combination of alpha chars, dots, one @ char, and the same, once 
again to the next space char ...

space: #" "
mailchar: charset [#"0" - #"9" #"A" - #"Z" #"a" - #"z" ".-"]
at-char: #"@"

email: [

   space
   start:
   some mailchar
   at-char
   some mailchar
   end:
   space
   (print copy/part start end)

]


str: "afadfa adfa asdfasdfa fd [asdfas-:-adfadf-:-adfa-adfadfsda-:-com] adfafaf 
a af"

parse/all str [any [email | skip]]
That eliminates email adresses inside of < >, but maybe it was not 
an intention?
btiffin
20-Mar-2009
[3609]
It would be nice if REBOL could LOAD foreign! data.  :)  Hint hint 
wink wink.


And being here in a public REBOL forum I might get in trouble for 
suggesting this one.

$ grep -o -E '\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b' 
files...
Pekr
20-Mar-2009
[3610]
Brian ... you post is broken ... it contains some strange binary 
fragments :-)
Geomol
20-Mar-2009
[3611]
Brian, you can probably do that grep with a few CHARSET and PARSE 
in REBOL.
btiffin
20-Mar-2009
[3612]
And actually I think it's wrong anyway ... as it should be.  Posting 
regex in a REBOL forum ... shame on me.   ;)
MaxV
23-Mar-2009
[3613]
Thank you, I'll try Pekr solution. I don't need the "<" and ">" characters.
However, where I can found some good parse documentation?