r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Parse] Discussion of PARSE dialect

Janko
14-Feb-2009
[3519]
(this is the right example .. I forgot to use thru above so second 
wouldn't pass anyway... but result is the same)

>> doc1: "start A 1 end start B 2 end"
== "start A 1 end start B 2 end"

>> parse doc1 [ thru "start" "A" copy R to "end" (print R) to end 
]
 1
== true

>> parse doc1 [ thru "start" "B" copy R to "end" (print R) to end 
]
== false

>> parse doc1 [ SOME [ thru "start" "B" copy R to "end" (print R) 
to end ] ]
== false
Anton
14-Feb-2009
[3520]
Is there anything expected between "start" and "A", for instance 
?
Janko
14-Feb-2009
[3521]
I know how to solve this by making it less robust (in this case relying 
that there is only one space between) but this doesn't solve my problem 
well 

>> parse doc1 [ thru "start B" copy R to "end" (print R) to end ]
 2
== true
Anton
14-Feb-2009
[3522]
No need for that.
Janko
14-Feb-2009
[3523]
1 or more spaces (to your question)
Anton
14-Feb-2009
[3524]
parse doc1 [some [thru "start" ["A" | "B"] copy R to "end" (?? R) 
"end"]]
Janko
14-Feb-2009
[3525]
hm.. just a sec so I try few things
Anton
14-Feb-2009
[3526]
PARSE without the /ALL refinement handles any amount of whitespace. 
(You will probably end up using parse/all, though. I usually do when 
parsing HTML.)
Janko
14-Feb-2009
[3527]
Your solution, I thought it won't work if I reverse order of A and 
B in the string but it seems it does.  I would need to know which 
one is A and B but I think this can be solved by setting some word 
( ) inside [ A | B] ...  so basically it seems to work... I think 
I can apply this way also to my concrete problem which is this
kib2
14-Feb-2009
[3528]
I don't understand why not simply : 

parse/all doc1 [ thru "start B" copy number to "end" (print number) 
]
Anton
14-Feb-2009
[3529x2]
You leave the pointer at beginning of "end" in the doc1 string. Look 
at my example, I move TO "end", then I also consume "end".
... to "end" (?? R) "end"]
Janko
14-Feb-2009
[3531]
kib2: becasue I don't know how many spaces are between start and 
B .. and in my concrete case I need to have multiple rules.. I will 
show concrete example
Anton
14-Feb-2009
[3532]
The second one actually consumes the "end", moving the pointer (the 
current parse index) through it.
kib2
14-Feb-2009
[3533]
Anton: Janko just said he wanted to extract the "2", so I don't care 
wheter the pointer is, no ?
Anton
14-Feb-2009
[3534]
Mmm.. probably true, but better to be neat and tidy with rules, then 
they can be reused in slightly different ways and still work as expected.
Janko
14-Feb-2009
[3535]
kib... because in concrete I think I need *complex rules* not just 
1 string for it to work .. it has to work on all sorts of pages written 
by anyone.. you will see once I show you real example .. right now
Anton
14-Feb-2009
[3536]
First define whitespace:
whsp: charset " ^-^/" ; Whitespace: space, tab, newline.
kib2
14-Feb-2009
[3537]
Ok, and if we define any space like this :

space: charset " ^-"

parse/all doc1 [ thru "start" any space thru "B" copy number to "end" 
(print number) to end]
Janko
14-Feb-2009
[3538x2]
( I need to parse meta tags description and keywords and abstract 
if they exist -- they can come in any order, there can be one or 
multiple spaces/newlines/tabs between tag arguments, there can be 
" or '  used as argument="asdasd" )
>> doc2: {<head>

{    <title>Dragonicum.com - making the right business connections 
!</title>

{    <meta name="keywords" content="Company Directory, Join Us, Advanced 
Search, Trade Leads, Forum, Trade S

{    hows, Advertising, Translation, fair trade, trade portal, business 
to business, trade leads, trade even
{    ts, china export, china manufacturer" />

{    <meta name="description" content="New international trade portal 
and company directory for Asia, Europe

{     and North America. Our priority No.1 is to create and maintain 
a safe, well lit business-to-business m

{    arketplace, by assisting our members in identifying new trustworthy 
business partners!" />

{    <link rel="stylesheet" href="style/blue_main.css" type="text/css" 
/>}
== {<head>

<title>Dragonicum.com - making the right business connections !</title>
<meta name="keywords" content="Company Directory...


>> T: "" parse doc [ thru "<meta" "name=" skip "keywords" skip "content=" 
m: skip (m1: first m ) copy T to m1  to end ] print T

Company Directory, Join Us, Advanced Search, Trade Leads, Forum, 
Trade Shows, Advertising, Translation, fair trade, trade portal, 
business to business, trade
leads, trade events, china export, china manufacturer

>> T: "" parse doc [ thru "<meta" "name=" skip "description" skip 
"content=" m: skip (m1: first m ) copy T to m1  to end ] print T

>>

( as you see because keywords are first it works for them , but doesn't 
for description , they can be in different order in other document 
etc)
I can't just use {<meta name="keywords content="} as rule because 
that would work just on some pages that use exactly one space and 
"
Anton
14-Feb-2009
[3540]
Yes, I know this problem.
Janko
14-Feb-2009
[3541]
I have been banging my head agains it for half of day :) .. now at 
least I know what exactly is the problem.. why it happens.. at first 
I had no clue.. but still have no idea how to solve it
Anton
14-Feb-2009
[3542]
I have solved similar parse job...
Janko
14-Feb-2009
[3543x2]
maybe your solution for A | B would work.. I will try
ha, yes it works .. briliant!
Anton
14-Feb-2009
[3545]
it does ?
Janko
14-Feb-2009
[3546x4]
yes :) thanks a lot!
>> T: K: D: "" parse doc [ SOME [ thru "<meta" "name=" skip [ "description" 
(V: 'D) | "keywords" (V: 'K)] skip "content=" m: skip (m1: first 
m ) copy T to m1
(set V T) ]  to end ] ?? K ?? D

K: {Company Directory, Join Us, Advanced Search, Trade Leads, Forum, 
Trade Shows, Advertising, Translation, fair trade, trade portal, 
business to business, tr
ade leads, trade events, china export, china manufacturer}

D: {New international trade portal and company directory for Asia, 
Europe and North America. Our priority No.1 is to create and maintain 
a safe, well lit busi

ness-to-business marketplace, by assisting our members in identifying 
new trustworthy business partners!}

== {New international trade portal and company directory for Asia, 
Europe and North America. Our priority No.1 is to create and mai...
>>
it is also not dependant on the order of things which I still have 
to figure out why is that .. it works no matter which one is before 
the other
I intended to make a blogpost .. "REBOL parse challenge" and present 
this problem and ask if people can provide solutions in other languages 
that would be more elgant ... (in similar note as the "arc challenge" 
... now that it seems even more hard nut to crack I should probably 
really do it .. does anyone think this would be easy to solve using 
the conventional language? (I think not)
Anton
14-Feb-2009
[3550]
I'm sure there are some elegant solutions in other languages too.
Janko
14-Feb-2009
[3551]
hm.. would this be nicely solvable with a regex? .. I think it would 
be quite a pain by using regular string functions like strpos substr 
etc... having the same requirenments (one or more spaces/tabs/newlines 
" or ' , undefined order)
Anton
14-Feb-2009
[3552]
I don't know - I only learn regex when I have to .. then a short 
time later I forget.
Janko
14-Feb-2009
[3553]
yes, me also
Anton
14-Feb-2009
[3554]
perl could do it pretty quick, I'm sure.
Janko
14-Feb-2009
[3555x4]
perl pro would certanly use regex (that is the initial home of it) 
:) ... I think parse and regex are best for some different problems, 
I am just not sure if this one is better solved with one or the other
regex I imagine sucks at structured stuff , where you have to make 
some sort of state machine , for example I don't think regex can 
well parse xml ... state machines are exelent at that but they do 
require more code than parse would
I will see with the "parse challenge" .. if I would want to be really 
*sneaky* I could ask if anyone can solve this in perl comunity .. 
and if their solution would suck more than rebol's then make the 
blogpost  :)
but I am not like that ;)
Anton
14-Feb-2009
[3559x2]
Yeah, I'm not really sure what that would prove. :)
What would you build a state machine with, which would generate so 
much code ?
Janko
14-Feb-2009
[3561]
I don't fully understand your question?
Anton
14-Feb-2009
[3562x2]
You say "state machines ... require more code". What code ? Obviously, 
you can build a state machine in any language, but I guess I'm wondering 
what ... ohh... I'm so tired after all those cheese sandwiches....
Anyway, I think I understand what you're saying. A state machine 
is big and clunky, expressing everything you don't want to hear about, 
while parse allows you to express your target more directly, cutting 
through anything you don't want without having to specify it.
Janko
14-Feb-2009
[3564]
I don't know the exact term for this but I build many parsers for 
things like xml, wiki text and some other custom things in various 
lower level langauges using simple state machine (at least that's 
how I called it)... To my understanding you can parse anything with 
something like that, also structured nested data with it but it of 
course takes some more coding than this rebol solution... what I 
mean as a state machine is a loop that accepts characters or words 
and has a predefined number of states and code for what to do at 
each state and when to switch to another state etc..
Anton
14-Feb-2009
[3565]
Right, yes. We agree.
Janko
14-Feb-2009
[3566]
ok :)
Anton
14-Feb-2009
[3567]
What is the next problem ?
Janko
14-Feb-2009
[3568]
that was the big stopper that you just solved for me.. there are 
no other problems for now .. just the wilingness to type in all the 
code :) ..