r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Parse] Discussion of PARSE dialect

Chris
9-Feb-2009
[3490]
As far as I'm aware, Mc and Mac are interchangeable.
Graham
9-Feb-2009
[3491x2]
In legal documents? Interesting.
I'm grabbing my phone book ....
BrianH
9-Feb-2009
[3493]
My family switched away from the Scottish spelling too, back in the 
19th century when that branch came to the US.
Chris
9-Feb-2009
[3494]
Didn't say that, just usage.
BrianH
9-Feb-2009
[3495]
Each family picks one spelling and sticks with it nowadays, mostly 
because of those legal documents.
Graham
9-Feb-2009
[3496x2]
Yep, my phone book has the Macleans between the Mcleans
so the alphabetical ordering system they're using treats mc and mac 
the same
Chris
9-Feb-2009
[3498]
B: from what name?
BrianH
9-Feb-2009
[3499x2]
Phone book sorting - that's really complex :(
Halle
Chris
9-Feb-2009
[3501]
Sounds nordic...
BrianH
9-Feb-2009
[3502x2]
To Hawley, the English spelling. To reduce prejudice in the US.
It's old Celtic.
Graham
9-Feb-2009
[3504x2]
Apple MacIntosh ??
I think I'll skip Macs
Chris
9-Feb-2009
[3506]
As opposed to MacKintosh.
Steeve
9-Feb-2009
[3507]
you can't guess, you need the list of all clans :)
Janko
14-Feb-2009
[3508x4]
hi, it's me again with parse problems...  I need this concretely 
to parse out web-page meta tags.. but I distilled the problem out 
of it to a minimal example..
doc1: "start A 1 end start B 2 end"  how can you get value of  2 
out
It works with a because it's first , but becasuse it enters the "parse" 
with it and then doesn't match it doesn't again test the B 

>> parse doc1 [ "start" "A" copy R to "end" (print R) to end ]
 1
== true
>> parse doc1 [ "start" "B" copy R to "end" (print R) to end ]
== false
I thought it will recheck if I put it into something like SOME [ 
] but it doesn't 


parse doc1 [ SOME [ "start" "B" copy R to "end" (print R) to end 
] ]
kib2
14-Feb-2009
[3512]
Maybe ? parse/all doc1 [ thru "B" copy number to "end" (print number) 
]
But I'm beginning with parse, so I'm not an expert
Janko
14-Feb-2009
[3513x2]
This would work in this case but I need to get "2" only if sequence 
before it is exactly previous two "start" "B" XX "end" ...  there 
can be "B" in other places of the string and it musn't take that 
(I am used on using thru and to too but I musn't use them in this 
case for this reason as it might just skip to some "B"
>> doc1: "start A 1 end xyz B 2 end" ;; in this case it must not 
take 2
== "start A 1 end xyz B 2 end"

>> parse doc1 [ "start" thru "B" copy R to "end" (print R) to end 
] ;; but it will that's why I can't u
se to\thru
 2
== true
Anton
14-Feb-2009
[3515]
some ["start" ["A" | "B"] copy R to "end" "end"]
Janko
14-Feb-2009
[3516]
ups ... my example above is wrong .. just a sec
Anton
14-Feb-2009
[3517]
no, hang on...
Janko
14-Feb-2009
[3518x2]
Anton this would  return me 1 probably ?
(this is the right example .. I forgot to use thru above so second 
wouldn't pass anyway... but result is the same)

>> doc1: "start A 1 end start B 2 end"
== "start A 1 end start B 2 end"

>> parse doc1 [ thru "start" "A" copy R to "end" (print R) to end 
]
 1
== true

>> parse doc1 [ thru "start" "B" copy R to "end" (print R) to end 
]
== false

>> parse doc1 [ SOME [ thru "start" "B" copy R to "end" (print R) 
to end ] ]
== false
Anton
14-Feb-2009
[3520]
Is there anything expected between "start" and "A", for instance 
?
Janko
14-Feb-2009
[3521]
I know how to solve this by making it less robust (in this case relying 
that there is only one space between) but this doesn't solve my problem 
well 

>> parse doc1 [ thru "start B" copy R to "end" (print R) to end ]
 2
== true
Anton
14-Feb-2009
[3522]
No need for that.
Janko
14-Feb-2009
[3523]
1 or more spaces (to your question)
Anton
14-Feb-2009
[3524]
parse doc1 [some [thru "start" ["A" | "B"] copy R to "end" (?? R) 
"end"]]
Janko
14-Feb-2009
[3525]
hm.. just a sec so I try few things
Anton
14-Feb-2009
[3526]
PARSE without the /ALL refinement handles any amount of whitespace. 
(You will probably end up using parse/all, though. I usually do when 
parsing HTML.)
Janko
14-Feb-2009
[3527]
Your solution, I thought it won't work if I reverse order of A and 
B in the string but it seems it does.  I would need to know which 
one is A and B but I think this can be solved by setting some word 
( ) inside [ A | B] ...  so basically it seems to work... I think 
I can apply this way also to my concrete problem which is this
kib2
14-Feb-2009
[3528]
I don't understand why not simply : 

parse/all doc1 [ thru "start B" copy number to "end" (print number) 
]
Anton
14-Feb-2009
[3529x2]
You leave the pointer at beginning of "end" in the doc1 string. Look 
at my example, I move TO "end", then I also consume "end".
... to "end" (?? R) "end"]
Janko
14-Feb-2009
[3531]
kib2: becasue I don't know how many spaces are between start and 
B .. and in my concrete case I need to have multiple rules.. I will 
show concrete example
Anton
14-Feb-2009
[3532]
The second one actually consumes the "end", moving the pointer (the 
current parse index) through it.
kib2
14-Feb-2009
[3533]
Anton: Janko just said he wanted to extract the "2", so I don't care 
wheter the pointer is, no ?
Anton
14-Feb-2009
[3534]
Mmm.. probably true, but better to be neat and tidy with rules, then 
they can be reused in slightly different ways and still work as expected.
Janko
14-Feb-2009
[3535]
kib... because in concrete I think I need *complex rules* not just 
1 string for it to work .. it has to work on all sorts of pages written 
by anyone.. you will see once I show you real example .. right now
Anton
14-Feb-2009
[3536]
First define whitespace:
whsp: charset " ^-^/" ; Whitespace: space, tab, newline.
kib2
14-Feb-2009
[3537]
Ok, and if we define any space like this :

space: charset " ^-"

parse/all doc1 [ thru "start" any space thru "B" copy number to "end" 
(print number) to end]
Janko
14-Feb-2009
[3538x2]
( I need to parse meta tags description and keywords and abstract 
if they exist -- they can come in any order, there can be one or 
multiple spaces/newlines/tabs between tag arguments, there can be 
" or '  used as argument="asdasd" )
>> doc2: {<head>

{    <title>Dragonicum.com - making the right business connections 
!</title>

{    <meta name="keywords" content="Company Directory, Join Us, Advanced 
Search, Trade Leads, Forum, Trade S

{    hows, Advertising, Translation, fair trade, trade portal, business 
to business, trade leads, trade even
{    ts, china export, china manufacturer" />

{    <meta name="description" content="New international trade portal 
and company directory for Asia, Europe

{     and North America. Our priority No.1 is to create and maintain 
a safe, well lit business-to-business m

{    arketplace, by assisting our members in identifying new trustworthy 
business partners!" />

{    <link rel="stylesheet" href="style/blue_main.css" type="text/css" 
/>}
== {<head>

<title>Dragonicum.com - making the right business connections !</title>
<meta name="keywords" content="Company Directory...


>> T: "" parse doc [ thru "<meta" "name=" skip "keywords" skip "content=" 
m: skip (m1: first m ) copy T to m1  to end ] print T

Company Directory, Join Us, Advanced Search, Trade Leads, Forum, 
Trade Shows, Advertising, Translation, fair trade, trade portal, 
business to business, trade
leads, trade events, china export, china manufacturer

>> T: "" parse doc [ thru "<meta" "name=" skip "description" skip 
"content=" m: skip (m1: first m ) copy T to m1  to end ] print T

>>

( as you see because keywords are first it works for them , but doesn't 
for description , they can be in different order in other document 
etc)
I can't just use {<meta name="keywords content="} as rule because 
that would work just on some pages that use exactly one space and 
"