World: r3wp
[Parse] Discussion of PARSE dialect
older newer | first last |
Chris 9-Feb-2009 [3490] | As far as I'm aware, Mc and Mac are interchangeable. |
Graham 9-Feb-2009 [3491x2] | In legal documents? Interesting. |
I'm grabbing my phone book .... | |
BrianH 9-Feb-2009 [3493] | My family switched away from the Scottish spelling too, back in the 19th century when that branch came to the US. |
Chris 9-Feb-2009 [3494] | Didn't say that, just usage. |
BrianH 9-Feb-2009 [3495] | Each family picks one spelling and sticks with it nowadays, mostly because of those legal documents. |
Graham 9-Feb-2009 [3496x2] | Yep, my phone book has the Macleans between the Mcleans |
so the alphabetical ordering system they're using treats mc and mac the same | |
Chris 9-Feb-2009 [3498] | B: from what name? |
BrianH 9-Feb-2009 [3499x2] | Phone book sorting - that's really complex :( |
Halle | |
Chris 9-Feb-2009 [3501] | Sounds nordic... |
BrianH 9-Feb-2009 [3502x2] | To Hawley, the English spelling. To reduce prejudice in the US. |
It's old Celtic. | |
Graham 9-Feb-2009 [3504x2] | Apple MacIntosh ?? |
I think I'll skip Macs | |
Chris 9-Feb-2009 [3506] | As opposed to MacKintosh. |
Steeve 9-Feb-2009 [3507] | you can't guess, you need the list of all clans :) |
Janko 14-Feb-2009 [3508x4] | hi, it's me again with parse problems... I need this concretely to parse out web-page meta tags.. but I distilled the problem out of it to a minimal example.. |
doc1: "start A 1 end start B 2 end" how can you get value of 2 out | |
It works with a because it's first , but becasuse it enters the "parse" with it and then doesn't match it doesn't again test the B >> parse doc1 [ "start" "A" copy R to "end" (print R) to end ] 1 == true >> parse doc1 [ "start" "B" copy R to "end" (print R) to end ] == false | |
I thought it will recheck if I put it into something like SOME [ ] but it doesn't parse doc1 [ SOME [ "start" "B" copy R to "end" (print R) to end ] ] | |
kib2 14-Feb-2009 [3512] | Maybe ? parse/all doc1 [ thru "B" copy number to "end" (print number) ] But I'm beginning with parse, so I'm not an expert |
Janko 14-Feb-2009 [3513x2] | This would work in this case but I need to get "2" only if sequence before it is exactly previous two "start" "B" XX "end" ... there can be "B" in other places of the string and it musn't take that (I am used on using thru and to too but I musn't use them in this case for this reason as it might just skip to some "B" |
>> doc1: "start A 1 end xyz B 2 end" ;; in this case it must not take 2 == "start A 1 end xyz B 2 end" >> parse doc1 [ "start" thru "B" copy R to "end" (print R) to end ] ;; but it will that's why I can't u se to\thru 2 == true | |
Anton 14-Feb-2009 [3515] | some ["start" ["A" | "B"] copy R to "end" "end"] |
Janko 14-Feb-2009 [3516] | ups ... my example above is wrong .. just a sec |
Anton 14-Feb-2009 [3517] | no, hang on... |
Janko 14-Feb-2009 [3518x2] | Anton this would return me 1 probably ? |
(this is the right example .. I forgot to use thru above so second wouldn't pass anyway... but result is the same) >> doc1: "start A 1 end start B 2 end" == "start A 1 end start B 2 end" >> parse doc1 [ thru "start" "A" copy R to "end" (print R) to end ] 1 == true >> parse doc1 [ thru "start" "B" copy R to "end" (print R) to end ] == false >> parse doc1 [ SOME [ thru "start" "B" copy R to "end" (print R) to end ] ] == false | |
Anton 14-Feb-2009 [3520] | Is there anything expected between "start" and "A", for instance ? |
Janko 14-Feb-2009 [3521] | I know how to solve this by making it less robust (in this case relying that there is only one space between) but this doesn't solve my problem well >> parse doc1 [ thru "start B" copy R to "end" (print R) to end ] 2 == true |
Anton 14-Feb-2009 [3522] | No need for that. |
Janko 14-Feb-2009 [3523] | 1 or more spaces (to your question) |
Anton 14-Feb-2009 [3524] | parse doc1 [some [thru "start" ["A" | "B"] copy R to "end" (?? R) "end"]] |
Janko 14-Feb-2009 [3525] | hm.. just a sec so I try few things |
Anton 14-Feb-2009 [3526] | PARSE without the /ALL refinement handles any amount of whitespace. (You will probably end up using parse/all, though. I usually do when parsing HTML.) |
Janko 14-Feb-2009 [3527] | Your solution, I thought it won't work if I reverse order of A and B in the string but it seems it does. I would need to know which one is A and B but I think this can be solved by setting some word ( ) inside [ A | B] ... so basically it seems to work... I think I can apply this way also to my concrete problem which is this |
kib2 14-Feb-2009 [3528] | I don't understand why not simply : parse/all doc1 [ thru "start B" copy number to "end" (print number) ] |
Anton 14-Feb-2009 [3529x2] | You leave the pointer at beginning of "end" in the doc1 string. Look at my example, I move TO "end", then I also consume "end". |
... to "end" (?? R) "end"] | |
Janko 14-Feb-2009 [3531] | kib2: becasue I don't know how many spaces are between start and B .. and in my concrete case I need to have multiple rules.. I will show concrete example |
Anton 14-Feb-2009 [3532] | The second one actually consumes the "end", moving the pointer (the current parse index) through it. |
kib2 14-Feb-2009 [3533] | Anton: Janko just said he wanted to extract the "2", so I don't care wheter the pointer is, no ? |
Anton 14-Feb-2009 [3534] | Mmm.. probably true, but better to be neat and tidy with rules, then they can be reused in slightly different ways and still work as expected. |
Janko 14-Feb-2009 [3535] | kib... because in concrete I think I need *complex rules* not just 1 string for it to work .. it has to work on all sorts of pages written by anyone.. you will see once I show you real example .. right now |
Anton 14-Feb-2009 [3536] | First define whitespace: whsp: charset " ^-^/" ; Whitespace: space, tab, newline. |
kib2 14-Feb-2009 [3537] | Ok, and if we define any space like this : space: charset " ^-" parse/all doc1 [ thru "start" any space thru "B" copy number to "end" (print number) to end] |
Janko 14-Feb-2009 [3538x2] | ( I need to parse meta tags description and keywords and abstract if they exist -- they can come in any order, there can be one or multiple spaces/newlines/tabs between tag arguments, there can be " or ' used as argument="asdasd" ) >> doc2: {<head> { <title>Dragonicum.com - making the right business connections !</title> { <meta name="keywords" content="Company Directory, Join Us, Advanced Search, Trade Leads, Forum, Trade S { hows, Advertising, Translation, fair trade, trade portal, business to business, trade leads, trade even { ts, china export, china manufacturer" /> { <meta name="description" content="New international trade portal and company directory for Asia, Europe { and North America. Our priority No.1 is to create and maintain a safe, well lit business-to-business m { arketplace, by assisting our members in identifying new trustworthy business partners!" /> { <link rel="stylesheet" href="style/blue_main.css" type="text/css" />} == {<head> <title>Dragonicum.com - making the right business connections !</title> <meta name="keywords" content="Company Directory... >> T: "" parse doc [ thru "<meta" "name=" skip "keywords" skip "content=" m: skip (m1: first m ) copy T to m1 to end ] print T Company Directory, Join Us, Advanced Search, Trade Leads, Forum, Trade Shows, Advertising, Translation, fair trade, trade portal, business to business, trade leads, trade events, china export, china manufacturer >> T: "" parse doc [ thru "<meta" "name=" skip "description" skip "content=" m: skip (m1: first m ) copy T to m1 to end ] print T >> ( as you see because keywords are first it works for them , but doesn't for description , they can be in different order in other document etc) |
I can't just use {<meta name="keywords content="} as rule because that would work just on some pages that use exactly one space and " | |
older newer | first last |