World: r3wp
[Parse] Discussion of PARSE dialect
older newer | first last |
Anton 14-Feb-2009 [3530] | ... to "end" (?? R) "end"] |
Janko 14-Feb-2009 [3531] | kib2: becasue I don't know how many spaces are between start and B .. and in my concrete case I need to have multiple rules.. I will show concrete example |
Anton 14-Feb-2009 [3532] | The second one actually consumes the "end", moving the pointer (the current parse index) through it. |
kib2 14-Feb-2009 [3533] | Anton: Janko just said he wanted to extract the "2", so I don't care wheter the pointer is, no ? |
Anton 14-Feb-2009 [3534] | Mmm.. probably true, but better to be neat and tidy with rules, then they can be reused in slightly different ways and still work as expected. |
Janko 14-Feb-2009 [3535] | kib... because in concrete I think I need *complex rules* not just 1 string for it to work .. it has to work on all sorts of pages written by anyone.. you will see once I show you real example .. right now |
Anton 14-Feb-2009 [3536] | First define whitespace: whsp: charset " ^-^/" ; Whitespace: space, tab, newline. |
kib2 14-Feb-2009 [3537] | Ok, and if we define any space like this : space: charset " ^-" parse/all doc1 [ thru "start" any space thru "B" copy number to "end" (print number) to end] |
Janko 14-Feb-2009 [3538x2] | ( I need to parse meta tags description and keywords and abstract if they exist -- they can come in any order, there can be one or multiple spaces/newlines/tabs between tag arguments, there can be " or ' used as argument="asdasd" ) >> doc2: {<head> { <title>Dragonicum.com - making the right business connections !</title> { <meta name="keywords" content="Company Directory, Join Us, Advanced Search, Trade Leads, Forum, Trade S { hows, Advertising, Translation, fair trade, trade portal, business to business, trade leads, trade even { ts, china export, china manufacturer" /> { <meta name="description" content="New international trade portal and company directory for Asia, Europe { and North America. Our priority No.1 is to create and maintain a safe, well lit business-to-business m { arketplace, by assisting our members in identifying new trustworthy business partners!" /> { <link rel="stylesheet" href="style/blue_main.css" type="text/css" />} == {<head> <title>Dragonicum.com - making the right business connections !</title> <meta name="keywords" content="Company Directory... >> T: "" parse doc [ thru "<meta" "name=" skip "keywords" skip "content=" m: skip (m1: first m ) copy T to m1 to end ] print T Company Directory, Join Us, Advanced Search, Trade Leads, Forum, Trade Shows, Advertising, Translation, fair trade, trade portal, business to business, trade leads, trade events, china export, china manufacturer >> T: "" parse doc [ thru "<meta" "name=" skip "description" skip "content=" m: skip (m1: first m ) copy T to m1 to end ] print T >> ( as you see because keywords are first it works for them , but doesn't for description , they can be in different order in other document etc) |
I can't just use {<meta name="keywords content="} as rule because that would work just on some pages that use exactly one space and " | |
Anton 14-Feb-2009 [3540] | Yes, I know this problem. |
Janko 14-Feb-2009 [3541] | I have been banging my head agains it for half of day :) .. now at least I know what exactly is the problem.. why it happens.. at first I had no clue.. but still have no idea how to solve it |
Anton 14-Feb-2009 [3542] | I have solved similar parse job... |
Janko 14-Feb-2009 [3543x2] | maybe your solution for A | B would work.. I will try |
ha, yes it works .. briliant! | |
Anton 14-Feb-2009 [3545] | it does ? |
Janko 14-Feb-2009 [3546x4] | yes :) thanks a lot! |
>> T: K: D: "" parse doc [ SOME [ thru "<meta" "name=" skip [ "description" (V: 'D) | "keywords" (V: 'K)] skip "content=" m: skip (m1: first m ) copy T to m1 (set V T) ] to end ] ?? K ?? D K: {Company Directory, Join Us, Advanced Search, Trade Leads, Forum, Trade Shows, Advertising, Translation, fair trade, trade portal, business to business, tr ade leads, trade events, china export, china manufacturer} D: {New international trade portal and company directory for Asia, Europe and North America. Our priority No.1 is to create and maintain a safe, well lit busi ness-to-business marketplace, by assisting our members in identifying new trustworthy business partners!} == {New international trade portal and company directory for Asia, Europe and North America. Our priority No.1 is to create and mai... >> | |
it is also not dependant on the order of things which I still have to figure out why is that .. it works no matter which one is before the other | |
I intended to make a blogpost .. "REBOL parse challenge" and present this problem and ask if people can provide solutions in other languages that would be more elgant ... (in similar note as the "arc challenge" ... now that it seems even more hard nut to crack I should probably really do it .. does anyone think this would be easy to solve using the conventional language? (I think not) | |
Anton 14-Feb-2009 [3550] | I'm sure there are some elegant solutions in other languages too. |
Janko 14-Feb-2009 [3551] | hm.. would this be nicely solvable with a regex? .. I think it would be quite a pain by using regular string functions like strpos substr etc... having the same requirenments (one or more spaces/tabs/newlines " or ' , undefined order) |
Anton 14-Feb-2009 [3552] | I don't know - I only learn regex when I have to .. then a short time later I forget. |
Janko 14-Feb-2009 [3553] | yes, me also |
Anton 14-Feb-2009 [3554] | perl could do it pretty quick, I'm sure. |
Janko 14-Feb-2009 [3555x4] | perl pro would certanly use regex (that is the initial home of it) :) ... I think parse and regex are best for some different problems, I am just not sure if this one is better solved with one or the other |
regex I imagine sucks at structured stuff , where you have to make some sort of state machine , for example I don't think regex can well parse xml ... state machines are exelent at that but they do require more code than parse would | |
I will see with the "parse challenge" .. if I would want to be really *sneaky* I could ask if anyone can solve this in perl comunity .. and if their solution would suck more than rebol's then make the blogpost :) | |
but I am not like that ;) | |
Anton 14-Feb-2009 [3559x2] | Yeah, I'm not really sure what that would prove. :) |
What would you build a state machine with, which would generate so much code ? | |
Janko 14-Feb-2009 [3561] | I don't fully understand your question? |
Anton 14-Feb-2009 [3562x2] | You say "state machines ... require more code". What code ? Obviously, you can build a state machine in any language, but I guess I'm wondering what ... ohh... I'm so tired after all those cheese sandwiches.... |
Anyway, I think I understand what you're saying. A state machine is big and clunky, expressing everything you don't want to hear about, while parse allows you to express your target more directly, cutting through anything you don't want without having to specify it. | |
Janko 14-Feb-2009 [3564] | I don't know the exact term for this but I build many parsers for things like xml, wiki text and some other custom things in various lower level langauges using simple state machine (at least that's how I called it)... To my understanding you can parse anything with something like that, also structured nested data with it but it of course takes some more coding than this rebol solution... what I mean as a state machine is a loop that accepts characters or words and has a predefined number of states and code for what to do at each state and when to switch to another state etc.. |
Anton 14-Feb-2009 [3565] | Right, yes. We agree. |
Janko 14-Feb-2009 [3566] | ok :) |
Anton 14-Feb-2009 [3567] | What is the next problem ? |
Janko 14-Feb-2009 [3568] | that was the big stopper that you just solved for me.. there are no other problems for now .. just the wilingness to type in all the code :) .. |
Anton 14-Feb-2009 [3569x2] | I know what it could be - eg: <img src=afile.jpg> <img src="afile.jpg> <img src='afile.jpg'> |
The first one without any quotes causes a little bit of a problem (solvable). | |
Janko 14-Feb-2009 [3571x2] | maybe you can make OPT [ " | ' ] ? |
copy to [ " | > | ' ] ? | |
Anton 14-Feb-2009 [3573] | You have to use a variable to store which one was used, then parse until that character is encountered again. |
Janko 14-Feb-2009 [3574x2] | yes, thats how I did it |
>> "content=" m: skip (m1: first m ) copy T to m1<< | |
Anton 14-Feb-2009 [3576] | So you did. |
Janko 14-Feb-2009 [3577] | in meta tags example |
Anton 14-Feb-2009 [3578] | But when no quotes are used, it gets tricky, eg: <img src= afile.jpg width=10> |
Janko 14-Feb-2009 [3579] | what I have the biggest problem (that I thought is unsolvable - but I have to study your example why it works) is the order of things |
older newer | first last |