Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] Re: How to properly parse HTML and XHTML Meta Tags

From: Tom:Conlin:gm:ail at: 11-Sep-2008 23:32

vonja-sbcglobal.net wrote:
> Hello Rebol Group, > > I'm a bit new, I have a couple of the Rebol books and have gone > over the different tutorial a few times but I'm having trouble with > the following code of mine. > > For example: > I'm attempting to parse the meta tags but the tag can end in either > ">" or "/>" > > I've tried to write the below script a different way, over 50 times, > but to no avail. I don't know how to properly code it where it will > check for either ending tag ">" or "/>" > > sample meta tag: > <meta name="description" content="Having trouble with this below script" /> > > The end result should look like: > "Having trouble with this below script" > -not- > "Having trouble with this below script" / > > If I change the script from ">" to "/>" and the meta tag is > <meta name="description" content="Having trouble with this below script"> > > Then the script will not catch the ">" since it's looking for "/>" > > REBOL CODE: > page: read http://www.rebol.com ; webpage to be parsed > title: [] description: [] keywords: [] > parse page [ thru <title> copy title to </title>] > parse page [ thru "<meta name=^"keywords^" content=" copy keywords to > ">" ] > title: copy ""
description: copy [] keywords: copy []
> print title > print description > print keywords > > Thank you in advance for your assistance. > > Regards, > Von >
Hi Von welcome, note 1: when you initialize words with empty strings or blocks you *do* want to copy the empty string or block. \ (otherwise they can be the *same* empty block or string) title: copy "" description: copy [] keywords: copy [] note 2: when using parse for more than simple string splitting get use to using the /all refinement and handling white space yourself. you could define a class of chars that are not "/>" then copy some of them. downside is you would have to check if a "/" you ran into was followed by ">" and if not concatenate and continue. this code untested and un-run tag-end: charset "/>" content: complement tag-end ... parse page [ ... thru "<meta name=^"keywords^" content=" some[ copy token some content here: ;;; make a pointer to where parse is (append keywords token all[#"/" == first :here #">" != second :here append keywords "/" here: next :here ;;; move parse pointer over "/" ]) :here ;;; set where pars will resume ] thru ">" ... ] ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; you could detect closing angle and see of the proceeding char is a slash and if so remove it from the copied string. note: this is running parse once not multiple times using braces for string that contain double quotes and taking the destination for the content copied from the meta name=<dest> i.e keyword or description block... parse page [ thru <head> some[ thru {<META NAME="} copy dest to {"} {"} thru {content=} copy token to ">" here: thru ">" (if #"/" = first back :here [trim/with token "/"] append get to-word dest token ) ] <title> copy title to </title> tag! ] print title print description print keywords ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; but ultimately I would probably start with blk: load/markup <source> which would return a block of string! and tag! then process the tags; if I used parse I would end with the rule like [{<META NAME="} ... ["/>" | ">"]] note: this won't work with the page: read <source> because there may be a "/>" beyond the first ">" that closes the meta tag but with load/markup each tag and string element is isolated hope that helps