Mailing List Archive: Re: How to properly parse HTML and XHTML Meta Tags

[REBOL] Re: How to properly parse HTML and XHTML Meta Tags

From: Tom:Conlin:gm:ail at: 11-Sep-2008 23:32


vonja-sbcglobal.net wrote:
> Hello Rebol Group,
>
> I'm a bit new, I have a couple of the Rebol books and have gone
> over the different tutorial a few times but I'm having trouble with
> the following code of mine.
>
> For example:
> I'm attempting to parse the meta tags but the tag can end in either
> ">" or "/>"
>
> I've tried to write the below script a different way, over 50 times,
> but to no avail.  I don't know how to properly code it where it will
> check for either ending tag ">" or "/>"
>
> sample meta tag:
> <meta name="description" content="Having trouble with this below script" />
>
> The end result should look like:
> "Having trouble with this below script"
> -not-
> "Having trouble with this below script" /
>
> If I change the script from ">" to "/>" and the meta tag is
> <meta name="description" content="Having trouble with this below script">
>
> Then the script will not catch the ">" since it's looking for "/>"
>
> REBOL CODE:
> page: read http://www.rebol.com     ; webpage to be parsed
>     title: []   description: []   keywords: []
>     parse page [ thru <title> copy title to </title>]
>     parse page [ thru "<meta name=^"keywords^" content=" copy keywords to
> ">" ]
>       title: copy ""
     description: copy []
     keywords: copy []

>     print title
>     print description
>     print keywords
>
> Thank you in advance for your assistance.
>
> Regards,
> Von
>

Hi Von welcome,

note 1: when you initialize words with empty strings or blocks
you *do* want to copy the empty string or block. \
(otherwise they can be the *same* empty block or string)

title: copy ""
description: copy []
keywords: copy []

note 2: when using parse for more than simple string splitting get use
to using the /all refinement and handling white space yourself.

you could define a class of chars that are not "/>"  then copy some of
them. downside is you would have to check if a "/" you ran into was
followed by ">" and if not concatenate and continue.
this code untested  and un-run

tag-end: charset "/>"
content: complement tag-end
...
parse page [
         ...
	thru "<meta name=^"keywords^" content="
	some[
             copy token some content
	    here:                  ;;; make a pointer to where parse is
	    (append keywords token
	     all[#"/" == first :here
		 #">" != second :here
		 append keywords "/"
                  here: next :here  ;;; move parse pointer over "/"
	     ])
              :here	;;; set where pars will resume
	]
   	thru ">"
         ...
]

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

you could detect closing angle and see of the proceeding char is a slash
and if so remove it from the copied string.

note: this is running parse once not multiple times
using braces for string that contain double quotes
and taking the destination for the content copied
from the meta name=<dest> i.e keyword or description block...

    parse page [
     	thru <head>
     	some[
     		thru {<META NAME="}
     		copy dest to {"} {"}
     		thru {content=}
     		copy token to ">" here: thru ">"
     		(if #"/" = first back :here [trim/with token "/"]
     		 append get to-word dest token
     		)
     	]
     	<title> copy title to </title> tag!
     ]
     print title
     print description
     print keywords

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

but ultimately I would probably start with

blk: load/markup <source>

which would return a block of string! and tag!

then process the tags; if I used parse I would end with
the rule  like
[{<META NAME="} ...  ["/>" | ">"]]

note: this won't work with the
page: read <source>
because there may be a "/>" beyond the first ">" that closes the meta
tag but with load/markup  each tag and string element is isolated

hope that helps