Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] parse vs. load/markup (a bit long)

From: hallvard::ystad::helpinhand::com at: 24-Jan-2002 12:57

Hello folks, I have a script that loads a given HTML page, parses it, and triggers different actions for the different tags it comes across. For the parsing, I have a parse rule that I invoke like this (those of you who don't bother reading all the code, skip to the bottom to see some figures): parse/all site-content [ copy txt to "<" (do_txt) some html-code copy txt to end (do_txt) ] Here's the 'html-code rule: ---code start--- html-code: [ copy tag [ "<script" thru "</script>" | ;"<script" thru "</script>" (append?: false) | ;["<style" whitespace thru "</style>"] (append?: false) | copy comm [ "<!--" thru "-->" ] | [ "<title>" copy title [to "</title"] thru ">" ] ( do_title append?: false) | copy base [ "<base" thru ">"] (do_base) | [ "<tr" thru ">"] (do_tr) | [ "<pop-from-stack" thru ">"] (pop-from-stack) | [ "</tr" thru ">"] (do_tr_end) | [ "<td" thru ">"] (do_td) | [ "</td" thru ">"] (do_td_end) | [ "<table" copy hei thru ">"] ( table: join "<table" hei do_table ) | copy frame-tag insert-point: [ "<frame" whitespace thru ">" ] ( insert-point: next find insert-point ">" frame-url: ex_att frame-tag "src" if frame-url <> "none" [ to-fetch: to-url HURL/frame frame-url HURL/address_list to-fetch insert HURL/hstack to-url HURL/address frame-content: fetch to-fetch either string? frame-content [ trim/lines frame-content insert insert-point join "<source " [to-fetch ">" frame-content "<pop-from-stack>"] ] [ ; hvis det er block, så er det en error print form frame-content ] ] ) | copy tag_a [ "<a" whitespace copy tab_h ar_tag] ( ) | [ "</table>"] (do_table_end) | ["<" thru ">"]] (do_tag append?: true) | copy txt to "<" (do_txt) ] ; end html-code ---code end--- And 'html-code in turn invokes this: ---code start--- ar_tag: [ [to "<" fmark: [ ["</a" thru ">"] | "</td" (insert fmark " </a> " :fmark) ["</a" thru ">"] | "<a" (insert fmark " </a> " :fmark) ["</a" thru ">"] | "<t" (insert fmark " </a> " :fmark) ["</a" thru ">"] | thru "<" to "<" ar_tag ] ] ] ---code end--- 'Ar_tag is just a mechanism to make sure <a>-tags are properly closed. Then it struck me I should use the built-in mechanism load/markup instead. So I replaced the above rule with load/markup and made a switch: ---code start--- content-block: load/markup site-content forall content-block [ either tag? first content-block [ tag: first content-block p: parse tag none switch first p [ "script" [ forever [ if (length? content-block) = 0 [ break ] if (first content-block) = </script> [ break ] content-block: next content-block ] ] "!--" [ comm: first content-block ] "title" [ title: second content-block do_title append?: false ] "base" [ base: tag do_base ] "tr" [ do_tr ] "pop-from-stack" [ pop-from-stack ] "/tr" [ do_tr_end ] "td" [ do_td ] "/td" [ do_td_end ] "table" [ table: tag do_table ] "frame" [ frame-url: ex_att tag "src" if frame-url <> "none" [ to-fetch: to-url HURL/frame frame-url HURL/address_list to-fetch insert HURL/hstack to-url HURL/address frame-content: fetch to-fetch either string? frame-content [ trim/lines frame-content insert content-block join "<source " [to-fetch ">" frame-content "<pop-from-stack>"] ] [ ; hvis vi mottar en block, så er det en error print form frame-content ] ] ] "/table" [ do_table_end ] "a" [ ; should rearrange structures here, I believe... ] ] do_tag append?: true ] [ txt: first content-block do_txt ] ] ---code end--- Which of the two is faster? I made a little test: ---code start--- test-it: func [] [ a: now/precise/time prin "Checking" loop 10 [ prin "." process-html ] b: now/precise/time print join "^/Done in " (b - a) ] ---code end--- To get accurate results, I stored an HTML page on my harddisk, so that networking shouldn't be a factor. And here are the results: 1) The 'html-code parse rule:
>> test-it
Checking.......... Done in 0:00:02.444
>> test-it
Checking.......... Done in 0:00:02.404
>> test-it
Checking.......... Done in 0:00:02.383
>> test-it
Checking.......... Done in 0:00:02.474
>> test-it
Checking.......... Done in 0:00:02.383
>> test-it
Checking.......... Done in 0:00:02.354
>> test-it
Checking.......... Done in 0:00:02.354
>> test-it
Checking.......... Done in 0:00:02.333
>> test-it
Checking.......... Done in 0:00:02.363
>> test-it
Checking.......... Done in 0:00:02.344
>>
2) using load/markup and a switch:
>> test-it
Checking.......... Done in 0:00:03.205
>> test-it
Checking.......... Done in 0:00:03.195
>> test-it
Checking.......... Done in 0:00:03.194
>> test-it
Checking.......... Done in 0:00:03.225
>> test-it
Checking.......... Done in 0:00:03.205
>> test-it
Checking.......... Done in 0:00:03.204
>> test-it
Checking.......... Done in 0:00:03.204
>> test-it
Checking.......... Done in 0:00:03.244
>> test-it
Checking.......... Done in 0:00:03.225
>> test-it
Checking.......... Done in 0:00:03.244
>>
So, even though the example with load/markup potentially triggers fewer actions (e.g. it doesn't clean up messy <a> tags), it is slower. Does this surprise anyone? I'm not surprised myself, since parse indeed is native and my rule probably is less thorough in its html parsing than load/markup is. This is not really a problem. I'll go back to the 'html-code parse rule. Just thought I'd share this little piece of comparison with the list. ~H