using parse to remove tags from html string

[1/4] from: mgh520:yah:oo at: 19-Sep-2001 14:36

I picked this up again the other day, determined to solve it. It turns out I was really close and debugging this has been helpful for understanding parsing, so I thought I'd post my results. So here's a pretty simple and concise way to remove all tags from a string (such as removing all html tags): parse test [any [to "<" begin: thru ">" ending: (remove/part begin ending) :begin]] The part I was missing before was the :begin at the end. What this does is still a bit hazy. Here is a quote from an example in the /core users guide (search for :mark to find it): Notice the :mark word used above. It sets the input to a new position. The insert function returns the new position just past the insert of the current time. The word mark is used to set the input to that position. So as I understand it, :begin in my example resets the position in begin, so that the next parse will find the first instance of '<'. Without :begin, it will not see that first '<' and will end up leaving the name tag, <Name>, in the string. Here's a good way to display what is actually happening with this code: parse test [any [to "<" begin: thru ">" ending: (print ["begin..." begin] print ["ending..." ending] remove/part begin ending) :begin]] This will show you what begin and ending are set to each pass through the string. the play by play: 1. to "<" -- moves to the first instance of the '<' character in the string 2. begin: -- sets begin to the series starting at that position 3. thru ">" -- moves to the first character *after* the next '>' character 4. ending: -- sets ending to the series starting at that position 5. (remove/part begin ending) -- removes from series 'begin everything up until start of 'ending. Since begin is pointing at the actual 'test series, when we do the remove we are modifying test 6. :begin -- moves the insert position (I attempted to explain this above. If anyone can better explain it, I'd be grateful). and that's it! I know it's not much, but I was so excited when I finally got this to work! Thanks for everyone's help, this is a great list. mike

[2/4] from: greggirwin:starband at: 20-Sep-2001 10:27

Thanks for posting that Mike! It may not seem like much but you can't judge the value of REBOL code by it's volume. :) --Gregg

[3/4] from: mgh520:yaho:o at: 10-Aug-2001 12:36

Hopefully this is an easy one for somone out there. I'm trying to remove all tags from a string in the simplest way possible. The following example seems to be very close--it removes the first tag but doesn't continue.

>> test: "<Name>Homer</Name>"

== "<Name>Homer</Name>"

>> parse test [any [to "<" begin: thru ">" ending: (remove/part begin ending)] to end]

== true

>> print test

Homer</Name> this next one prints all indexes of tags, and it works correctly, leading me to believe that the problem with the above example comes into play once you start to modify the original string.

>> test: "<Name>Homer</Name>"

== "<Name>Homer</Name>"

>> parse test [any [to "<" begin: thru ">" ending: (print index? begin print index? ending)]t

o end] 1 7 12 19 == true Thanks for any suggestions you may have. mike p.s. are my messages coming through with extra newlines? It looks truly awful on escribe.

[4/4] from: sterling:rebol at: 10-Aug-2001 11:26

Two choices: 1. Throw a SOME into your code and go from there...slightly modified because you don't seem to need the ANY. parse test [some [to "<" begin: thru ">" ending: (print [index? begin index? ending])]] 2. Use load/markup: data: load/markup test == [<Name> "Homer" </Name>] Now just iterate over that block and pull out every string! type and ignore the tag! types. Sterling