ANN: SiteCrawl

[1/4] from: ryan::christiansen::intellisol::com at: 24-Jul-2001 11:43

Following is the function 'SiteCrawl and its dependent functions 'linkURL and 'pageLinks. 'SiteCrawl will crawl an entire web site and gather the URLs in a block! for each page on the site. USAGE: rebol-pages: copy [] SiteCrawl http://www.rebol.com rebol-pages I need feedback on this. Do you have a small site where you can test 'SiteCrawl for me? Thanks. -Ryan pageLinks: func [ "Return a block of links from an HTML page" page [string!] "The code for an HTML page as a string" /local links ][ links: copy [] parse page [any [thru {<A HREF="} copy text to {"} (append links text)]] return links ] linkURL: func [ "Create a complete url for a link based on a link's relativity to the URL of the HTML page where the link appears" link "The url of a link parsed from an HTML page" page-url "The url for the HTML page where the link appears" /local protocol domain path-branch ][ link: make string! link page-url: make string! page-url parse page-url [copy text thru "://" (protocol: copy text)] parse page-url [thru "://" copy text [to "/" | to end] (domain: copy text)] path-branch: parse/all page-url "/" either find link "://" [ return link ][ either link/1 = #"/" [ insert link (rejoin [protocol domain]) return link ][ either (last path-branch) = domain [ insert link (rejoin [page-url "/"]) return link ][ replace page-url (last path-branch) "" insert link page-url return link ] ] ] ] SiteCrawl: func [ "Crawl an entire web site" site [url!] "The url of the site to crawl" pages [block!] "A block where all found pages will be gathered" /local page links link target find-page site-domain target-domain ][ page: read site site: make string! site links: pageLinks page foreach link links [ target: LinkURL link site either find target "mailto:" [ ; nothing ][ parse site [thru "://" copy text [to "/" | to end] (site-domain: copy text)] parse target [thru "://" copy text [to "/" | to end] (target-domain: copy text)] either site-domain == target-domain [ find-page: select pages target if (find-page == none) [ append/only pages target if error? try [SiteCrawl make url! target][links: next links] ] ][ ; nothing ] ] ] return pages ] Ryan C. Christiansen Web Developer Intellisol International 4733 Amber Valley Parkway Fargo, ND 58104 701-235-3390 ext. 6671 FAX: 701-235-9940 http://www.intellisol.com Global Leader in People Performance Software _____________________________________ Confidentiality Notice This message may contain privileged and confidential information. If you think, for any reason, that this message may have been addressed to you in error, you must not disseminate, copy or take any action in reliance on it, and we would ask you to notify us immediately by return email to [ryan--christiansen--intellisol--com]

[2/4] from: ryan:christiansen:intellisol at: 24-Jul-2001 12:23

I forgot the second argument when SiteCrawl is called within itself. Here is the updated function. SiteCrawl: func [ "Crawl an entire web site" site [url!] "The url of the site to crawl" pages [block!] "A block where all found pages will be gathered" /local page links link target find-page site-domain target-domain ][ page: read site site: make string! site links: pageLinks page foreach link links [ target: linkURL link site either find target "mailto:" [ ; nothing ][ parse site [thru "://" copy text [to "/" | to end] (site-domain: copy text)] parse target [thru "://" copy text [to "/" | to end] (target-domain: copy text)] either site-domain == target-domain [ find-page: select pages target if (find-page == none) [ append/only pages target if error? try [SiteCrawl make url! target pages][links: next links] ] ][ ; nothing ] ] ] return pages ] [ryan--christiansen--inte] llisol.com To: [rebol-list--rebol--com] Sent by: cc: [rebol-bounce--rebol--com] Subject: [REBOL] ANN: SiteCrawl 07/24/2001 11:43 AM Please respond to rebol-list Following is the function 'SiteCrawl and its dependent functions 'linkURL and 'pageLinks. 'SiteCrawl will crawl an entire web site and gather the URLs in a block! for each page on the site. USAGE: rebol-pages: copy [] SiteCrawl http://www.rebol.com rebol-pages I need feedback on this. Do you have a small site where you can test 'SiteCrawl for me? Thanks. -Ryan pageLinks: func [ "Return a block of links from an HTML page" page [string!] "The code for an HTML page as a string" /local links ][ links: copy [] parse page [any [thru {<A HREF="} copy text to {"} (append links text)]] return links ] linkURL: func [ "Create a complete url for a link based on a link's relativity to the URL of the HTML page where the link appears" link "The url of a link parsed from an HTML page" page-url "The url for the HTML page where the link appears" /local protocol domain path-branch ][ link: make string! link page-url: make string! page-url parse page-url [copy text thru "://" (protocol: copy text)] parse page-url [thru "://" copy text [to "/" | to end] (domain: copy text)] path-branch: parse/all page-url "/" either find link "://" [ return link ][ either link/1 = #"/" [ insert link (rejoin [protocol domain]) return link ][ either (last path-branch) = domain [ insert link (rejoin [page-url "/"]) return link ][ replace page-url (last path-branch) "" insert link page-url return link ] ] ] ] SiteCrawl: func [ "Crawl an entire web site" site [url!] "The url of the site to crawl" pages [block!] "A block where all found pages will be gathered" /local page links link target find-page site-domain target-domain ][ page: read site site: make string! site links: pageLinks page foreach link links [ target: LinkURL link site either find target "mailto:" [ ; nothing ][ parse site [thru "://" copy text [to "/" | to end] (site-domain: copy text)] parse target [thru "://" copy text [to "/" | to end] (target-domain: copy text)] either site-domain == target-domain [ find-page: select pages target if (find-page == none) [ append/only pages target if error? try [SiteCrawl make url! target][links: next links] ] ][ ; nothing ] ] ] return pages ] Ryan C. Christiansen Web Developer Intellisol International 4733 Amber Valley Parkway Fargo, ND 58104 701-235-3390 ext. 6671 FAX: 701-235-9940 http://www.intellisol.com Global Leader in People Performance Software _____________________________________ Confidentiality Notice This message may contain privileged and confidential information. If you think, for any reason, that this message may have been addressed to you in error, you must not disseminate, copy or take any action in reliance on it, and we would ask you to notify us immediately by return email to [ryan--christiansen--intellisol--com]

[3/4] from: arolls:bigpond:au at: 25-Jul-2001 4:34

> rebol-pages: copy [] > SiteCrawl http://www.rebol.com rebol-pages > > I need feedback on this. Do you have a small site where you can test > 'SiteCrawl for me? > -Ryan

My site is fairly small, you can check it out easily: http://users.bigpond.net.au/datababies/anton/index.html I think not all links are written with surrounding quotes, as assumed by your pageLinks function. Maybe it's not the official way, but IE lets this through: <a href=http://antonrolls.net>mysite</a> Also, it doesn't catch a link such as this: <a href="TechSupport/">Tech Support</a><br> (as found in my site.) without an index.html file specified. I think it should look for: TechSupport/index.htm(l) TechSupport/default.htm(l) It's interesting, if you trace/net on read http://users.bigpond.net.au/datababies/anton/TechSupport trace/net off you can see it tries first to find the file "TechSupport", then it tries to get the directory "TechSupport/". In your SiteCrawl function, where it is written: if error? try [...][ links: next links ] It seems as if you are relying on the error to occur. An error occurs for all of the links in my site. And why do you write links: next links ? Surely the next link will come along in the next iteration of the foreach loop. I suggest just do nothing []. Regards, Anton.

[4/4] from: arolls:bigpond:au at: 25-Jul-2001 4:38

Just the addition of one word: 'pages ? It works better, but there is still a problem if you try my site again. Anton.