Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] Re: Download a whole website

From: oliva:david:seznam:cz at: 2-Aug-2002 14:54

Hello Abdel, Thursday, July 25, 2002, 10:36:13 PM, you wrote: AB> I know how to download a web page and save it to the disk, that if I know AB> the name of the page. But if I want to download a whole website and save it AB> to my hard drive? I had one reb-bot for travelling on the net and searching for images but he is really old (born in 2000) was not saving pages and had some bugs inside so I decided to make new generation. Here is what I have now (excuse the function 'uprav-url - it's from the old bot and needs to be improved (and translated) as well) What it does? Simply parses the html and sorts found urls to blocks: images, stylesheets, linked scripts - I will add apllets and embedded objects as well.... There are two things that want to discuss: 1. How to save page from url: http://localhost/ :(may be index.html default.html or what is specified on the server side:( 2. Way how to encode file names of dynamic documents as: http://127.0.0.1:85/cgi-bin/getboard.r?boardID=default&lang=cz and here is the script: <code> rebol [ title: "Site downloader" purpose: {To download pages from some url with all content} author: {Oldes} email: [oliva--david--seznam--cz] comment: {This is not finished version... now it just parses the page and returns sorted types of urls. Need to make saving the content and recursion for traveling from one page to another} version: 0.0.1 ] page-url: to url! ask "start URL: " ;page: read/binary page-url page-markup: load/markup page-url ;to-string page purl: decode-url page-url if none? purl/path [purl/path: "/"] purl/port-id: either none? purl/port-id [""][purl/port-id: join ":" purl/port-id] base-href: rejoin [http:// purl/host purl/port-id purl/path] images: make block! 50 links: make block! 500 scripts: make block! 10 stylesheets: make block! 10 tag-rules: [ "img" copy x thru {src=} copy url [to { } | to end ] y: to end ( tag-name: "img" url: uprav-url url if all [none? find images url not none? url] [insert images url] ) | "link" copy x thru {href=} copy url [to { } | to end ] y: to end ( tag-name: "link" url: uprav-url url if all [none? find stylesheets url not none? url] [insert stylesheets url ] ) | "script" copy x thru {src=} copy url [to { } | to end ] y: to end ( tag-name: "script" url: uprav-url url if all [none? find scripts url not none? url] [insert scripts url ] ) | "BASE" copy x thru {href=} copy url [to { } | to end ] y: to end ( tag-name: "base" base-href: uprav-url url ;print rejoin ["new base-href: " base-href] ) | "EMBED" copy x thru {src=} copy url [to { } | to end ] y: to end (tag-name: "EMBED") | "a" copy x thru {href=} copy url [to { } | to end ] y: to end ( tag-name: "a" url: uprav-url url if all [none? find links url not none? url] [insert links url ] ) ] uprav-url: func [path [string!] /local u q w new-url][ path: trim/with path {"} if find path "javascript:" [return none] if find path "mailto:" [return path] either found? find path "://" [ return path ] [ either path/1 = #"/" [ parse base-href [copy w thru "://" copy q [to "/" | to end]] return newurl: rejoin [w q path] ] [ site: tail parse (to-string skip base-href 7) "/" path: parse path "/" foreach p path [ either p = ".." [ if error? try [remove back site] [print "Spatny relativni odkaz"] ] [ if p <> "." [append site p] ] ] newurl: make string! "" foreach p head site [append newurl join p "/"] newurl: head clear back tail newurl replace/all newurl "//" "/" insert head newurl "http://" return head newurl ] ] ] parse/all page-markup [ some [ set tag tag! ( if parse/all tag tag-rules [ ; print reform [x url y tag-name] ] ) | any-type! ] ] probe stylesheets probe images probe scripts probe links </code>