Mailing List Archive: Re: Download a whole website

[REBOL] Re: Download a whole website

From: oliva:david:seznam:cz at: 2-Aug-2002 14:54


Hello Abdel,

Thursday, July 25, 2002, 10:36:13 PM, you wrote:

AB> I know how to download a web page and save it to the disk, that if I know
AB> the name of the page. But if I want to download a whole website and save it
AB> to my hard drive?

I had one reb-bot for travelling on the net and searching for images
but he is really old (born in 2000) was not saving pages and had some
bugs inside so I decided to make new generation. Here is what I have
now (excuse the function 'uprav-url - it's from the old bot and needs
to be improved (and translated) as well)

What it does? Simply parses the html and sorts found urls to blocks:
images, stylesheets, linked scripts - I will add apllets and embedded
objects as well....

There are two things that want to discuss:

1.
  How to save page from url: http://localhost/
:(may be index.html default.html or what is specified on the server
side:(

2.
  Way how to encode file names of dynamic documents as:
  http://127.0.0.1:85/cgi-bin/getboard.r?boardID=default&lang=cz

and here is the script:

<code>
rebol [
    title: "Site downloader"
    purpose: {To download pages from some url with all content}
    author: {Oldes}
    email: [oliva--david--seznam--cz]

    comment: {This is not finished version... now it just parses the page and returns sorted 
    types of urls. Need to make saving the content and recursion for traveling from one page 
    to another}
    version: 0.0.1
]

page-url: to url! ask "start URL: "
;page: read/binary page-url
page-markup: load/markup page-url ;to-string page

purl: decode-url page-url
if none? purl/path [purl/path: "/"]
purl/port-id: either none? purl/port-id [""][purl/port-id: join ":" purl/port-id]
base-href: rejoin [http:// purl/host purl/port-id purl/path]

images: make block! 50
links: make block! 500
scripts: make block! 10
stylesheets: make block! 10

tag-rules: [
    "img" copy x thru {src=} copy url [to { } | to end ] y: to end   (
        tag-name: "img"
        url: uprav-url url
        if all [none? find images url not none? url] [insert images url]
    )
    | "link" copy x thru {href=} copy url [to { } | to end ] y: to end (
        tag-name: "link"
        url: uprav-url url
        if all [none? find stylesheets url not none? url] [insert stylesheets url ]
    )
    | "script" copy x thru {src=} copy url [to { } | to end ] y: to end  (
        tag-name: "script"
        url: uprav-url url
        if all [none? find scripts url not none? url] [insert scripts url ]
    )
    | "BASE" copy x thru {href=} copy url [to { } | to end ] y: to end  (
        tag-name: "base"
        base-href: uprav-url url
        ;print rejoin ["new base-href: " base-href]
    )

    | "EMBED" copy x thru {src=} copy url [to { } | to end ] y: to end  (tag-name: "EMBED")
    | "a" copy x thru {href=} copy url [to { } | to end ] y: to end (
        tag-name: "a"
        url: uprav-url url
        if all [none? find links url not none? url] [insert links url ]
    )
]

    uprav-url: func [path [string!] /local u q w new-url][
        path: trim/with path {"}
        if find path "javascript:" [return none]
        if find path "mailto:" [return path]
        either found? find path "://" [
            return path
        ] [
            either path/1 = #"/" [
                parse base-href [copy w thru "://" copy q [to "/" | to end]]
                return newurl: rejoin [w q path]
            ] [
                site: tail parse (to-string skip base-href 7) "/"
                path: parse path "/"
                foreach p path [
                    either p = ".." [

                        if error? try [remove back site] [print "Spatny relativni odkaz"]
                    ] [
                        if p <> "." [append site p]
                    ]
                ]
                newurl: make string! ""
                foreach p head site [append newurl join p "/"]
                newurl: head clear back tail newurl
                replace/all newurl "//" "/"
                insert head newurl "http://"
                return head newurl
            ]
        ]
    ]

parse/all page-markup [
    some [
        set tag tag! (
            if parse/all tag tag-rules [
            ;   print reform [x url y tag-name]
            ]
        )
        | any-type!
    ]
]
probe stylesheets
probe images
probe scripts
probe links
</code>