Rebol indexer

[1/4] from: hallvard:ystad:helpinhand at: 23-Aug-2003 17:56

Hi list, Some of you may remember I made a little search engine to register web pages with the word "rebol" in them some time back. Ingo suggested I also register rebol scripts. I've had some spare time the last few days, and the result is here: http://folk.uio.no/hallvary/rix/ This still is a robot that registers all pages with the word �rebol� in them, but there are some changes (in the robot and in the search interface): * The robot obeys robots.txt (agent id: RixBot) * html and script files available on the web are registered alike * users may choose to search in html files only, script files only or both (based on the http mimeType header) I noticed that rebol scripts are treated differently on different servers (surprise? not really). Some call them text/plain, others text/x-rebol-application. I'll look into that (and other things) later, if my wife goes on another week-end trip with the kids ... (possible enhancements: searching of different parts of rebol header etc.) As for the name, you can see that I chose to call it Rix. I like it, it's short and easy to pronounce, and I don't think there are a million other applications with that name out there already. Comments are welcome. Hallvard

[2/4] from: andreas:bolka:gmx at: 25-Aug-2003 21:14

Saturday, August 23, 2003, 5:56:50 PM, Hallvard wrote:

> http://folk.uio.no/hallvary/rix/

looks nice :)

> * The robot obeys robots.txt (agent id: RixBot)

would you like to factor that code out, so that future writers of bot coul reuse that bit :) ? -- Best regards, Andreas

[3/4] from: hallvard:ystad:helpinhand at: 25-Aug-2003 23:31

Dixit Andreas Bolka (21.14 25.08.2003):

>Saturday, August 23, 2003, 5:56:50 PM, Hallvard wrote: >> http://folk.uio.no/hallvary/rix/ >looks nice :)

Thanks.

>> * The robot obeys robots.txt (agent id: RixBot) >would you like to factor that code out, so that future writers of bot >coul reuse that bit :) ?

It's on my rebsite, Diddeley-do, available from /view desktop. Also (also? It's the very same file!) on my website: http://folk.uio.no/hallvary/rebol/server.r. I peeked a bit on the ht://dig package, and constructed this as an object. It has got a function to see if a url is permitted or not: forbidden? "/some/path/with/file.html" Forbidden paths are stored in a hash! I reconstruct such objects from a mysql database every now and then. I think probably it would be less memory consuming to have the forbidden? function as a global word, and keep nothing but the object's hash! of forbidden paths. Some guru might have a qualified opinion on this? I'm nothing but a script kiddie myself. The robots.txt standard is unclear about one thing. Is this allowed: user-agent: someAgent user:agent: someOtherAgent user-agent: someOtherAgentsAunt disallow: / Or must it be like this: user-agent: someAgent disallow: / user:agent: someOtherAgent disallow: / user-agent: someOtherAgentsAunt disallow: / I believe there really _should_ be one disallow: line for each user-agent, but I've seen examples of the first approach (http://www.rebol.org/robots.txt, for instance), so the script also accepts those. Comments are welcome. Regards, Hallvard

[4/4] from: SunandaDH::aol::com at: 25-Aug-2003 23:19

Hallvard:

> The robots.txt standard is unclear about one thing. Is this allowed: > user-agent: someAgent

<<quoted lines omitted: 6>>

> user:agent: someOtherAgent > disallow: /

Both formats are acceptable according to the robots standard. The first is good as it reduces bloat -- the file is half the size. The second is good as some robots are badly written and won't handle the first format. If your code can handle it either way, then you have a good robot, Sunanda.

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted