Rebol indexer
[1/4] from: hallvard:ystad:helpinhand at: 23-Aug-2003 17:56
Hi list,
Some of you may remember I made a little search engine to register web pages with the
word "rebol" in them some time back. Ingo suggested I also register rebol scripts. I've
had some spare time the last few days, and the result is here:
http://folk.uio.no/hallvary/rix/
This still is a robot that registers all pages with the word «rebol» in them, but there
are some changes (in the robot and in the search interface):
* The robot obeys robots.txt (agent id: RixBot)
* html and script files available on the web are registered alike
* users may choose to search in html files only, script files only or both (based on
the http mimeType header)
I noticed that rebol scripts are treated differently on different servers (surprise?
not really). Some call them text/plain, others text/x-rebol-application. I'll look into
that (and other things) later, if my wife goes on another week-end trip with the kids
... (possible enhancements: searching of different parts of rebol header etc.)
As for the name, you can see that I chose to call it Rix. I like it, it's short and easy
to pronounce, and I don't think there are a million other applications with that name
out there already.
Comments are welcome.
Hallvard
[2/4] from: andreas:bolka:gmx at: 25-Aug-2003 21:14
Saturday, August 23, 2003, 5:56:50 PM, Hallvard wrote:
> http://folk.uio.no/hallvary/rix/
looks nice :)
> * The robot obeys robots.txt (agent id: RixBot)
would you like to factor that code out, so that future writers of bot
coul reuse that bit :) ?
--
Best regards,
Andreas
[3/4] from: hallvard:ystad:helpinhand at: 25-Aug-2003 23:31
Dixit Andreas Bolka (21.14 25.08.2003):
>Saturday, August 23, 2003, 5:56:50 PM, Hallvard wrote:
>> http://folk.uio.no/hallvary/rix/
>looks nice :)
Thanks.
>> * The robot obeys robots.txt (agent id: RixBot)
>would you like to factor that code out, so that future writers of bot
>coul reuse that bit :) ?
It's on my rebsite, Diddeley-do, available from /view desktop. Also (also? It's the very
same file!) on my website: http://folk.uio.no/hallvary/rebol/server.r.
I peeked a bit on the ht://dig package, and constructed this as an object. It has got
a function to see if a url is permitted or not:
forbidden? "/some/path/with/file.html"
Forbidden paths are stored in a hash!
I reconstruct such objects from a mysql database every now and then. I think probably
it would be less memory consuming to have the forbidden? function as a global word, and
keep nothing but the object's hash! of forbidden paths. Some guru might have a qualified
opinion on this? I'm nothing but a script kiddie myself.
The robots.txt standard is unclear about one thing. Is this allowed:
user-agent: someAgent
user:agent: someOtherAgent
user-agent: someOtherAgentsAunt
disallow: /
Or must it be like this:
user-agent: someAgent
disallow: /
user:agent: someOtherAgent
disallow: /
user-agent: someOtherAgentsAunt
disallow: /
I believe there really _should_ be one disallow: line for each user-agent, but I've seen
examples of the first approach (http://www.rebol.org/robots.txt, for instance), so the
script also accepts those.
Comments are welcome.
Regards,
Hallvard
[4/4] from: SunandaDH::aol::com at: 25-Aug-2003 23:19
Hallvard:
> The robots.txt standard is unclear about one thing. Is this allowed:
> user-agent: someAgent
<<quoted lines omitted: 6>>
> user:agent: someOtherAgent
> disallow: /
Both formats are acceptable according to the robots standard.
The first is good as it reduces bloat -- the file is half the size.
The second is good as some robots are badly written and won't handle the
first format.
If your code can handle it either way, then you have a good robot,
Sunanda.
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted