Websnake 1.23 Problems

Beside unfathomable file counts, Websnake has other problems that make it unusable. Here are some more.

HTML mangling

Websnake often mangles the HTML when it tries to resolve links after downloading. I just tried to grab a page and saw it happen again. Here's a line from the original page:

A test page for the WebCapture project. It has links to these pages:

It became this (two non-printing characters removed):

A test page for the WebCapture 8< project.8<ù It has links to these pages:

I'm surprised that it screwed up simple text. There were no anchors or HTML entities in the original.

Websnake also becomes confused when trying to resolve links to anchors inside HTML pages, and resolves them incorrectly.

Incomplete pages

If you give Websnake the URL of a page and set the level of links traversed to zero, Websnake will only download only the HTML file. If there are images or other non-HTML content on the page, they aren't downloaded. You could set the level limit to 1, but then you'll probably end up downloading pages you don't want. And you won't get their images.

The programming solution would be to check the link type. If it's an HREF target, add one to the current link value. Otherwise, it's page content, and should be downloaded. Not too hard, but the program's authors treated all references identically.

Getting a specific page and its subpages

One thing that's good about Websnake is that you can restrict your sessions. You can either follow all links (i.e., no restrictions), stay within the top domain, stay within the current server, or only follow links below the starting URL. This would be a really useful feature if it was coded intelligently.

For example, say you want a particular HTML file and everything it links to in subdirectories. In essence, you want to grab a subtree of the site. Be prepared to fail. If you provide the desired file name in the URL, Websnake will download that file and stop. It won't follow links if you specify a file name. But if you don't, it can lead to worse problems.

This came up in real life. There's a site with all of Shakespeare's plays. It has the directory structure below. All the plays are in one directory, and a set of glossary pages are in a subdirectory. Each play links to multiple glossary pages.

[directory tree]

If you try to download one play and the glossary pages it links to, you can't. Say you want "Othello". If you give Websnake the URL /plays/othello.html, Websnake will download that one page and then stop. To get the linked glossary pages, you have to start with the URL /plays. Websnake will grab the default page (index.html in this case), then whatever is linked from the default page, in this case all the plays.

This can take quite a while. I know; I tried. Then I remembered that Websnake doesn't correctly fix links to anchors inside a page (page.html#anchor). So even if it had downloaded the correct pages, the links to the glossary pages would have been wrong anyway.

Even worse is if the page you want isn't linked from the default page. Then there is no way for Websnake to grab the correct pages. This is a serious problem. When trying to get a particular page, you get either way too much or nothing at all.


Last updated 3 June 2000
http://www.rdrop.com/~half/General/ProductReviews/WebsnakeProblems.html
All contents ©1999-2002 Mark L. Irons