How to comprehensively spider a site
June 5, 2008 9:42 AM   RSS feed for this thread Subscribe

I have been charged with doing a full audit of my company's "web portal solution". This involves me going through the hundreds of pages and essentially developing an incredibly detailed sitemap showing where all pages link back and forth to. Please help me do this efficiently and accurately - I want to impress.

I will add that this "Web portal solution" is indeed online, however it is password protected, and therefore I have not been able to find a web service that can automate this task. The ideal solution would create a document that has a tree-type structure, or maybe flowchart layout detailing what children URLS branch off of other parent URLS.

It gets tricky because there are several external links which do not need to be followed, and several links are just ASP pages (ie .../menupage.asp?pageid=21, .../menupage.asp?pageid=22 etc...) does this complicate things?

Is there a firefox add on that can track where I click and then create a logical, visual output of where I visited? Basically I need something to look at all the links on the page, follow those links to the sub page, then repeat this process until all links in the domain have been followed.

Any ideas?
posted by yoyoceramic to technology (4 comments total) 3 users marked this as a favorite
There's lots of dedicated software that does this. I haven't done it myself, but a quick search found tools like PowerMapper, which supports password-protected sites. (No endorsement of PowerMapper - that's just the first one I found.)

You could also write a program (or hire a programmer) to spider the site for you, but if you can buy a tool to do what you want for well under $100, it's going to be much cheaper to go that way.
posted by pocams at 10:15 AM on June 5, 2008


I always used to use MS Visio for doing website maps dynamically. You can download a full function demo of Visio from Microsoft that will last 30 days or so, for free.

Here's a walkthrough of how it works.

When you configure it to do the mapping, you can enter a username/password into the setup to allow it to get into your portal for the mapping. You can also tell it how many layers off your site you want it to map. 0, 1, 2, etc. For example, if your site is "www.mysite.com" and you tell it to go only 1 layer off the site, it will only map pages directly linked to from "mysite.com". It will not spider pages/sites beyond that.

I'm not affiliated with Microsoft, btw.
posted by xotis at 10:21 AM on June 5, 2008 [1 favorite has favorites]


Perhaps linkchecker can help.
posted by PueExMachina at 5:53 PM on June 5, 2008


Or possibly Xenu's Link Sleuth? It's free, so no harm trying it out, anyway.
posted by kristi at 7:48 PM on June 5, 2008


« Older Does this literary form exist?...   |   Are there general guidelines t... Newer »
This thread is closed to new comments.