How to get a list of pages and assets for a website?
December 10, 2008 6:52 AM   Subscribe

How can I create a list of URLs for all "pages" and assets on a website?

I want to crawl a website and extract the URL of every page (including querystring) and asset encountered into a text file, csv or Excel sheet that I can manipulate.

I use WinHTTrack for my general page crawling needs, but it doesn't quite seem to give me what I want. Can anyone direct me to some software that already does this or do I have to write it myself?
posted by rocketpup to Computers & Internet (2 answers total) 4 users marked this as a favorite
 
Best answer: If you can obtain a copy of wget, use the --no-verbose option to reduce the amount of logging clutter. It will take care of the page crawling as well.
posted by mkb at 7:08 AM on December 10, 2008


Response by poster: Thanks, mkb. That helps.

Your suggestion led me to find this PHP app:

http://code.google.com/p/url-batch

which gives you a leg up in composing the wget command and extracting the urls from the log output.
posted by rocketpup at 8:16 AM on December 10, 2008


« Older Other than bringing me Diet Coke and giving me...   |   How to "out" a self-plagiarist? Newer »
This thread is closed to new comments.