How to get a list of pages and assets for a website?
December 10, 2008 6:52 AM Subscribe
How can I create a list of URLs for all "pages" and assets on a website?
I want to crawl a website and extract the URL of every page (including querystring) and asset encountered into a text file, csv or Excel sheet that I can manipulate.
I use WinHTTrack for my general page crawling needs, but it doesn't quite seem to give me what I want. Can anyone direct me to some software that already does this or do I have to write it myself?
I want to crawl a website and extract the URL of every page (including querystring) and asset encountered into a text file, csv or Excel sheet that I can manipulate.
I use WinHTTrack for my general page crawling needs, but it doesn't quite seem to give me what I want. Can anyone direct me to some software that already does this or do I have to write it myself?
Response by poster: Thanks, mkb. That helps.
Your suggestion led me to find this PHP app:
http://code.google.com/p/url-batch
which gives you a leg up in composing the wget command and extracting the urls from the log output.
posted by rocketpup at 8:16 AM on December 10, 2008
Your suggestion led me to find this PHP app:
http://code.google.com/p/url-batch
which gives you a leg up in composing the wget command and extracting the urls from the log output.
posted by rocketpup at 8:16 AM on December 10, 2008
This thread is closed to new comments.
posted by mkb at 7:08 AM on December 10, 2008