How do I extract the URLs from a web page?
July 30, 2008 5:47 PM Subscribe
What's the fastest and simplest way of extracting the URLs from a html file?
Input: Any html page.
Output: A .txt file with the list of all the URLs in the page.
Is there freeware that does this?
How about a macro of some kind?
A script would work but I don't know any of the script languages that run on PCs. Wouldn't mind learning but only as a last resort.
Input: Any html page.
Output: A .txt file with the list of all the URLs in the page.
Is there freeware that does this?
How about a macro of some kind?
A script would work but I don't know any of the script languages that run on PCs. Wouldn't mind learning but only as a last resort.
This tool (URL Extractor 1.0) seems to be free and do what you need. I haven't tested it though.
posted by McSly at 6:24 PM on July 30, 2008
posted by McSly at 6:24 PM on July 30, 2008
Best answer: I found a bookmarklet awhile back that does this. Here's the code:
javascript:var a='';for(var ln=0;ln<document.links. length;ln++){var lk= document.links[ln];a+=ln+': <a href=\''+lk+'\' title=\''+lk.text+'\'>'+ lk+'</a><br>\n';}w= window.open('','Links',' scrollbars,resizable,width= 400,height=600');w.document. write(a)
Just copy that into your address bar while on the page you want links for. Press enter and it will bring up a pop-up window with all the links listed and clickable.
You can also create a bookmark in your browser with that code as the URL. In the future just click the bookmark and it will perform the link-gathering operation automagically.
posted by Rhaomi at 6:40 PM on July 30, 2008 [2 favorites]
javascript:var a='';for(var ln=0;ln<document.links. length;ln++){var lk= document.links[ln];a+=ln+': <a href=\''+lk+'\' title=\''+lk.text+'\'>'+ lk+'</a><br>\n';}w= window.open('','Links',' scrollbars,resizable,width= 400,height=600');w.document. write(a)
Just copy that into your address bar while on the page you want links for. Press enter and it will bring up a pop-up window with all the links listed and clickable.
You can also create a bookmark in your browser with that code as the URL. In the future just click the bookmark and it will perform the link-gathering operation automagically.
posted by Rhaomi at 6:40 PM on July 30, 2008 [2 favorites]
Best answer: Open in Firefox, right click on page, View Page Info, click Links tab, right click, Select All, copy, and paste into a text file.
Only downside is it doesn't have any line-breaks apparently (I pasted into Notepad), but you can probably do some kind of search and replace on "http://" to add a line-break in before each URL.
posted by EndsOfInvention at 6:41 PM on July 30, 2008 [2 favorites]
Only downside is it doesn't have any line-breaks apparently (I pasted into Notepad), but you can probably do some kind of search and replace on "http://" to add a line-break in before each URL.
posted by EndsOfInvention at 6:41 PM on July 30, 2008 [2 favorites]
The URL regex that I've had some success with is: [Hh][Rr][Ee][Ff]='?"?([^'"<>]+)
posted by a robot made out of meat at 7:01 PM on July 30, 2008
posted by a robot made out of meat at 7:01 PM on July 30, 2008
It's important to understand the difference between parsers that understand markup and those that don't. This would be based on looking at a document like <a href="http://a.com">http://b.com<a/> and seeing whether you want A) http://a.com AND/OR http://b.com or B) both.
A) If you only want http://a.com OR http://a.com then you'll need a markup based parser, either in HTML or XHTML. Use JavaScript and a browser, or Beautiful Soup, or HTML Tidy and XML tools such as XPath.
B) Use regexs.
There are some scenarios that regexs won't catch for particularly invalid and malformed HTML. If a fragment looks like <a href="http://a.com/list wit space">click me</a> then most regexs would not have the sense to understand the attribute value boundaries or to encode the URL as browsers do. While initially more difficult and bulky a markup based parser (A) will be more reliable.
posted by holloway at 8:33 PM on July 30, 2008
A) If you only want http://a.com OR http://a.com then you'll need a markup based parser, either in HTML or XHTML. Use JavaScript and a browser, or Beautiful Soup, or HTML Tidy and XML tools such as XPath.
B) Use regexs.
There are some scenarios that regexs won't catch for particularly invalid and malformed HTML. If a fragment looks like <a href="http://a.com/list wit space">click me</a> then most regexs would not have the sense to understand the attribute value boundaries or to encode the URL as browsers do. While initially more difficult and bulky a markup based parser (A) will be more reliable.
posted by holloway at 8:33 PM on July 30, 2008
If you need the links on the page to "work", then cutting and pasting the source isn't going to be the solution because you'll end up with links that look like
../about.html
Or similar that are relative to the current page.
It would help to know what you want this for. That might influence which tool would work best for what you need.
posted by Deathalicious at 11:19 PM on July 30, 2008
../about.html
Or similar that are relative to the current page.
It would help to know what you want this for. That might influence which tool would work best for what you need.
posted by Deathalicious at 11:19 PM on July 30, 2008
If you have lynx
posted by [@I][:+:][@I] at 7:45 AM on July 31, 2008 [1 favorite]
lynx -dump -listonly http://site.comgives you a nice list without having to reinvent a fragile regex for the umpty millionth time. There's also an option so that images would show up as links too ...
posted by [@I][:+:][@I] at 7:45 AM on July 31, 2008 [1 favorite]
Response by poster: Wow lost of tech goodness here, thanks everyone. Thx to Chrisamiller for the pointer to Ruby, Rhaomi for introducing me to bookmarklets (that is cool), and EndsofInvention for hidden goodness of Firefox.
posted by storybored at 11:08 AM on July 31, 2008
posted by storybored at 11:08 AM on July 31, 2008
[@I][:+:][@I]'s answer is my favourite.. nice and robust.
posted by holloway at 5:22 PM on July 31, 2008
posted by holloway at 5:22 PM on July 31, 2008
This thread is closed to new comments.
Step 1: One Click Install of Ruby. Easy enough.
Step 2: type this into a command prompt:
ruby -ne 'if $_ =~ /href=\"([^\"]+)\"/;puts $1;end' < YourHTMLfile.html
(The regex could be tweaked, but does a reasonable enough job by pulling out everything that's inside a 'href' tag)
posted by chrisamiller at 6:11 PM on July 30, 2008