Join 3,432 readers in helping fund MetaFilter (Hide)


Strip me some mp3 urls.
May 10, 2007 5:32 AM   Subscribe

In Linux, what's the simplest way to take a podcast RSS file as input, and output a file containing just the URLs of all the MP3s enclosed in the RSS file? Difficulty: the method must only Bash + GNU tools, Python or Perl, with no non-standard add-on libraries or dependencies required.
posted by Jimbob to Computers & Internet (22 answers total) 4 users marked this as a favorite
 
http://www.wellho.net/resources/ex.php4?item=p668/medireport.pl
posted by ReiToei at 5:48 AM on May 10, 2007


... or one of tutorials on this page (probably better): http://www.xml.com/pub/a/2001/04/18/perlxmlqstart1.html
posted by ReiToei at 5:51 AM on May 10, 2007


wget http://foo.com/podcast.xml | grep "http.*[Mm][Pp]3" > url_file.txt


Could use curl rather than wget too, I suppose. And you could use sed and just strip out exactly what you need...
posted by unixrat at 6:13 AM on May 10, 2007


Just tested it:
curl -s http://foo.com/podcast.xml | grep -oi "http.*mp3" | sort | uniq > url_file.txt
posted by unixrat at 6:21 AM on May 10, 2007 [1 favorite]


curl -s http://podcastfile | awk -F\" '{for(i=1;i< nf;i++){if($i ~ /mp3$/) print $i}}'/code>
posted by [@I][:+:][@I] at 6:25 AM on May 10, 2007


i dunno why mefi ate code, but: i< nf/code> should be i< nf/code>
posted by [@I][:+:][@I] at 6:28 AM on May 10, 2007


ok i give up
posted by [@I][:+:][@I] at 6:28 AM on May 10, 2007


[swearing redacted]
posted by [@I][:+:][@I] at 6:31 AM on May 10, 2007


curl http://foo.com/podcast.xml | grep -i "http.*mp3" > url_file.txt

This version of unixrat's one-liner may be easier to understand, but it still won't work with a URL that doesn't end in mp3 (in no way guaranteed). Substituting curl because wget doesn't go to standard out. You could also use `wget -O -`depending on which one is installed in your system.

I don't think you can safely assume that an xml parser is available for your python or perl installation is present, so I would go with a perl script that looks for all instances of m@< [^/>]*enclosure@ and sucks out the innards. I could write one, but not in this little box :)
posted by mkb at 6:31 AM on May 10, 2007


Hey, Jimbob, do you have an example RSS feed for us to work with? One that the tool will be used with?
posted by unixrat at 6:53 AM on May 10, 2007


I tried this on exactly one podcast (they might be giants, http://www.tmbg.com/_media/_pod/podcast.xml), and I don't actually know much about RSS/podcasting, so take it with a grain salt. mp3s.py
posted by jepler at 7:01 AM on May 10, 2007


This is exactly the type of task XPath is good at solving. If your system has LibXML2 then it probably also has the python bindings for the same. This will give you a more stable, reliable solution than grepping the feed for URLs.
posted by sbutler at 7:05 AM on May 10, 2007


If your system has LibXML2 then it probably also has the python bindings for the same.
Nope, it probably doesn’t. Python libraries are not any more standardly installed than CPAN modules.

$ python
Python 2.3.5 (#2, Oct 16 2006, 19:19:48)
[GCC 3.3.5 (Debian 1:3.3.5-13)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import libxml2, sys
Traceback (most recent call last):
File "<stdin>", line 1, in ?
ImportError: No module named libxml2
>>>
$ locate libxml2
/usr/lib/libxml2.so.2
/usr/lib/libxml2.so.2.6.16
$

posted by Aidan Kehoe at 7:14 AM on May 10, 2007


Well that bites. I'm too used to Gentoo.
posted by sbutler at 7:29 AM on May 10, 2007


Try pasting this into bash:

feed="http://www.tmbg.com/_media/_pod/podcast.xml"
wget -qO- "$feed" | sed -n '
\:<enclosure:b glom
b
:glom
\:/>:b glommed
N
b glom
:glommed
\:type="audio/mpeg":!b
\:url=":{s:.*url="::
s:".*::
p
}
'

It's by no means an XML parser, but it should be reasonably robust for what you want. It will spit out the guts of the url="guts" attribute in any <enclosure /> tag that has both a url= attribute and a type="audio/mpeg" attribute. It doesn't care what order the attributes of the <enclosure /> tag occur in, or what other attributes it might have, and it doesn't require the whole tag to be on one line.
posted by flabdablet at 8:43 AM on May 10, 2007


Attributes containing "/>" will break it, though.
posted by flabdablet at 8:48 AM on May 10, 2007


One line less sed:

feed="http://www.tmbg.com/_media/_pod/podcast.xml"
wget -qO- "$feed" | sed -n '
\:<enclosure:b glom
b
:glom
\:/>:b glommed
N
b glom
:glommed
\:type="audio/mpeg":!b
\:url=":!b
s:.*url="::
s:".*::
p
'

posted by flabdablet at 8:56 AM on May 10, 2007


One less again:

feed="http://www.tmbg.com/_media/_pod/podcast.xml"
wget -qO- "$feed" | sed -n '
\:<enclosure:!b
:glom
\:/>:b glommed
N
b glom
:glommed
\:type="audio/mpeg":!b
\:url=":!b
s:.*url="::
s:".*::
p
'

FFS, me, go to bed.
posted by flabdablet at 9:14 AM on May 10, 2007


libxml2 might not be a standard python library, but I am pretty sure that xml.dom.minidom is including in all distributions. In that case, this is pretty simple and easy to read

import urllib2
import xml.dom.minidom
doc = xml.dom.minidom.parse(urllib2.urlopen("http://www.tmbg.com/_media/_pod/podcast.xml"))
for node in doc.getElementsByTagName("enclosure"):
print node.getAttribute("url")
posted by mmascolino at 10:53 AM on May 10, 2007


that sucks...preview had it right...the line starting with print is supposed to be indented
posted by mmascolino at 10:54 AM on May 10, 2007


Sorry I haven't been around, but I'll test out all these methods and see how they all measure up, thanks everyone.
posted by Jimbob at 2:09 PM on May 10, 2007


Don't forget to deal with entities if you're not using a real parser. At the very least, sed -e 's/&amp;/&/g'.
posted by Freaky at 4:35 PM on May 10, 2007


« Older I'd like to learn carpentry in...   |  Where can I find the Thunder P... Newer »
This thread is closed to new comments.