Blocking robot downloads from a publically available mp3 on a website
November 17, 2016 1:34 PM   Subscribe

I "run" the website for a friend who is a musician. I say "run" because I don't know what I'm doing really. Anyways, I have some mp3s of his works on his website. He likes this and it is important to him, but a vast majority of the traffic I end up paying for is various scraping robots and the like that I imagine just download every mp3 they can find in the world just because. Is there some smart way to block this while still letting the website work?
posted by cmm to Computers & Internet (9 answers total) 3 users marked this as a favorite
 
I am not much better versed in web design than you are, I suspect, but is there some way to introduce a simple captcha into the download process? That would thwart the vast majority of bots.
posted by Shepherd at 1:43 PM on November 17, 2016


Best answer: Does your site have a robots.txt file? This won't prevent unscrupulous scrapers from burning up your bandwidth, since compliance is totally up to the implementer of the bot, but more ethical scrapers (including the big boys like search engines) will not follow the paths included in your robots.txt file, thus sparing you from paying for bandwidth for big files that aren't really indexable anyway. It might be worth adding one just to see if it makes a significant difference in the traffic that you see.
posted by firechicago at 1:56 PM on November 17, 2016 [1 favorite]


I think the best way would be to restrict access to the contents of a folder to all but whitelisted domains. In another life (ie: 15 years ago), I think I did that by editing the .htaccess file, but that depends on what kind of web server it is running on.
posted by lmfsilva at 2:05 PM on November 17, 2016


Best answer: Would it be possible to host these mp3s somewhere else, like SoundCloud, or as the audio track for a few youtube videos? That way you wouldn't be on the hook for costs related to robots downloading the files, as you wouldn't be hosting them in the first place. As a side benefit, if your friend is a musician, doing this may result in more exposure generally anyway.
posted by Aleyn at 2:24 PM on November 17, 2016 [7 favorites]


Best answer: So, first put all your MP3s in a seperate folder from all the other web content (something like 'music-directory/song1.mp3"...'music-directory/song1000.mp3').

firechicago's advice is good. Put a file called 'robots.txt' in the root directory of the website. Put this in the robots.txt:
User-agent: *
Disallow: /music-directory/
lmfsilva advice is good, but I would blacklist rather than whitelist. In the 'music-directory' folder create a file called '.htaccess'. Go to this website and copy and paste the code into your .htaccess file.

These two things should be totally transparent to the end user, but test them throughly. There are more intrusive ways of dealing with this kind of problem like, CAPTCHA, passwords, emailing links to the MP3s, etc. I wouldn't do any of that because that will reduce the number of people downloading the MP3. One way to increase the number of listens though is to take Aleyn's advice, and just use SoundCloud since end users will be able to just hit a play button, and not bother with downloading the file.
posted by gregr at 2:46 PM on November 17, 2016 [7 favorites]


Best answer: If you can do mod-rewrite-like stuff, configure the web server to refuse connections unless the requestor sends a referrer link from the page where the download link lives -- bots generally just download straight without passing a referring link (not all, but most) so this will eliminate most of the troublesome downloaders. Or, check for the existence of a cookie that you set from a page earlier in the website navigation, to prove that the downloader actually visited the site.

CAPTCHA is also a good suggestion; but pretty much the thing to do is verify that the MP3 downloader visited some other part of the site, and didn't just go straight to the MP3 URL.

But, you're never going to get rid of all because on the internet, there's no foolproof way to distinguish a bot from a human just from HTTP traffic. If a browser can pass the information back and forth, a bot can imitate that seamlessly. Best to try and limit it as much as possible, but it won't completely go away. Maybe look at a content distribution network (CDN) which may have cheaper bandwidth for handling the MP3s.
posted by AzraelBrown at 6:28 AM on November 18, 2016


Response by poster: Thanks everyone!

Yeah I know I can't do everything but my little rinky dink site isn't important enough to get special attention. So I think if I can just quash a majority of it by doing minor checks I will be ecstatic.

It sounds like from this I will see if I can use soundhoud to get these files off my server even and then I don't really care who downloads them.

If that fails:
- Add those 6G Firewall things to my .htaccess and add ips to the block list at the end if I see heavy hitters in the logs
- Move the music files to their own directory and use robots.txt to hide them from indexing from people playing by the rules
- Validate that a referrer is passed to get to the music files (though I do not know how to do this so I'll have to figure it out)

I appreciate and understand the captcha recommendations, but I don't really want to have something artificial like that in the user experience of the site.
posted by cmm at 7:01 AM on November 18, 2016


Best answer: I think it might be easier for you to try the robots.txt solution first, rather than moving everything over to soundcloud. It's easy, will take you like 30* minutes to set up, try it.

Bonus points for testing it after you set it up.

* 5 minutes for a pro, budgeting extra time for newbie
posted by intermod at 1:16 PM on November 18, 2016


Response by poster: Thanks everyone for the help. I ended up moving all large files to a subdirectory and adding that directory to robots.txt and then also requiring referrer be set for serving those files. I also put the htaccess "firewall" from gregr's reply on the site because it seemed like a good idea and would let me blacklist easily if I still see specific things getting through.

It really didn't take long and so far has been working ok. I'm trying to monitor the forbidden errors for a bit to see if I'm losing real traffic, but I imagine I won't have the patience to do that much longer and whatever will be will be.
posted by cmm at 2:07 PM on November 28, 2016 [1 favorite]


« Older I'm hideous. Volumizing product needed   |   Neither my co-workers or I can solve this... Newer »
This thread is closed to new comments.