Searching for a pearl of Perl wisdom
October 6, 2005 8:45 AM   Subscribe

Can someone help me figure out how to search and replace case-sensitive filenames in HTML using Perl?

I have a large number of HTML files that include both links and JavaScript tags. I am porting these files over to Linux, where case-sensitivity becomes an issue. Using Perl, I'd like to search through the files, find all instances of src tags or onClick events, and replace the referenced filenames with all lower-case equivalents. The script needs to be aware of the possibility of escape sequences. Additionally, the filename could be surrounded by either single or double-quotes.

I'm a newbie to Perl, but I have searched both CPAN and Google for examples. I am aware of the numerous HTML:: modules in Perl, but am not entirely sure how to implement them for this case. My regular expression skills are still rough around the edges. Any pointers to useful code/examples or discussion of an approach will be greatly appreciated.
posted by nightengine to Computers & Internet (10 answers total)
 
In general what you want is something like

s/src=(["'][^"']+)/src=\L$1/g

but I don't know how you'd find filenames in the onClick events where they could be embedded in arbitrary Javascript code.
posted by nicwolff at 9:10 AM on October 6, 2005


How large is a large number, and how unusual are these filenames? For sufficiently unique stuff (ThisIsMyBigHonkingFile.jpg) you could get away with a straight search & replace.

You might consider downloading and installing (if you're on Windows) UltraEdit and playing with your regexes in its search and replace function where you can watch it work and confirm or refuse the alteration. I use it all the time for big collections of files, in or not in subdirectories. Just be sure to turn off the annoying default preference of "UltraEdit regex syntax."

It would also be useful to use one of those editors to use your regex as a straight search and see how many false positives you get.

jEdit has similar support, I believe, if you're on a Mac.
posted by phearlez at 9:49 AM on October 6, 2005


Actually, that's not a bad idea phearlez. Here's how you might code it in perl:
#!/usr/bin/perl -wuse strict;use File::Find;my @files;find( sub { push @files, lc }, '.' );find( sub {    return unless /\.html$/;    open FILE, '+<', $_ || die "cannot open file ($!)";    my @lines = ;    foreach (@lines) {        foreach $file (@files) {            s/\Q$file\E/$file/i;        }        print;    }    close FILE;}, '.' );
I haven't tested this, so you might want to only run it on a copy of your data.
posted by sbutler at 10:29 AM on October 6, 2005


Opps... that one line is supposed to be "my @lines = <FILE>;"
posted by sbutler at 10:35 AM on October 6, 2005


ermm... and it should be /ig on the end of the regular expression.
posted by sbutler at 10:38 AM on October 6, 2005


and you need a "truncate FILE, 0;" line before the loops. Like I said... preliminary code. :)
posted by sbutler at 10:44 AM on October 6, 2005


nicwolf's regexp is close but missing something:

s/src=(["'][^"']+["'])/src=\L$1/ig

(the last quote and the afformentioned "i")

The wordy translation is: search for the string {src=} followed by a double quote or a single quote followed by as many characters (but at least one) that are NOT single or double quotes, followed by the closing double or single quote.

Replace that with the string {src=} followed by the \Lowercased results of the first paren ($1). Do this search while ignoring case and for all instances on the line.

sbutler has an interesting take on the "replace filenames with their actual name as they are cased" but if all your stuff is lowercased already (or uppercased, as you like it) then this is superoverkill. It also assumes that the filenames you are referencing is in the same directory and that you aren't running a tree that looks like /content pointing to javascript in /js, but I jabber.

One of perl's fly-est tricks is the magical one-liner. Its kind of a pain in the ass to do with this one example because of all the quotes flying about, but you could save that -one- line (or multiple regexps) into a file (say changer.pl) and say

perl -p -i.bak changer.pl *html

This says "run perl over changer.pl and assume that we're looping through input and we want to -print after each series of operations. Do this printing -in-line (within the same file) and furthermore, save the original to {filename}.bak. Use everything ending in html as your arguments.

I would agree with sbutler that its probably a good idea to work on a practice file or two until you get the code down pat.
posted by Ogre Lawless at 11:29 AM on October 6, 2005


Huh? Why replace the closing quote with itself? Just leave it alone. And you only need the /i if you've got mixed "src" and "SRC" which would be silly. Probably ought to precompile it with /o though.
posted by nicwolff at 12:36 PM on October 6, 2005


If you're only replacing filenames, can you narrow down the things which need replacing to strings ending with .gif, .jpg and .htm?

That way all you need to do is find

"\w+\.(htm|jpg|gif)"

and

'\w+\.(htm|jpg|gif)'
posted by AmbroseChapel at 1:56 PM on October 6, 2005


Alternatively, you could install mod_speling and not worry about editing anything.
posted by Rhomboid at 5:31 PM on October 6, 2005


« Older egalitarian movie   |   "made from sugar, so it tastes like ... what?" Newer »
This thread is closed to new comments.