Searching for a pearl of Perl wisdom
October 6, 2005 8:45 AM Subscribe
Can someone help me figure out how to search and replace case-sensitive filenames in HTML using Perl?
I have a large number of HTML files that include both links and JavaScript tags. I am porting these files over to Linux, where case-sensitivity becomes an issue. Using Perl, I'd like to search through the files, find all instances of src tags or onClick events, and replace the referenced filenames with all lower-case equivalents. The script needs to be aware of the possibility of escape sequences. Additionally, the filename could be surrounded by either single or double-quotes.
I'm a newbie to Perl, but I have searched both CPAN and Google for examples. I am aware of the numerous HTML:: modules in Perl, but am not entirely sure how to implement them for this case. My regular expression skills are still rough around the edges. Any pointers to useful code/examples or discussion of an approach will be greatly appreciated.
I have a large number of HTML files that include both links and JavaScript tags. I am porting these files over to Linux, where case-sensitivity becomes an issue. Using Perl, I'd like to search through the files, find all instances of src tags or onClick events, and replace the referenced filenames with all lower-case equivalents. The script needs to be aware of the possibility of escape sequences. Additionally, the filename could be surrounded by either single or double-quotes.
I'm a newbie to Perl, but I have searched both CPAN and Google for examples. I am aware of the numerous HTML:: modules in Perl, but am not entirely sure how to implement them for this case. My regular expression skills are still rough around the edges. Any pointers to useful code/examples or discussion of an approach will be greatly appreciated.
How large is a large number, and how unusual are these filenames? For sufficiently unique stuff (ThisIsMyBigHonkingFile.jpg) you could get away with a straight search & replace.
You might consider downloading and installing (if you're on Windows) UltraEdit and playing with your regexes in its search and replace function where you can watch it work and confirm or refuse the alteration. I use it all the time for big collections of files, in or not in subdirectories. Just be sure to turn off the annoying default preference of "UltraEdit regex syntax."
It would also be useful to use one of those editors to use your regex as a straight search and see how many false positives you get.
jEdit has similar support, I believe, if you're on a Mac.
posted by phearlez at 9:49 AM on October 6, 2005
You might consider downloading and installing (if you're on Windows) UltraEdit and playing with your regexes in its search and replace function where you can watch it work and confirm or refuse the alteration. I use it all the time for big collections of files, in or not in subdirectories. Just be sure to turn off the annoying default preference of "UltraEdit regex syntax."
It would also be useful to use one of those editors to use your regex as a straight search and see how many false positives you get.
jEdit has similar support, I believe, if you're on a Mac.
posted by phearlez at 9:49 AM on October 6, 2005
Actually, that's not a bad idea phearlez. Here's how you might code it in perl:
posted by sbutler at 10:29 AM on October 6, 2005
#!/usr/bin/perl -wuse strict;use File::Find;my @files;find( sub { push @files, lc }, '.' );find( sub { return unless /\.html$/; open FILE, '+<', $_ || die "cannot open file ($!)"; my @lines =I haven't tested this, so you might want to only run it on a copy of your data.; foreach (@lines) { foreach $file (@files) { s/\Q$file\E/$file/i; } print; } close FILE;}, '.' );
posted by sbutler at 10:29 AM on October 6, 2005
Opps... that one line is supposed to be "my @lines = <FILE>;"
posted by sbutler at 10:35 AM on October 6, 2005
posted by sbutler at 10:35 AM on October 6, 2005
ermm... and it should be /ig on the end of the regular expression.
posted by sbutler at 10:38 AM on October 6, 2005
posted by sbutler at 10:38 AM on October 6, 2005
and you need a "truncate FILE, 0;" line before the loops. Like I said... preliminary code. :)
posted by sbutler at 10:44 AM on October 6, 2005
posted by sbutler at 10:44 AM on October 6, 2005
nicwolf's regexp is close but missing something:
s/src=(["'][^"']+["'])/src=\L$1/ig
(the last quote and the afformentioned "i")
The wordy translation is: search for the string {src=} followed by a double quote or a single quote followed by as many characters (but at least one) that are NOT single or double quotes, followed by the closing double or single quote.
Replace that with the string {src=} followed by the \Lowercased results of the first paren ($1). Do this search while ignoring case and for all instances on the line.
sbutler has an interesting take on the "replace filenames with their actual name as they are cased" but if all your stuff is lowercased already (or uppercased, as you like it) then this is superoverkill. It also assumes that the filenames you are referencing is in the same directory and that you aren't running a tree that looks like /content pointing to javascript in /js, but I jabber.
One of perl's fly-est tricks is the magical one-liner. Its kind of a pain in the ass to do with this one example because of all the quotes flying about, but you could save that -one- line (or multiple regexps) into a file (say changer.pl) and say
perl -p -i.bak changer.pl *html
This says "run perl over changer.pl and assume that we're looping through input and we want to -print after each series of operations. Do this printing -in-line (within the same file) and furthermore, save the original to {filename}.bak. Use everything ending in html as your arguments.
I would agree with sbutler that its probably a good idea to work on a practice file or two until you get the code down pat.
posted by Ogre Lawless at 11:29 AM on October 6, 2005
s/src=(["'][^"']+["'])/src=\L$1/ig
(the last quote and the afformentioned "i")
The wordy translation is: search for the string {src=} followed by a double quote or a single quote followed by as many characters (but at least one) that are NOT single or double quotes, followed by the closing double or single quote.
Replace that with the string {src=} followed by the \Lowercased results of the first paren ($1). Do this search while ignoring case and for all instances on the line.
sbutler has an interesting take on the "replace filenames with their actual name as they are cased" but if all your stuff is lowercased already (or uppercased, as you like it) then this is superoverkill. It also assumes that the filenames you are referencing is in the same directory and that you aren't running a tree that looks like /content pointing to javascript in /js, but I jabber.
One of perl's fly-est tricks is the magical one-liner. Its kind of a pain in the ass to do with this one example because of all the quotes flying about, but you could save that -one- line (or multiple regexps) into a file (say changer.pl) and say
perl -p -i.bak changer.pl *html
This says "run perl over changer.pl and assume that we're looping through input and we want to -print after each series of operations. Do this printing -in-line (within the same file) and furthermore, save the original to {filename}.bak. Use everything ending in html as your arguments.
I would agree with sbutler that its probably a good idea to work on a practice file or two until you get the code down pat.
posted by Ogre Lawless at 11:29 AM on October 6, 2005
Huh? Why replace the closing quote with itself? Just leave it alone. And you only need the /i if you've got mixed "src" and "SRC" which would be silly. Probably ought to precompile it with /o though.
posted by nicwolff at 12:36 PM on October 6, 2005
posted by nicwolff at 12:36 PM on October 6, 2005
If you're only replacing filenames, can you narrow down the things which need replacing to strings ending with .gif, .jpg and .htm?
That way all you need to do is find
"\w+\.(htm|jpg|gif)"
and
'\w+\.(htm|jpg|gif)'
posted by AmbroseChapel at 1:56 PM on October 6, 2005
That way all you need to do is find
"\w+\.(htm|jpg|gif)"
and
'\w+\.(htm|jpg|gif)'
posted by AmbroseChapel at 1:56 PM on October 6, 2005
Alternatively, you could install mod_speling and not worry about editing anything.
posted by Rhomboid at 5:31 PM on October 6, 2005
posted by Rhomboid at 5:31 PM on October 6, 2005
This thread is closed to new comments.
s/src=(["'][^"']+)/src=\L$1/g
but I don't know how you'd find filenames in the onClick events where they could be embedded in arbitrary Javascript code.
posted by nicwolff at 9:10 AM on October 6, 2005