Is writing a bad C++ program better than writing it in Perl?
March 12, 2008 2:13 PM   Subscribe

Does compiled vs. interpreted code make a difference if the structure of the algorithm is the same (ie, ugly and slow)? Should I take the extra effort of writing a small program in C++ or just do it in Perl...?

I need to write a script that will search for a pattern of characters (let's say 5 letters, followed by a dash, followed by 3 letters -- or something like that) in EVERY file (of certain types) on the computer.

Whether I write it in Perl or in C++, it would look the same, I think.

I would create a list of all the applicable files on the computer and loop through them one by one. Then, for each file, I would get each line and search it for my pattern. O(n^2), right?

This seems to be pretty easy to do in Perl, which I have used a lot, but more of a headache to getting it working in C++. However, the only reason I am leaning towards C++ is because this script has to run on at least 100 something that is faster would be better.

But if the algorithm is fundamentally the same for both, does it make a difference? Should I just do it in Perl? What about things like memory management and the computer resources being all used up making the machine unusable while the program/script is running? (i.e., these other things besides the actual running time of the program?)

Thank you for your insight into this.
posted by omair to Computers & Internet (17 answers total) 3 users marked this as a favorite
This seems like a good candidate for Boyer–Moore string matching (also here).
posted by RichardP at 2:18 PM on March 12, 2008

Running the same algorithm, compiled code will be faster than interpreted code. The interpreted code has the overhead of interpretation to slow it down; the compiled code doesn't. It's just that simple. If the code doesn't have to modify itself (which is vastly easier to do with interpreted code), which I can't see why it would, you will get much better performance from the C++.
posted by cerebus19 at 2:19 PM on March 12, 2008

Perl has excellent string matching and your task sounds entirely I/O bound. Do it perl.
posted by chairface at 2:22 PM on March 12, 2008

For the algorithm you describe I don't think it'll be a big difference, because you'll spend most of your time waiting for the disk anyway.

However: this is exactly what the program grep is for, and it'll probably be faster than whatever you write. (It's had decades of tuning.)
posted by equalpants at 2:23 PM on March 12, 2008

Are you using a Unix (including Mac) system? If so, don't do it in either. There are specialized tools to do this. For example this will search every file on your system for that pattern.

grep -r "[A-za-z]{5}-[A-za-z]{3}" /

Want to limit the filenames? Use find and xargs:

find / -name '*.txt' -print0 | xargs -0 grep "[A-za-z]{5}-[A-za-z]{3}"

O(n^2), right?

The naive search algorithm is O(nm). But efficient search algorithms are as fast as O(n). So this is another reason not to do it yourself—the algorithms in grep or Perl will actually do it more efficiently than the algorithm you would use.

If you aren't using Unix, you'll need to install software anyway, so you might as well consider installing these tools instead of Perl.
posted by grouse at 2:24 PM on March 12, 2008 [3 favorites]

Sure, C++ is faster than perl. But is your perl script too slow? Don't bother rewriting it if it's gonna be a pain until you have demonstrated that you actually need the program to be faster.
posted by demiurge at 2:29 PM on March 12, 2008

There won't be a speed difference, and the likelyhood of bugs will be way lower in the perl version.

The reason for no speed difference is exactly what equalpants said, the amount of time waiting for the disk to spin to the location of the file and stream it back to you is massive compared with a trivial text algorithm. Just use regexes in perl, and be done with it.
posted by cschneid at 2:45 PM on March 12, 2008

Compiled code is only significantly faster if your interpreted language doesn't have carefully optimized libraries to take care of the heavy lifting. There are any number of mistakes you might make in the C++ implementation that could cause it to be miles slower than even a naive Perl version. For example, I can remember a LZW compressor I wrote (for fun) that had an I/O call for each byte read or written. I'd assumed that disk caching would mean that this wouldn't really matter but the performance was terrible.

I agree with grouse though; it seems like you don't need to write a new tool to do this.
posted by tomcooke at 2:47 PM on March 12, 2008

Thank you, all.
posted by omair at 2:54 PM on March 12, 2008

Although everyone has already told you this, premature optimization is bad, and modern computers are insanely fast.

Here's how you should approach a problem like this:

You can write the code in perl, oh say an hour (and that's being generous). Perl does text processing pretty damn zippilly. So, chances are perl will be fine for what you want to do. You can write the code in C++ but it will take maybe 2 days to write, will likely have bugs (or stupid slowdowns) and will be much harder to maintain. If the perl program is too slow you wasted an hour. If the perl program is fast enough you saved 2+ days of stupidity. So write the perl program. Run a test. It's worth spending the hour to see if the simple solution is good enough.

As grouse showed, knowing unix tools is even better because this is exactly what makes unix so awesome. Find + grep + xargs is an amazing set of tools that every power user should learn and know.
posted by aspo at 3:42 PM on March 12, 2008

To answer the general question:

C or C++ can often give you a factor of 10-100 speed up over an interpreted language (and sometimes higher), but it's more important to get your algorithm right. If you're doing something that takes high polynomial time (or greater) and a low polynomial algorithm exists, then your time is much better spent fixing the algorithm.
posted by qxntpqbbbqxl at 3:43 PM on March 12, 2008

The Golden Rules of Optimization:

1. Don't.
2. (Experts only) Not yet.

On most real world programs there will be no real difference in execution speed between compiled and interpreted programs. (Interpreted systems do most of their heavy lifting in code that's carefully tuned and optimized by real experts.)

To write good fast code, think carefully before designing your algorithm, and use the language that you're most comfortable in. (If that's Perl (ugh!), use it.) In your specific case of searching for regexps in lots of files, do what everyone has already suggested: use grep or find+xargs.

I write all my code in a very-high-level-language (Unicon), except for stuff that has to do a lot of bit manipulation, or low-level control of serial/parallel ports etc., in which case I do it in C. (Of course at my day job I use the language the boss-man says to.)
posted by phliar at 4:47 PM on March 12, 2008

I was curious about the actual speed difference so I wrote a C++ program and a perl script to do this and unleashed them each on 10 copies of Ulysses, in which they find a handful of lines matching the expression.

C++ turned out to be 10 times slower. Apparently either I'm doing something stupid with Boost::Regex or it's just not optimized for performance like the perl implementation is. If I take out the matching and leave in all the file-opening and line-reading C++ performs 2-3 times faster than Perl, but the regular expression speed kills it. So, if you did need to optimize your perl script it'd have to be on more of a C level, with a character loop or a finite state machine or something. Or link the perl regex library, I guess.
posted by moift at 4:56 PM on March 12, 2008 [2 favorites]

First, Perl was built for exactly this task, it's very good at text searching.

Second, most of the performance hit of an interpreted language is in the startup times. Once it's running within the interpreter it should be very quick. Compiled languages will be faster at startup, but once running will not be faster by default.

Trivia note - you can compile Perl, but then it's not as platform independent.

And as you've mentioned, you're better at Perl than C, which is another vote for Perl. Much less chance of bugs and having to redo the whole process after discovering weird results.
posted by jpeacock at 5:46 PM on March 12, 2008

I agree with the consensus here. Perl is pretty much designed (and heavily optimized) for exactly this sort of task, you know it well, there are oodles of online help resources, and your task is I/O bound anyways; use Perl. For many tasks Perl will be as fast or faster than naive C implementations anyways.

As a side note, a common idiom in higher-level scripting languages these days is to write out your script in the high-level language, and if it runs too slowly to profile it, locate tight inner loops, and rewrite just those portions in C. This has varying degrees of practicality depending on the C API of your scripting language, architectures you need to support, etc. Perl has XS, which always gives me a headache when I try to read about it, but is extremely common on CPAN so it must be usable.

For your task, I doubt you'd be able to do better in a Perl XS than Perl's current incredibly good string-processing routines (which are already implemented in C anyways), but you might want to keep a hybrid approach in mind for future endeavors.

(As another side note, zomg, somebody's actively working on Icon!)
posted by whir at 6:18 PM on March 12, 2008

It is my understanding that Perl is a compiled language. It's just that it gets compiled each time you run a program written in it. As long as your program has a relatively long running time to amortize the startup cost (and it sounds like this is the case), you should stick with Perl. Perl is compiled to a non-native bytecode form, so generally speaking you may not get the same performance as with C++.
posted by speedo at 10:42 AM on March 13, 2008

Perl was once an interpreted language, but now it's JITed. If you want you can find a Perl compiler. Even though C is a "compiled language", it is fairly easy to find a C interpreter. Python is interpreted, or it's compiled, or it's transformed to Java byte code and then JITed by a Java virtual machine.

The point I am trying to make is that evaluation strategies are a property of the implementation, not of the language. Can we please stop using the phrase "X is a compiled language"?

C and C++ are almost always the wrong choice. C and C++ are tools, and there are very few tasks nowadays that fit this tool.

There are 14 programming languages that are within 1.5 times the raw performance of C at the great programming language shootout: D, Pascal, Oberon, OCaml, Haskell, Clean, SML, Eiffel, BASIC, Lisp, Scala, Java, Nice, C#. Coding in any of these languages is much easier than C or C++, simply since nothing is harder than them except assembly. For most problem domains, when coding in these languages you easily compensates for the slowdown by stepping up to an algorithms that you would not dare code in C. C and C++ make it so difficult to code fancy algorithms, you're likely to settle on something simpler, destroying any edge the optimizing compiler might have given you.
posted by gmarceau at 1:03 PM on March 13, 2008

« Older How is cotton waxed? What's t...   |  I'm looking for interesting or... Newer »
This thread is closed to new comments.