Join 3,433 readers in helping fund MetaFilter (Hide)


Dance, text files, dance
April 12, 2007 11:34 PM   Subscribe

Which programming language can I learn relatively quickly for simple interpretation and manipulation of text files on my Windows box?

For now I'm wanting to code something allowing me to batch search and replace, batch copy and paste, and write a program to tell me which files contain certain words/strings.

The ideal language is cross-platform/non-proprietary. And feel free to suggest good starter resources (online when possible, but books are good to) with your suggestion, or anything else that seems relevant.

Petty arguments over the merits of x language vs y lanauge also welcome, as long as they're relevant.
posted by poweredbybeard to Computers & Internet (28 answers total) 10 users marked this as a favorite
 
perl
posted by pompomtom at 11:46 PM on April 12, 2007


ruby
posted by bkudria at 11:48 PM on April 12, 2007


Perl is good at some things, and maniplation of text files is most definitely one of those things.
posted by cmonkey at 11:49 PM on April 12, 2007


Learn fast? I fancy Python is the best fit. If you've done any programming before, you should find it fairly easy.

An alternative would be grab the demo version of RealBasic which will let you get up a simple app very quickly. You may find its syntax and approach more to your liking.
posted by outlier at 11:51 PM on April 12, 2007


I second Python.
posted by dcbarker at 11:57 PM on April 12, 2007


Reasoning/justification also allowed, as opposed to randomly-shouted proper names
posted by poweredbybeard at 12:00 AM on April 13, 2007 [1 favorite]


Well, it's hard to argue with the power of the simple bash shell, combined with sed, awk, grep, and vi. These tools are complex and hard to learn, but are the backbone of Unix's ability to process text files so well.

The way I usually get these is with Cygwin, but you can also use Microsoft's Services for Unix. I haven't messed with SFU yet. I've downloaded it but I've never installed it. I believe it's faster than Cygwin, as it's a new kernel 'personality'. For whatever reason, Cygwin can be a little slow at heavy file I/O situations. For normal text processing, it's just fine, but gets a little sluggish if you're dealing with 100MB+ files. The SFU tools are, I believe, BSD-based, so they'll be less advanced, but will probably run faster.

Both are free. Cygwin doesn't hook itself into the system as deep, so it's easy to install/uninstall. I have some vague idea that SFU can't be removed once installed. So I'd probably try Cygwin first, and if it's not fast enough, then look into SFU.

You can also, with either tool, explore perl, which has been called 'the duct tape of the Internet'. It's highly optimized around text processing, and has many, many powerful features in that regard.

Sed, awk, and grep are specialized tools for particular applications, where perl is a general-purpose language that can duplicate any of their features. Because it's not specialized, it'll generally take longer to, say, do a search function in perl than in grep, but you can combine all the features at once into a single program, and then add a bunch more functionality on top. If you want APPLICATIONS, perl is probably better... if you want UTILITIES, batch scripts around the command-line tools will often be faster.

Another option is Python, which is less oriented around just text manipulation. It's more of a general-purpose language. You can still get to regular expressions and the like, but they're add-ons, rather than being built into the core language itself. (regular expressions, for instance, are done through the 're' library.)

Fundamentally, Unix-type stuff absolutely dominates this area. I've never run into anything Windows-native that even comes close to the power of these tools.
posted by Malor at 12:01 AM on April 13, 2007


Oh and as a possible bonus, what can be used most easily in a web environment as well? Priority is local tfile management though. Anyway, thanks so far. Carry on.
posted by poweredbybeard at 12:02 AM on April 13, 2007


Perl is really good for work like this, if all you need to do is write a few quick-'n-dirty scripts to hack around with text files. You can get a free Perl distribution for Windows here.

While you're at it you might want to take a look at this regular expressions tutorial if you're not familar with them already -- regardless of the language you using, they're the backbone of search and replace.
posted by thisjax at 12:03 AM on April 13, 2007


For text manipulation, most of the work you'll be doing will be with regular expressions. Many languages have their own incompatible regular expression language, but one of the most popular is the Perl-Compatible Regular Expression language, or PCRE. Perl (obviously), PHP, Javascript, Ruby, and tons other languages use PCRE-syntax.

The language wrapped around the PCRE is pretty irrelevant. In this case, I would choose anything other than Perl - its behaviour can be somewhat mystifying at times.

But learn about regular expressions.
posted by meowzilla at 12:04 AM on April 13, 2007


Python!
posted by magikker at 12:06 AM on April 13, 2007


You can use any of these tools in Unixy-type web work (ie, with Apache.) Perl or python are probably best, overall, as there has been a lot of integration work done there. (mod_python and mod_perl). There's a good chance that most sites you visit are running at least a little perl in them somewhere.

Beware PHP in web work. It's a simple language that works something like ASP, but it's hard to use it securely, and the language authors focus very little on that area. You can put together a web app in just a few hours with PHP, but learning to write one SECURELY will take weeks or months. I'd just suggest avoiding it unless you can devote a lot of time to really understanding it.

meowzilla is right about regular expressions being really critical to text processing. Perl syntax is as good as any other; my first real work with regular expressions was in that language. And, heh, I never knew what PCRE meant until just now.

Personally, I wouldn't say that Perl is all that mystifying if you're writing the code yourself. Once you've chosen a style you like, it's pretty predictable. The problem is that there's eighty zillion different variants of syntax that can be used, so understanding what someone ELSE has done can be difficult.

I haven't used perl for a long time, but a simple pseudocode example:

if something_is_true then die;
die if something_is_true;


These statements are equivalent. Many perl expressions can have weird things tacked on the end or in the middle, sort of like afterthoughts. One of the design goals is to make it something like natural languages, which are maddeningly complex, and perl tends to be similar. The core motto of perl is "There's more than one way to do it". That means reading someone's else's way -- which is perfectly valid, mind -- can be very hard if you've settled on a different style. I've seen perl code that might as well have been written in Martian. :)

Python is much more structured and easy to read, but I don't think it does PCRE.

Ruby has the reputation of being stunningly elegant, but also the rep of being extremely slow. None of these languages are speed demons, being interpreted, but Ruby is apparently much slower than the other two.
posted by Malor at 12:22 AM on April 13, 2007


If you're going to use perl for batch search and replace and the like, you'll save yourself some time and typing by learning the commandline options (run "perldoc perlrun" to see them):

$ perl -pi.save -e 's/foo/bar/g' *

This will switch "foo" to "bar" in all files in the current directory, saving the original with the ".save" extension. sed has similar options, too.

As far as telling you which files contain certain strings, the standard unix utility "grep" is designed to do that. To print just the names of files containing your string:

$ grep -l "whatever" *

I imagine grep is included with any suite of unix tools you'd find for Windows, but I haven't used any myself. If not, here's a perl oneliner that does the same thing:

$ perl -nle 'if (/whatever/) { print "$ARGV"; close ARGV }' *

You're either a perl person or you aren't, it seems. I am, so I recommend checking out Learning Perl (with the llama on the cover) to get a feel for the language. Another perl mantra: Make easy things easy and hard things possible.
posted by hutta at 1:18 AM on April 13, 2007 [2 favorites]


Get Regular Expressions under your belt and then you can implement your project in any language you like. Perl is the natural language for regex's the deliberate obfustication and in-jokes can prove annoying.
posted by oh pollo! at 1:34 AM on April 13, 2007


"Ruby is apparently much slower than the other two."

Depends how you define "much"; probably not enough to matter in most cases -- if it does, you probably need something faster than any of them (or a better algorithm).

Ruby's pretty nice for text manipulation; the regexp support's very good, with Perlish and more OO styles depending on taste and needs.

Search wise, we've got ferret, which is a port of the Lucene search engine to Ruby/C.
posted by Freaky at 2:05 AM on April 13, 2007


Perl's great if you know it, good if you understand it (and can look up what you need), & frustrating as hell if you're starting out. Do yourself a favor and use sed/awk/etc. They're predictable, at least. If those aren't adequite, use python. Perl is great to learn, because honestly I haven't found anything more fun to write short scripts in, but there's a certain level of understanding before its truly useful.
posted by devilsbrigade at 2:08 AM on April 13, 2007


Python.

Depending on what you're doing using the plain string methods might be an easier start but if you need to use regex's then the Python REGEX HOW-TO is a great resource and Kodos is a great tool to help (will actually write the Python for you - allows you to fiddle around on screen with the regex until you've got it right).

I'm just guessing but I suspect the Python module glob is going to be useful to you.

Dive Into Python is a great general resource on Python Programming and the Python Tutorial also contains lots of good basic stuff.
posted by southof40 at 2:28 AM on April 13, 2007


Perl used to be good for this, but since Ruby is really an easier, better structured Perl, then Ruby.
posted by wackybrit at 2:47 AM on April 13, 2007


I've learned perl and almost never use it. Perl produces some of the least readable code you will ever see. Things I wrote myself baffle me when I go back to them a few months later. Things I didn't write might as well be mandarin.

If you're not a programmer I strongly suggest using a newer language and a nice IDE with easy to use visual debugging and clear error messages. The hardcore command line linux geeks will scoff at this but it makes things much clearer if you can see what is going on.

Eclipse is a fantastic IDE platform and has simple plugin support for Ruby and Python and others. It has a bit of a learning curve (the built in file handling is less than intuitive) but the payoff is that you get familiar with a cross platform, cross language IDE while enjoying the convenience of visual debugging.


SED/Awk and Grep will do what you need but they each have similar but slightly different syntax that will make your life hell while learning them. Single quotes versus double quotes, wildcard substitution, etc will all be problematic as they differ by shell and platform. Not to mention that the tutorials will sometimes be specific to a shell you may or may not be using. I've had lots of problems with cygwin on windows xp and pretty much abandoned it but I do sometimes still use SED.


BTW Java would be just about the worst option for file operations. Way more complicated than it needs to be.
posted by srboisvert at 2:53 AM on April 13, 2007


I'd leave cygwin alone--unnecessary extra layer. You could find precompiled versions of grep, sed, awk for Windows/DOS, and just use those. Grep finds a pattern in a text file, sed will do search/replace. Awk adds in the ability to do a little math on the actions and print out reports.

O'Reilly's "sed & awk" book starts with a section on regular expressions in general.

Or save yourself the search time and just use perl.
posted by gimonca at 4:37 AM on April 13, 2007


Perl has rocked at this task for a long time, works well with text filters in UNIX command line pipelines, and there's a huge body of reference for how to quickly write all sorts of text-munging.

Ruby actually has the same kinds of command-line switches and can do pretty much the same things, but I haven't seen the Ruby community call much attention to this yet.

Python makes no attempt to compete in the text filter in UNIX pipeline space -- it's not suitable for it. If you want that option, you don't want Python.

For short text-manipulation scripts not in pipelines, I think you'll find that all 3 are actually more similar than not. (That would be a stupid thing to say about them over a larger problem domain. I am not saying that stupid thing; I'm saying this only for this very limited domain.) The advantages and disadvantages that can be trotted out for the languages are not likely to outweigh your personal taste in determining how much you enjoy or are annoyed by the language for these problems.

Given my druthers, I'd use Ruby.
posted by Zed_Lopez at 5:20 AM on April 13, 2007


If you just doing something ad hoc I would consider using Boxer, a shareware text editing program which has perl regular expressions and a macro language as well as a pretty good "find text in file" function which also accepts perl regexs. You might not need to program at all.
posted by shothotbot at 5:45 AM on April 13, 2007


I'll just helpfully point out that the gnu awk, grep and sed all handle at least extended regex and probably PCRE. Honestly I rarely ever need much beyond the extended syntax, and I'm finding that I'm using regexes less and less.

Anyway, my opinions:

Cygwin, when I used Windows XP, it used to be a mandatory install. Now I might kick the tires on the new Microsoft shell before installing it.

The grep family is my workhorse for searching through text files, and I use it hourly when I'm doing html development and weekly when I'm doing other forms of development. It's trivial to write a grep using perl, but the greps are ancient, and have been highly optimized for cpu and memory performance.

I used to use perl -pi -e 's/foo/bar/g' for searching, but then I discovered that gnu sed has has an "in-place" operator. So now I usually use gnused -i.bak -e 's/foo/bar/g'. A long time ago, I benchmarked gnused at around 3 times faster than perl for extracting ip numbers from a huge log file.

(Perl may also be highly optimized, grep and sed have the advantage of being one-trick ponies that don't have to care about variables, and namespaces.)

The next level is something like bash as a "glue" language to join processes together and to automate actions. I seem to do a lot of work at the terminal, even with more tasty OS X methods of doing things. Although, I find myself using Automator more and more. This is why I think the new Microsoft shell might be worth kicking the tires on if it has hooks into some application interfaces.

For anything more complex than can be done with a single loop, I tend to go for lisp or python. A typical case involves reading data from file A, relating that data to file B, and doing something with the results. I learned lisp because I was doing too much boilerplate code for this process, and its macro features cut that down quite a bit. I'm starting a big project with python, primarily because there were better libraries in that direction.

I have an irrational dislike of perl. It's not a bad language, I just can't stand the mashup of sed/awk/c in a single language. I don't know Ruby, and have yet to see the ideal application case that justifies learning Ruby. I kick the tires on Haskell once a year, but again, have not found that ideal case for spending more time on it.

Two other tools that I think are really necessary for batch text processing, but I don't have strong feelings about a particular program.

First, you absolutely, positively, gotta have a revision control system. Well, ok. You could make a manual backup before your work session. But trust me, when you do batch text processing using these tools, you will make mistakes and there is no undo. With an RCS system like Subversion or cvs, you can do all your work in a nice safe copy. svn revert has saved my ass more than once.

Secondly, you want a good diff that shows you the differences between text files. Run your batch edit tool with an option to save a backup file, then diff the results to look for unanticipated side-effects. Command-line diff is ok, but there are two-pane viewers that can be better.
posted by KirkJobSluder at 6:23 AM on April 13, 2007


I used to use perl -pi -e 's/foo/bar/g' for searching,

Whoops, that should be editing, the s/foo/bar/ operator replaces all cases of "foo" with "bar."

I'd say your best option is to download all of the above and pick the tools that are the most fun for you to use. Meanwhile, in terms of resources, I suggest finding a copy of Friedl's Mastering Regular Expressions. It's worth reading at least once for understanding how to optimize your expressions for match accuracy and performance.
posted by KirkJobSluder at 6:43 AM on April 13, 2007


I'd go with Python or Ruby if were you if you were coming at this with a blank slate. I too have that dislike for reading perl that was mentioned above. Python and Ruby read far more clearly. The one upside for you with Perl is that people have been doing file manipulation of all sorts with Perl for years and years and years so there are lots of cut and paste examples out there for you to reuse.

Let me also offer another tactic that might be helpful for you. You might take a look at the text editor jEdit. It has an extremely powerful Find/Replace function that can search through directory trees and use regular expressions and all sorts of stuff. For some problems you might be able to do your work done a lot faster than writing a custom program.
posted by mmascolino at 7:03 AM on April 13, 2007


nthing those that say the language is somewhat irrelevant as long as you get to grips with regular expressions, which is where the heavy lifting will be done. If you're the learn-from-a-book-type, I found O'Reilly's Mastering Regular Expressions as good a place to start as any.

If you know anything of any modern programming language, just start there. They all support regexps of some kind and there are a lot more similarities than differences between implementations.

If you've little knowledge of programming, then pick any of the modern well-supported interpreted scripting languages. Python, Ruby are the obvious candidates. If you're a complete beginner, I'd avoid Perl, for similar reasons already expressed above. Personally, I'm a Python fan. If you're learning programming it keeps you in good habits and is quick to get results. How to Think Like a Computer Scientist is a nice little book that's surprisingly accessible and will get you up to speed with the Python language pretty quickly. Python.org has pointers to more resources than most folks will ever need.

As for Python vs. Ruby, you'll get split opinions because both are great languages for what they do and people who learn one tend not to learn the other because both will do what they need.
posted by normy at 7:09 AM on April 13, 2007


A couple of basic questions about learning a language:

How often will you be programming in it? Every day? Once every 3 months when you need to tweak something? Something in between?

Will you always be the only maintainer of your output?

If your answers are every day and only you, then Perl is probably the answer -- a useful tool in your programmer's armory. Otherwise make it easy on yourself and go with Ruby, with your code carefully written to be easy to understand later -- very important if you ever might want to tweak anything.

If this is really just learning enough to get by on the one task, and you have some sort of programming experience, then the BASIC suggestion from outlier is probably tops -- you won't get any street cred, but it is probably the least work if you have the right reference book in your hand (not one from a different flavour of BASIC).

And if it is only this one task, once only, then go with the text editor solution. Programming is not an easy thing to learn.
posted by Idcoytco at 10:09 AM on April 13, 2007


Thanks for the answers.

I narrowed it down to Perl vs Python. I decided to go with Perl, because of what many of you said regarding its strength with t-files, and because I just sorta think whitespace in code should mean what I want it to mean.

And after just dipping a toe in, I'm already finding myself wondering why I ever did anything with PHP.
posted by poweredbybeard at 8:53 PM on April 13, 2007


« Older Is someone stealing the valve ...   |  Real estate pee etiquettefilte... Newer »
This thread is closed to new comments.