Help Me Learn AWK
March 11, 2013 7:42 PM   Subscribe

I'm good at Grep (aka pattern matching, aka regular expressions). Can you help me figure out the best way to learn AWK?

I'm a grep-a-holic. I've been using it (via BBEdit) to manipulate text (not code) for years. And I've been told that AWK would be a fun way of extending my grep skills. Luckily, it's built into the UNIX layer of my Mac.

I have no other programming experience, aside from some HTML and applescript, plus an overall comfort with command line.

Can you recommend a reference for someone in my position? Either something authoritative and accessible, or else interesting and catchy. Or, even better, are there are any resources out there specifically for people coming to AWK via grep? Probably not, but it'd be great.

For extra points, what do I learn after mastering AWK?
posted by Quisp Lover to Computers & Internet (27 answers total) 14 users marked this as a favorite
 
Response by poster: PS - by "interesting and catchy", I mean a resource intended for liberal arts majors. If there's nothing like that, I'll settle for a big dry resource if it's accessible and authoritative.
posted by Quisp Lover at 7:43 PM on March 11, 2013


I always used the O'Reilly book on Sed and Awk. Link is here.
posted by jadepearl at 7:44 PM on March 11, 2013 [1 favorite]


As far as what to learn next: perl may or may not stand for "Practical Extraction and Report Language" but I've always liked it the best for pure text processing tasks.
posted by XMLicious at 7:53 PM on March 11, 2013 [1 favorite]


The O'Reilly books on regexps (example) are also worth a look.

Perl is the obvious next step: one of the reasons for its creation and adoption was to address the disparities between the various incarnations of sed, awk and grep on different *nix systems. (And Larry Wall's limitations with awk's arcane syntax.)

Broadly speaking, I think of awk/grep/sed primarily as sysadmin tools for ad hoc stuff; perl is way better for programmatic pattern munging.
posted by holgate at 8:03 PM on March 11, 2013


Response by poster: Thanks. Re: perl, I'm guessing that between Grep and Awl, I might have all the text processing power I'd need. Would probably prefer to program web-ish stuff if I got into "real" programming. So would PHP make more sense? Or is that a tough jump from AWL?
posted by Quisp Lover at 8:03 PM on March 11, 2013


I'm with XMLicious. I wrote a number of reports using awk ... in the 80s ... around the time Perl was being developed. Learning just a tiny subset of Perl renders that skill superfluous while offering a lot more room to grow in. In the past 20 years, I don't think I've written anything with awk more complex than "awk -F: '{print $2}'". I use sed mostly out of habit--again a tiny subset of Perl obviates it. The Perl you want instead of sed or awk is stuff like this: "perl -ne 's/bad/good/g; print;'". Perl even has BEGIN and END blocks like awk.

Any language can use regular expressions for text processing, though, and the extra keywords you might have to type in other languages are not a big deal. But PHP might not be the best language for writing command line tools. You'll find very little help on that topic for PHP, and you can wind up having to do weird stuff like "exec('stty cbreak')" for every script, because many pre-packaged PHPs aren't compiled properly for writing Unix command line filters.

Python and Perl are both good choices.
posted by Monsieur Caution at 8:15 PM on March 11, 2013 [3 favorites]


You're heading into language wars territory there, though I think Python is probably a better place to look if you want a coherent, versatile language where you can use non-web experience in web environments with relatively straightforward deployment.
posted by holgate at 8:20 PM on March 11, 2013


Nobody knows how to use sed or awk. That's why Larry Wall invented perl. Learning awk is like learning to use ed, which nobody can use either, which is why there's vi (which, really, hardly anyone can use itself).
posted by tylerkaraszewski at 8:33 PM on March 11, 2013 [5 favorites]


M. Caution above has it. I very much advocate going directly to a useful text-processing subset of Perl, Python, Ruby, etc. Perl in particular offers a friendlier, more robust superset of the features of sed and awk, along with the single most featureful dialect of regular expressions. I don't write much Perl these days, but I use it routinely for little filters and text transformations within my editor. It certainly is no more effort to learn that subset of the language than awk would be.

tomc.txt is good for conveying the flavor of what this particular usage would be like.

I write PHP for a living, and would advise against going down that particular path. If you're not careful, you can wind up writing PHP for a living.
posted by brennen at 8:42 PM on March 11, 2013 [2 favorites]


Learning just a tiny subset of Perl renders[awk] superfluous while offering a lot more room to grow in.

Yep. More or less true.

Depending on your liberal art, Perl is also the beneficiary of a lot of time putting into writing various libraries for doing analysis or silly things.

Scripting-siblings Python and Ruby are arguably "nicer" or "cleaner" languages by one standard or another, and they also have liberal arts practitioners who've made some contributions. PHP doesn't shine on that front, but because of its preconfigured ubiquity on cheap web hosting, it's probably the avenue of least resistance when it comes to doing dynamic web stuff.

Still, Perl was once the go-to language for web scripting, even for people who hadn't done much coding at the time, and it remains a serviceable option, and has kept a small community of deeply thoughtful developers, cross-discipline workers, and borderline-to-full-on polymaths. Even though I don't use it as much as I used to, I check in to see what they're up to every so often. Sometimes it's interesting and catchy.

If you want to learn something that's both a little more theoretical, challenging, and a bit off the beaten path for text processing (but arguably interesting and catchy), poke into Prolog and Definite Clause Grammars.
posted by weston at 8:55 PM on March 11, 2013


Chiming in to say perl is a happy medium between sh/awk and python/ruby. It's available and stable on just about every unix system, so you can get meaningful things done without worrying about dependencies.

Recently I wanted to eliminate duplicate lines in a file, keeping only the last occurrence of a line, and the awk solution started to approach three lines of code when I realized I could just do
tail -r | perl -ne 'print unless $seen{$_}++' | tail -r
posted by Phssthpok at 9:10 PM on March 11, 2013


Best answer: I spent the better part of the nineties doing things in sed and awk, using the O'Reilly lorus book linked above. I wouldn't say I really know either: I'd just go through the book and pick out what I needed at the time. I was mostly using them as commands in shell scripts to process instrument files. They were means to an end rather than things worth learning in and of themselves. They were mostly pains in the ass, but worth it for little things not worth firing up the compiler for.

However, when python got good enough, I dropped s/awk like a hot potato. By contrast, I *knew* python in a few weeks.
posted by bonehead at 9:19 PM on March 11, 2013


On the other hand, at least you're not asking about TCL.
posted by bonehead at 9:22 PM on March 11, 2013 [3 favorites]


Even though I don't use it as much as I used to, I check in to see what they're up to every so often. Sometimes it's interesting and catchy.

On that point (and others here) John Siracusa's discussion of high-level languages (and their discontents) is worth a listen.
posted by holgate at 10:25 PM on March 11, 2013 [2 favorites]


Best answer: I think a lot of people have a major misunderstanding about AWK. It's honestly one of my favorite languages because anything you try to write in it is either easy or impossible, and if it's impossible that's a sign to Find Another Language. O'Reilly's Sed & Awk is a fine resource.

If you are writing command-line utilities, regardless of language, you should use docopt.

If you want to turn your simple command line program into a website, you should use something modeled after Sinatra (the ruby project) for the server, while Bootstrap makes a fine HTML base. They both allow you to have something useful with minimal fuss. (You can just google "[some language] sinatra" and something useful should come up.)

If you're working with Natural Language (non-code text), you probably want to use Python. No other language has anything as good as NLTK - it's not perfect, but it has most functions you could want baked-in, from simple part-of-speech tagging to full sentence parsing, real lemmatizing (can figure out "were" is a form of "be"), and even Wordnet support.

I wrote a small project that highlights words with the year they were coined using Python, Bottle, and NLTK recently, and it came to something between twenty and thirty lines of Python. Though I'm a software engineer I was technically an English major, and fooling around with text like this is still one of my favorite things to do.

Monsieur Caution: You do know about cut, right?
posted by 23 at 1:11 AM on March 12, 2013 [2 favorites]


Best answer: Go to the source, The AWK programming language by A, W and K.

Other than historical charm I don't AWK will teach you anything new, and despite being in the CS community in Cambridge (UK) in the 70s and 80s I never knew anyone who had written an actual program in AWK. One liners, maybe but no programs. EDIT: on reading your question again, don't bother. If you have little programming experience then AWK is not the place to start.
posted by epo at 2:59 AM on March 12, 2013


Phssthpok: "Recently I wanted to eliminate duplicate lines in a file, keeping only the last occurrence of a line,"

sort -u filename
posted by namewithoutwords at 5:09 AM on March 12, 2013


Best answer: Yes to Python.

Perl and Python are the only two languages I know relatively fluently these days, and of the two Python was both easier to learn and MUCH easier to maintain after-the-fact. Perl is great if you Know What You're Doing and are anal enough to focus on future readability, but with Python I can go back a year later and stuff I wrote sloppily is easy to understand. I still like Perl, but more for cultural rather than practical reasons. :)

Both languages are perfectly suitable for web programming and are insanely extensible and the communities are large enough that most small solutions are just a matter of futzing with someone else's code. The extensibility is great if you're using it to just get quick-and-dirty jobs done. As a bridge between different applications (ie, getting output from program X into program Y) the ability to just plug in some magic that allows you to read ArcGIS shapefiles or CAD files or Word documents or JPEGs or whatever is critical.

I'm not programmer, but if you know a good scripting language you'll find ways to use it. Any time you're dealing with data or text in large quantities, being able to dip down a layer from Access or Excel or Word and manipulate things programmatically is a superpower.
posted by pjaust at 6:37 AM on March 12, 2013


Response by poster: Well, that was tangly...but interesting. Thanks everyone.

That final posting by pjaust made me take another view. My AppleScript skills are pretty rudimentary. And I only know enough about shell scripts to steal them from others and embed them in AppleScripts.

So since I'm a Mac devotee, I'm thinking maybe I should extend both of those. And THEN learn Python.
posted by Quisp Lover at 7:10 AM on March 12, 2013


I believe that AWK scripts will run under Perl. I have no training in AWK, yet one quarter of my thesis is AWK scripts. I have a hard time starting one from scratch (but it's easy to modify an already existing one) and can't do any one liners...but here's some stuff I did do:
Wrote a script to make a series (sometimes 1000s) of input files for a program (Gaussian) where a parameter was systematically varied, the filename indicated the value of the parameter. Another script made the batch file to run the input file. And another extracted the values we wanted from the extremely long output files and made a output file we could plot.
Wrote a script to take the published coordinates of the atoms in a virus protein coat for one protein, and then use the symetry of the virus capsid to calculate the position of all the atoms in the capsid. And then calculate the nearest neighbor distance between the amino acids in the capsid (for the purpose of choosing what amino acid to mutate for labeling).
I've even done real math (the convolution integral) in AWK. That said...learn Perl. I finally did when I wanted to grab text off the serial port.
posted by 445supermag at 7:33 AM on March 12, 2013


Best answer: Do you know sed?

The difference between grep/sed/awk and perl/python is that you typically use grep/sed/awk for one liners from the command line, whereas perl/python are used to run whole programs you write separately in an editor. It's a bigger commitment to sit down and write a program; I often jump through serious hoops of grep | awk | grep | sort | uniq -c | sed to avoid busting out a program editor. (My characterization isn't strictly true; awk is a full programming language although you'd be insane to write big programs in it in 2013. And perl -pie is a great way to do one liners on the command line in perl. But in general, my characterization is truthy.)

If jumping into programming sounds like fun, I recommend Python. You can start by replacing some of your fancier grep hacks with little Python programs using the re module and a loop over the lines in a file. Then you can ease into more Python, using dictionaries to store data for instance.

But if that sounds like too much commitment, there's probably a lot more you can learn in grep. In particular are you comfortable using grep -o and parenthesis matching to do complex stuff? It's powerful and not many people know it. There's also advanced tricks you can do with grep -C to work on multiline output. And if you add a little sed to your command line repertoire, you can do a lot more.
posted by Nelson at 8:32 AM on March 12, 2013 [1 favorite]


The NLTK example is why you should learn Python for this, but also because Perl is ugly and harder to read than Python!
posted by oceanjesse at 9:05 AM on March 12, 2013


Response by poster: Nelson, thanks for the grep intrigue. Can you point me to any coherent site/document/book for further enticement?
posted by Quisp Lover at 9:47 AM on March 12, 2013


Best answer: The O'Reilly Frog
posted by bonehead at 9:50 AM on March 12, 2013


Response by poster: Ooh. Cool! thanks, bonehead!

I'm figuring I'll need to abandon BBEdit's grep implementation, and do this more serious stuff via command line in MAC OS. But so be it.
posted by Quisp Lover at 9:57 AM on March 12, 2013


Any heavy user of the command-line would do well to know one of Perl, Ruby, or sed & awk. You shouldn't learn Python for command-line filters, because it's not built for them, lacking the command-line switches that, say, Perl and Ruby have.

I had occasion recently to want to count requests by user in a web log.
perl -anF'\s+' -e '$h{$F[2]}++; END { print "$_ $h{$_}\n" for sort keys %h }'
I'll always love Perl, but except maybe for library availability, Ruby's a better language. I still reach for Perl for command-line filters because I never learned Ruby's command-line switches.
posted by Zed at 3:31 PM on March 12, 2013


Does Python as a shell work well on Macs? I have spent the last couple of years refining my command of bash and general shell scripting in Linux but in retrospect I kind of wish I'd instead learned Python and tried to do everything with .py command line scripts, which I'd considered but decided against based on a few particular feature deficits listed in that Wikipedia comparison.

It's kind of cool to feel like I'm initiated into the ancient UNIX dark arts but shell scripting is really a stupidly kludgey, slow, and quirky language to be using in the 21st century. Even many things that are pretty well-trod paths by now are difficult to use and poorly designed compared to working in more modern programming environments.
posted by XMLicious at 3:58 PM on March 12, 2013


« Older Have to face my boss tomorrow; reported her to...   |   Am I overreacting to my friends' values about... Newer »
This thread is closed to new comments.