Is this possible with sed only?
September 14, 2011 2:23 PM   Subscribe

Shell scripting gurus! I've got a bash script which works...but I suspect there's a better way to do what I'm trying to do. Am I wrong?

I have a number of LaTeX documents that I'm converting to HTML using HeVeA. The basic conversion to HTML is fine but a few things need changing and cleaning up. So I'm writing a bash script to do it in a nice consistent way. I've got the whole thing working fine so far but there's one part that seems kind of ugly to me. I'm wondering if anyone with more shell experience might suggest a nicer way to do it.

So the problem:
The documents have footnotes and HeVeA puts them at the very end of the page with local links from the text to the footnotes and vice-versa. What I want to do (well, have done) is to copy the footnotes to the <a> tag's 'title' property, so the footnote is displayed when you hover over the link instead of having to navigate to the bottom of the page and back up. The following code accomplishes this, but I feel it's a little kludgy.

### read the input file into the variable 'document' ###
document=$(cat $1)

### count the number of footnotes in the document and cycle through them in a loop ###
num_footnotes=$(grep -c '"dd-thefootnotes"' $1)
for nth_footnote in $(seq 1 $num_footnotes)
do
### use grep to find the line containing the nth footnote, pipe that through sed to cut out the footnote itself ###
the_footnote=$(grep '"note'$nth_footnote'"' $1 | sed 's/.*dd-thefootnotes">\(.*\)/\1/')

### escape any '&' characters in the footnote and strip HTML tags ###
corrected_footnote=$(echo "$the_footnote" | sed 's|\&|\\\&|g; s|<\/.*>||g; s|<.*>||g')

### pipe the document through sed, finding and replacing the <A> tag for the nth footnote, return the substitution back to the 'document' variable ###
document=$(echo "$document" | sed 's|<A NAME="text'$nth_footnote'" HREF="#note'$nth_footnote'">|<A HREF="" TITLE="'"$corrected_footnote"'">|')
done

### write the new/corrected document to a temporary file ###
echo "$document" > temp.html

It's probably not obvious from the code but, the superscript in the main text is a link with: NAME="text1" HREF="#note1"
and the footnotes have links with the reverse: NAME="note1" HREF="#text1"
And the footnotes are also inside a <DD> tag with CLASS="dd-thefootnotes" which is how I find the things. Obviously the number changes depending on the specific footnote in question.

So is there a better, nicer, more concise way to do this? In particular I'm wondering if there's a way to do this with sed only. I suspect the answer to that is, "no," but if there's one thing I'm not, it's a sed expert.

The above script works and is fine for my purposes but I figure there's always more to learn so who's got suggestions?
posted by Mister_Sleight_of_Hand to Computers & Internet (14 answers total) 1 user marked this as a favorite
 
I'd suggest doing this in a language like Perl or Python instead. Bash was not designed for this kind of task.
posted by grouse at 2:38 PM on September 14, 2011 [1 favorite]


Seconding grouse, though plaudits for getting it to work in bash in the first place. I'd do this in perl; I don't think it would take you long to figure out a way to do the same, even if you're new to the language. Apologies if that's not the kind of better way to do it you were looking for.
posted by ManyLeggedCreature at 2:50 PM on September 14, 2011


We would be remiss if we failed to warn you of the dangers of parsing HTML with regular expressions.

I appreciate the moxie of your bash hack, but any improvements to this code will likely have diminishing returns in terms of maintainability and extensibility. I'd agree with the other answers suggest using an HTML parsing library with a language like Perl or Python (BeautifulSoup, perhaps?).
posted by aparrish at 3:01 PM on September 14, 2011 [3 favorites]


Seems like a job for XSLT, though it's probably too much to learn for an isolated task.
posted by stebulus at 3:11 PM on September 14, 2011


Best answer: Answering the question as asked, sticking to bash and sed and grep:

The ugliest thing in your code is that you keep the intermediate states of the document in a bash variable. There's also the inefficiency that you do one pass through the document per footnote. Better would be something like this (in schematic form):
grep "pattern which picks out footnote lines" in.html |
sed -e "sed command to extract just the footnote number and text" \
    -e "sed command to escape ampersands etc" \
    -e "sed command to generate a sed command! see below" > foo.sed
sed -f foo.sed in.html >out.html
That's two passes through the document, no matter how many footnotes there are, and the document is always processed in pipes. The idea is to use the data to generate a program that will perform the desired transformation, then run the generated program.

The third sed command should turn
1 text of footnote 1
into
s|<A NAME="text1" HREF="#note1">|<A HREF="" TITLE="text of footnote 1">|
It looks like you know enough sed to write this, so I will say no more about it.

(A couple minor things: (1) The initial grep can probably be moved into sed, since sed has regular expressions. (2) Instead of my chain of -e arguments, perhaps put those commands into a separate file and invoke it using -f (this avoids having to get special characters past the shell, for example). (3) I'm not sure why you're quoting & as \&; shouldn't it be quoted as &amp;? (4) My suggested program will have problems if a footnote happens to contain the pipe character. This is the kind of issue that motivates using a proper parser, as others have suggested.)

I have to leave right away, so the above may have more errors than usual.
posted by stebulus at 3:42 PM on September 14, 2011 [2 favorites]


Best answer: This part "

grep '"note'$nth_footnote'"'

is problematic for two reasons. First, I suspect that some of the quotes/double quotes got mangled when you posted it to AskMe. The second reason is that you're grepping for "note 0', 'note 1', 'note 2', etc. so 'note 1" is going to catch notes 1, 10, 11, 12, 13, 14, 15... and so on.

I would protect any $ expression in double quotes, especially file names that may contain spaces.

I don't like the way you're using $document. It's looks to me like you could accomplish the same thing with input redirection.

If you've got input that was well enough behaved, then there's no reason you shouldn't be able to do this in bash if you'd rather not learn anything new and you want to get this done fast. You can't expect whatever script you develop to work on inputs that don't follow the exact pattern you've developed it for. If you're going to use a script like this, you need to add error checking to make sure that you're catching exactly one match each time you grep or sed.

If it were me, I do this elisp since I live in emacs anyway.
posted by rdr at 3:43 PM on September 14, 2011


Seems like a job for XSLT

Did I miss a memo and LaTex is now an XML format?

posted by yerfatma at 6:39 PM on September 14, 2011


Did I miss a memo and LaTex is now an XML format?

Mister_Sleight_of_Hand's script processes HTML pages that are the result of running a previous script over a LaTeX document.
posted by grouse at 6:45 PM on September 14, 2011


Another matter, kind of unrelated to your question: The script as posted here takes away the href, so that the footnote appears only in the title attribute. I think that makes it hard to access on some machines (tablets and phones don't have mice to mouseover with), so if I were you, I'd put the footnotes in the title attribute but leave the href alone, so that people can follow the link if they need to.
posted by stebulus at 8:46 PM on September 14, 2011


Response by poster: Thanks guys!
This in particular:
sed command to generate a sed command!
is something I would have never thought of. But it works great! At least on the vastly simpler test text I tried it with.

I probably should have mentioned that all the documents were written by me with the same template (documentclass, packages, macros, etc.) and I can trust HeVeA to follow it's own rules for outputting to HTML, so I'm reasonably certain that all the documents will behave with the script.

tablets and phones don't have mice to mouseover with
Good point. Since I own neither a tablet, or a smart phone, this simply didn't occur to me but you're absolutely right, I'll leave the href alone.
posted by Mister_Sleight_of_Hand at 2:52 AM on September 15, 2011


Post a sample document somewhere and I'll show you the tidiest bash script I can write to do what you want.
posted by flabdablet at 8:07 AM on September 15, 2011


Seriously, though, is there truly no way of telling HeVeA more precisely what you want it to generate rather than applying post-processing hacks to its present output?
posted by flabdablet at 8:09 AM on September 15, 2011


One trick you need to know when you're writing sed commands to generate sed commands is that any text you bury inside the search-text part of a sed s/search-text/replacement-text/ command, or in the /address/ prefix of any command, is actually going to be treated as a regular expression. If regex special characters like . or [ or sed delimiters like / could possibly occur inside any of that text, you need to escape them all or your generated sed script will break.

Here's a script I wrote that uses this kind of technique:
#!/bin/bash

# Allow the specified YouTube video ID to be accessed
# by proxy users for whom YouTube is otherwise blocked.

video_id=$1

# YouTube video ID is in base64 URL format.
# Convert to vanilla base64, then to hex.

base64_id="${video_id//_//}"
base64_id="${base64_id//-/+}"==
base64_id="${base64_id:0:$((${#video_id} + 3 & ~3))}"
base64 -d <<<$base64_id >/dev/null || exit
hex_id=$(base64 -d <<<$base64_id | xxd -p)

# Build the URL regex lines we want to insert
# into /etc/squid/squid_login_urls.

video_re='^http://www\.youtube\.com/watch\?v='$video_id
cache_re='^http://[^/]+\.youtube\.com/videoplayback\?.+&id='$hex_id

# We want sed to treat those as text, not use them
# as regexes; escape all the special regex characters.

metachars='[].*[{,}^$\()&/]'
video_re_escaped="$(echo "$video_re" | sed -re 's/('"$metachars"')/\\\1/g')"
cache_re_escaped="$(echo "$cache_re" | sed -re 's/('"$metachars"')/\\\1/g')"

# Create a sed script to append the new URL regexes
# and delete any existing occurrences.

script='
$a\
'"$video_re_escaped"'\
'"$cache_re_escaped"'
/^'"$video_re_escaped"'$/d
/^'"$cache_re_escaped"'$/d
'

# Run it against the list of URLs that Squid will
# provide its own authentication for; the 'squid' user
# is in the upstream group with the most permissive
# available filtering.

sudo sed -ire "$script" /etc/squid/squid_login_urls
sudo service squid reload
Saved as /usr/local/bin/yt-allow, this lets me enter a command like yt-allow nHlJODYBLKs to append the two lines
^http://www\.youtube\.com/watch\?v=nHlJODYBLKs
^http://[^/]+\.youtube\.com/videoplayback\?.+&id=9c79493836012cab
to my /etc/squid/squid_login_urls file (or move them to the end if they're already there).

Note that /etc/squid/squid_login_urls is a list of regexes to match URLs that Squid is supposed to treat specially, and that being URLs they're all loaded with / characters that sed uses as delimiters - and yet I want sed always to handle them as chunks of uninterpreted text. Lots of escaping needs to be done, and doing it by hand would be tedious, error-prone and inscrutable. So I use sed for that job too (see the three lines beginning at metachars='[].*[{,}^$\()&/]'). Running the script using bash -x allows me to look at the generated strings and I can assure you I would not have had a hope of getting them right by hand.

It's also worth paying attention to the way this script uses ' and " quotes. I wanted to use ' quotes for most things, because quite a lot of the strings I'm defining contain \ escapes (often doubled) and using " would have meant needing to double them again. So you'll quite often see a construct like '"$variable"' appear in the middle of an otherwise single-quoted string. This looks like some bizarre new kind of quoting, but all it's really doing is closing the current single quotes, opening double quotes for just long enough to cause a clean expansion of $variable, then closing those and re-opening single quotes. Yes, it's kind of ugly - but the fact that it's even possible is one of the things that makes bash such a pleasure compared to Windows cmd.
posted by flabdablet at 9:04 AM on September 15, 2011


And having looked over that again: I've just given myself another useless use of cat award and replaced a few lines with this:
metachars='[].*[{,}^$\()&/]'
video_re_escaped="$(sed -re 's/('"$metachars"')/\\\1/g' <<<"$video_re")"
cache_re_escaped="$(sed -re 's/('"$metachars"')/\\\1/g' <<<"$cache_re")"
Kind of galling to find myself having used the handy <<< "here string" construct in the base64 decoding step but missing it there. It is a bashism though.
posted by flabdablet at 10:02 AM on September 15, 2011


« Older Grad studies. Music and.........   |   What kind of doctor do I need? Newer »
This thread is closed to new comments.