How do I script removal of the head and tail of a file up to and after given comments?
January 7, 2006 1:44 PM   Subscribe

Given BBEdit (or sed) and a file of HTML, how can I script the deletion of all text up to a given comment tag, skip over deleting the part of the file I want to keep, then resume deletion of contents after the next occurrence of a given comment tag?

I edit a site that calls for me to periodically repurpose content from other sites in our company.

I've written an Automator workflow that handles grabbing the printer-friendly version of a given article I'm repurposing, runs it through TextSoap to strip smartquotes and other stuff I hate, replaces some absolute links to be appropriate to the article's new home, slaps on a line attributing the original source for the story, and loads it into BBEdit.

At this point, it's a pretty simple operation to manually cut out the gunk I don't want, which is everything in the page source before a comment tag that reads "content_start" and after a comment tag that reads "content_stop", but I'd really like to automate this part, too, for the sheer pleasure of having an end-to-end workflow.

I just don't have the scripting chops to describe "delete all the lines up to this comment and delete all the lines after that comment."

It seems doing this in Applescript using BBEdit's scripting dictionary or doing it with a line or two of sed would be equally adequate.
posted by mph to Computers & Internet (7 answers total) 1 user marked this as a favorite
 
BBEdit's grep engine has tokens for the beginning and end of a document (\A and \Z respectively). Those should help you build a query you can automate.
posted by jjg at 2:18 PM on January 7, 2006


I know it's not Sed, but where Sed goes, perl shall follow. Here's a simple script that'll do what you want:


#!/usr/bin/perl
my $middle = 0 ;
while(){
my $line = $_ ;
if($line =~ /content_stop/){ $middle = 0 ; }
if($middle){ print $line ; }
if($line =~ /content_start/){ $middle = 1 ; }
}

run it like this:

perl script.pl < inputfile.txt> outputfile.txt

It even runs when only two lines long:
#!/usr/bin/perl
my $middle = 0 ; while(){ my $line = $_ ; if($line =~ /content_stop/){ $middle = 0 ; } if($middle){ print $line ; } if($line =~ /content_start/){ $middle = 1 ; } }

:)

posted by roue at 2:33 PM on January 7, 2006


Perl is smarter than that! It's a one-liner:

perl -ne 'print if /content_start/ .. /content_stop/' input.txt > output.txt
posted by nicwolff at 3:42 PM on January 7, 2006


Try
sed -n '/tag_start/, /tag_end/ { p; }'

posted by rycee at 3:43 PM on January 7, 2006


(If you go for roue's solution, note that while() should actually be while(<>) -- looks like the angle brackets were stripped out. But nicwolff's solution is, of course much shorter and more the 'Perl way' to do it...
posted by littleme at 4:57 PM on January 7, 2006


Best answer: The above solutions should work well if your start/stop tags don't span multiple lines, and if you can live with line-level granularity. If not you can do something like this:
perl -e 'local $/; $_ = <>; s@^.*?content_start@@s; s@content_end.*?$@@s; print' < file'/tt>
posted by Rhomboid at 6:55 PM on January 7, 2006


Response by poster: Thanks for all the answers!

Several worked well, and Rhomboid's saved me the part where I tossed in a TextSoap rule to isolate the comments onto their own lines, since the CMS inserting the content runs the start/stop comments right up against the inserted text with no breaks, causing some of the suggestions to chop off the first/last lines of the items I tested on.
posted by mph at 9:19 PM on January 7, 2006


« Older Cat! I'm a kittycat. And I dance dance dance, and...   |   How can I have iTune "see" tracks already present... Newer »
This thread is closed to new comments.