Stripping timecode from SRT file
May 30, 2011 11:05 AM   Subscribe

I would like to delete certain lines in a text document. It's an SRT file, which is basically a TXT file that contains subtitles and in & out points. I'd like to delete all the timecode numbers, and leave only the text. Example inside...

The format is as follows:

1
00:02:46,250 --> 00:02:47,625
Yahoo!

2
00:03:04,768 --> 00:03:07,353
-Congratulations.
-Thank you.


My desired end result is this:

Yahoo!

-Congratulations.
-Thank you.


Thanks in advance!
posted by Silky Slim to Computers & Internet (9 answers total) 2 users marked this as a favorite
 
Response by poster: Extra points for recognizing the film by those first two lines (you crazy geniuses you)
posted by Silky Slim at 11:06 AM on May 30, 2011


The easiest way to do this will probably involve a small amount of scripting, so it would be helpful to know what operating system you use/have access to.
posted by Salvor Hardin at 11:21 AM on May 30, 2011


Response by poster: Mac OS X Snow Leopard, UNIX under the hood of course
posted by Silky Slim at 11:28 AM on May 30, 2011


Response by poster: I guess I forgot to mention this file contains several thousand subtitles :^)
posted by Silky Slim at 11:29 AM on May 30, 2011


The easiest way to do this is with the UNIX command grep or a text editor with grep functionality.

TextWrangler is a text editor that I've used for tasks like this. Go to Search -> Find and check "Use grep". Search the string ".* --> .*" (without quotes) and click Find All to find all the timecodes. You can then replace them all with nothing.

A search for the string "[0-9]" (without quotes) will show all the numbers between each subtitle, which you can replace.
posted by rancidchickn at 11:40 AM on May 30, 2011 [1 favorite]


Another option........
If you have access to a spreadsheet, you can do this. It is a little roundabout, but gets the job done without scripting.

- Paste the whole TXT file into column B of Excel or something like that.
- In column A, create a number list from 1 to whatever to the end of the text.
- Sort using Column B. All the offending parts like the timecode and numbers will group together.
- Delete those offending lines completely (including the number list cells in Col A). You will be able to delete mass amounts at one time, so it is rather fast.
- Re-sort both Columns using Col A as the sort key.

What you will be left with is the comments in order with some spacing where there were empty lines. You can then copy the results of Col B out to a text file.
posted by lampshade at 11:53 AM on May 30, 2011 [1 favorite]


Old school awk will do the job:

awk '/-->/{for(i=1;i<d;i++){print a[i]};delete a;d=0;next}{a[++d]=$0}END{for(i in a)print a[i]}' [filename.srt]

which you can > output.txt. Some .srt files use HTML encoding (italics, etc.) which you can strip out with other UNIXy tools.
posted by holgate at 12:07 PM on May 30, 2011


egrep -v '[0-9]+($|:)' < file.srt > nonumbers.txt

where file.srt is the existing file, and nonumbers.txt is the name of the output file

this will filter out lines like in your example, but conservatively, so a hypothetical line of dialog like "1 little monkey jumping on the bed" will not be filtered out
posted by idiopath at 12:44 PM on May 30, 2011


posted to soon: improved regexp (even less likely to filter out lines you wanted)
'^[0-9]+($|:)'

also take out the -v to *only* show the lines to be rejected, if you are worried about accidental filtering
posted by idiopath at 12:47 PM on May 30, 2011


« Older Baby Hates Car Seat   |   Unmarked helicopters, hovering... Newer »
This thread is closed to new comments.