Parsing movie/TV scripts
January 5, 2015 6:25 PM Subscribe
I have several scripts for movies and TV shows which appear to be in a standard tab-delimited format. I would like to get each character's lines alone into a new separate text document (likely via a computer script).
Here's an example short snippet of the type of file I am working with. From this example, I'd like to run some process against BEVERLY (line 8 - or any other character, really), which creates a text file which contains:
"Useful"? How and what, Commander?
Would this be available in emerald green?
(continuing) I'm sure, Commander, there are reasons for a first officer to want to demonstrate his energy and alertness to a new captain. But since my duty and interests are outside the command structure...
etc.
I am somewhat comfortable at the command line and use basic utilities like grep and cut regularly, but creating something like what I want is just a little out of my range. I'm also comfortable with GUI tools like Notepad++. Super bonus points will be awarded if someone could walk me through the exact syntax of how it is I need to accomplish this task, so I can get some understanding of how to do it in the future given different parameters. Giving me the regex is good, but telling me how to implement it properly is better, if that makes sense.
Again, the end goal is to create a text file that is a list of a given character's lines. Thanks!
Here's an example short snippet of the type of file I am working with. From this example, I'd like to run some process against BEVERLY (line 8 - or any other character, really), which creates a text file which contains:
"Useful"? How and what, Commander?
Would this be available in emerald green?
(continuing) I'm sure, Commander, there are reasons for a first officer to want to demonstrate his energy and alertness to a new captain. But since my duty and interests are outside the command structure...
etc.
I am somewhat comfortable at the command line and use basic utilities like grep and cut regularly, but creating something like what I want is just a little out of my range. I'm also comfortable with GUI tools like Notepad++. Super bonus points will be awarded if someone could walk me through the exact syntax of how it is I need to accomplish this task, so I can get some understanding of how to do it in the future given different parameters. Giving me the regex is good, but telling me how to implement it properly is better, if that makes sense.
Again, the end goal is to create a text file that is a list of a given character's lines. Thanks!
I like a challenge. This works with your sample, at least (you can pipe input to the script or specify files on the command line):
posted by neckro23 at 8:33 PM on January 5, 2015
#!/usr/bin/env python2 from fileinput import input from re import match wanted_character = "BEVERLY" dialogue = False out = "" for line in input(): # match a dialogue header (5 tabs) result = match(r"^\t{5}(\S.*?)( *)$", line) if result: character = result.group(1) dialogue = (character == wanted_character) if not dialogue and len(out) > 0: # if speaker changed, print what we've got and start over print out out = "" elif dialogue: # match a spoken line (3 tabs) result = match(r"^\t{3}(\S.*?)([ \t]*?)$", line) if result: # append this line to the dialogue, with a space if len(out) > 0: out += " " out += result.group(1) # just in case the input ends in the middle of dialogue if len(out) > 0: print out
posted by neckro23 at 8:33 PM on January 5, 2015
Oh, in case you don't know what to do with that:
posted by neckro23 at 8:39 PM on January 5, 2015
- Save it as crusher.py or whatever (make sure the #! is the first line)
- chmod a+x crusher.py
- ./crusher.py [input file] > [output file]
posted by neckro23 at 8:39 PM on January 5, 2015
Eek, I assumed you're on Mac for some reason. If you're on Windows it should still work, but you'll have to install Python 2 and run python crusher.py instead.
posted by neckro23 at 8:41 PM on January 5, 2015
posted by neckro23 at 8:41 PM on January 5, 2015
Best answer: Couldn't help myself. Here's an improved version.
- less redundant, more sensible logic
- can show more than one character's quotes
- does a substring match on character names (so "RIKER" matches "RIKER'S VOICE" etc)
- optionally print character name in front of the quote (with prefix_speaker variable)
posted by neckro23 at 7:57 AM on January 6, 2015 [1 favorite]
- less redundant, more sensible logic
- can show more than one character's quotes
- does a substring match on character names (so "RIKER" matches "RIKER'S VOICE" etc)
- optionally print character name in front of the quote (with prefix_speaker variable)
#!/usr/bin/env python2 from fileinput import input from re import match characters = ["BEVERLY", "WESLEY", "DATA", "GEORDI", "PICARD", "RIKER"] prefix_speaker = True out = speaker = "" for line in input(): # parse all lines beginning with at least one tab result = match(r"^(\t+)(\S.*?)\s*$", line) if not result: continue tabs = len(result.group(1)) text = result.group(2) if tabs == 5: # dialogue header if speaker != text: # speaker changed, print what we've got and start over if len(out) > 0: if prefix_speaker: print "%s: %s" % (speaker, out) else: print out out = "" speaker = text elif tabs == 3 and any(c in speaker for c in characters): # spoken line # append this line to the dialogue, with a space if len(out) > 0: out += " " out += text else: # ignore all other lines pass # just in case the input ends in the middle of dialogue if len(out) > 0: print out
posted by neckro23 at 7:57 AM on January 6, 2015 [1 favorite]
Response by poster: Yes, thank you so much neckro23! I altered it just a bit to change python2 to python (I am on a mac). This appears to give me the results I want, so this is wonderful.
Better yet, I can even understand what most of these lines of code do! I really appreciate the comments in the code.
posted by antonymous at 12:24 PM on January 6, 2015
Better yet, I can even understand what most of these lines of code do! I really appreciate the comments in the code.
posted by antonymous at 12:24 PM on January 6, 2015
« Older My husband says she's his best friend. Really? | Help me find pans that can fit in my (large)... Newer »
This thread is closed to new comments.
2.) Replace all newline characters (\n) with a unique item (like SPORK), so the text will all be on one line.
3.) Replace all BEVERLY with \nBEVERLY (a newline). Do the same for other characters.
You can now import this into excel and use the sort function, or look for some of the text sorting plugins to notepad++
http://milospjanic.blogspot.com/2011/05/sorting-lines-in-notepad.html
Once you complete sorting them, you can unwrap the lines by replacing SPORK with newline (\n
)
posted by nickggully at 8:19 PM on January 5, 2015