Parsing movie/TV scripts
January 5, 2015 6:25 PM   Subscribe

I have several scripts for movies and TV shows which appear to be in a standard tab-delimited format. I would like to get each character's lines alone into a new separate text document (likely via a computer script).

Here's an example short snippet of the type of file I am working with. From this example, I'd like to run some process against BEVERLY (line 8 - or any other character, really), which creates a text file which contains:

"Useful"? How and what, Commander?
Would this be available in emerald green?
(continuing) I'm sure, Commander, there are reasons for a first officer to want to demonstrate his energy and alertness to a new captain. But since my duty and interests are outside the command structure...
etc.

I am somewhat comfortable at the command line and use basic utilities like grep and cut regularly, but creating something like what I want is just a little out of my range. I'm also comfortable with GUI tools like Notepad++. Super bonus points will be awarded if someone could walk me through the exact syntax of how it is I need to accomplish this task, so I can get some understanding of how to do it in the future given different parameters. Giving me the regex is good, but telling me how to implement it properly is better, if that makes sense.

Again, the end goal is to create a text file that is a list of a given character's lines. Thanks!
posted by antonymous to Computers & Internet (6 answers total) 2 users marked this as a favorite
 
1.) Import the file into something like Notepad++
2.) Replace all newline characters (\n) with a unique item (like SPORK), so the text will all be on one line.
3.) Replace all BEVERLY with \nBEVERLY (a newline). Do the same for other characters.

You can now import this into excel and use the sort function, or look for some of the text sorting plugins to notepad++

http://milospjanic.blogspot.com/2011/05/sorting-lines-in-notepad.html

Once you complete sorting them, you can unwrap the lines by replacing SPORK with newline (\n
)
posted by nickggully at 8:19 PM on January 5, 2015


I like a challenge. This works with your sample, at least (you can pipe input to the script or specify files on the command line):
#!/usr/bin/env python2

from fileinput import input
from re import match

wanted_character = "BEVERLY"

dialogue = False
out = ""
for line in input():
    # match a dialogue header (5 tabs)
    result = match(r"^\t{5}(\S.*?)( *)$", line)
    if result:
        character = result.group(1)
        dialogue = (character == wanted_character)
        if not dialogue and len(out) > 0:
            # if speaker changed, print what we've got and start over
            print out
            out = ""
    elif dialogue:
        # match a spoken line (3 tabs)
        result = match(r"^\t{3}(\S.*?)([ \t]*?)$", line)
        if result:
            # append this line to the dialogue, with a space
            if len(out) > 0:
                out += " "
            out += result.group(1)

# just in case the input ends in the middle of dialogue
if len(out) > 0:
    print out

posted by neckro23 at 8:33 PM on January 5, 2015


Oh, in case you don't know what to do with that:
  • Save it as crusher.py or whatever (make sure the #! is the first line)
  • chmod a+x crusher.py
  • ./crusher.py [input file] > [output file]

posted by neckro23 at 8:39 PM on January 5, 2015


Eek, I assumed you're on Mac for some reason. If you're on Windows it should still work, but you'll have to install Python 2 and run python crusher.py instead.
posted by neckro23 at 8:41 PM on January 5, 2015


Best answer: Couldn't help myself. Here's an improved version.

- less redundant, more sensible logic
- can show more than one character's quotes
- does a substring match on character names (so "RIKER" matches "RIKER'S VOICE" etc)
- optionally print character name in front of the quote (with prefix_speaker variable)

#!/usr/bin/env python2

from fileinput import input
from re import match

characters = ["BEVERLY", "WESLEY", "DATA", "GEORDI", "PICARD", "RIKER"]
prefix_speaker = True

out = speaker = ""
for line in input():
    # parse all lines beginning with at least one tab
    result = match(r"^(\t+)(\S.*?)\s*$", line)
    if not result:
        continue
    tabs = len(result.group(1))
    text = result.group(2)
    if tabs == 5:
        # dialogue header
        if speaker != text:
            # speaker changed, print what we've got and start over
            if len(out) > 0:
                if prefix_speaker:
                    print "%s: %s" % (speaker, out)
                else:
                    print out
            out = ""
            speaker = text
    elif tabs == 3 and any(c in speaker for c in characters):
        # spoken line
        # append this line to the dialogue, with a space
        if len(out) > 0:
            out += " "
        out += text
    else:
        # ignore all other lines
        pass

# just in case the input ends in the middle of dialogue
if len(out) > 0:
    print out

posted by neckro23 at 7:57 AM on January 6, 2015 [1 favorite]


Response by poster: Yes, thank you so much neckro23! I altered it just a bit to change python2 to python (I am on a mac). This appears to give me the results I want, so this is wonderful.

Better yet, I can even understand what most of these lines of code do! I really appreciate the comments in the code.
posted by antonymous at 12:24 PM on January 6, 2015


« Older My husband says she's his best friend. Really?   |   Help me find pans that can fit in my (large)... Newer »
This thread is closed to new comments.