How to easily get multi-line passages from a text file in Bash?
July 23, 2021 10:48 PM

One of my common Bash scripting patterns is like this: shuf < playingcards.txt | head -n 3 to "draw" 3 random playing cards from a text file with all the card names on separate lines. Is there a similarly simple way to pick three random multi-line passages from a text file? I am OK adding limited special formatting, like opening and closing a multi-line passage with lines of squigglies or something. Thanks!
posted by circular to Computers & Internet (18 answers total) 5 users marked this as a favorite
I just realized I should probably add a desired format for the text file:

### Item 1
- Details
- Details
- More Details

### Item 2
- Details
- Details
- More Details


Or...
- Item 1
    - Details
    - Details
    - More Details
- Item 2
    - Details
    - Details
    - More Details

posted by circular at 11:15 PM on July 23, 2021


I would do this in Python (using the first format, i.e. passages separated by an extra newline).
#!/usr/bin/python
import random

with open("foo.txt") as f:
    text = f.read().strip()

passages = text.split("\n\n")

print("\n\n".join(random.sample(passages, 3)))
If some of the passages have empty lines in them then you'll need add a unique string between each passage and change the argument to text.split to that string.
posted by caek at 11:36 PM on July 23, 2021


four ideas:

1. shuf's man page documents the flag -z, --zero-terminated end lines with 0 byte, not newline. so that suggests creating an input file where you separate each passage with a 0 byte. that way lies madness, i suspect text editors will hate a file like that.

2. you could use an ordinary newline as a delimeter between passages, and use a different character to encode newlines within a passage. E.g. if you didn't need to use semicolons as semicolons within the text of a passage, you could use ; within a passage to mean newline. Then you could use your current bash construction to sample 3 random passages, then postprocess to replace the ;s with genuine newlines. E.g.
shuf < playingcards.txt | head -n 3 | sed -E "s/;/\n/g"
3. store each passage as a separate text file, then use shuf to pick three random file names, then print them

e.g. assuming a dir "passages" containing many files a.txt, b.txt, c.txt with a single passage inside each
find passages/ -type f -iname "*.txt" -exec shuf -n 2 -e {} +  | xargs cat
4. throw python at it, as per caek suggestion.

python variation: if you wish to descend into madness and be periodically frustrated by YAML syntax, encode your passages as a single structured YAML document and then use a python's PyYAML parsing library to parse it

### example contents of file "passages.yaml":
one:
  - hello world
  - this is an example passage
two:
  - second example passage
three:
  - nobody expects
  - the
  - third passage
four:
  - now we're just
  - showing
  - off
### example contents of python3 script "blabber.py":
import yaml # if you get an import error, run python3 -m pip install yaml
import sys
import random

if __name__ == '__main__':
    with open(sys.argv[1]) as f:
        passages_by_id = yaml.safe_load(f)

    ids = list(passages_by_id)
    chosen_ids = random.sample(ids, 3)
    for passage_id in chosen_ids:
        for line in passages_by_id[passage_id]:
            print(line)
        print() # newline between passages
### usage example:
python3 blabber.py passages.yaml

posted by are-coral-made at 11:45 PM on July 23, 2021


This works (sort of) for the first format:
gawk 'BEGIN{srand();} {RS="\n\n"; h[rand()]=$0} END {n=0; for (i in h) {print h[i]; n++; if (n==4) exit;}}'  file
Seems to be picky about blanks in the first entry, for no reason I can understand. Not all awks support multichar RS definitions.

How random does this need to be? None of these I'd trust beyond my own amusement.

are-coral-made's YAML solution is perfect if you want to start out on a process of self-loathing.
posted by scruss at 12:04 AM on July 24, 2021


Ah! If shuf supports -z as are-coral-made notes then there is no need to maintain a null-separated file. You can keep your empty line separator and use sed or perl to replace the empty lines with nulls, which makes this a reasonably legible a one-liner:

perl -p -e 's/^$/\x0/' foo.txt | shuf -z -n 3

(I chose perl rather than sed because macOS sed is ... problematic, but if you're on Linux or have GNU sed then sed works fine too.)
posted by caek at 12:06 AM on July 24, 2021


If you don't mind separating the multiline sequences with %, and only need one at a time (or don't mind if your n random entires contain a duplicate), the 40+ year old unix "utility", fortune, does exactly what you want.
posted by aubilenon at 12:35 AM on July 24, 2021


Lots of Gnu tools support the -z option to allow records to be separated by NUL instead of newline. So if the input format looks like
### Item 1
- Details
- Details
- More Details

### Item 2
- Details
- Details
- More Details
...
and items are strictly defined to extend from a line beginning with ### to either EOF or just before the next line beginning with ###, the first step I'd go with is translating input in that format to a stream of NUL-separated items. Easily done with sed:
sed '/^###/s/^/\x0/' /path/to/input
This gets most of the way there but because what we have is item headers and what we want is item separators, we also get a spurious empty item right before the first inserted NUL delimiter. Delete that with tail:
sed '/^###/s/^/\x0/' /path/to/input | tail -zn+2
Now let shuf do its thing (note also the use of shuf's -n option, which has the same effect as piping its output through head):
sed '/^###/s/^/\x0/' /path/to/input | tail -zn+2 | shuf -zn3
Finally, strip the NULs back out again:
sed '/^###/s/^/\x0/' /path/to/input | tail -zn+2 | shuf -zn3 | tr -d '\0'
That's about as terse as I can make it.
posted by flabdablet at 12:51 AM on July 24, 2021



$ cat foo.txt
one
two
three

four
five

six
seven
eight

nine

ten
$ $ perl -000 -ne '$x=$_ if rand(1)<1/$.; END{print $x}' < foo.txt 
four
five

$ perl -000 -ne '$x=$_ if rand(1)<1/$.; END{print $x}' < foo.txt 
six
seven
eight

$ perl -000 -ne '$x=$_ if rand(1)<1/$.; END{print $x}' < foo.txt 
nine

$ perl -000 -ne '$x=$_ if rand(1)<1/$.; END{print $x}' < foo.txt 
one
two
three

You could also look up the fortune and strfile> commands to make a '%' separated file that is rather common.

A random fortune
%
Yet another
random
fortune
%
etc fortune
goes here
%

Here's the Raku Text::Fortune module, I have opinions on picking random chunks from files. :)

posted by zengargoyle at 12:52 AM on July 24, 2021


If you're willing to use yaml (and your second example is valid yaml) then look for a yaml descendant of jq, such as this yq . Parsing structured text properly is tricky in bash itself, so you'll be searching for a helper of some variety. Awk is almost certainly capable of it as well, if your separator is \n\n.

I would probably count the sections with one round of said tool, pick a random number in the range, and use the second round to extract it.
posted by How much is that froggie in the window at 1:04 AM on July 24, 2021


Stepwise refinement of pipelines involving NUL-separated records is made easier by using cat -v, which will show all the NULs in its input stream as ^@, as the last component of the pipeline until you know it works right.
posted by flabdablet at 1:12 AM on July 24, 2021


Oh, the beauty of the Perl solution above is that it's a base case of Reservoir sampling.

It has nice properties. Probably my favorite bit of code.

$ echo -ne "one\ntwo\0three\nfour\nfive\0six\nseven\0" > bar.txt
$ perl -0 -ne '$x=$_ if rand(1)<1/$.; END{print "$x\n"}' < bar.txt 
six
seven
$ perl -0 -ne '$x=$_ if rand(1)<1/$.; END{print "$x\n"}' < bar.txt 
three
four
five
$ 

posted by zengargoyle at 1:29 AM on July 24, 2021


Pretty sure Gnu shuf also uses reservoir sampling internally. The shuf | head -n $COUNT idiom represents a really common use case, and it seems highly likely to me that extending shuf to implement that use case itself via shuf -n $COUNT would have been motivated by the opportunity for a substantial efficiency win from doing so.
posted by flabdablet at 3:54 AM on July 24, 2021


If the text is to be in % separated paragraph format, developing a suitable pipeline for that is fairly straightforward too. Let's make some sample text:
stephen@jellynail:/tmp$ cat <<eof >text
apple
banana
%
catalog
dormant
%
eagle
fruit
goose
%
hat
%
icicle
eof
First thing the pipeline will need to do is convert every % separator line into a single NUL without a trailing newline. We do that by skipping the % line and then inserting a NUL at the front of the next one, using cat -v to check that it's working:
stephen@jellynail:/tmp$ sed -n '/^%$/{n;s/^/\x0/};p' text | cat -v
apple
banana
^@catalog
dormant
^@eagle
fruit
goose
^@hat
^@icicle
Feed that through shuf a few times to make sure it's doing what we want:
stephen@jellynail:/tmp$ sed -n '/^%$/{n;s/^/\x0/};p' text | shuf -zn3 | cat -v
catalog
dormant
^@icicle
^@eagle
fruit
goose
^@stephen@jellynail:/tmp$ sed -n '/^%$/{n;s/^/\x0/};p' text | shuf -zn3 | cat -v
icicle
^@eagle
fruit
goose
^@apple
banana
^@stephen@jellynail:/tmp$
Since shuf -z seems to be treating NUL as an output record terminator rather than a separator, there will always be a newline followed by a NUL at the end of its output that we don't need. Drop that, then convert all remaining NULs back to % plus newline paragraph separators:
stephen@jellynail:/tmp$ sed -n '/^%$/{n;s/^/\x0/};p' text | shuf -zn3 | sed '$d;s/\x0/%\n/g' | cat -v
icicle
%
apple
banana
%
eagle
fruit
goose
Dropping the cat -v debug stage, the final pipeline then becomes
sed -n '/^%$/{n;s/^/\x0/};p' /path/to/input | shuf -zn3 | sed '$d;s/\x0/%\n/g'

posted by flabdablet at 5:04 AM on July 24, 2021


Alternatively, perl -p includes the trailing newline in the $_ line buffer variable, so it's tidier to use than sed for the first pipeline step:
</path/to/input perl -pe 's/^%\n/\0/' | shuf -zn3 | sed '$d;s/\x0/%\n/g'
If you're stuck with a version of sed that doesn't understand the \xNN syntax for non-printing characters, you might want to use perl for the restoring conversion as well. Probably easiest to split the last-line deletion out into its own pipeline stage in that instance:
</path/to/input perl -pe 's/^%\n/\0/' | shuf -zn3 | head -n-1 | perl -pe 's/\0/%\n/g'

posted by flabdablet at 5:35 AM on July 24, 2021


This general technique - a cooking pass that augments complicated-to-parse separators with single non-printing characters, followed by operations that use those simple characters to delimit item boundaries and/or identify item types, followed by an un-cooking pass that strips the control characters out again - is one worth bearing in mind.

You can use it to get quite close to processing HTML robustly with regexps, for example, when using a proper parser properly is going to invite worse failure modes than not. Not that I would ever advocate doing such a thing. Oh dearie me no.
posted by flabdablet at 5:57 AM on July 24, 2021


It doesn't fit your desired format, but the simplest thing that comes to mind: I'd reach for "\n" to encode newlines, and add an "echo -e" to decode them.
posted by Pronoiac at 10:11 AM on July 24, 2021


Couldn't have asked for a cooler answer thread. Thanks everybody! I have already learned a lot from reading the responses, and there is a ton left to learn based on various references you've left as well.
posted by circular at 11:28 AM on July 24, 2021


aside:

> This general technique - a cooking pass that augments complicated-to-parse separators with single non-printing characters, followed by operations that use those simple characters to delimit item boundaries and/or identify item types, followed by an un-cooking pass

great point.

slightly more abstractly: let C denote cook and C^-1 denote inverse-cook and S denote randomly sample, we're doing something akin to C passages | S | C^-1 as a left-to-right shell pipeline, or C^-1 ( S ( C ( passages ) ) ) as right-to-left algebraic function composition.

The J programming language has operators for this algebraic pattern: the concept of applying a function on something "under" some other invertible transformation.

So we could call this "random-sampling passages under cooking". Maybe we need the form of cooking that operates on collections of items, not one. "random-sampling passages under batch cooking".

(and imagine a world where we can splash J operators throughout our shell pipelines)
posted by are-coral-made at 2:12 PM on July 27, 2021


« Older Investment portfolio with automatic WITHDRAWS?   |   AI Music Software Newer »
This thread is closed to new comments.