How to easily get multi-line passages from a text file in Bash?
July 23, 2021 10:48 PM
One of my common Bash scripting patterns is like this:
shuf < playingcards.txt | head -n 3
to "draw" 3 random playing cards from a text file with all the card names on separate lines. Is there a similarly simple way to pick three random multi-line passages from a text file? I am OK adding limited special formatting, like opening and closing a multi-line passage with lines of squigglies or something. Thanks!I would do this in Python (using the first format, i.e. passages separated by an extra newline).
posted by caek at 11:36 PM on July 23, 2021
#!/usr/bin/python import random with open("foo.txt") as f: text = f.read().strip() passages = text.split("\n\n") print("\n\n".join(random.sample(passages, 3)))If some of the passages have empty lines in them then you'll need add a unique string between each passage and change the argument to text.split to that string.
posted by caek at 11:36 PM on July 23, 2021
four ideas:
1. shuf's man page documents the flag -z, --zero-terminated end lines with 0 byte, not newline. so that suggests creating an input file where you separate each passage with a 0 byte. that way lies madness, i suspect text editors will hate a file like that.
2. you could use an ordinary newline as a delimeter between passages, and use a different character to encode newlines within a passage. E.g. if you didn't need to use semicolons as semicolons within the text of a passage, you could use ; within a passage to mean newline. Then you could use your current bash construction to sample 3 random passages, then postprocess to replace the ;s with genuine newlines. E.g.
e.g. assuming a dir "passages" containing many files a.txt, b.txt, c.txt with a single passage inside each
python variation: if you wish to descend into madness and be periodically frustrated by YAML syntax, encode your passages as a single structured YAML document and then use a python's PyYAML parsing library to parse it
### example contents of file "passages.yaml":
posted by are-coral-made at 11:45 PM on July 23, 2021
1. shuf's man page documents the flag -z, --zero-terminated end lines with 0 byte, not newline. so that suggests creating an input file where you separate each passage with a 0 byte. that way lies madness, i suspect text editors will hate a file like that.
2. you could use an ordinary newline as a delimeter between passages, and use a different character to encode newlines within a passage. E.g. if you didn't need to use semicolons as semicolons within the text of a passage, you could use ; within a passage to mean newline. Then you could use your current bash construction to sample 3 random passages, then postprocess to replace the ;s with genuine newlines. E.g.
shuf < playingcards.txt | head -n 3 | sed -E "s/;/\n/g"3. store each passage as a separate text file, then use shuf to pick three random file names, then print them
e.g. assuming a dir "passages" containing many files a.txt, b.txt, c.txt with a single passage inside each
find passages/ -type f -iname "*.txt" -exec shuf -n 2 -e {} + | xargs cat4. throw python at it, as per caek suggestion.
python variation: if you wish to descend into madness and be periodically frustrated by YAML syntax, encode your passages as a single structured YAML document and then use a python's PyYAML parsing library to parse it
### example contents of file "passages.yaml":
one: - hello world - this is an example passage two: - second example passage three: - nobody expects - the - third passage four: - now we're just - showing - off### example contents of python3 script "blabber.py":
import yaml # if you get an import error, run python3 -m pip install yaml import sys import random if __name__ == '__main__': with open(sys.argv[1]) as f: passages_by_id = yaml.safe_load(f) ids = list(passages_by_id) chosen_ids = random.sample(ids, 3) for passage_id in chosen_ids: for line in passages_by_id[passage_id]: print(line) print() # newline between passages### usage example:
python3 blabber.py passages.yaml
posted by are-coral-made at 11:45 PM on July 23, 2021
This works (sort of) for the first format:
How random does this need to be? None of these I'd trust beyond my own amusement.
are-coral-made's YAML solution is perfect if you want to start out on a process of self-loathing.
posted by scruss at 12:04 AM on July 24, 2021
gawk 'BEGIN{srand();} {RS="\n\n"; h[rand()]=$0} END {n=0; for (i in h) {print h[i]; n++; if (n==4) exit;}}' fileSeems to be picky about blanks in the first entry, for no reason I can understand. Not all awks support multichar RS definitions.
How random does this need to be? None of these I'd trust beyond my own amusement.
are-coral-made's YAML solution is perfect if you want to start out on a process of self-loathing.
posted by scruss at 12:04 AM on July 24, 2021
Ah! If shuf supports -z as are-coral-made notes then there is no need to maintain a null-separated file. You can keep your empty line separator and use sed or perl to replace the empty lines with nulls, which makes this a reasonably legible a one-liner:
(I chose perl rather than sed because macOS sed is ... problematic, but if you're on Linux or have GNU sed then sed works fine too.)
posted by caek at 12:06 AM on July 24, 2021
perl -p -e 's/^$/\x0/' foo.txt | shuf -z -n 3
(I chose perl rather than sed because macOS sed is ... problematic, but if you're on Linux or have GNU sed then sed works fine too.)
posted by caek at 12:06 AM on July 24, 2021
If you don't mind separating the multiline sequences with %, and only need one at a time (or don't mind if your n random entires contain a duplicate), the 40+ year old unix "utility", fortune, does exactly what you want.
posted by aubilenon at 12:35 AM on July 24, 2021
posted by aubilenon at 12:35 AM on July 24, 2021
Lots of Gnu tools support the -z option to allow records to be separated by NUL instead of newline. So if the input format looks like
posted by flabdablet at 12:51 AM on July 24, 2021
### Item 1 - Details - Details - More Details ### Item 2 - Details - Details - More Details ...and items are strictly defined to extend from a line beginning with ### to either EOF or just before the next line beginning with ###, the first step I'd go with is translating input in that format to a stream of NUL-separated items. Easily done with sed:
sed '/^###/s/^/\x0/' /path/to/inputThis gets most of the way there but because what we have is item headers and what we want is item separators, we also get a spurious empty item right before the first inserted NUL delimiter. Delete that with tail:
sed '/^###/s/^/\x0/' /path/to/input | tail -zn+2Now let shuf do its thing (note also the use of shuf's -n option, which has the same effect as piping its output through head):
sed '/^###/s/^/\x0/' /path/to/input | tail -zn+2 | shuf -zn3Finally, strip the NULs back out again:
sed '/^###/s/^/\x0/' /path/to/input | tail -zn+2 | shuf -zn3 | tr -d '\0'That's about as terse as I can make it.
posted by flabdablet at 12:51 AM on July 24, 2021
$ cat foo.txt one two three four five six seven eight nine ten $ $ perl -000 -ne '$x=$_ if rand(1)<1/$.; END{print $x}' < foo.txt four five $ perl -000 -ne '$x=$_ if rand(1)<1/$.; END{print $x}' < foo.txt six seven eight $ perl -000 -ne '$x=$_ if rand(1)<1/$.; END{print $x}' < foo.txt nine $ perl -000 -ne '$x=$_ if rand(1)<1/$.; END{print $x}' < foo.txt one two threeYou could also look up the fortune and strfile> commands to make a '%' separated file that is rather common.
A random fortune % Yet another random fortune % etc fortune goes here %
Here's the Raku Text::Fortune module, I have opinions on picking random chunks from files. :)
posted by zengargoyle at 12:52 AM on July 24, 2021
If you're willing to use yaml (and your second example is valid yaml) then look for a yaml descendant of jq, such as this yq . Parsing structured text properly is tricky in bash itself, so you'll be searching for a helper of some variety. Awk is almost certainly capable of it as well, if your separator is \n\n.
I would probably count the sections with one round of said tool, pick a random number in the range, and use the second round to extract it.
posted by How much is that froggie in the window at 1:04 AM on July 24, 2021
I would probably count the sections with one round of said tool, pick a random number in the range, and use the second round to extract it.
posted by How much is that froggie in the window at 1:04 AM on July 24, 2021
Stepwise refinement of pipelines involving NUL-separated records is made easier by using cat -v, which will show all the NULs in its input stream as ^@, as the last component of the pipeline until you know it works right.
posted by flabdablet at 1:12 AM on July 24, 2021
posted by flabdablet at 1:12 AM on July 24, 2021
Oh, the beauty of the Perl solution above is that it's a base case of Reservoir sampling.
It has nice properties. Probably my favorite bit of code.
posted by zengargoyle at 1:29 AM on July 24, 2021
It has nice properties. Probably my favorite bit of code.
$ echo -ne "one\ntwo\0three\nfour\nfive\0six\nseven\0" > bar.txt $ perl -0 -ne '$x=$_ if rand(1)<1/$.; END{print "$x\n"}' < bar.txt six seven $ perl -0 -ne '$x=$_ if rand(1)<1/$.; END{print "$x\n"}' < bar.txt three four five $
posted by zengargoyle at 1:29 AM on July 24, 2021
Pretty sure Gnu shuf also uses reservoir sampling internally. The shuf | head -n $COUNT idiom represents a really common use case, and it seems highly likely to me that extending shuf to implement that use case itself via shuf -n $COUNT would have been motivated by the opportunity for a substantial efficiency win from doing so.
posted by flabdablet at 3:54 AM on July 24, 2021
posted by flabdablet at 3:54 AM on July 24, 2021
If the text is to be in % separated paragraph format, developing a suitable pipeline for that is fairly straightforward too. Let's make some sample text:
posted by flabdablet at 5:04 AM on July 24, 2021
stephen@jellynail:/tmp$ cat <<eof >text apple banana % catalog dormant % eagle fruit goose % hat % icicle eofFirst thing the pipeline will need to do is convert every % separator line into a single NUL without a trailing newline. We do that by skipping the % line and then inserting a NUL at the front of the next one, using cat -v to check that it's working:
stephen@jellynail:/tmp$ sed -n '/^%$/{n;s/^/\x0/};p' text | cat -v apple banana ^@catalog dormant ^@eagle fruit goose ^@hat ^@icicleFeed that through shuf a few times to make sure it's doing what we want:
stephen@jellynail:/tmp$ sed -n '/^%$/{n;s/^/\x0/};p' text | shuf -zn3 | cat -v catalog dormant ^@icicle ^@eagle fruit goose ^@stephen@jellynail:/tmp$ sed -n '/^%$/{n;s/^/\x0/};p' text | shuf -zn3 | cat -v icicle ^@eagle fruit goose ^@apple banana ^@stephen@jellynail:/tmp$Since shuf -z seems to be treating NUL as an output record terminator rather than a separator, there will always be a newline followed by a NUL at the end of its output that we don't need. Drop that, then convert all remaining NULs back to % plus newline paragraph separators:
stephen@jellynail:/tmp$ sed -n '/^%$/{n;s/^/\x0/};p' text | shuf -zn3 | sed '$d;s/\x0/%\n/g' | cat -v icicle % apple banana % eagle fruit gooseDropping the cat -v debug stage, the final pipeline then becomes
sed -n '/^%$/{n;s/^/\x0/};p' /path/to/input | shuf -zn3 | sed '$d;s/\x0/%\n/g'
posted by flabdablet at 5:04 AM on July 24, 2021
Alternatively, perl -p includes the trailing newline in the $_ line buffer variable, so it's tidier to use than sed for the first pipeline step:
posted by flabdablet at 5:35 AM on July 24, 2021
</path/to/input perl -pe 's/^%\n/\0/' | shuf -zn3 | sed '$d;s/\x0/%\n/g'If you're stuck with a version of sed that doesn't understand the \xNN syntax for non-printing characters, you might want to use perl for the restoring conversion as well. Probably easiest to split the last-line deletion out into its own pipeline stage in that instance:
</path/to/input perl -pe 's/^%\n/\0/' | shuf -zn3 | head -n-1 | perl -pe 's/\0/%\n/g'
posted by flabdablet at 5:35 AM on July 24, 2021
This general technique - a cooking pass that augments complicated-to-parse separators with single non-printing characters, followed by operations that use those simple characters to delimit item boundaries and/or identify item types, followed by an un-cooking pass that strips the control characters out again - is one worth bearing in mind.
You can use it to get quite close to processing HTML robustly with regexps, for example, when using a proper parser properly is going to invite worse failure modes than not. Not that I would ever advocate doing such a thing. Oh dearie me no.
posted by flabdablet at 5:57 AM on July 24, 2021
You can use it to get quite close to processing HTML robustly with regexps, for example, when using a proper parser properly is going to invite worse failure modes than not. Not that I would ever advocate doing such a thing. Oh dearie me no.
posted by flabdablet at 5:57 AM on July 24, 2021
It doesn't fit your desired format, but the simplest thing that comes to mind: I'd reach for "\n" to encode newlines, and add an "echo -e" to decode them.
posted by Pronoiac at 10:11 AM on July 24, 2021
posted by Pronoiac at 10:11 AM on July 24, 2021
Couldn't have asked for a cooler answer thread. Thanks everybody! I have already learned a lot from reading the responses, and there is a ton left to learn based on various references you've left as well.
posted by circular at 11:28 AM on July 24, 2021
posted by circular at 11:28 AM on July 24, 2021
aside:
> This general technique - a cooking pass that augments complicated-to-parse separators with single non-printing characters, followed by operations that use those simple characters to delimit item boundaries and/or identify item types, followed by an un-cooking pass
great point.
slightly more abstractly: let C denote cook and C^-1 denote inverse-cook and S denote randomly sample, we're doing something akin to C passages | S | C^-1 as a left-to-right shell pipeline, or C^-1 ( S ( C ( passages ) ) ) as right-to-left algebraic function composition.
The J programming language has operators for this algebraic pattern: the concept of applying a function on something "under" some other invertible transformation.
So we could call this "random-sampling passages under cooking". Maybe we need the form of cooking that operates on collections of items, not one. "random-sampling passages under batch cooking".
(and imagine a world where we can splash J operators throughout our shell pipelines)
posted by are-coral-made at 2:12 PM on July 27, 2021
> This general technique - a cooking pass that augments complicated-to-parse separators with single non-printing characters, followed by operations that use those simple characters to delimit item boundaries and/or identify item types, followed by an un-cooking pass
great point.
slightly more abstractly: let C denote cook and C^-1 denote inverse-cook and S denote randomly sample, we're doing something akin to C passages | S | C^-1 as a left-to-right shell pipeline, or C^-1 ( S ( C ( passages ) ) ) as right-to-left algebraic function composition.
The J programming language has operators for this algebraic pattern: the concept of applying a function on something "under" some other invertible transformation.
So we could call this "random-sampling passages under cooking". Maybe we need the form of cooking that operates on collections of items, not one. "random-sampling passages under batch cooking".
(and imagine a world where we can splash J operators throughout our shell pipelines)
posted by are-coral-made at 2:12 PM on July 27, 2021
This thread is closed to new comments.
### Item 1
- Details
- Details
- More Details
### Item 2
- Details
- Details
- More Details
Or...
posted by circular at 11:15 PM on July 23, 2021