Pray, how is Brittles? PAD OSC
July 19, 2022 11:21 AM   Subscribe

I am writing a script that combines excerpts from various .txt files that were obtained via Project Gutenberg, into a compilation .txt file. There's some kind of a problem...sometimes I see a lot of what I think are Unicode control characters in an excerpt. Like PAD, OSC, SGCI, etc. Is there a quick way or a Linux command line tool to filter these somehow? Im using for scripting.

The script gets the files kinda randomly, like: booklist = `find "$srcdir" -type f -iname "*.txt" | shuf | head -n 10`

Then I use a .lines() function to turn that into an array, and I iterate over it, picking a random place from which to start an excerpt of some length. It's written to STDOUT (I think that's what it's called? Straight to terminal when I run it there) and my cron job is like `myscript.abs > excerpts.txt`.

So I'm wondering if I should insert a command somewhere to filter the data through (my imaginary example: sudo apt install asciifier).

Any tips on this unicode stuff would be appreciated! The control codes are displayed as inverted-color icon-blocks in my editor (Geany) and it's annoying to read with them in there. Not all of the excerpts have these, and I know some of the PG .txt files are ASCII to start with.
posted by circular to Computers & Internet (26 answers total)
Best answer: If you can provide links to some of the text files concerned, I'll have a squiz and see what I think the least-effort way to clean them up is going to involve.
posted by flabdablet at 12:25 PM on July 19, 2022 [2 favorites]

Best answer: seconding flabdablet

You're using a slightly unusual workflow, and it might be something in there that's causing the problem.
posted by scruss at 12:26 PM on July 19, 2022

Best answer: As a complete aside, shuf -n 10 is functionally equivalent to shuf | head -n 10 and always does less work because it's not required to shuffle an entire list.
posted by flabdablet at 12:29 PM on July 19, 2022 [1 favorite]

Response by poster: Here are some examples:

Art of War - Text file from this page:

"Hence a wise general makes a point of foraging on the enemy. One
cartload of the enemy’[PAD][SGCI]s provisions is equivalent to twenty of one’[PAD][SGCI]s"

This also happens with The Great Gatsby's txt file: Link

No problems so far with:

A Christmas Carol

The Sky Detectives

> As a complete aside, shuf -n 10 is functionally equivalent to shuf | head -n 10 and always does less work because it's not required to shuffle an entire list.

What...! Nice to read that if true. But it definitely shuffles all the books, not just the first 10 found by find, is that right?
posted by circular at 12:55 PM on July 19, 2022

Best answer: It's your workflow. Here's the paragraph, pasted from The Art of War:
15. Hence a wise general makes a point of foraging on the enemy. One cartload of the enemy’s provisions is equivalent to twenty of one’s own, and likewise a single picul of his provender is equivalent to twenty from one’s own store.
The "enemy’[PAD][SGCI]s" bit includes a ’ (U+2019, "RIGHT SINGLE QUOTATION MARK") character. Either set Geany (which is kind of more of a programmer's IDE than a text editor) to use UTF-8, or use an editor that does. I'm particularly fond of micro.
posted by scruss at 1:03 PM on July 19, 2022 [1 favorite]

Response by poster: > Geany to use UTF-8

First--thank you.

Using Document > Set Encoding > Unicode > UTF-8 in Geany goesn't change anything. Not sure why. So I tried some more editors on my system:

SciTe: Shows little code boxes with numbers in them, instead of PAD, etc.

Graviton: Shows two red middle-dots instead of code boxes. (?)

VSCodium: Shows the correct graphical characters instead of e.g. SCGI, though parts of the text like "â" are still there. I wonder if this "â" is just an artifact of the txt conversion process that was never really worried about?

Micro: Damn, this really does a nice job! Wow. There are no "â" characters, either.

Edit: Vim in xfce4-terminal: (Alice's Adventures in Wonderland) "â<8><9>When _Iâ<8><9>m_ a Duchess,â<8><9> she said to herself, (not in a very hopeful..." lol

(The bad news is, I have almost 100 snippets configured with Geany, I have a bunch of system-level keyboard shortcuts assigned to interoperate with it, and I know it like the back of my hand...)

Edit: BTW, if I add some markdown formatting things like little gt-brackets for blockquotes, should I expect Pandoc to be able to convert this stuff to e.g. HTML in my own style (rather than PG HTML)? Seems reasonable but I thought I'd ask if there was any reason why not, since some editors clearly aren't up to those "special" characters.
posted by circular at 1:15 PM on July 19, 2022

Response by poster: OK, one more important update. I found this in Geany:

File > Reload As > Unicode > UTF-8

And poof, no more code icons anywhere, symbols seem OK, and in fact I can't reproduce the problem anymore, even if I regenerate 10 excerpts randomly 10 more times from 400+ books. And even if I close and re-open the file.

So I wonder if the first time I opened the file, Geany didn't do anything about those weird codes, but then my fiddling with Unicode settings for this file became some kind of permanent fix for Geany's interpretation of the file.

Weird! But I'm more OK with this particular weirdness.
posted by circular at 1:40 PM on July 19, 2022 [2 favorites]

Best answer: If you're seeing things like "I ♡ UTF-8" instead of "I ♡ UTF-8" in places in your workflow, check your LANG environment variable. If it doesn't include UTF-8 (mine is en_CA.UTF-8), then you probably want to make sure it does. On Debian-like systems, that's done using dpkg-reconfigure locales
posted by scruss at 2:25 PM on July 19, 2022

Response by poster: > check your LANG environment variable

Good one, thank you. It looks like en_US.UTF-8 here.

Oh and my Geany problem with those weird codes is back! xD The tools and reloads, they do nothing! This started when the editor offered to reload the file (it changes based on a cron job) but didn't do anything when I clicked "Reload", so I closed manually, and opened again via File > Open. Boom, weird codes came back, nothing works to remove them.

I am thinking I might reach out to the developers and see what they say. TBT I don't really understand what I'm doing with the Reload As feature anyway.

In the meantime it's a good opportunity to learn more about Micro, I guess.
posted by circular at 2:38 PM on July 19, 2022

Response by poster: I didn't anticipate jumping in here with comments so often...sigh. Anyway, I looked through the Geany docs and it turns out it accepts an encoding directive at the top of the file.

In anticipation of using Pandoc, I've wrapped it in some HTML-style comments I looked up, included the requisite space on either side of the directive, and parked this in the first three lines:

[HTML comment open, with three dashes, stripped by MeFi]
[HTML comment close, with two dashes, stripped by MeFi]

For now this seems to work, fingers crossed.

Edit: I realized why it won't reload the file when it's been changed! The status bar says: "The file (excerpts.txt) is not valid UTF-8." Wonderful. No idea why...
posted by circular at 2:54 PM on July 19, 2022

Best answer: Plain text is anything but and it may well be that Geany has an option set somewhere for which encoding to assume by default when opening a file. Explicitly set per-application preferences quite often get priority over the hints set by environment variables.

it definitely shuffles all the books, not just the first 10 found by find, is that right?

The way pipelines work is that all the commands in the pipeline run as separate processes in parallel, each one consuming input as it becomes available, until it's either terminated by the shell process that launched the pipeline or sees an end-of-file arrive on its input stream and (usually) terminates itself.

The pipes themselves have a certain amount of buffering built in (on very early Unices this was one memory management page of 4096 bytes; in modern systems it's of the order of a megabyte) and when any of the processes in a pipeline fills the buffer on its output pipe, it stalls until the downstream process has consumed some of that to make more room and then picks up where it left off.

The job of shuf is to emit a randomly permuted reordering of all the lines of its input file, so when used in a pipeline the only way it will terminate if not forcibly killed is after reading its input all the way to an EOF. And it won't even begin to produce output until it has read all that input, because the very last available input line has to be exactly as much a candidate for being the very first output line as any other.

So when the shell starts the pipeline find "$srcdir" -type f -iname "*.txt" | shuf | head -n 10 it launches three processes (find, shuf and head) of which shuf and head immediately stall, waiting for input from the pipes attached to their respective standard inputs.

As find finds things, it writes them to its output pipe. That's also shuf's input pipe, so find and shuf will run in parallel for a while until find has found everything it's been asked to look for, at which point it will close its end of its output pipe and terminate itself.

Meanwhile, shuf has been collecting all those input lines into a big internal buffer of some kind. Once it sees EOF on its input, it will start selecting lines from that buffer at random, writing them to its standard output and deleting them from the buffer as it goes. If nothing stops it, it would keep on doing this until it had emptied its internal buffer, at which point it would close its own output pipe and terminate itself.

As soon as shuf begins to write lines to its output pipe, head will wake up and begin to consume them. After writing ten of them to its own output,head will close its input pipe and terminate. When that happens, its parent process (the shell that launched the pipeline in the first place) will see that the rightmost pipeline component has terminated, and forcibly terminate any of the remaining pipeline processes that haven't already terminated themselves. In this particular case, it might or might not end up forcibly terminating shuf because shuf might or might not have been able to stuff the entirety of what it was ever going to emit into its output pipe's buffer before head terminated (on a really old Unix with teeny tiny pipe buffers, there's less chance of that).

Using find "$srcdir" -type f -iname "*.txt" | shuf -n 10 instead doesn't change the relationship between find and shuf in any way so yes, shuf will still get the whole list to work with. The difference is that instead of continuing to emit randomly selected lines until its output pipe fills and/or the invoking shell kills it because a downstream partner is all done, shuf itself will quit after emitting ten output lines.
posted by flabdablet at 2:55 PM on July 19, 2022 [1 favorite]

Best answer: geany_encoding=UTF-8

Ugh. Glad it works for you, but seriously, ugh. This is 2022. Everything should just be UTF-8 or gtfo by now. Sigh.

[HTML comment open, with three dashes, stripped by MeFi]

The MeFi text entry facility always treats a left angle bracket as if it were the lead-in for an HTML tag, and it completely strips anything that looks tag-like but isn't one of the restricted set of tags it allows.

So any < that needs to appear as-is in a MeFi post or comment has to be entered using the HTML entity representation &lt; instead. Similarly, any & needs to be entered as &amp; (don't forget the trailing semicolons on these or they won't work).

If you process your comment with the Preview button before posting it, you'll see that Metafilter also replaces every > with &gt; even though there's no HTML-based need to do so.

Plain text really isn't. Not any more. I mean it never really was, but this modern world provides more ways for it not to be than we've ever seen before.
posted by flabdablet at 3:08 PM on July 19, 2022 [1 favorite]

Best answer: The status bar says: "The file (excerpts.txt) is not valid UTF-8." Wonderful. No idea why...

This can happen when text encoded as UTF-8 is processed by tools that work with it as a stream of bytes rather than a stream of characters. Sometimes a UTF-8-encoded character ends up with some of its bytes removed, and UTF-8 being a format with a certain amount of redundancy inbuilt to allow for detection of invalid sequences, the decoder will pick that up.

Another possible cause is assembly of what is putatively a "plain text" file from multiple sources that are not all UTF-8 encoded. There are certain byte values that are perfectly legitimate inside Windows-1252 encoded text, for example, that are illegal in UTF-8.
posted by flabdablet at 4:22 PM on July 19, 2022 [1 favorite]

Best answer: If it turns out that your issue is rooted in the cut-and-paste origins of your excerpts file, you might want to make sure the sources you're pulling content from are indeed UTF-8 encoded before you do so.

In shell script you could use enc=$(file --brief --mime-encoding "$source") to set variable enc to a reasonable guess at the encoding used in a source text, then iconv --from-code="$enc" --to-code=utf-8 "$source" to output a UTF-8 encoded version of it. There's probably some tidy way to express the same workflow in ABS as well.
posted by flabdablet at 4:40 PM on July 19, 2022 [1 favorite]

Best answer: And of course if you have reliable information about a particular source's encoding, you should stick that into enc explicitly instead of letting file guess it for you. iconv --list will give you a list of the encoding names it understands, of which I believe the MIME encoding names that file can emit to be a subset.
posted by flabdablet at 4:52 PM on July 19, 2022 [1 favorite]

Response by poster: That's really fascinating, flabdablet. I had a realization/hunch so I did a quick for-loop in bash to check the mime-encoding on the PG text files. Every single one came up as UTF-8.

But, when I first built the script, I decided to let the find command be a little bit leaky and so $srcdir is like /books/various/ instead of /books/various/pg-files-to-play-with/.

As a result it was picking up and pasting in txt files from pre-2002, stuff I used to cart around on my HP Jornada and read in uBooks for example.

So I grabbed a copy of the file and renamed it "specimen.txt" when the editor complained. Mousepad also complained--not valid UTF-8.

Some interesting old txt files in there. "file --brief --mime-encoding suspected-issue-text-file.txt" yielded: "binary".

lol. So I think if I can update my filter to just these PG texts, well, maybe that should do it. And I want to play with iconv too.
posted by circular at 5:07 PM on July 19, 2022

Best answer: txt files from pre-2002, stuff I used to cart around on my HP Jornada

are almost certainly Windows-1252 encoded. Try comparing the output of less identifies-as-binary.txt with that of iconv --from-code=windows-1252 --to-code=utf-8 identifies-as-binary.txt | less and see what you get.

If file yields nonsense results, just assuming windows-1252 is probably a reasonable fallback provided the file is actually some form of text.

Another slightly less likely possibility is that the file is a Windows "Unicode" text file. These are actually encoded as UCS-2, which is a 16 bits per character standard; open one of those with less and you'll get warned that it might be a binary file, and if you go ahead and open it anyway you'll see every other byte show up as ^@. Again, iconv will make clean UTF-8 out of those with no problem.
posted by flabdablet at 5:22 PM on July 19, 2022 [1 favorite]

Best answer: I would suspect that `shuff -n X` uses Reservoir sampling - Wikipedia. So while it does have to see each input line in order to give it a fair chance of being an output line... it only ever has to keep 'X' lines in memory and just shuffles those for output when the input is done.

Totally one of my favorite algorithms.
posted by zengargoyle at 9:06 PM on July 19, 2022 [2 favorites]

Best answer: no need to suspect, zengargoyle: it does - shuf.c\src - coreutils.git - GNU coreutils - line 170
posted by scruss at 7:41 AM on July 20, 2022

Response by poster: Wow I learned a lot here and really appreciate all the help, tips, and additional links & insights. Definitely got it all taken care of now. Thanks everybody.
posted by circular at 10:51 AM on July 20, 2022 [1 favorite]

I'm curious to know what encoding the file that file identified as binary is actually using.
posted by flabdablet at 11:32 AM on July 20, 2022

Response by poster: > I'm curious to know what encoding the file that file identified as binary is actually using.

It's interesting, there's no difference in less results before vs. after converting to utf-8. However less and w3m both show a ^Z at the end of the file, and Mousepad shows a U+001A character there, a little box with 00 / 1A in it.
posted by circular at 7:24 PM on July 20, 2022

Substitute character - Wikipedia. Yep, it's a Ctrl-Z.

I would guess Mousepad is either just using the 'control character font' to display the character and that font is designed with the unicode codepoint displayed in full, or it's actually working internally with utf-16/ucs-2 and just reading utf-8 in, editing in ucs-2, and then writing back out utf-8. I think a lot of editors do something like that to make the size-of-character-in-memory be the same for all characters to make it easier to work with vs having in memory variable number of bytes per character.

You could probably google up a 'remove control characters from utf-8 files' and find a `sed` one-liner or something. But you'd have to be careful to not strip out all of them... tab, cr, lf, ff, etc. There are probably tables somewhere of the non-printable/visual control characters.

Futzing with old files and old encodings is a total PITA and just takes a bit of practice. I usually look at them with `xxd` (comes with vim IIRC).
$ xxd guitar.txt 
00000000: 4720 2020 3a20 2047 3220 2042 3220 2044  G   :  G2  B2  D
00000010: 330a 4320 2020 3a20 2043 3320 2045 3320  3.C   :  C3  E3 
00000020: 2047 340a 4420 2020 3a20 2044 3320 2046   G4.D   :  D3  F
00000030: 2334 2041 330a 4620 2020 3a20 2046 3320  #4 A3.F   :  F3 
00000040: 2041 3320 2043 340a 4120 2020 3a20 2041   A3  C4.A   :  A
00000050: 3320 2043 2334 2045 340a 4520 2020 3a20  3  C#4 E4.E   : 
00000060: 2045 3220 2047 2333 2042 330a 456d 2020   E2  G#3 B3.Em  
00000070: 3a20 2045 3220 2047 3320 2042 330a       :  E2  G3  B3.
You eventually learn to tell all of the unicode apart by looking at the hex digits.
posted by zengargoyle at 10:37 PM on July 20, 2022

A note on Mousepad, which has caused all sorts of issues for users of Raspberry Pi OS: its "I can't recognize this encoding" dialogue usually highlights the first available encoding, not one that's anywhere near correct. But most users don't know what to select, so they stick with whatever Mousepad highlights. Trying to unpick a file that Mousepad had decided was ISO 8859-11 (Latin/Thai) was no fun.
posted by scruss at 7:28 AM on July 21, 2022

Ctrl-Z as the very last byte in a text file is a blast from the past! The file in question probably originated on a CP/M system.

CP/M's disk filesystem didn't include file lengths as metadata; cosest you could get was counting how many 128-byte disk sectors a file had been allocated. To let application programs work out how much of a text file's last sector was actually in use, CP/M adopted the convention of using ctrl-Z as an end-of-file marker.

That convention was partly inherited by MS-DOS, and that inheritance persists in Windows to this day. If you open a CMD window and enter copy con test.txt then the console window will let you enter as much text as you want; press ctrl-Z when you're done and you'll find everything you entered inside test.txt. The very same command did the very same thing on CP/M, except that CP/M would include the terminating ctrl-Z at the end of the copied text while DOS and Windows don't.
posted by flabdablet at 5:24 PM on July 21, 2022 [1 favorite]

Sorry, that was inaccurate. CP/M had no inbuilt COPY command; the CP/M command that still works in Windows is TYPE.

TYPE CON: works the same in both systems, echoing console input back to the display one line at a time.

The CP/M equivalent of COPY CON: TEST.TXT would have been PIP TEST.TXT=CON: and PIP was not built into the command interpreter but was a transient command that needed to be loaded from disk.
posted by flabdablet at 11:29 AM on July 22, 2022 [1 favorite]

« Older Tips for middle aged cis woman new to dating women...   |   Simple timer app for android Newer »

You are not logged in, either login or create an account to post comments