Join 3,563 readers in helping fund MetaFilter (Hide)


What is this non-alphanumeric character?
October 12, 2006 2:20 PM   Subscribe

This 'bad' character is showing up as a box, and seems to invoke a new line. How can I find out what it really is?

Parsing my tab-delimited file is impossible with these darn special characters that keep showing up, and causing records to break over multiple lines!

Excel treats them as new lines, and my text editor treats as a new line plus a character that is displaying as a box (i.e., it's not in the current character set).

I need to strip this character out using PHP - how do I find out what it is really, and how do I then find the regexp equivalent? It's not a \r or a \n.

On preview, it got stripped when I tried to paste it in here.
posted by CaptApollo to Computers & Internet (9 answers total) 1 user marked this as a favorite
 
Vertical tab? (\v usually)
posted by pharm at 2:24 PM on October 12, 2006


Are you c/ping from Word? You should never ever do that.

Lots of characters do exactly that, the best way to find out what it is is to look at the original file or source of data.
posted by shownomercy at 2:27 PM on October 12, 2006


Good guess, pharm, but no cigar.

shownomercy - gracious no! I am a webmaster, so that issue is the bane of my existence as it is (what part of 'type directly into the CMS/Dreamweaver/whatever' don't people understand??)!

This file is an exported database (from software called Office Tracker), and what seems to be happening is that it is concatenating several fields together into one tab-delimited field, using this weird character as glue.

There must be some way to determine what it is without just wildly guessing, right? Translate it to it's character code somehow?
posted by CaptApollo at 2:36 PM on October 12, 2006


You can iterate through a string using a for loop and use ord($string[$i]) to view the ascii value. (I tried to place the code here, but it was all screwed up.

But it's probably not an ascii character. Check out the functions here, which were written by a fellow mefite (scottreynen).

See also: http://ask.metafilter.com/mefi/38886
posted by miniape at 2:44 PM on October 12, 2006


Ah, by "you" I think I meant, "Have you uninstalled MSWord at your place of work yet??"

I haven't been successful regexing this *at all* but I'm sure it's possible .. hopefully someone with greater emotional fortitude stops by. Tab delimited files seem to have issues with spacing depending on what program you use, like a nbsp is the same as a tab, and a space is sometimes the same as a tab .. what i have the most problems with though is punctuation in Word, hence my unhelpful comment =)
posted by shownomercy at 2:51 PM on October 12, 2006


In Windows? Get WinVi. Open the file. Go to the Options menu and select "hex edit mode".

This will show the file simultaneously as hex on the left and a narrow column of text (possibly gibberish, since your file may be at least part binary) on the right. Wierd non-viewable or non-ASCII characters will appear as little dots amongst the text on the right. Highlight a dot and its corresponding hex value will be highlighted on the left.

Compare against the ASCII table.
posted by speedo at 2:52 PM on October 12, 2006


on preview--speedo beat me re: hex editor. AXE has a free trial.

another thought is to define what is acceptable and use regexp to filter everything else out.
posted by mdpc98 at 3:02 PM on October 12, 2006


Use a hex editor to find out what character it is, then look it up in the ASCII tables cross reference against the PHP regex docs to find out how to spell it.

If you're on unix, then you can just use "hexdump -C" to give you a nice dump of the file in hex.

(I see others have already made this suggestion. Go to it!)
posted by pharm at 12:30 AM on October 13, 2006


For finding it, as others mentioned, hexdump/vim/emacs on *nix. On Windows: In Textpad, hovering the mouse over that character shows the hex equivalent. Regex equivalent - \x{<hex code>}
posted by swapspace at 10:08 AM on October 13, 2006


« Older I'm trying to imagine what MyS...   |  Are the pictures on Google Ear... Newer »
This thread is closed to new comments.