How do I tell if a file is a real corrupt PDF or a fake broken file?
May 13, 2014 11:17 AM   Subscribe

I make my students submit their assignments (for college writing classes) by uploading them as PDFs to our course management system. Sometimes students submit broken or un-openable files. My question is, how do I tell the difference between a real PDF file that became corrupted somewhere along the line and a fake corrupted PDF file created by a student who didn't do the assignment and wants an excuse?

I have heard of students changing the file extension on an MP3 to PDF or DOCX and submitting that, then claiming that a virus ate it when it turns out gibberish.

I'm generally pretty flexible about working with students who appear to be having technical problems, and I give students the benefit of the doubt, but I'd like to be better at telling when someone is bullshitting me.

I am generally a computer and internet literate person but I am an English teacher and have limited technical knowledge of how files actually work.
posted by Tesseractive to Computers & Internet (28 answers total) 9 users marked this as a favorite
 
"PDFs uploaded through our course management system is the preferred way of submitting assignments. If you are experiencing trouble converting to PDF or with the upload system, you can email the document file directly to me at poops@blerg.edu."

Honestly any student these days should be technically literate enough to attach a file to an email. If they can't comply with that very, very basic request (barring something catastrophic like "a latent chunk of Skylab literally just fell through my roof and onto my hard drive, here is a picture as proof") then they're probably full of crap.
posted by phunniemee at 11:23 AM on May 13, 2014 [2 favorites]


Response by poster: I'm not really looking for a syllabus policy change, because I'm pretty happy with what I have. I'm genuinely curious at how I would analyze a file to see whether it's been tampered with or straight up broken.
posted by Tesseractive at 11:27 AM on May 13, 2014 [3 favorites]


Best answer: PDFs start with these four characters: %PDF
In hex, those are (hex 25 50 44 46). Any hex editor will show you that, or you can use a text editor (like TextEdit, which is included on the Mac) to see it. Word will also open a PDF if you select "All files" and then import as Unicode. It will be near-gibberish but look like:

%PDF-1.5
%‚„œ”

229 0 obj
<>>
endobj


[etc.]
posted by wnissen at 11:27 AM on May 13, 2014 [19 favorites]


Files don't usually just "become corrupted", though many things are possible. Moreover, the systems I have used let students re-download the files they have submitted to check that they're valid.

You can try opening the "corrupted" files in a text editor and seeing what you see. Real PDFs will start with %PDF-1.5 or similar, and then a lot of gibberish because PDFs are not plain text files. (i.e. what wnissen said)
posted by katrielalex at 11:27 AM on May 13, 2014 [1 favorite]


I think wnissen answered your question. But I'll just reiterate what katrielalex said: I file pdfs with courts' electronic filing systems on a regular basis, where filing deadlines can mean the difference between a case being allowed to go forward and a case being dismissed, and I'm unaware of anything ever becoming "corrupted" in the process, as your students seem to claim is happening. As katrielalex also suggested, courts' e-filing systems always allow you to download/open the file you uploaded to confirm that you uploaded the right document, which I always do. If there's a way to allow students to do the same, the "corrupted" excuse becomes much less convincing. As does any claim that they accidentally submitted the wrong document, etc.

I'd be interested in hearing, if you do what wnissen suggests, whether it turns out that any of the "corrupted" documents were actually pdf files - though I suppose you can probably find "corrupted" pdf files online for this purpose too.
posted by traveltheworld at 11:34 AM on May 13, 2014 [2 favorites]


Yeah, the PDF should have the PDF header info when you open in a text editor.

Please bear in mind that your student could've submitted a "real" corrupt PDF file of an instruction manual (or something) rather than renaming an MP3.
posted by griphus at 11:35 AM on May 13, 2014 [3 favorites]


I would guess the most common legitimate "corruption" for an uploaded file is that it doesn't finish uploading, but the system thinks it does. For those cases, the "%PDF" header is a good test.
posted by smackfu at 11:35 AM on May 13, 2014 [1 favorite]


(Just as a test, I opened a PDF file in notepad, deleted a big chunk of lines, leaving the header alone, and saved it. Acrobat indicates the file is corrupt, but it still passes the header test.)
posted by griphus at 11:37 AM on May 13, 2014 [3 favorites]


Yes, wnissen has it: the first four bytes (the "magic") of a PDF are '%PDF'. Anything else is not a PDF.
posted by scruss at 11:43 AM on May 13, 2014


This will depend a bit on what they're submitting, but if these are plain text papers you're talking about, file size should give you a pretty solid hint. I have a PDF of a writing project that's about 6500 words with a little formatting in it, and it's just under 300KB in file size. An mp3 is typically more like 5MB.
posted by jacquilynne at 11:45 AM on May 13, 2014


Response by poster: This is helpful. The file I'm looking at right now does not have the PDF "magic" at the beginning when opened in a text editor, which is a bad sign.

I have had one well-documented instance of a student who used a Mac having extreme difficulty getting her PDFs to upload to the course management site (a version of Sakai) in a format that would let them be opened. I don't know what happened--I sat with her and we tried it from multiple browsers and the files always broke. So I let her email things to me.

It's lore among instructors that our course management system is unreliable and "breaks" things all the time, but I wonder how much of that is really true.
posted by Tesseractive at 11:47 AM on May 13, 2014 [1 favorite]


I would probably also switch the file type over to a few other likely ones -- doc/docx/mp3/mp4/avi, absolutely not an executable -- and see if those opened up properly.

If you can find another student in that same situation, I would try to check a copy of the broken pdf against the test here, to see if that actually did pass the test since you know they were real pdf files that really failed.
posted by jeather at 11:50 AM on May 13, 2014 [1 favorite]


If you're using a Mac, you can open Terminal and type "file [filename]" (and instead of typing the filename, you can drag and drop it from the Finder). If it's a recognizable file type, it will tell you.
posted by one more dead town's last parade at 11:53 AM on May 13, 2014 [3 favorites]


Response by poster: Hmm. Opening a file from the student I mentioned who was definitely not faking a computer problem, it doesn't have the PDF header in it either, but it does have something that says "[Content_Types].xml" after a bunch of gibberish. But that file attached without a file extension--the current one I'm dealing with shows up as a .pdf file.
posted by Tesseractive at 11:59 AM on May 13, 2014


I use Sakai as both a student and a faculty member. Sakai isn't wonderful and the UI painful; however, I've never had a similar problem. I've uploaded many, many PDFs over the years and have never experienced what you are describing. I've never had a student say that they've had that problem.

The student is responsible for the upload. They can always click on the uploaded file and verify that it opens correctly in their web browser. I don't know why you'd have any part of that. If they let you know - in advance of the deadline - that upload isn't working you could negotiate a different file format.

There's a technical glitch somewhere. As much as I dislike Sakai I don't think this a Sakai's fault.
posted by 26.2 at 12:01 PM on May 13, 2014 [1 favorite]


The "[Content_Types].xml" bit tells us that the document was likely generated via Microsoft Office. More here: http://office.microsoft.com/en-us/office-open-xml-i-exploring-the-office-open-xml-formats-RZ010243529.aspx?section=16

Is it possible that you have a student who made a PDF, then embedded it in Word, and then sent the docx Word file to you?
posted by Mo Nickels at 12:07 PM on May 13, 2014


Microsoft Office also has its own read-only format, .xps. I just saved an XPS document and it contains the "[Content_types].xml" in the content. The XPS file type option is also directly below the option for PDF in the Save As window, so perhaps it's just the student accidentally selecting the wrong file format...
posted by comradechu at 12:17 PM on May 13, 2014


Your student with the "[Content_Types].xml" may have saved a .docx and just renamed the file to .pdf and uploaded that, without actually converting the file to PDF.
posted by Blue Jello Elf at 1:22 PM on May 13, 2014 [1 favorite]


As last parade indicated above, there are command-line utilities which will identify a file type based on header info. These are a bit technical, I'm afraid, because they do require use of the command prompt, which can be a bit intimidating to first-time users.

On MacOS (or Linux/Unix, which I assume you're not using), there's the "file" command. This comes pre-installed and last parade's description above of how to use it is pretty complete. You might look at a simple command-line tutorial just to get some idea of what the Terminal actually does, though.

On Windows, you'd have to install a utility of your own. The "Win32 Console Toolbox" on this page includes a command-line utility called "FileType" which should work pretty much the same way as the MacOS/Linux file command. A brief tutorial on how to use the dos command line at all is here.
posted by jackbishop at 1:28 PM on May 13, 2014


If you're not using a mac (or linux), for windows there's a port of the file command. (on preview, as jackbishop mentioned) It will be able to distinguish between a corrupted pdf and an mp3.

Note, in relation to the "[Content_Types].xml" in the file, I note that a few of MS office xlsx documents that I have on my desktop are actually zip files (as reported by "file" and they correctly decompress via zip). I assume this is true of other newer Office save files. Open the the zip contents, and one of the files is [Content_Types].xml - but within the archive you might be able to navigate to an xml which might be semi-readable.
posted by nobeagle at 1:36 PM on May 13, 2014


I think your student doesn't understand the concept of converting to PDF, and is simply renaming her Word file to a pdf extension.

There might be a few students in this situation?
posted by dave99 at 3:37 PM on May 13, 2014 [2 favorites]


One thing that might be helpful is to let students know at the beginning of the semester about the number of claims that you have gotten regarding corrupted files, and that you are aware that some students (but not yours!) still try and pull this off. Then just let them know that you find this interesting and somewhat humorous in light of how statistically unlikely it is that this number of students actually have this problem and how easy it is to tell if this is actually the case. My guess is that it wouldn't eliminate all instances of deception, but perhaps more than a few.
posted by SpacemanStix at 4:19 PM on May 13, 2014 [2 favorites]


Having worked with students a lot, I'd hazard most of the problems are incompetence - i.e. inability/unwillingness to learn to properly save as/convert to a PDF - rather than malicious intent. Having a crisis alternative option to submit files in other formats (such as office native docx via email etc) with the same deadline would help weed out the slackers from the technically illiterate.

You could have an incentive for them to figure out how to submit PDFs properly to your CMS by having a small scoring penalty for work submitted via the crisis route after the first couple. Having similar for missing the deadlines repeatedly - regardless of reason - would also do wonders to sharpen up the slackers trying to get themselves more time.

It is their responsibility to get work to you in the right format and on-time; a lesson they will need to learn before heading out into the world of work, whether they end up as writers or something else...
posted by ArkhanJG at 4:45 PM on May 13, 2014


One thing one of my instructor's did that I liked is to introduce a 10% penalty per day for late assignments and then applied that regardless of reason for being late. This removes the all or nothing penalty for being late that incentives submitting corrupt PDFs and is very easy to apply. Obviously it won't work for all classes but it was seen as fair in that particular class.

PS: This was way before course management programs. Assignments were always due at 11:59PM on the due date and the instructor had a mailbox on his house that students could use to hand in assignments right up until the deadline. I made the trip up to his house on more than one occasion right at the deadline though I never stuck around to see if he actually checked the box at midnight. Later on when the campus got full time security we could get the guard to time stamp our assignments if we we're working late.
posted by Mitheral at 7:21 PM on May 13, 2014


When I was a TA, once in a while a student would try to "convert" a file by just changing the extension. So you might try changing the extension to .docx and .doc and see if it's actually a Word document. Renamed zip files were also popular.
posted by qxntpqbbbqxl at 9:39 PM on May 13, 2014


If you're on Windows, TrIDNet should tell you what sort of file it is if you'd rather not use the command line.

You might also want to throw it at Virus Total, as much to see whether it's a unique file or not as to detect whether it's a virus.
posted by Busy Old Fool at 7:03 AM on May 14, 2014


There is no litmus test that would distinguish deliberate corruption from a bug or mistake. For example if I wanted to take advantage of your leniency I would just truncate files deliberately. Please ignore the technical angle and make it the student's responsibility to upload valid documents. Incentivise them by treating a bad file the same as a non-submission, and they'll figure out how to double check so you won't have to (26.2's answer explains how, just don't make it your problem).
posted by Tobu at 7:40 AM on May 14, 2014


The XPS file type option is also directly below the option for PDF in the Save As window, so perhaps it's just the student accidentally selecting the wrong file format...

Further to this, I believe that XPS is the default printer driver for those who have Office installed and who have not otherwise configured a printer to work with their PC - lots of peole in other words. I know that I have occidentally created an XPS document when intending to create a PDF. So it would be both a plausible error.

In terms of telling whether a submitted "corrupted pdf" file with the incorrect extension is actually an .xps file you could follow the suggestion mentioned here. Here is the full xps file specification if you wanted to get into that.
posted by rongorongo at 8:17 AM on May 14, 2014


« Older Explosions for Three-Year-Olds   |   Stop biting behavior in an SPD/ASD kid? Newer »
This thread is closed to new comments.