Most widespread file format
July 14, 2005 11:52 AM   Subscribe

What file format exists in the greatest numbers? HTML? TXT? MP3? JPEG? DLL?
posted by timnyc to Computers & Internet (16 answers total)
 
I'm going to go out on a limb and say text files. html, xml, almost all unix configuration files and many others are just so much text files.

If however, you are asking which extension is most common... Well, that question is slanted simply because the practice of having a file terminate with a dot and a three letter extension is not universal to all operating systems.

Just some thought.
posted by cm at 12:23 PM on July 14, 2005


Yeah, can you be more specific about what you're looking for here? What sort of temporal bounds are you suggesting? Is a generic binary temporary file that lives for a minute or an hour or a month worth counting? Are we talking about unique files or every instance of a file, even repeats? If I were to create 5000 copies of the same one-second mp3 file with randomly generated names, would that be 5000 more mp3s in your model?

You could make an estimate based on the files installed by default in a set of common operating systems multiplied by the approximate installed base for each OS, as a start.

What is a text file? A file that contains intelligable text in English (or some other natural language)? A file that can be displayed correctly in 7-bit ASCII?

Narrow this down for us.
posted by cortex at 12:35 PM on July 14, 2005


It'd have to be txt. Even if you consider html, xml etc files non-txt (because they're interpreted by something else or some other contrived reason) txt would still be by far the most common. First of all it can be read and written by any computer system that doesn't rely on punch cards or wires for programming. From edlin on DOS to ed on a unix system or TextEdit on a Macintosh everything has a text editor. Second of all email and usenet were the two big things that helped the internet take off prior to the web (email was the first killer application). That's all about sending bits of text from one user to one or more other users. I've personally got gigabytes of email stored away.
posted by substrate at 12:39 PM on July 14, 2005


If we're talking about sheer numbers of not-necessarily-unique files, it's got to be html. Every computer in the world has a cache with at least a few MB of small html files. (If Outlook and Outlook Express stored email as txt, than it would probably be number one.)
posted by teg at 12:48 PM on July 14, 2005


After careful analysis of 5 random hard drives in my posession, I have the following numbers:


Hard Drive 1:
.dll - 10401 files
.dat - 5675 files

Hard Drive 2:
.wav - 26669 files
.jpg - 8986 files

Hard Drive 3:
.avi - 371 files
(none) - 94 files

Hard Drive 4:
.mp3 - 17824 files
.jpg - 973 files

Hard Drive 5:
.txt - 2245 files
.jpg - 1900 files


Conclusion: None.
posted by Jairus at 12:55 PM on July 14, 2005


(Hard drive sizes are 80GB, 60GB, 120GB, 200GB and 80GB respectively.)
posted by Jairus at 12:56 PM on July 14, 2005


Response by poster: Good answers, all. I was thinking of file-extention popularity (sorry, not file type...thanks for helping me see the diff.) (extention indicates what it's used for) and at a snapshot moment in time now accross all networked and non-networked machines, repeats ok.

This is in the context of the 10-year anniversary of the MP3. I wonder how many those are out there... I'd love to see file-extention graphed over time to see the rise of the MP3...
posted by timnyc at 1:11 PM on July 14, 2005


on my computer, mail messages are the most common type of file. i have over 30,000 spam messages alone (no, i'm not sure why i bother keeping them either, but they came in useful the other day for training a bayesian spam filter).
posted by andrew cooke at 1:28 PM on July 14, 2005


371 avi files ... hmmmmmmm
posted by crewshell at 2:12 PM on July 14, 2005


First of all it can be read and written by any computer system that doesn't rely on punch cards or wires for programming.

Unless your text file is EBCDIC and you're on a modern computer. Or vice versa. Fear and loathing!
posted by grouse at 2:37 PM on July 14, 2005


This depends entirely on where you look -
e.g. on Google
Results 1 - 20 of about 248,000 for mp3 filetype:mp3
Results 1 - 20 of about 7,490,000 for txt filetype:txt
Results 1 - 20 of about 10,700,000 for doc filetype:doc
Results 1 - 20 of about 278,000,000 for htm filetype:htm
Results 1 - 20 of about 856,000,000 for html filetype:html
posted by Lanark at 4:09 PM on July 14, 2005


On my Mac, this command:
find / -fstype local -type f 2>/dev/null | tr '[:upper:]' '[:lower:]' | sed -Ee 's/^.*\/\.?//' -e 's/.*(\.[^.]*)/\1/' -e 's/^[^.]*$/NONE/' | sort | uniq -c | sort +0nr
Produces:
7810 .html
4214 .nib
1802 NONE
1676 .tiff
1615 .cache
1205 .strings
987 .h
983 .rtf
941 .pm
826 .plist
633 .gif
633 .tif
510 .png
502 .i
491 .cpp
430 .o
396 .jpg
323 .icns
312 .css
255 .c
200 .webbookmark
196 .order
155 .scpt
129 .txt
114 .m
112 .r
88 .pbxbtree
82 .hpp
54 .helpindex
53 .pbxproj
42 .dot
37 .rgb
36 .dependency
36 .log
35 .26l
35 .psd
32 .264
31 .xmlfragment
29 .pbxuser
28 .cpf
27 .cp
25 .pod
24 .bundle
24 .rsrc
23 .applescript
22 .scriptsuite
21 .js
21 .scriptterminology
21 .utxt
20 .bom
20 .bs
20 .gz
19 .dep
18 .jsv
18 .pdf
16 .pl
16 .xml
15 .rprt
14 .aiff
14 .al
14 .sh
13 .cfg
13 .xib
12 .hmap
11 .header
11 .pbxsymbols
10 .collection
10 .perspective
--snip--
posted by ryanrs at 5:02 PM on July 14, 2005


DLL or JPEG. A billion Windows boxes and people who like pornography. Think about it.
posted by angry modem at 5:27 PM on July 14, 2005


Thanks to ryanrs for that command. I wanted to do that as soon as I read the main post but didn't know where to start.

A question, though: what is the "2 > /dev/null" supposed to be doing? I was getting errors that '2' is not a find option so I took out that bit and it worked fine.

So the results on my betbsd machine surprised me:

102341 NONE
15033 .h
12104 .png
-- etc --

"None" is reasonable, and I suspect the abundance of .h files is due to an untidy pkgsrc tree, but now I have to figure out why the heck I have so many pngs.
posted by Lirp at 10:52 PM on July 14, 2005


The redirection should have no spaces: "2>/dev/null". It says errors should be discarded instead of printed. If omitted, find will complain when it encounters a directory it can't read (due to permissions, etc).

Your .pngs are probably from html documentation.

Here's a breakdown of the entire command:

find /
Starting from /, recursively search every directory.

-fstype local
Restrict search to local filesystems (ie. those not on a remote server).

-type f
Print the path of every normal file. Ignore directories, symbolic links, etc.

2>/dev/null
Don't print error messages.

tr '[:upper:]' '[:lower:]'
Convert the paths to lowercase.

sed -E
Use Extended regular expressions for the following -e commands.

-e 's/^.*\/\.?//'
For each path, discard the leading directories, leaving only the bare filename. If the filename starts with dot ('.'), then discard the dot.

-e 's/.*(\.[^.]*)/\1/'
If the filename contains one or more dots, then discard everything left of the rightmost dot. Keep the last dot and everything to the right of it. This is the extension.

-e 's/^[^.]*$/NONE/'
Otherwise, if the filename does not have any dots, then discard the entire filename and replace it with the word NONE.

sort
Sort the list of extensions. This gathers up all occurences of a particular extension into a solid block. The result is a block of .pngs, a block of NONEs, a block of .htmls, etc.

uniq -c
For each block, count the number of occurences in the block. Then print the count followed by the extension.

sort +0nr
Sort the output of uniq. Field 0 is the count; sort it numerically and in reverse order (highest count at the top). Field 1 is the extension; if two lines have the same count, then order them alphabetically by extension. We don't explicitly list field 1 since it uses the default options.

Some examples:
/Users/ryanrs/stack.c -> .c
/Users/ryanrs/foo.bar.baz -> .baz
/Users/ryanrs/.profile -> NONE
/Users/ryanrs/goat -> NONE
posted by ryanrs at 12:37 AM on July 15, 2005


Output from a more heavily used Mac:
25695 .html
20244 .emlx
15895 .nib
13955 .h
4934 .strings
4725 NONE
2740 .c
2600 .tiff
2506 .plist
2446 .gif
2377 .jpg
1676 .cache
1622 .css
1219 .r
1075 .rtf
994 .o
941 .pm
926 .scpt
922 .tif
843 .png
754 .m
545 .xib
501 .i
424 .log
361 .icns
334 .helpindex
319 .cpp
259 .htm
258 .java
256 .rsrc
204 .order
200 .webbookmark
178 .py
168 .pbxuser
167 .pbxbtree
165 .2
164 .pct
156 .emlxpart
128 .xml
122 .dependency
105 .pbxproj
93 .s
92 .js
81 .pdf
72 .mov
64 .sit
--snip--
posted by ryanrs at 12:41 AM on July 15, 2005


« Older Script to provide Intellitext-style functionality   |   Returning to librarianship Newer »
This thread is closed to new comments.