How does virus scanning work?
December 31, 2005 3:13 AM   Subscribe

How does virus scanning work?

There are thousands of viruses/virii to look for, and yet the filenames cycle through instantaneously. My initial theory was that the scanner queries for the last time the file was modified. If after the previous scan, it actually checks for signatures, otherwise it skips through. But this doesn't explain why the checking still seems fast when done from a new A/V scanner. Even factoring all the viruses that are slight variants, there still must be a couple hundred unique types.
posted by Gyan to Computers & Internet (6 answers total)
 


There are in fact tens of thousands of variants, not just hundreds.

The scanner can narrow down the search by only looking in executable files and other known file types (such as Word .doc files that can contain macros.) In general it would be pointless to search things like .txt files because you can't execute a text file, therefore it cannot contain a virus.

Another thing to note is that while there may be tens of thousands of signatures to scan for, the algorithms for matching a given set of bytes against a list of signatures are very efficient. Pattern matching is a basic area of study of computer science (that has existed for decades) and there are many fast algorithms for doing this, and in practice the bottleneck is just reading the data from disk, not actually the scanning part.

But any scanner that just used straight pattern matching against a list of signatures would be obsolete because viruses have contained self-modifying code ever since the very earliest scanners existed. So they also have to use heuristics and other "fuzzy matching" techniques that can identify a particular sort of sequence of instructions.
posted by Rhomboid at 4:14 AM on December 31, 2005


assuming you scan all files (and i think they do), you have a simple search problem - you need to find if certain "substrings" appear in the "text" as you read through the file.

there are standard techniques for making this as efficient as possible. at it's simplest, instead of searching for "abc" in one scane, and "abd" in another, you search for "a" and, if you find a match, check if that's followed by "b" and, if so, whether that is followed by "c or d".

more generally, you "compile" the substrings to a state machine that you then use to process the file data character by character.

more generally still, you probably use regular expressions rather than exact substrings. these also compile to state machines, in a very similar way. as you can probably guess, there's a lot of work on compiling regexp state machiens to make them as efficient as possible. you can exploit not just the information you have about the regexps, but also the expected occurence of words/patterns in the target files - it wouldn't surprise me if they had a variety of state machines, each tailored to different file types, or even an adaptive process.

anyway, for more on search algorithms see something like cormen et al.

disclaimer - i know nothing about virus scanning, this is just from knowledge of standard algorithms.
posted by andrew cooke at 5:01 AM on December 31, 2005


i didn't say, but you can often get better performance than O(n) for these scans. it sounds crazy, but sometimes you have enough information from the data you have examined to know that you can jump some number of characters ahead (remember that matching is exceptional, so if you jump ahead and find a very suspicious "tail" you can then backtrack - the forward scan with failures is what you try to opimize).
posted by andrew cooke at 5:04 AM on December 31, 2005


oops. stupid mistake. it's still O(n), but it's got a constant factor less than 1, if you see what i mean - you don't have to look at every byte in the file to guarantee that it is clean (assuming virus signatures are significantly longer than 1 byte in length)
posted by andrew cooke at 5:11 AM on December 31, 2005


Google Boyer-Moore
posted by orthogonality at 7:03 AM on December 31, 2005


« Older does menthol from cigarettes crystalize in your...   |   Wristwatch with thermometer Newer »
This thread is closed to new comments.