Help a Python newbie out!
March 6, 2012 3:15 AM   Subscribe

Programming noobfilter: I have a Python script that does some things to a proxy log file: it can count and rank the top 20 urls accessed, users and source IPs. Three further problems and code inside:

1. I want it to count and rank the top source IP segments (255.255.0.0) too, preferably in the same script, and output it in the same format. Is there a way to go about that? (split the IP addresses by period...?)

2. I tried for a few days to print the output into a file, but "print >>file, '%d %s' % (count, url)" gives me syntax errors. f.write() similarly doesn't work. How do you write the output into a text file or better yet, send it as an email? (The server this will run on has Windows, and I was thinking of using Task Manager to run this at a regular time each day.)

3. The proxy log file is uploaded automatically at 5:00am everyday and has the date (e.g. 20120229) in its name. How can I set this script to get the current date and scan the correct file?

Any help is greatly appreciated!

--------- (disclaimer: code from stackoverflow)

from collections import defaultdict
from operator import itemgetter
import heapq
 
access = defaultdict(int)
user = defaultdict(int)
sourceip = defaultdict(int)
 
with open("C:/log.log") as f:
                for line in f:
                                parts = line.split() #split at whitespace
                                if len(parts) >= 6:
                                                access[parts[11] + parts[13]] += 1 # grabs host url and path and combines them
                                                user[parts[15]] += 1 # grabs usernames
                                                sourceip[parts[4]] += 1 # grabs computer's IP
 
# top k entries
k = 20
for url, count in heapq.nlargest(k, access.iteritems(), key=itemgetter(1)):
    print "%d %s" % (count, url)
 
print "\n"
 
# top k users
for user, count in heapq.nlargest(k, user.iteritems(), key=itemgetter(1)):
    print "%d %s" % (count, user)
 
print "\n"
   
# top k sourceips
for ip, count in heapq.nlargest(k, sourceip.iteritems(), key=itemgetter(1)):
    print "%d %s" % (count, ip)
 
print "\n"
posted by monocot to Computers & Internet (10 answers total)
 
If you ask this as a question on stackoverflow it'll be much easier to format it well, gather good responses, etc.
posted by katrielalex at 3:33 AM on March 6, 2012


Best answer: 1. I want it to count and rank the top source IP segments (255.255.0.0) too, preferably in the same script, and output it in the same format. Is there a way to go about that? (split the IP addresses by period...?)

To count things, use a `collections.Counter`. I don't really understand what you mean to count, though.

2. I tried for a few days to print the output into a file, but "print >>file, '%d %s' % (count, url)" gives me syntax errors. f.write() similarly doesn't work. How do you write the output into a text file or better yet, send it as an email? (The server this will run on has Windows, and I was thinking of using Task Manager to run this at a regular time each day.)

with open(..., "w") as f:
f.write("{d} {s}".format(count, url))


3. The proxy log file is uploaded automatically at 5:00am everyday and has the date (e.g. 20120229) in its name. How can I set this script to get the current date and scan the correct file?

import datetime
datetime.datetime.now().strftime("%Y%m%d")
... '20120306'

posted by katrielalex at 3:37 AM on March 6, 2012


Best answer: Is there a way to go about that? (split the IP addresses by period...?)

Why can't you call split a second time on parts[4]? You could also use a regex to do all of the parsing in one go, but you know what they say about regexes and having two problems.
posted by Dr Dracator at 3:47 AM on March 6, 2012


use socket.inet_aton(ip_string) to convert the ip address to binary then do a bitwise and with 255.255.0.0 also converted the same way.
P.S. Given classless addressing and IP portability I'm curious as to how you are using that helps you?
posted by Rubbstone at 3:48 AM on March 6, 2012


import socket
mask=socket.inet_aton("255.255.0.0")
...
if len(parts) >= 6:
access[parts[11] + parts[13]] += 1 # grabs host url and path and combines them
user[parts[15]] += 1 # grabs usernames
sourceip[parts[4]] += 1 # grabs computer's IP
classA[socket.inet_aton(parts[4]) & mask ] += 1 # determines class A address
posted by Rubbstone at 3:59 AM on March 6, 2012


Best answer: # top k Class A's
for classA, count in heapq.nlargest(k, classA.iteritems(), key=itemgetter(1)):
print "%d %s" % (count, socket.inet_ntoa(classA))

Note these solutions assume IPv4 for v6 or dual stack support you would need another mask and the program would need to detect v6 addressing.
posted by Rubbstone at 4:07 AM on March 6, 2012


Response by poster: Hm, I might have fudged a bit on the explaining the IP segment thingy. The IP segments are allocated by DHCP in our building but depending on the PC's location or type.
For example, our servers only use 10.0.x.x, and PC-workstations-on-16F all have IPs of 10.160.240.x. Knowing the aggregate traffic of a certain segment will help me/us figure out which areas are consuming the most.
posted by monocot at 4:19 AM on March 6, 2012


Response by poster: Thanks for the answers so far! I'll try them asap tomorrow :)
posted by monocot at 4:20 AM on March 6, 2012


Best answer: pastebin is your friend. If you're posting code put it onto a site like pastebin and then post the link.

Also, please post an example of the actual file you're trying to parse. I still don't know what the specification of the input is. Also, I don't understand the full specification of what the log filenames are; you say there's a date inside there, but what else?

As a noob I know you can get stuck into a problem and then post a mind dump of everything you've tried in frustration. You need to learn to structure your posts and how to ask questions properly. I probably would have put this as:
I have a text file full of the following types of lines. [post example lines of the input file].

Moreover, these text files have the following filenames. [post an 'ls' or 'dir' of your directory showing what the filenames look like].

I would like to transform a given text file into the following output as another text file. [post example lines of your output file].

As a complete example, given the following input file I would expect the following output file. [post your complete example].

As I've started in Python it'd be really helpful if I got an answer in Python! Thanks.
Responding to your other questions:
2. I tried for a few days to print the output into a file, but "print >>file, '%d %s' % (count, url)" gives me syntax errors. f.write() similarly doesn't work. How do you write the output into a text file or better yet, send it as an email? (The server this will run on has Windows, and I was thinking of using Task Manager to run this at a regular time each day.)

def write_string_to_file(input_string, filename):
    with open(filename, 'w') as f:
        f.write(input_string + '\n')

def write_strings_to_file(input_strings, filename):
    with open(filename, 'w') as f:
        for line in input_strings:
            f.write(line + '\n')


Note the gotcha that file.writelines() does not add line breaks to your strings, so I like to be explicit and add them myself.
3. The proxy log file is uploaded automatically at 5:00am everyday and has the date (e.g. 20120229) in its name. How can I set this script to get the current date and scan the correct file?
Use glob, get a list of files, then apply regular expressions to them and parse the dates out of the filenames.

Hope that helps! If you post a full example on Stack Overflow I'd be surprised if an answer took longer than 10 minutes to reach you.
posted by asymptotic at 4:41 AM on March 6, 2012


Best answer: A lot of your work has already been done for you. Take a look at a Python library designed specifically to read apache log files. apachelog, for example. You iterate line by line over your log, and it turns each line into a convenient dictionary. No regular expression or string manipulation required.

Finding and using other people's code is one of the great things about Python, and, for beginners, one of the best lessons to learn. Communities build up over popular programming languages; in fact, some people judge languages by the quality of those communities.
posted by jgfoot at 5:26 AM on March 6, 2012


« Older easy visa renewal   |   I need a pantry app to help me inventory my food! Newer »
This thread is closed to new comments.