C#: capture data, display to user, save to disk, upload to database: BEST approach
January 17, 2009 10:45 PM   Subscribe

I have a program that captures live data from a piece of hardware on a set interval. The data is returned as XML. There are several things I would like to do with this data (in order): -display it to user -save it to disk -eventually, upload it to database

My current approach is to take the XML, parse it into a hashtable so I can display the correct values to the user.

Next, I want to save the XML to a file on disk. For each data capture session I am planning on creating a unique XML file and I will dump all the data into it.

Finally, I would like to reparse the XML and upload it to a MySQL database. The data cannot be immediately uploaded to the database.

This seems really inefficient method of solving this problem and I would love some advice.

Is it a waste of hd space to save the data as XML?

Is it THAT inefficient to have to reparse the XML in order to write it to a database?

Thank you!

To clarify: a typical XML response will be ~1kb and are captured at a rate of about 1 response every 15-60 seconds.

I think I do want to store the XML as XML on the disk because the data is very valuable and a pain to reproduce (if it is even possible). Thank you!
posted by jdlugo to Computers & Internet (8 answers total)
Best answer: I wouldn't worry about any inefficiencies. 1KB every 15sec is trivial for any sort of modern PC, barring some sort of crazy pathological XML constructs.

As far as storing the data on disk, only you can decide if the disk space is worth it. My late-night back-of-envelope bourbon-influenced calculations suggest that uncompressed it will only take up ~5MiB, per day. You can win even more with compression. XML compresses great and if it's just for archival there's not much reason not to. But as long as you're sure you're capturing the data accurately there should be no difference between the original XML and what you extract from it.

Which leads us to the next point. Make sure you use a proper parser and not some regex or other hack to import the XML. And be careful of MySQL. Regular expressions and MySQL are both prone to working great and fast when everything is exactly right, but you can get silent failures and loss of data with the slightest glitch. Write some good tests so you can test your parsing and importing modularly.

Finally, if you're talking about the order of things as priorities, or the order of implementation, that's fine. But as you move forward you might want to rearrange the code so that, for example, there is a dæmon which captures the data, archives it, parses it, and throws it into the database. Meanwhile your GUI can consume this data and display it to the user. Then you can be UI independent and not worry about some UI thread locking thing screwing up your data capture.

Good luck!
posted by vsync at 10:56 PM on January 17, 2009

Can you tell us what the format of the data is? It might save some space to use CSV (comma separated vector) format for the data. If it's data that's already similar to a database table (i.e. each record is just a list of records) CSV will work well.

And on top of that, if it's in CSV format, there are lots of tools to import that directly into databases or even Excel spreadsheets. You wouldn't need to store each response in a separate file, rather you'd just append data each time you get new stuff (open the file in append mode, write your data, and close it. Once every 15 seconds this should work fine)
posted by delmoi at 11:15 PM on January 17, 2009

Just as a thought, unless it's all CDATA sections, XML zips up very nicely. Lots of compressible whitespace there. Lots of repeated tags.
posted by adipocere at 11:18 PM on January 17, 2009

Response by poster: The returned XML may or may not have certain result sets returned as well. Additionally, the XML also contains unit attributes which could change depending on the hardware and these need to be properly maintainted.

My question may be refined by asking what order is the best way to accomplish my end goals.

For instance: get XML, write XML locally, parse XML from (file or original response) to display.

Thank you!
posted by jdlugo at 11:24 PM on January 17, 2009

Best answer: If that were my project, the first thing I'd work on is a really simple daemon whose sole job is to save the data to the disk drive in the exact format it comes out of the data logger, with no modifications at all except some form of time stamp per entry if there was not already one included in the data itself. If what was coming out of the logger had a superficial resemblance to XML, I would not attempt to top and tail it with the necessary single element to make my whole file a proper XML file. I'd just start with an empty file whose name I could optionally pass as a command line parameter, and keep on appending everything that comes out of the logger. I'd do a file flush after each write to make sure I didn't lose any logger data. I would want my data logger collector daemon to be absolutely simple, absolutely correct and absolutely reliable, and I'd want a simple concatenation of a suitable selection of the resulting files to result in exactly the same data set as leaving the daemon running for long enough to collect all that data in one session.

Once I had that up and running, I'd stick it on a production system and start collecting data before I even started coding up the UI, database and possibly intermediate-format file storage stuff.

My thinking is that (a) UI stuff always takes longer to get right than you think it will, because the user side of the interface is so often poor and (b) there will usually be some nasty surprise waiting for me in the data logger's output - some odd quirk like a nonstandard time format or a comma where I expected a dot or some damn thing, and having a reasonably large slab of real data to hand would help shake that stuff out.

I would not worry about efficiency at all until profiling on an actual production system showed me if and where it was required.

First get it right. Then make it quick. So speaks the voice of bitter experience.

And yeah, if you're going to parse XML, use a proper XML parser. Don't fsck about with regular expressions, or your code will annoy everybody who has to work with it in future.
posted by flabdablet at 12:48 AM on January 18, 2009

Oh, yeah: when testing my importer, I would write a little filter that applied a selectable degree of pseudo-random corruption (with a selectable initial key for the PRNG) to the files that came off the logger before feeding them into the parser, and I would make sure that (a) no degree of corruption would make it or the database lock up and fall over and (b) detectable but uncorrectable corruption was detected and reported in some way that the end user would be able to make sense of (c) correctable corruption was corrected and the fact of that correction logged (normally silently, but with optional alerts).

In no case would I ever discard the original files that came off the logger. Compress and archive yes, discard never never never.
posted by flabdablet at 12:54 AM on January 18, 2009

Just a thought, but there's probably no reason why you couldn't just write the unparsed contents of each XML file into a database table, along with the date/time, original filename, along with a flag to say whether the data has been parsed. That way you can parse the data whenever you like (if it were me, I'd run a PHP script from cron once a day to parse any new stuff) , and you can pretty much reconstruct the original XML files if you ever need them. This removes the necessity to back up the XML files and the database separately, and means there's less of a need to worry about managing large numbers of files.
posted by le morte de bea arthur at 2:18 AM on January 18, 2009

Best answer: 2nding the multi daemon approach.

We did exactly this project for a client except our data is 1) emailed to us in 2) is in eight different formats. (Two of which are XML and one of which we call WTFML.) 3) The UI is Sharepoint and the application stack is Microsoft. But the concepts are the same.

We use a single Windows service (daemon) that reads the mail and parses it. How many services is your call based on reliability/management/scalability issues. Our data is just sitting on the Exchange server, so all our service has to do is stay ahead of the flow of data on average. If the data wasn't cached on the Exchange server, we might choose differently as we wouldn't want the parsing duties to make the service too busy to receive data.

Absolutely write what you get to the db so when the data source gives you garbage you know what happened. That's how we knew when the sensor was broken. That's also how we knew that the system was choking on spam.

Then parse your data and write it out into a real, normalized database table. (A second daemon? Your call.)

Then figure out what the UI needs. In our case, the MSSQL server uses triggers to denormalize the data for fast consumption on the client. (We dupe a bunch of data so that we don't have to do timezone calculations in real time when the user requests the data. We just use multiple rows and filter. So one row for Unit A in GMT, one row for Unit A in CST, one row for Unit A in EST, etc. Filtering is faster than doing the addMinutes(-600) or whatever live in T-SQL.)

Feel free to email me offline if you need anything.
posted by brian60640 at 8:48 AM on January 18, 2009

« Older Sometimes, if you don't hide stuff, nobody notices...   |   Incentives for Jaded High Schoolers? Newer »
This thread is closed to new comments.