Help a noob out.
March 27, 2012 6:59 AM   Subscribe

Programming challenge: pull XML data with an HTTP GET, process and/or lightly transform it, and output a csv file. Please help me figure this out in Java or tell me what other language to use.

I'm trying to pull XML data down from a REST database, process it, and then spit out csv's. Using Java I've gotten as far as pulling the XML down from the server into a Document object, which I figured out how to print out as text. Now I need to process some of that data; simple transformations like taking a field from "AA-BBB-CCCC" into three separate fields "AA", "BBB", and "CCCC". Another task is changing all the underscores in a certain column into hyphens. After that I want to write the output to a csv file. If it would be easier to convert the XML to CSV and then process the CSV in Java, that's fine with me as well.

When I've been googling and stackoverflowing Java and XML, I find stuff like this which seems super heavyweight and complicated.

Is there a simple way to do these kinds of transforms in Java, and write out to CSV? If MeFites could point me to any resources or tutorials on this, I would be very grateful.

Also, if Java is not the best place to be trying to do this, I am open to other languages (Python? Groovy?) as long as learning them will be fairly quick.

Other info that may be helpful: these are small files, less than 100 rows. I have a small about of experience with SQL and Java but am not a real programmer. I had to redownload Eclipse and google "how to add jar file" in order to get this far.
posted by Aizkolari to Computers & Internet (19 answers total) 6 users marked this as a favorite
Any modern language/ecosystem can do this. The theory heavy way of doing this would be for you to construct a XSLT stylesheet that you would use to transform the xml data into the csv (whilst doing your data splitting and character modifications). With that said, if you are struggling with the Java, I can't say that XSLT will be any easier especially since XSLT is a very different mindset.

As for Java, take a look at JDOM that is covered in the 14th chapter of the book you referenced. It is by far the most Java-like of the XML APIs and I think you will have the most success starting there.
posted by mmascolino at 7:13 AM on March 27, 2012

This would be trivial in Perl with the REST::Client, XML::Simple, and Text::CSV_XS modules. Perl was made for text transformations on data.
posted by nicwolff at 7:17 AM on March 27, 2012 [1 favorite]

You're in good shape so far.

You've got two basic philosophical choices here.

1 - You can walk through your xml data, pick out the elements you want to do stuff with, and write the CSV rows to a StringBuilder. This might be easier to wrap your head around. For small files, a DOM parser might be your best bet.

2 - Or, you can use XSLT to change the XML document directly to the CSV that you want. It's a bit more abstract, and the XSLT language is a beast unto itself. But, it does give you a terse and powerful way to do what you're looking for, with less lines of total code.

I kind of wish I knew what the full ask is, so I can say one way or the other which is the best way to go.

Anyway, since you say you already have a document in memory, this is a good place to start for option 1:

Something like Document.getElementsByTagName(), along with a loop, might just get you the rest of the way.

Good luck!
posted by Citrus at 7:18 AM on March 27, 2012 [1 favorite]

Response by poster: Citrus, if there is more information that I could provide that would be helpful, let me know.

Thanks for all the answers so far.
posted by Aizkolari at 7:21 AM on March 27, 2012

Putting in a plug for the Groovy programming language, which has taken this non-programmer further than I thought I could ever go. The syntax is amazingly intuitive: even if you don't know exactly what you're doing, a lot of times you can enter some parens or some other character that seems to make sense and lo and behold, it does what you want it to. XML processing with Groovy, in particular, is pretty pain-free.

There is a great mailing list (user at groovy dot codehaus dot org) on which I frequently post "newb, please help" type questions and I always get at least one helpful answer.

If you really want to learn programming, I've heard that it's best to start with Java or C/C#/C++ and go on to the "syntactic sugar" languages like Groovy, but if your job doesn't require you to be a full-fledged programmer, a language like Groovy can help you immensely.

Python is also great, but be prepared to wrestle with getting packages if you are behind an NTLM (Windows) proxy at work and there aren't any Python programmers at your job to help you out. I haven't found that to be the case with Groovy.
posted by Currer Belfry at 7:35 AM on March 27, 2012 [1 favorite]

Quick Perl example on a random XML feed, showing all the features you mention needing:
#!/usr/bin/perl -l

use strict;

use LWP::Simple;
use XML::Simple;
use Text::CSV;

my $csv = Text::CSV->new;

my $xml = XMLin( get "" );
for my $item ( @{ $xml->{channel}->{item} } ) {
  $item->{title} =~ s/_/-/g;
  my @parts = split '/', $item->{link};
  $csv->combine( $item->{title}, @parts );
  print $csv->string; 

posted by nicwolff at 7:43 AM on March 27, 2012 [4 favorites]

Response by poster: nicwolff, that looks super helpful, although I admit I probably only understand 3% of that. Would that code work with either ActiveState or Strawberry Perl? I'm probably going to install one or the other to give it a shot.
posted by Aizkolari at 8:06 AM on March 27, 2012

I think if you've started down the Java road, Citrus' Option 1 is the way to go, but FYI,
Ruby can do this pretty elegantly with the HTTParty gem.

In the linked example, the gem does the request and converts the XML payload into a Hash that you are free to iterate over, or whatever. I find this a lot easier than dealing with DOM objects. Here's a naive example of CSVifying the example URI from that blog post:
response = Representative.get('')
# {"result"=>{"n"=>"1", "rep"=>{"name"=>"Joe Donnelly", ...}}}  

headers = []
values = []
response['result']['rep'].each do |key, value|
  headers << key
  values << value

puts headers.join(',')
# state,office,phone,name,district,link
puts values.join(',')    
# IN,1218 Longworth,(202) 225-3915,Joe Donnelly,2,

posted by substars at 8:09 AM on March 27, 2012

Best answer: Aizkolari, I just tested nickwolff's Perl on the latest Strawberry Perl ( on WinXP from a fresh Virtual Machine install. A couple of steps and changes and it works fine.

Download/Install Strawberry Perl (I used the .msi installer)

Open a cmd shell and you need to install the Text::CSV module (can't believe this wasn't included, the other 'harder' modules like XML::Simple were.) Type 'cpan Text::CSV' to install the missing module.

Then since Strawberry Perl is the latest 5.14.2 (and I dislike using -l), change 'use strict;' to 'use 5.014;' and change the 'print' to 'say'. (The '-l' does automatic line-ending stuff, should strip the EOL on input and add one with print on output, but since the script fetches data using LWP::Simple and 'say' is print with a EOL... the '-l' is just baggage. The whole #! line may be baggage on Windows AFAIK.)


use 5.014;

use LWP::Simple;
use XML::Simple;
use Text::CSV;

my $csv = Text::CSV->new;

# perform HTTP GET request and parse the XML content
# into a Perl data structure.
my $xml = XMLin( get "" );

for my $item ( @{ $xml->{channel}->{item} } ) {  # each item in array of items
  $item->{title} =~ s/_/-/g;  # replace '_' with '-' globally in the title
  my @parts = split '/', $item->{link};  # split the url into pieces on '/'
  $csv->combine( $item->{title}, @parts );  # make a proper CSV line with quotes if needed.
  say $csv->string; 

Create and run the script.
C:\....\Downloads> perl
"Subtle Patterns | Free textures for your next web project",http:,,
" - Discover Yourself!",http:,,,help,quicktour,chrome
"Dart : Structured web programming",http:,,
"Kern Type, the kerning game",http:,,
"Wordle - Beautiful Word Clouds",http:,,
"COLOURlovers :: Color Trends + Palettes",http:,,
"TED: Ideas worth spreading",http:,,
"Khan Academy",http:,,,http:,,
"YouTube - Broadcast Yourself.",http:,,

You can use 'perldoc XML::Simple' to read the documentation for that and other modules. A handy module to install is 'Data::Dump' which gives you a nice 'dd' command to dump data in a readable format.
use Data::Dump;
dd $xml;
dd $item->{title};

posted by zengargoyle at 10:16 AM on March 27, 2012 [2 favorites]

Best answer: Thanks for the updating and commenting zengargoyle! Aizkolari, is the Perl script clear to you now?

Whatever solution you use, please don't try to make valid CSV "manually" by joining with commas as in substars' Ruby example – use some kind of CSV library. Otherwise a comma or newline in the data will break your script, and then you'll try surrounding all the values with double quotes, and then a double quote in the data will break your script...
posted by nicwolff at 10:25 AM on March 27, 2012 [1 favorite]

Response by poster: You guys are awesome. I am going to try this this afternoon and see where I get.
posted by Aizkolari at 10:32 AM on March 27, 2012

Best answer: Yeah, +1 to nicwolff and mea culpa; was just trying not to confuse things with a bunch of libraries. Ruby has FasterCSV, which is analogous to Text::CSV in Perl.
posted by substars at 1:16 PM on March 27, 2012 [1 favorite]

Response by poster: OK, so I was able to read the rss feed in the example above into a file but am now having trouble with the HTTP get to my server. Specifically, the headers. You guys have been super help and I will keep googling to try to figure this out but if anyone is still reading and wants to help, here goes:

The error I get is:
c:\Perl Project>perl
Unrecognised option: KEY-HEADER-NAME at line 23

How do I correctly add the headers to this request so it can pull the XML in from the servers?

My code is below. The KEY-HEADER-NAME and the KEY, along with the target server URL are all correct, tested with


use 5.014;

use LWP::Simple;
use XML::Simple;
use Text::CSV;
use HTTP::Headers;

my $csv = Text::CSV->new;

my @headers = (
'ACCEPT' => 'application/xml',

# perform HTTP GET request and parse the XML content
# into a Perl data structure.
my $xml = XMLin( get "", @headers);

# Desination CSV
open (FH, ">>Output1.csv") or die "$!";

for my $item ( @{ $xml->{channel}->{item} } ) { # each item in array of items
$item->{title} =~ s/_/-/g; # replace '_' with '-' globally in the title
my @parts = split '/', $item->{link}; # split the url into pieces on '/'
$csv->combine( $item->{title}, @parts ); # make a proper CSV line with quotes if needed.
say $csv->string;
print FH $csv->string . "\n"; # write line to csv

posted by Aizkolari at 4:31 PM on March 27, 2012

Response by poster: If it is at all helpful, I was also trying to use the below code to generate a request with headers but I got an error saying I needed a URI at the XMLin line instead of $r which is what I had in there.

my $r = HTTP::Request->new(GET => '"');
$r->header( 'ACCEPT' => 'application/xml');
$r->header( 'KEY-HEADER-NAME' => 'KEY');

posted by Aizkolari at 4:37 PM on March 27, 2012

Best answer: See: LWP::UserAgent

The non-Simple version of the doohickey.

use LWP::UserAgent;
my $ua  = LWP::UserAgent->new;
my $res = $ua->get(
    "ACCEPT"            => "application/xml",
    "KEY-HEADER-NAME"   => "foo",
if ( $res->is_success ) {
    say $res->decoded_content;
else {
    die $res->status_line;
perldoc will tell you what you need to know, LWP::UserAgent is the full web-client thing with user agent strings, cookie-jars, headers, SSL certs and the like. Here is the stuff the above sends to a server:

# nc -l 80
GET / HTTP/1.1
TE: deflate,gzip;q=0.3
Connection: TE, close
Accept: application/xml
Host: localhost
User-Agent: libwww-perl/6.04

Nitpick time...

# 2-arg open is frowned upon (FH is a global here)
open FH, '>>Output.csv' or die "$!";
# -> 3 arg open with local vars
open my $outfile, '>>', 'Output.csv' or die "$!";

print $outfile $csv->string, "\n";   # you can use , vs . here, print takes argument list.
close $outfile or die "$!";  # if you're not checking for errors like full filesysem...
# Perl will close the file automatically at exit.

# optional: replace checking with 'autodie' pragma.
use autodie;
open my $fh, '>', 'Outfile.csv';
# dies with "$!" on error automagically...

# you may need to turn off ssl host verification for local certs
$ua->ssl_opts( verify_hostname => 0 );  # see perldoc
Looks like you found the lower level objects used: HTTP::Request, HTTP::Response...
posted by zengargoyle at 9:53 PM on March 27, 2012 [1 favorite]

Best answer: Oh, you might want to play with the 'lwp-request' program that's installed with Strawberry / LWP::UserAgent. See: perldoc lwp-request

...> lwp-request -m GET -H "ACCEPT: application/xml" ... https://...
It's a simple CLI wrapper around LWP::UserAgent so you can play with flags and see results before/while hacking on your code. And your previous try with HTTP::Request would go something like:

$res = $ua->request($r);
$xml = XMLin( $res->content );

posted by zengargoyle at 10:08 PM on March 27, 2012 [1 favorite]

Response by poster: This continues to be fantastic. I'm now getting the XML in from my server, and when I run say $res->decoded_content; it spits the XML out all over the command line.

What is not working all of a sudden is the XML in. I currently have my code below. I've commented out some of the replace stuff to try to make it as simple as possible but it is not printing the csv out to the command line like it was doing before. Also, before I commented out the $outfile stuff it would just create an empty csv. I tried both res->content and res->decoded_content on the XMLin line and neither worked.

# perform HTTP GET request and parse the XML content
# into a Perl data structure.
my $xml = XMLin( $res->decoded_content );

# Desination CSV
#open my $outfile, '>>', 'Output1.csv' or die "$!";

for my $item ( @{ $xml->{channel}->{item} } ) { # each item in array of items
#$item->{title} =~ s/_/-/g; # replace '_' with '-' globally in the title
#my @parts = split '/', $item->{link}; # split the url into pieces on '/'
#$csv->combine( $item->{title}, @parts ); # make a proper CSV line with quotes if needed.
say $csv->string;
#print $outfile $csv->string . "\n"; # write line to csv

posted by Aizkolari at 9:11 AM on March 28, 2012

Response by poster: Nevermind; I can see that I need to modify the for loop there to look for the right content in my xml.
posted by Aizkolari at 10:11 AM on March 28, 2012

I'll add that it may be helpful to know about the MetaCPAN site. Pretty much every Perl module somehow comes from the CPAN and you'll find decently formatted documentation:

And you can look at some of the files that aren't installed with the module but are part of the distibution as a whole. Since Perl is generally awesome when it comes to testing, you can get a bunch of example usage by browsing some of the tests:

And the while you can use 'perldoc' in a terminal, often the HTML version on MetaCPAN is a much easier read with all the linking and syntax-highlighting. The core Perl documentation is also available:

The first 3 tutorials may be of help in digging out your data from $xml.

perlreftut Perl references short introduction
perldsc Perl data structures intro
perllol Perl data structures: arrays of arrays

If you get lost along the way memail me. From the original description, what you're trying to do isn't that hard, but the evil of XML and deep data structures may not be the easiest first Perl hackery.
posted by zengargoyle at 12:07 PM on March 28, 2012

« Older Make history interesting!   |   Blank-librarians Newer »
This thread is closed to new comments.