Differing behavior from browser and java client program?
June 12, 2007 11:18 AM
I'm writing a little test program just to try out network programming in Java. However, some pages I GET from return the correct page, some return the standard 404 page (even when I can access the page from my browser), and some just hang without returning at all (again, from a page I can look at with my browser). Can someone more knowledgeable about network programming than me explain this?
Basically, I'm confused about why certain pages produce certain behavior in the program when I try to connect to them: all I'm doing is creating/opening a socket to a website, getting index.html, and printing the contents to stdout. Should be simple right? For some pages, it is: the html ouputs to the console like it should. However, for other pages, attempting to read index.html causes my program to just hang until the connection resets, or I get a 404 back instead of the page (the html of the 404 pages is output to the console). I'm pretty confused about both of these behaviors, since I can access the domains and index.htmls just fine through a browser. Why is this happening?
Basically, I'm confused about why certain pages produce certain behavior in the program when I try to connect to them: all I'm doing is creating/opening a socket to a website, getting index.html, and printing the contents to stdout. Should be simple right? For some pages, it is: the html ouputs to the console like it should. However, for other pages, attempting to read index.html causes my program to just hang until the connection resets, or I get a 404 back instead of the page (the html of the 404 pages is output to the console). I'm pretty confused about both of these behaviors, since I can access the domains and index.htmls just fine through a browser. Why is this happening?
HTTP 1.1 is a big ugly spec. It's not easy to get right. You might try poking "GET foo HTTP/1.0" down the wire to pretend you're dumber than you are.
If you want to try your programming, make your own server to talk to. Or code something easy, like Jabber -- make a Google Talk client.
posted by cmiller at 12:14 PM on June 12, 2007
If you want to try your programming, make your own server to talk to. Or code something easy, like Jabber -- make a Google Talk client.
posted by cmiller at 12:14 PM on June 12, 2007
If you want to make sure your HTTP stuff is right, you can telnet into the server and copy and paste (or type, really) into the window.
E.g., in windows from the command line, type:
telnet www.google.com 80 [screen blanks out as you connect]
GET / HTTP/1.1 [hit enter]
I think from there, if you wait, it'll send you a 200 OK, and then a page.
I'm doing this from memory, so it might not be exactly right, but if you can make sure your HTTP stuff is correct, you can move on to looking at your java code.
posted by !Jim at 12:22 PM on June 12, 2007
E.g., in windows from the command line, type:
telnet www.google.com 80 [screen blanks out as you connect]
GET / HTTP/1.1 [hit enter]
I think from there, if you wait, it'll send you a 200 OK, and then a page.
I'm doing this from memory, so it might not be exactly right, but if you can make sure your HTTP stuff is correct, you can move on to looking at your java code.
posted by !Jim at 12:22 PM on June 12, 2007
Just to expand on what bhance said, make sure you send a 'head:' request, they're required in HTTP 1.1. Also, the server can refuse to send a reply (or send a 404 or whatever) if it thinks you're not using a browser it supports (it shouldn't, of course, but it might) so you might want to add a 'user-agent:' header which spoofs some modern browser.
Also, I don't know if you're already using it, but I recommend using jakarta common's HttpClient package for this.
posted by jlub at 12:41 PM on June 12, 2007
Also, I don't know if you're already using it, but I recommend using jakarta common's HttpClient package for this.
posted by jlub at 12:41 PM on June 12, 2007
I'll make a guess without knowing more about how you are making the requests and which sites you are making requests to.
With the advent of dynamic html there's been a need to determine which browser the user is making a request on. I suspect that your headers (via java's HttpURLConnection?) are showing something that the server doesn't like. You can put anything you want in your headers, check the API (or search google for "java set header"). Since most of the mainstream browsers' headers are recognized (USER_AGENT, especially) it's not surprising that you can view the pages with a browser.
However, the other server wouldn't just wait around and expect more input. Request/Response is just that. There isn't much time to start shoving stuff down the line, especially with a GET.
However, it sounds like you're trying to learn the java http api. Its kind of old and in my opinion hard to use. You could try using the jakarta commons http client package:
http://jakarta.apache.org/commons/httpclient/
which might be a little nicer. I don't know. Haven't had to use it yet.
Also, that telnet trick is extremely useful. We still use it all the time to see if our firewall ports are open for http, EJB connections, etc...
posted by kookywon at 12:42 PM on June 12, 2007
With the advent of dynamic html there's been a need to determine which browser the user is making a request on. I suspect that your headers (via java's HttpURLConnection?) are showing something that the server doesn't like. You can put anything you want in your headers, check the API (or search google for "java set header"). Since most of the mainstream browsers' headers are recognized (USER_AGENT, especially) it's not surprising that you can view the pages with a browser.
However, the other server wouldn't just wait around and expect more input. Request/Response is just that. There isn't much time to start shoving stuff down the line, especially with a GET.
However, it sounds like you're trying to learn the java http api. Its kind of old and in my opinion hard to use. You could try using the jakarta commons http client package:
http://jakarta.apache.org/commons/httpclient/
which might be a little nicer. I don't know. Haven't had to use it yet.
Also, that telnet trick is extremely useful. We still use it all the time to see if our firewall ports are open for http, EJB connections, etc...
posted by kookywon at 12:42 PM on June 12, 2007
Like everyone else, I suspect that the request is somehow wrong.
Most likely things to be wrong with it: Missing carriage return or line feed chars, make sure your request ends with "\r\n\r\n". Some, but not all, of the many web servers you'll find will accept "\n\n". Or, a missing "Host: " header, which is important to some servers.
In my experience, "GET / HTTP/1.1\r\nHost: hostname\r\n\r\n" always works.
posted by sfenders at 2:01 PM on June 12, 2007
Most likely things to be wrong with it: Missing carriage return or line feed chars, make sure your request ends with "\r\n\r\n". Some, but not all, of the many web servers you'll find will accept "\n\n". Or, a missing "Host: " header, which is important to some servers.
In my experience, "GET / HTTP/1.1\r\nHost: hostname\r\n\r\n" always works.
posted by sfenders at 2:01 PM on June 12, 2007
"attempting to read index.html"
Of course, it could just be that you're assuming that "/index.html" is always valid. It isn't. Use "/" instead.
posted by sfenders at 2:03 PM on June 12, 2007
Of course, it could just be that you're assuming that "/index.html" is always valid. It isn't. Use "/" instead.
posted by sfenders at 2:03 PM on June 12, 2007
I'd also note that "/index.html" is not necessarily going to exist on all (or even most) sites these days, which could account for some of your 404s. You want to be sending "/" as the URI.
There's also the problem of virtual hosting, where hundreds of distinct sites might share a common IP address. You'll want to send a
(On preview: what they said.)
posted by whir at 2:13 PM on June 12, 2007
There's also the problem of virtual hosting, where hundreds of distinct sites might share a common IP address. You'll want to send a
Host: www.whomever.com
header along with your request. See here for more info.(On preview: what they said.)
posted by whir at 2:13 PM on June 12, 2007
It was the request form, thanks so much for all the info about HTTP 1.1! All the network programming I've done was using 1.0, so I didn't really know what I was doing :)
posted by version control at 2:54 PM on June 12, 2007
posted by version control at 2:54 PM on June 12, 2007
Install Wireshark, and use it to compare what actually goes over the wire on a successful browser request compared to what happens on one of your failed ones. Then, keep tweaking your own requests until they look sufficiently like the browser ones to get the job done.
Trying to learn network anything without being able to see what's actually going on is just way, way too hard.
posted by flabdablet at 3:59 PM on June 12, 2007
Trying to learn network anything without being able to see what's actually going on is just way, way too hard.
posted by flabdablet at 3:59 PM on June 12, 2007
« Older In PHP, how do I search through a large string to... | What has happened to my cellphone keypad? Newer »
This thread is closed to new comments.
"just hang until the connection resets" could just be the webserver waiting for proper header syntax...
posted by bhance at 11:22 AM on June 12, 2007