How to test for external API availability on a CI server?
April 8, 2023 8:05 PM Subscribe
I have a unique predicament. Our vendor provides a GraphQL endpoint (only production, no test is available). We're working on a NextJS app and our developers can work against mocked data. Assume our code can't cause the issue of deleted users still showing up, etc., because we can test this outside of code. On our CI server I want to test scenarios we know have been issues that are edge cases and kill a build before it goes out, this will help finger pointing. The process to test is pretty complex, so a shell script would probably be a pain. Has anyone dealt with this?
The conditions are long and I have to but I have HTTP methods to test some really unique edge cases, I used user above but maybe something more realistic is this:
1. Authenticate against the API and retrieve the JSON on say product data for a known edge case.
2. Authenticate in another manner to actually do the any updates (one authentication does reads, the other does updates, it is strange).
3. Make the update to the API delete, change or whatever our multitude of test cases.
4. Reauthenticate against the other method check the JSON against step 1.
5. Then with the API tell it to "release the product"
6. Authenticate against the release endpoint and make sure whatever change was made is there.
7. If there's a problem send a notification, reverse any changes made, "kill the endpoint" (I can tell it to regenerate), tell it to generate. Stop the build.
8. Otherwise proceed with the build.
These endpoints are all not on us so we're testing the vendor API. The client also accesses the API non-programmatically through a client interface not hosted by us. Pretty clear its not on us and the vendor won't give us logs or anything. Since our code *does* manipulate the endpoint programmatically doing other things (say updating a product's parent SKU or whatever) there could be an edge case where we're impacting things we aren't touching but we can't reliably replicate in unit tests not against the mock data. All we can do is compile the really unique edge cases we are able to sometimes replicate by going through the UI on real data, say it is working, push out the code, say it is not working. This will cause downtime if the error appears but the client is okay with that. Their API is a complete black box, provides no logs.
To me this is a vendor problem completely but this is now our problem as the vendor has no other clients that experienced intermittent bad data issues. Since the client is heavily invested into this vendor changing is not the problem. Our tests against the APIs always come back with an "ok" response and no errors. What it appears basically is that the end user makes a change outside of our application and under very certain conditions that aren't 100% reproducible breaks things and we're adhering to their API but anyway this gives executives a report that it worked after our build. I've got a collection of pretty complex shell/curl scripts that check all above, e.g., are there invisible characters in the commands we're sending and the responses back that might be initiating this but need to get this going.
So with all that what's my best bet into integrating this into some sort of test where we don't get blamed for breaking the API? I'm talking specifically technical like just pick a unit testing framework for a language and have it run and give a bad status code telling them the API is broken and build won't run?
I don't care about the language/testing framework. I was looking at something simpler and outside the large NextJS application and just throught a Deno test (https://deno.land/manual@v1.32.3/basics/testing) that e-mails out high level results and a more detailed log? I've never had CI tests that weren't testing my application and were fairly destructive. Also and this is a bit complicated. The APIs are actually GraphQL so testing it using the same client seems reasonable but the application is a bit of a blackbox where on build it takes our code and executes the request somehow from React code that is SSG and no actual queries built into it. They're trying to move to a "codeless approach" where you don't even need actual code to do what we're doing but do it all in the browser so you basically give them a React component without the direct call and attach to their product and it somehow executes the query to correctly fill the props. I can still make the same assumed API call and I founds curl exposes things like invisible characters in responses and I can give it the vendor exact headers on failures where GraphQL clients tend to clean up requests/responses. I don't know this is a weird one.
The conditions are long and I have to but I have HTTP methods to test some really unique edge cases, I used user above but maybe something more realistic is this:
1. Authenticate against the API and retrieve the JSON on say product data for a known edge case.
2. Authenticate in another manner to actually do the any updates (one authentication does reads, the other does updates, it is strange).
3. Make the update to the API delete, change or whatever our multitude of test cases.
4. Reauthenticate against the other method check the JSON against step 1.
5. Then with the API tell it to "release the product"
6. Authenticate against the release endpoint and make sure whatever change was made is there.
7. If there's a problem send a notification, reverse any changes made, "kill the endpoint" (I can tell it to regenerate), tell it to generate. Stop the build.
8. Otherwise proceed with the build.
These endpoints are all not on us so we're testing the vendor API. The client also accesses the API non-programmatically through a client interface not hosted by us. Pretty clear its not on us and the vendor won't give us logs or anything. Since our code *does* manipulate the endpoint programmatically doing other things (say updating a product's parent SKU or whatever) there could be an edge case where we're impacting things we aren't touching but we can't reliably replicate in unit tests not against the mock data. All we can do is compile the really unique edge cases we are able to sometimes replicate by going through the UI on real data, say it is working, push out the code, say it is not working. This will cause downtime if the error appears but the client is okay with that. Their API is a complete black box, provides no logs.
To me this is a vendor problem completely but this is now our problem as the vendor has no other clients that experienced intermittent bad data issues. Since the client is heavily invested into this vendor changing is not the problem. Our tests against the APIs always come back with an "ok" response and no errors. What it appears basically is that the end user makes a change outside of our application and under very certain conditions that aren't 100% reproducible breaks things and we're adhering to their API but anyway this gives executives a report that it worked after our build. I've got a collection of pretty complex shell/curl scripts that check all above, e.g., are there invisible characters in the commands we're sending and the responses back that might be initiating this but need to get this going.
So with all that what's my best bet into integrating this into some sort of test where we don't get blamed for breaking the API? I'm talking specifically technical like just pick a unit testing framework for a language and have it run and give a bad status code telling them the API is broken and build won't run?
I don't care about the language/testing framework. I was looking at something simpler and outside the large NextJS application and just throught a Deno test (https://deno.land/manual@v1.32.3/basics/testing) that e-mails out high level results and a more detailed log? I've never had CI tests that weren't testing my application and were fairly destructive. Also and this is a bit complicated. The APIs are actually GraphQL so testing it using the same client seems reasonable but the application is a bit of a blackbox where on build it takes our code and executes the request somehow from React code that is SSG and no actual queries built into it. They're trying to move to a "codeless approach" where you don't even need actual code to do what we're doing but do it all in the browser so you basically give them a React component without the direct call and attach to their product and it somehow executes the query to correctly fill the props. I can still make the same assumed API call and I founds curl exposes things like invisible characters in responses and I can give it the vendor exact headers on failures where GraphQL clients tend to clean up requests/responses. I don't know this is a weird one.
Sounds like a pretty deep test of your vendor’s product, not your own code. You probably don’t want to gate your own releases on the vendor’s operational issues, which may pop up at any time regardless of the state of your work. Check to see if your CI platform allows for scheduled tasks instead of event-driven ones (e.g. Github’s on.schedule syntax). Then connect those to some appropriate alerting or monitoring like OpsGenie so you are notified when your vendor’s API is failing.
posted by migurski at 8:58 AM on April 9, 2023 [4 favorites]
posted by migurski at 8:58 AM on April 9, 2023 [4 favorites]
Response by poster: I myself was debating on CI vs a cron job, I’m going to try to get out of this all together.
posted by geoff. at 10:01 AM on April 9, 2023
posted by geoff. at 10:01 AM on April 9, 2023
Mod note: One deleted. "People on metafilter can't answer this because of their political views" is certainly a spicy take, but no.
posted by taz (staff) at 10:32 AM on April 9, 2023 [2 favorites]
posted by taz (staff) at 10:32 AM on April 9, 2023 [2 favorites]
This scenario of data losing integrity under certain edge cases is a hard one to control or manage. The term of art is "ACID compliance".
As I understand it, you want a test that demonstrates the ACID compliance problem in a way that convinces the vendor that the responsibility for the data integrity failure lies with them.
I don't think CI needs to be anything to do with it until the problem is solved. Either you have an automated test that reproduces the problem at least some of the time, or not. Until you do, it's much much harder to zero in on the problem. When the problem is solved, a guard against it reoccurring could perhaps become part of CI.
Reading between the lines, I'm assuming your approach to demonstrating the problem is manual, in a browser. Also that you have mainly a front-end skill set. Because the solution to this kind of issue is a shell script, not a CI runner.
If I was faced with this, I'd extract the HTTP requests from your current test into a shell script that sends those requests and asserts that the responses are as expected. Put a bit of rate limiting in to be friendly. Then loop it 10,000 times. And then just run that script manually, locally ideally, but from the server if you must. The stats will tell you more about where the problem lies: it sounds like you're expecting flakiness rather than 100% passing or failing. 100% passing means the problem is elsewhere. 100% failing means you've nailed the problem, or the test is broken.
I'll bet my bottom dollar that constructing this test will expose you to a solution within your control. Perhaps you might see that the shell script works 100% so there is a problem with the front-end code. Or that requests work when made and responded-to synchronously. Perhaps it uncovers an unreliable library. Whatever transpires, you can build the equivalent into your app. Or else it genuinely is only within the vendor's control, in which case most of the solutions your end are political.
posted by cogat at 3:50 PM on April 9, 2023 [1 favorite]
As I understand it, you want a test that demonstrates the ACID compliance problem in a way that convinces the vendor that the responsibility for the data integrity failure lies with them.
I don't think CI needs to be anything to do with it until the problem is solved. Either you have an automated test that reproduces the problem at least some of the time, or not. Until you do, it's much much harder to zero in on the problem. When the problem is solved, a guard against it reoccurring could perhaps become part of CI.
Reading between the lines, I'm assuming your approach to demonstrating the problem is manual, in a browser. Also that you have mainly a front-end skill set. Because the solution to this kind of issue is a shell script, not a CI runner.
If I was faced with this, I'd extract the HTTP requests from your current test into a shell script that sends those requests and asserts that the responses are as expected. Put a bit of rate limiting in to be friendly. Then loop it 10,000 times. And then just run that script manually, locally ideally, but from the server if you must. The stats will tell you more about where the problem lies: it sounds like you're expecting flakiness rather than 100% passing or failing. 100% passing means the problem is elsewhere. 100% failing means you've nailed the problem, or the test is broken.
I'll bet my bottom dollar that constructing this test will expose you to a solution within your control. Perhaps you might see that the shell script works 100% so there is a problem with the front-end code. Or that requests work when made and responded-to synchronously. Perhaps it uncovers an unreliable library. Whatever transpires, you can build the equivalent into your app. Or else it genuinely is only within the vendor's control, in which case most of the solutions your end are political.
posted by cogat at 3:50 PM on April 9, 2023 [1 favorite]
Response by poster: Quite the opposite it is not a browser test and I don’t have a front end skill set. But I agree with the rest.
posted by geoff. at 6:40 PM on April 9, 2023
posted by geoff. at 6:40 PM on April 9, 2023
Response by poster: The vendor fixed the issue on their end :)
posted by geoff. at 9:22 AM on April 10, 2023
posted by geoff. at 9:22 AM on April 10, 2023
Glad it worked out! I was going to suggest attempting to reproduce and giving them conclusive logs from your end. Sometimes you really have to do the work to get someone else to fix their bug.
posted by Horselover Fat at 3:11 PM on April 10, 2023
posted by Horselover Fat at 3:11 PM on April 10, 2023
This thread is closed to new comments.
posted by geoff. at 7:55 AM on April 9, 2023