How do dynamic pages for things like Facebook/Twitter work?
January 7, 2020 4:07 PM   Subscribe

Assuming you were building an app that created a unique view of data depending on the visitor but was consistent? Such as on refresh as long as I was a visitor identified as "male over 30" they'd see the same data but that variable could change. Facebook and Twitter seem to treat data as immutable, they change on every reload. I'm looking for a way to architecture something similar that doesn't change unless the visitor themselves changes. It makes more sense inside.

I'm not asking on StackExchange as it'll get marked as too broad. But the answers here have been actually really helpful on getting me on the right path. Facebook, Google (such as a search), Netflix all provide content that depends on the visitor context which means they aren't using a CDN to cache the entire page as static html. I'm not looking at a solution that is universal. With Facebook, I logged in, have done nothing and with disabled cache I have 98.6MB without scrolling and an amazing 2103 requests after 20 minutes. Google search, which I assume is dynamic even with the same search word, returns 1.3MB. I'm guessing that Facebook's requests and size are due to lazy loading the infinite scroll after the initial load and chat polling requests, I'm ignoring that for a second as I don't have chat or anything.

To keep this less broad I'm going to use Gap as an example. Let us say for now it is 1995 and I have a static site. But now I have the ability based on user navigation in the site, to say "hey this guy is a male over 30, let us show him this block of html every time until we determine for some reason he's not a male over 30." Facebook's newsfeed somewhat does this but it refreshes every time so it isn't consistent in deliver content, as does twitter. I'm assuming Google if I searched for "elephant" would also dynamic.

What I'm looking for is something simple like this. Coming from a monolithic architecture I couldn't wrap my head around how service discovery worked until I realized that the application needs to be aware that it is in a container orchestration environment and that something (in this case Ambassador/envoy proxy) does the auth and has predefined routes for service discovery. The microservices are truly loosely coupled but Ambassador ends up being the orchestration that defines what needs to be done.

Are there any other examples that would say build out a web page based on an algorithm that doesn't rely on a single server handling out building everything out? I don't care about the algorithm itself, but looking for something that would explain how this happens in an ideal world ignoring that Facebook/Twitter/Netflix have highly complex domain specific problems. Basically:

1. I'm your entry point server, I have 100 highly compressed predefined HTML templates that everything will be filled in. Think news feed items or blocks. I'll send you these first.
2. Somehow the blocks need to be defined in a certain order based on the visitor and this order needs to be repeatable. It doesn't change if the visitor demographic doesn't change.
3. Needs to handle large load.

This feels like a solved problem that's a combination of client-side rendering via JS templates and data requests. Basically if you were to build a modern CMS that allows content to change and didn't rely on hitting one end-point, are there any simple or non-simple examples out there on how to do this with a modern loosely-coupled microservice architecture? All CMS's I'm aware of run into a bottleneck if they personalize data somewhere in the pipeline. Ignore the CMS aspect of this (that someone can change a layout of a page on the backend without a build from source). What has been bugging me is that I haven't been able to figure out a way, assuming it was built from scratch. Feels like someone has solved this problem but all opensource CMS's handle this in a naive way (every request hits the server requiring sticky sessions, etc.) and the answer is just more and more servers handling all the logic.
posted by geoff. to Computers & Internet (12 answers total)
 
I think you might be overthinking this. It sounds like you want an HTML page which makes client-side JS calls to retrieve data from a database on the server. The data in the database can remain consistent for each user.

Does that not meet your needs in some way?
posted by mekily at 4:38 PM on January 7, 2020 [1 favorite]


You can also generate the HTML server-side using PHP, but that’s pretty old-school at this point.
posted by mekily at 5:10 PM on January 7, 2020


I'm not entirely clear what problems you're trying to solve. In general these days, the assumed better solution (unless you're operating at massive scale) is just to build dynamic content for each user because people expect personalized stuff and CPUs are relatively cheap compared to nightmares of complexity when highly complicated systems go belly up.

That said, what I think would work for what I think you're describing is to have a relatively lightweight page that handles determining what the user's demographics are and then uses JavaScript to load content from a caching server like Varnish.

So you land on the lightweight page, it determines that you're a male between 30 and 40, and asks Varnish for the page for males between 30 and 40 that's been built within an hour (or whatever). If that cached content already exists, it's served up by Varnish; if it doesn't, Varnish forwards the request onto the app server that builds that content, and stores the result into the cache on the way back out.

If you mean that each user has temporarily static content, you can do that with caching servers too, though if you have many users, it will get expensive as far as memory use.
posted by Candleman at 8:04 PM on January 7, 2020 [2 favorites]


I’m also not sure exactly what the question is. I’ve been building websites with user specific data for well, eons. You login, you get an experience with data that’s unique to you. There are several ways to do this, but it generally involves caching the parts that are not user specific (templates, assets, etc) in any multitude of ways, and then serving (via an api or other server side application) the user specific parts from the application server, and combining the results.

If the user is getting one of a finite number of sets of data (say data for the over 40 crowd living in Austin) the api generally just tells you what that set is and then the app retrieves the specific set (and that data itself is probably cached somewhere). If the data is very specific to a user (let’s say their recently viewed netflix titles) then the api could return the ids of the titles and the app retrieves each of those titles individually (again, highly cachable data... Game of Thrones data is the same for everyone it’s served to). Then your modern client side apps (in javascript in modern web browsers) is built to request and handle the various combinations of data possible from the api requests. Even the page layout can be done this way — what blocks are displayed and in what order could be an api call and the app knows how to handle the result.

For a site of any size at all, these apis are not single server — there are most likely multiple servers running the same code and other additional servers (load balancers of some sort) sitting on top of them acting as traffic cops to spread the incoming requests between all the other servers. Im simplifying it, but If these servers are the bottleneck, add more servers. Generally you scale out (more servers), not up (more hardware per machine) but obviously there are use cases for both. But that really is the answer. AWS is massive for a reason.

I had a much longer answer typed up, but i’m really not sure what part your trying to understand better. If i had to summarize though, it’s really “how do i build a dynamic web app at scale” and from that google rabbit hole you could probably have more answers and reading than you want :)
posted by cgg at 9:00 PM on January 7, 2020 [4 favorites]


Response by poster: First, thanks for validating my concept of a microservice. I was given a "microservice" that was highly coupled with other "microservices" which in turn were coupled with other microservices. I would not consider that a microservice, just moving functionality out of process. I don't even know if there's a name for that sort of architecture

But any case I started using modern devops practices where I could and it solved so many problems I think I began to get into the "maybe devops can solve this!" Let me summarize my problem set better:

1. I've been working CMS/e-commerce frameworks and they're fairly well defined in their functionality and technically simple. You more or less have to fit the Gartner "magic quadrant" checklist of functionality to be competitive. Companies like the Gap, again never worked with them so I can't say just using them as example, generally choose a CMS/ecomm framework from a large like Adobe or Microsoft then choose an agency that implements the paticular design and functionality ontop of the CMS framework.
2. At the core the CMS serves up chunks of HTML or really just strings, and builds out a page. These are based on rules, simple being the obvious, like a route. But really what it comes down to is that business user comes in and sets up pages, again at a basic level building blocks of preset HTML templates.
3. A business user can build out logic on these blocks, "if this then show that, if this then show that," the "if then" ... not truly an Excel type language, but a simple rules engine. An actual developer would somewhere define the conditional logic, "if (visitor is over 18) then show this html block," and a certain point yeah no matter what architecture or design or how efficient you are, if you give a business user the ability to create giant logic statements without knowing the performance implications even the most efficient code will at some point cause performance concerns if you make 100 if statements on each html block and there's 12 html blocks on the page, but that itself is rarely the problem.
4. A lot of businesses still want to have 100% control over their data, or the belief that they do, so analytics on everything is captured where I would argue that'd be better handled by a service like Treasure Data or well anything but trying to capture analytics in house but that's out of scope of this question. Similarly, e-commerce has scaling issues that are a solved problem like in the example I linked to.

So really, I was overthinking this a bit. My only complication and I think I narrowed it down is that really performance issues are right now in all CMS's I see in the top of the Gartner Magic quadrant handle nearly all the logic on the delivery server.

As a sort of POC for myself I kind of wanted to see what this would look like as most if not all open source CMS's lack what big enterprises want (besides support), they correctly put a lot of the weight of creating the site on the developer. Whether or not that actually happens is irrelevant if I were to demo a CMS that made it look like you can create a website from magic without needing an expensive developer you'd go for it if you were a CMO.

But thanks, working on closed source projects you begin to think of how to solve things in an inverted manner or crazy ways around the fact you can't change how the core application works but still get the blame for it being slow.

In any case I think really the rule engine, much like an Excel cell that contains a giant ton of logic, is the issue and abstracting each rule into its own micro-service that scales horizontally would be the answer. It'd be hard/not worth it I would think to "compile" business user generated if/then statements into a sort of dynamic micro-service, instead I think it would be sufficient to make the condition itself a microservice (determining if a visitor is a certain age), since I've never seen a CMS implement something complex like regex or anything more then really boolean logic or >= type of things, the problem really comes when there's a lot and a lot on the page.

Again it is a unique situation, I don't know of any other application that lets non-technical users have the ability to basically do logic that goes out into a high traffic environment, but I could be wrong. At least with Excel if you write an efficient statement, it crashes on your machine. If you write a "bad statement" that works perfectly fine when you see are viewing it in isolation that's different.
posted by geoff. at 12:12 PM on January 8, 2020


Microservice has a relatively set definition in that you provide data via a structure that is set and is relatively open. From the perspective you are talking about, the other way to define what is returned by a website is via server-side CGI - generally the old way each of the variables defined and visible as '&' as this one for example, straight from The Gap's website.

https://www.gap.com/browse/product.do?pid=529344012&rrec=true&mlink=5001,1,HP_gaphome1_rr_1&clink=1

where the following variables are set and returned by a database or other data structure.
pid=529344012
rrec=true
mlink=5001,1,HP_gaphome1_rr_1
clink=1

This data is not open, it's not defined for anyone else to use, it doesn't return common data structures for The Gap's partners, hence it is not a microservice.

A common mantra is that the interface must be defined and that the client of the data should not be guessing what the data is, hence why few CMOSs use complex logic commands like regex to determine how data should be structured. But they may use regex and nibble swapping and all sorts of proprietary logic within the space of a single well defined data element. A good example is the mlink above, those commas represent different variables within a single data structure.


Websites do have rules engines for marketing, hence why they don't generally require separate designs for 31 year olds vs 30 year olds for marketing purposes.
posted by The_Vegetables at 12:58 PM on January 8, 2020


Response by poster: Oh I didn't even go to the Gap site, I was just assuming they were a high traffic site but not with an unlimited budget a la Netflix, so that was the example. Their SEO is pretty bad too, but again I don't know if or how much that really matters when you're the Gap or if Google just takes that in account at this point an doesn't really need clean URLs. I do get what you're saying about Microservices.

In my mind a microservice may be dependent on on a database service but everything is at L7 at that point so if it is properly written the db connection itself is not a concern of the service, just being provided json or some data structure like you talk about.
posted by geoff. at 1:52 PM on January 8, 2020


This is a vague enough question that there are a myriad of answers.

I think, generally, few companies are caching complete html segments. They're using templating and then either populating those templates server-side via high performance engines or simply feeding that data in JSON format to JavaScript frameworks.

As far as coming up with consistent data to feed these templates, it's a matter of a larger pipeline that queries relevant data to come up with results consistent with rules. We're talking about everything from sharded databases to caching layers on top of databases to auto-seeded caches to full queryable and replayable streaming data options like Apache Kafka (my current proof of concept field) that, in turn, might feed a consumer cache. Companies have done all of these things, and continue to do so.

I think introducing Kubernetes and API Gateways like Ambassador, or related products that help implement a service mesh, are complicating your imagined solution. Those things might be useful at scale, but it's easy to conflate the enterprise software architecture (what proxies, gateways, deployment strategies, etc) to use with the application software infrastructure (how granular services are, how they're implemented, whether they count as "microservices").

There are a lot of companies with decent technical blogs out there that explain how they've tackled software engineering problems. LinkedIn has a lot of good examples, and you'll also find other companies have the same. Spotify comes to mind, as I've run into their blog when researching messaging options. It's also worth checking to see what companies have open sourced -- most of them have at least a few components out there, and that can give you a glimpse into their implementation.
posted by mikeh at 2:12 PM on January 8, 2020


Oh, the other thing that comes to mind:

Most organizations delivering large amounts of customized data do not have a single data store. This is a modern headache, and I've attended or listened to a number of talks on how to keep integrity. Source systems don't just write to a database -- they might dump metadata into something like elasticsearch for search results, a record store (database or otherwise) for the full record, and maybe even data storage that does algorithmic analysis via MapReduce in order to feed back into other systems. Once you have a larger system infrastructure, the problem is no longer software -- it's data governance, data schema alignment, and making sure there's a shared understanding of what systems are upstream/downstream when it comes to specific pieces of data.
posted by mikeh at 2:17 PM on January 8, 2020


The really short version is that 90% of what you're talking about isn't writing *specific* frontend rules, it's about allowing for generalized rules. The entire "show this message to 30 year olds" thing isn't implemented as a one-off, nor is it onerous for developers. That's a metadata problem. Just give the people who put products in the ability to define filters based on a key, type, and a value. So create an "age range" rule (key), based on an integer range (type), with the (value) of 30 -35 or whatever. Then when your service filters products, it applies all the rules. It could be as simple on a small site as auto-generating SQL.

I believe this was what The_Vegetables was getting at -- products do have rules engines, but I think you're implying a lot of this is being put onto developers instead of the people entering metadata, which is wrong. They should be able to add content with good metadata, and define pages based off that metadata. If they're not, then it's back to them.

Then you're back to the performance issue, and you can more easily tackle that by going back to caching, infrastructure, etc. 90% of software engineering is discovering what you actually want to do, not coming up with an infrastructure to tackle every problem broadly. Then, when you find another issue (speed, efficiency, etc) you tackle that as a new issue.
posted by mikeh at 2:25 PM on January 8, 2020


Response by poster: So create an "age range" rule (key), based on an integer range (type), with the (value) of 30 -35 or whatever. Then when your service filters products, it applies all the rules. It could be as simple on a small site as auto-generating SQL.

Right, and I think I misunderstood this a bit as currently that's how the system (again I didn't develop) implements it so I kind of had blinders on. Right now what I'm working with says "I have no knowledge of the content metadata" when it should be the reverse, the content/products/whatever should be created and tagged appropriately.

Of course that means you know the conversion rate of the product up front, and if you're analyzing what people really over 30 male and make between 50k-70k buy you can't tag the product or piece of content. Or you can I guess if you run have an algorithm/machine learning know the non-intuitive things and tag the content once a day or whatever on a cron job. In any case, again, the system I'm currently work doesn't have a way of tagging items, so again it is doing the inverse.

Most organizations delivering large amounts of customized data do not have a single data store. This is a modern headache, and I've attended or listened to a number of talks on how to keep integrity. Source systems don't just write to a database -- they might dump metadata into something like elasticsearch for search results, a record store (database or otherwise) for the full record, and maybe even data storage that does algorithmic analysis via MapReduce in order to feed back into other systems. Once you have a larger system infrastructure, the problem is no longer software -- it's data governance, data schema alignment, and making sure there's a shared understanding of what systems are upstream/downstream when it comes to specific pieces of data.

This is another problem I'm dealing with. Currently the system I work with has basically one SQL database as the source of record, to solve this problem then keeps adding on caching to deal with various problems. This brings on other problems. You can use an ESB/event sourcing but yeah at some point there's going to be a mismatch. I can cache a product image or description but if the price isn't real time, there's legal and obvious bad business practice to charge something based on an outdated cache. I'm going to go out on a limb and say that Amazon has calculated caching vs loss on price if there's a change and have some sort of policy in place they'll take a "loss" as it'll be cheaper then scaling out but that's just a wild guess. I do know there's some sort of ESB in order processing as I've had a valid credit card process and say "Order Complete" only to get an email later saying there's an error processing my credit card.

Again I felt a bunch of "code smells" in the system I'm working with and I was like I'm solving these problems that shouldn't be problems and this is exactly what I needed to hear.
posted by geoff. at 4:03 PM on January 8, 2020


Response by poster: Ah yes, the content is being tagged. When someone goes in and adds a bunch of marketing rules it is essentially tagging the content, only tagging it virtually at run time. This is kind of a cheap way of going about it, enter content/products quickly and then later try to figure out which fits. It also demos really well, it just isn't at all performant. Well once you take expensive look ups out of the equation this really is very simple. But going back to the multiple data source problem if you're doing a demo in front of a client and you say "tag this as male over 30 who likes the color blue" and then go "update Solr indexes, wait for the Redis cache to populate, etc." looks very not intuitive. But that's a non-technical problem. You could easily have a "no cache" environment so that they can see the updates quickly assuming there's no load and then set expectations that a geo-replicated data set is going to innately have latency in updating across data stores. Wow, I was in a wrong mindset. Well I hope that explains why I was having a hard time even asking the question.
posted by geoff. at 4:16 PM on January 8, 2020


« Older Spreadsheet to calculate what can be made based on...   |   Archivists, Librarians, and Manga Freaks! Newer »
This thread is closed to new comments.