Startup Engineering

How to synchronise a distributed pubsub system

Episode Summary

Building a fully distributed system is really hard. But a few compromises can go a long way. Hear how Paddy CTO and Co-Founder of Ably, would white board out complex problems and where needed centralise small pieces in a single region. It might seem simple or small but trade offs like this can go along way and still meets the requirement of a pier to pier region relationship where any region can fail and come back online at anytime. What trade offs are hidden in your architecture that would simplify your stack?

Episode Notes

Find out more about Ably and Paddy and be sure the checkout their GitHub profile.

 

Episode Transcription

Rob: Welcome back to Startup Engineering, where we get behind the scenes with engineering teams at startups, we hear from the engineers, CTO's and founders on how they architected simple solutions through the complex distributed architectures and everything in between. Experts share their experiences, lessons learned and best practices. In this episode we have Paddy Byers, the CTO of Ably, he spent over 20 years in tech, previously the CTO of Tao Group, who created a virtual OS and the first Java VM for mobile and active open source contributor to Node.js and with a degree in mathematics and a Ph.D. in theoretical computer science from the University of Cambridge. Paddy, can you start off by telling us about Ably and what problem it solves and for who?

Paddy: Ably provides a pub sub messaging service for messaging between publishers and subscribers. The kinds of applications that use Ably, broadly fall into two categories. There are fan out applications and these are a situation where somebody publishes a message and you want that message to reach in real-time a very large number of subscribers. The kinds of applications, those would be anything involving live data, live sports scores, live odds, transport and logistics, train times, lots of applications like that. The second category are those applications where, if you like, it's the reverse situation. So you have a very large number of channels. Even though each channel might have a small number of subscribers, the kinds of applications that fall into that category, our chat conversations, notifications, anything that has interactivity on an item by item basis. The problem we're solving for them is to provide that functionality with scale and reliability. The functionality we offer it is actually on the face of it and very simple. What is hard is when you have to do that on a global scale, at a level of reliability, offering proximity and reach to all of those companies, subscribers, customers use Ably when they want to focus their energy on creating their own business value. And they want service providers to provide the utilities on which they are going to fill that money.

Rob: It sounds like you're looking at two very different use cases or markets for the product. And you describe these as broadly being many different messages across a single channel or many channels with a small number of messages. What are the big differences from a technology implementation point of view? And I guess what I'm saying is why would you tackle both of them rather than just focus on one?

Paddy: Because the reality is that those are two ends of a spectrum and a general solution is able to meet the needs of applications at both ends of the spectrum and also things in between. It's not uncommon, for example, for there to be both types of channels within the single app. We have customers where the majority of their traffic is a large number of channels with individual conversations, but they also have a requirement that certain channels exist where they can broadcast to everybody. We saw a range of applications with a common solution.

Rob: So after identifying the commonality in the solution, how did you go about architecting and design in your service?

Paddy: We went into it a little bit naively thinking this is what you do. If you're going to build a cloud service, then it has to be distributed and it has to be scalable. We started to sketch it out. We were writing some components and every time we got to the point of, OK, so we understand how this piece is going to work now. How do we make it scale? If we wanted to replicate this, what would that look like? Or if this component, is it going to exist in multiple regions, then what would that look like? And we had lots of ideas that we got as far as drawing on a white board, but never as far as turning into code, because every time we drew something on the white board, we convinced ourselves it would work.

Rob: That sounds like a massive timesaver. Have you got any examples of where you started using the whiteboard and then realized that the original approach you had in your mind was wrong? And you saved a lot of time that way.

Paddy: That way, if you have participants on a channel in multiple regions, then how a messages processed so that a message that is published in one region reaches everybody in every other region. If you look at that on its face, it's the problem where all of the regions could operate entirely peer-to-peer if there's a message published in region A that has to go to each of the other regions where there are subscribers for that channel, but it doesn't have to pay attention to any other message that might happen to be published in any other region. So everything could be entirely peer to peer, just using the Internet to route messages from one place to another. The problem with that is there is no central place that all of the messages transmit through for a single channel. And what that means is there isn't a single ordering of all of the messages in certain channels. We would have a white board with this diagram on it and say, OK, if that's the case, how are we going to persist these messages? How are we going to know that if every region operates entirely independently, then how are we going to know that once you have accepted a message for publication, the guarantee you're supposed to be providing is that from that point all onward processing is supposed to be guaranteed. How do you make sure that you have persisted that in a way thats survivable no matter what failures occur from that point forward, you can always perform the onward processing. We found it hard to stick with that picture of everything being fully peer-to-peer and still be able to convince ourselves that we had the right guarantees for survivability of the data. So what we imagined was a single place that all these messages need to go through. And that is the place where we can ensure that we have the messages persisted to a sufficient degree that we can survive errors that occur after that point.

Rob: Not having a global entry point and having clients connect independent regions removes that sort of central point, and it means that the clients can subscribe to the closest region. So this design choice, though, was a pretty big one. Did you make this decision to give clients the lowest latency and highest reliability?

Paddy: Exactly, yeah. And one of the significant moments that we had in the design was we eventually came down and decided we have to abandon the idea that there's a global ordering. We have to abandon the idea that any of the regions have a master slave relationship. We just have to bite the bullet and take on board the complexity of everything working peer-to-peer. So if a channel exists and you already have a publisher and a subscriber in two different regions, then at any time in the lifetime of that channel, a subscriber can pop up in a third region and the channel will be activated in that region. There has to be a well-defined understanding of exactly the position of the new subscription to the channel, and everything has to work peer-to-peer without the being any global coordination. It was a significant fork in the road when we made that decision and so, well, we have to work on that assumption. Otherwise we just can't get the survivability and the latency performance that we've been hoping to achieve. At that point, we decided we have to take on board a lot of this additional complexity, but it really is the only way to solve the problem properly.

Rob: One of the things that strikes me about the implementation that you have is that because you have these different regions that operate independently, it means that one consumer could subscribe to a region, say us-west-1, and then start receiving messages. Another subscriber could subscribe to a different region, say us-east-1, and then start receiving a different set of messages, perhaps by order or starting point. Is that something that can happen with your implementation?

Paddy: Yeah, exactly. So that is the consequence of making everything peer-to-peer. That means different observers on the channel will see messages in a different order because each message for the publisher, wherever they happen to be, goes directly to every subscriber. And if you have two publish it and in two different regions that are both distant from you, then that the messages from them could arrive to different observers in the different order.

Rob: Yeah. So this would also happen when I would subscribe to a queue because the most common use case I can think of is give me all the messages from now. But now could be different depending on the region.

Paddy: Some parts of that problem are relatively straightforward to answer and some parts of the puzzle became a lot harder. Let's say a channel is already active and somebody in another region that was previously not active, they subscribe to that channel. You have to determine a moment in time at which that new region is deemed to have been subscribed to the channel. And what that means is that the time that which that channel becomes activated has to be resolved to a particular time in the time series or the clock for each of the other regions that are active. And that new channel that new subscribers are eligible to get all of the matches from that point in time from all of those other regions. Part of the challenge is figuring out, well what is a collection of time stamps that correspond to the time that the new subscription is active, but also make sense together. So if you do subscribe to all of these other regions from the set of time stamps, how do you know that the set of messages you get as a result is a coherent set of messages? That means that you start to ask questions like, do I have to support a causal ordering, a causal ordering addresses? The problem of, let's say Alice sends a message which asks a question and in reply to that question sends a message back. Now, Bob's message was caused by Alice's message. And if you were to ever see those two messages out of sequence, it just wouldn't make sense. So if you saw Bob's reply and then later saw Alice's message, that just wouldn't make sense. Or if you saw Alice's message and subsequent Alice's messages. But without having seen Bob's reply to her first question, that would also not make sense. What you have to do when you create the set of timestamps that you're subscribing to for each of the regions, then that sort of timestamps in itself must be a coherent set so that the set of messages you eventually receive that you have to make sense as a collection.

Rob: How's that possible in a fully distributed system?

Paddy: This, in fact, was one place where we said, OK, what we're going to do in this case is at the point of attachment, there will be one specific region in the system for this channel where it determines an appropriate position to attach for each of the other channels. That was the only way we could figure out at the time to determine this array of attachment points and still give you a causally valid result. That was a if you like, a concession we made in our first design. We said, OK, we're not going to have a single region that controls everything. Everything's going to be peer-to-peer, any region could fail independently and come back at any given time. But for this one specific operation, we're going to have to rely on a single region at a given point being the source of truth, because otherwise you could get these causally inconsistent result.

Rob: Multi-regional is a solution to a problem, but often the problem is not so big that you need multi region, especially of multi availability zones. And I've seen quite a few startups wanting to use this solution and then realize the problem is not actually significant enough to use it. For your customers, how important is it to have a multi region solution?

Paddy: It is important for two reasons. The first is simply performance and latency. The truth is, many applications involve subscribers that most of the time but only subscribe to a single region. And even when it's multiple regions, they tend to be regionally local. We have many customers based in the US that customers might subscribe to us-east or us-west. But they certainly are not subscribing from Japan or they're not subscribing from Europe. What those customers want is they want to know that their messages go from their subscribers in the most direct route possible. And therefore, for latency and performance reasons, we want publishers and subscribes in the same region to be able to talk to each other without any other region becoming involved. So that their traffic spends the least amount of time going over the public Internet. The second need is for resilience, and it happens rarely. But it does happen that a region will become unavailable either because there's widespread network disruption in the region or because there are problems in one or more data centers. And you can always attempt to architect your things to get around that. So you have availability zones. You can minimize the likelihood the disruption in the region causes the whole region to be out. But at the end of the day, there will be situations where a whole region is unavailable.

Rob: I assume when you say that a region is becoming available, you're not necessarily talking about the underlying cloud infrastructure, is it some of the services that are deployed or perhaps things not working as well as you would expect?

Paddy: Yes, and this is one of the practical difficulties with providing this kind of fault tolerance, because simplistically everyone imagines, let's say the region you're connected to goes down. Everyone naively assumes what that would mean is services in that region are no longer responsive. You can detect that failure. You can have route fifty three or some other health or proximity based DNS resolution, which means that people connect to another region. And how hard can it be? If the region is down, is down, you just connect somewhere else. The reality is things never fail that way. So what happens is things fell gradually, error rates will rise because that might be a problem with connectivity. That might be a problem with some instances, might be a problem with EBS or some other service that your service depends on. And so what happens when there is disruption in the region is you get unpredictable performance, you get elevated error rates. It's very hard to make a clear cut decision to say this region is no longer functional or no longer functioning to the degree I need it to be in order that it can provide a service. Even just doing that, to be able to respond well to these kinds of incidents. That itself is a hard thing to do. It's another instance where customers rely on the service we provide, and they do that because we are able to provide continuity in the presence of these regional failures or regional outages.

Rob: So think about multi region architectures is the hard and they have unintended consequences. One of the architectures I looked at was neat. If there was a problem in one region, the new to failover another. The problem is if the second region couldn't scale up fast enough or there was some other type of underlying problem that was causing the failover, it would then fail over the third one. At that stage, you have a cascading failure. You just can't recover from it. So how in your system do you avoid these sorts of cascading failures?

Paddy: What we have is a two stage process for handling those kinds of things. First, we put responsibility on the client to determine if it's getting error responses from a given endpoint. Then the client will go and use a different region instead. The first thing that happens is that would just be error responses and we won't even know until there's a response to reach a certain threshold. We won't even know there is a problem. The Ably client library is configured with the primary endpoint, but also with fallback end points. So in the event of error responses, the client will go and use a fallback endpoint. Those fallback endpoints use a different CDN, the use of different top level domain, use a different DNS provider. We do everything we can think of to make sure that can't be a common cause of failures between the primary endpoint and some other fallback. And if having failed at the primary endpoint, they then succeed with a fallback endpoint, then they will continue to use that fallback for a period of time and then re-attempt to the primary endpoint sometime later, maybe 10 minutes later. So the first level of protection we have is the clients themselves should go elsewhere. The next level we have as well, if there is a significant level of errors in a region that will trigger an incident and then we will look at that incident and that incident will get escalated so that if necessary, an engineer will get up in the middle of night and respond to it. So there are certain kinds of errors, conditions that we're happy to allow the system to recover by itself, individual instance, failures, even instance failures up to a certain fraction of the instances in which of those to be handled automatically. That incident will get raised, but automatic resolution of that will happen. And provided that resolution resolves the problem that the incident will be closed. There are always those occasions, though, where you do have to as you say, you just migrate to traffic, to another region. And for us, that is always a human decision. We never allow the system itself to decide unilaterally this particular regions out of action, "I'm going to reroute the traffic". The client behavior that is the client's ability to go and use alternative endpoints gives us some time window in which we can appraise the situation and we can make a decision are we going to disable this region entirely? And we have crisis totally to do that.

Rob: Let's go into this in more detail. Why have a human and not a machine or an algorithm or set of metrics automatically do this? Or how did you balance this against, say, waking someone up in the middle of the night and that person needs a bit of time to get situated, understand the problem and then make the correct decision.

Paddy: The military spend their time figuring out how to fight the last war and not the next war. So it's the same problem we could have. We could design systems that react to errors and figure out the right time to transition. But all we'd be doing is we'd be figuring out a better response to the last incident we had not correctly figuring out how to respond to the next incident. So that's the biggest concern, knowing that the machine will make the right decision and always the risk of, as you said, if the machine is able to make a decision in one region, then can that lead to a cascade? Would you have a situation where the machine is able to terminate one region, but not more than that? If the machine is going to terminate the region, how does it decide where the traffic gets redirected? What we designed it is in the beginning, I would rather get woken up and have the opportunity to make that decision than to allow the machine to do, because I'm just not confident we know enough ahead of time what that might be. Now, in the fullness of time we might see enough of these incidents that we are confident to let it react autonomously in more situations. But the truth is, these are very rare. So the number of times there was a problem that is widespread enough in a region to cause us to redirect traffic away from that, that happens maybe once every six months or something like that. It's not something that is easy to prototype what your response will be and then to test it, and then to be confident that will work. For those very few incidents we would rather for now at least have a human in the loop.

Rob: And assume even though there's a human there doing the actions, they're not necessarily on their own. And they've got all the past tribal knowledge and experience that your operations team has any metrics, dashboards, scripts all ready to run, rather than having to figure out exactly how to do things each new time.

Paddy: One of the growing pains of the company is you come at everything in the beginning with a theoretical understanding and over time you learn the reality. And so over time, we have dealt with quite a number of incidents of various sorts. And we do have a good instinct in the team for how do you react to this? And through that, we've built a playbook.

Rob: Can you talk us through your playbook of what it is and what's inside of it?

Paddy: In analyzing our responses to earlier incidents, we looked at the root causes you to fix those things. But you also look at how did I respond? What could I have done to respond more quickly? For those serious incidents where we did end up having to redirect traffic? The biggest single factor was the time taken to make that human decision. It's a significant decision because it is likely to cause some disruption of some sort. So the quickest resolution to any problem will always be it sort itself out. In review of past incidents we did realize that actually one of the factors that determined the time to recover was the human decision. Now our playbook says on the onset of any disruption whatsoever, the very first thing you do is you decide where am I going to redirect the traffic if I'm going to do that? And then you start provisioning capacity and that destination region and that happens straightaway. Then within a couple of minutes, you will have the capacity there if you do make the decision to redirect traffic and then we have crisis tools which are to enact those changes. In the case that we decide we're going to direct traffic away from us-west-1, we're going to direct it to us-east, we have crisis tools to do that. It's a DNS change, together with some action to explicitly and progressively shed the existing traffic. And that happens in a way that means you can invoke it with a single command. If there is one of these very severe incidents, your playbook says decide where it's going to go, scale up the region when it's ready to accept that traffic. I type in one command and then the traffic can go.

Rob: When you're responding to an incident, I always wonder, what are these like when you're going through it? Are you calm? Are you able to take these logical steps to diagnose and figure out the problem? Or is it pretty full on with alarms going off, phones ringing, people not necessarily panicking, but under high amount of stress? How does this look for your team and what's even correct?

Paddy: From experience come to realize you can only react effectively if you're thinking clearly. And so, yes, there are alarms going off all over the place. And to a degree, you have to meet. Then you can keep an eye on alarms going past, but you can't have a lot of noise in the environment. So the response is look at the metrics and do things that you've done before. So we don't explicitly drill a lot of the responses, but we do know how to react from experience. It is calm and because it's impossible to diagnose in real time in a way that takes a certain amount of pressure off, you're not fretting, trying to figure out what's wrong. All you're doing is you're saying, OK, I know how to respond to this. I can get to a stable situation and then I can start investigating. When things are going wrong, you might have what is normally a very low level of errors. You might suddenly have logs filling up with a million errors a minute, literally, and you can't possibly look at that. So all you have to do is have a way of trying to classified information. You're seeing decide the best response, use responses that are practiced, then give yourself a window in which you can then investigate and figure out in more detail what's going on. I think we have got to a point and it isn't something that happens very often on those occasions. It does. That is, it is a relatively calm affair.

Rob: I really like that guidance, by the way, I think it's amazing. So this has been super interesting learning about how you deal with instance. And we really started there because we were learning about how you build resilient architectures in the first place. Can we take a step back to that? And you describe for us what your architecture looks like and what tools and services you're using to build this resilient architecture?

Paddy: Yes, we are hosted one hundred percent on AWS. We have a main production cluster, which is in six AWS regions. That's two in the US, two in Europe, two in Southeast Asia. In each of those regions that we present in multiple data centers, thats multiple availability zones. And that is the main production cluster. We also have a number of dedicated clusters. So these are functionally identical, logically separate instances of the service which are running for specific customers. Each of those clusters in each region is effectively made up of two principal groups of instances. So these are EC2 instances, there are these two principal scaling groups. One of the groups contain the endpoints that you reach after you've come to load balancer, and that group will scale according to the load imposed by individual connections and individual requests. The second scaling group is channels. So all of the interactions that come through the front end are operations related to the same channel, well they will go to a specific instance which is the location of that channel in that region. These are auto-scaling groups, they scale autonomously based on load parameters that we monitor and the system is elastic. They will scale up and down according to load and it will literally scale by an order of magnitude every day. And that doesn't involve any manual intervention.

Rob: Are there any patterns in how it scales? For example, follows a sun working days or working hours. Or is it outside of what you can realistically predict?

Rob: We have some applications which are very much tied to work hours or school hours, for example in a given US time zone. One customer in particular has very large classroom application and they will have all of their connections will begin within the first few minutes of the workday and they will end at the end of the workday. In a small number of cases, we would explicitly schedule scaling ahead of time. If you want to auto scale and this is what we do almost exclusively, then there is only a certain rate at which you can scale. If your load goes above a threshold that triggers a scaling event, we go and assign new capacity from AWS in the region in question. It takes a few minutes for that capacity to arrive. There is a certain rate at which the cluster can accommodate new capacity. Now, if you need to scale more quickly than that, then there's two ways to do it. One is you have a higher margin of capacity. It means you're scaling thresholds are lower. So you get the signal to begin scaling at an earlier stage. And then in some cases, you literally have to schedule scaling ahead of time. We have some kinds of applications like HQ trivia type applications, where you will get hundreds or thousands of connections within a few seconds of one another. For those, you simply cannot scale reactively. So we can have scheduled events for some customers for specific goals. We have scheduled events, but for the most time it's all just load driven auto scaling.

Rob: When I think about the data in your architecture, it seems like a really difficult problem to solve, at least in my head. The reason for this is when I think about all the requirements that you have around the distributed nature, especially including the latency, because you're going multi region, the reliability and the performance. So what you use and how do you think about how the data should be synced?

Paddy: We have two layers for persistence within the system. Just to explain the way the service works, what happens when you're interacting with the system? If you're a publisher or subscriber, you will attach to a channel that activates the channel. Message will get published and whoever subscribing at that time will get sent a message. Everything is ephemeral. Channels will be active only for the period of time that there are either publishers or subscribers and all of the messages transiting the system. For the most part, they are also ephemeral. If the publisher sends a message is delivered to a subscriber and then it's gone. We do have functionality that allows a subscriber to recover messages that they missed. If the connection drops, if you're attached to a channel, you're subscribing, you're receiving messages, your connection drops for some reason. So long as you reconnect within a certain period of time, which is two minutes by default, then you will get to recover the messages that occurs while you weren't there. But that two minute window for most messages is the maximum amount of time the message needs to remain in the system. There is a persistence layer that deals with persisting channels for that two minute window. There is a feature that you can enable on a channel which enables the data to be persistent for that channel. And what that means is that at any later time you can come back to the history API and you can retrieve the messages from that channel specifying some time bounds for example. And for that longer term persistence, we have a different persistence layer. Dealing with the longer term one first, that's a Cassandra database. That exists in three regions. And Cassandra was chosen because it's very good at processing very high write rates. We stream data into Cassandra for persistent channels and Cassandra takes care of sharding that among its various instances and also replicating it locally within a region for fault tolerance and also across regions for fault tolerance. Cassandra is very good and takes away that problem of how to persist messages for long term durability. For the short term, the ephemeral storage of messages we need to do is we need to stop for two minutes and we need to access them in various ways, but in particular if a connection is resuming, you want to go back to a certain point in time, which is the point of time that they were last connected and retrieve messages from that point. For that layer, that is a fully distributed, ephemeral store which we built, when we built the system first of all. The primitives at each instance where things have persisted is Redis. But the whole thing taken together is a primitive datastore that really looks very much like Dynamodb does. This is one of the first significant components we built. Hadn't even read the Dynamodb paper at the time, but when we came back and read it, we realized how similar the thing we had built was to that architecture. So it uses a gossip based platform for global cluster state discovery, uses a consistent hashing for placing of individual elements within that global cluster, and it has protocols for understanding where data needs to go if roles move from one place to another. In the case that scaling the system, you double the number of core processes, that means the location of individual channels has to move from one state to another. And that means the location of the persisted data for that channel has to move from one place to another. So yeah, that is the biggest or the single most complex thing that we built as part of building the system. The thing that saved us and doing that is that at the end of the day, the data is ephemeral. We have to copy data from one place to another that the amount you have to copy is limited because you're only having to keep track of this two minutes worth of data. But it does mean you have a sustainable, a tractable problems in terms of the amount of data that you need to keep track of, as well as move around.

Rob: Building a fully distributed system is hard, it's really hard, but a few compromises can go a long way. I like how Paddy would whiteboard out complex problems and where needed, centralized small pieces into a single region. The example he used was attaching a subscriber to the correct place in a queue and having this in one location made it significantly easier to build. It might seem small but trade-offs like this can go a long way, and he still meets the key requirement of having a peer-to-peer region relationship where any region can fail and come back online at any time. It's worth looking at your designs to see the most difficult parts and see if there are similar trade offs available to you. Responding to incidents automatically with tooling and with confidence takes a lot of effort to build. Putting a person at the center of the decision making process really helped speed up creating a playbook and body of knowledge and is a great starting point. Over time, some of the fixes will be root cause fixes or you'll be creating tooling that can reduce the burden on the person thats responding to the incident. Let's get back to Paddy to hear about what he's learned while building Ably and some of his best practices. When you spoke about white-boarding out problems, how many iterations did you have or how long did you spend doing this? And is there a point where doing more doesn't help and you really just need to get started?

Paddy: We would meet typically once a week and spend a day with the white board. And I think that process went on for several months. It was a lot of sessions, we'd spend some time at the white board would go away and write some code and come back and discuss it again. I think there wasn't really a point where we said maybe this is bigger than we were able to do. It's one of those things where the more you do, the more remains to be done, because as you take a step forward, then you can see the stuff that follows in much greater detail. And so even though it turned out there was a monster amount of work to get the whole thing over the line, we didn't know that there wasn't a stage where we properly saw the magnitude of what we were trying to do.

Rob: This is a huge amount of time spent in front of a whiteboard, on reflection. Does it feel like it was time well spent?

Paddy: One of my mottos is often when people construct a short term solution to a problem, then my reaction would be how come we don't have time to do it properly, but we do have time to do it twice. So that is what I wanted to avoid. I think if we're going to build this thing, then I'm not going to do it twice. We're going to do it properly. And I think that mentality is a double edged sword. You can find that it gives you the discipline that you need to figure things out properly. But sometimes it means you're trying to address things because you think that problems when in fact they're not. It is something you have to watch very carefully. But that really was the philosophy. That's how we approached it. It was just one of those things that you only learn about the magnitude of what you have to do as you go along.

Rob: White-boarding is great and it really kind of helps you figure out some of the future problems that you might have before you put that considerable effort into coding. But sometimes you just want to get started. So what kind of guidance do you have of when you should be really thinking about planning it, scoping out on a whiteboard versus putting down the pen, opening up the laptop and start building it?

Paddy: My experience is that you only really understand the problem you're dealing with in detail when you start writing code, because for some reason your brain looks at detail in a different way. White-boarding is great when you're confident the details are things you can deal with. Writing code means you have no choice to defer anything you have to do with it. And the advice I would give is as soon as you think you know how it's going to work, that's when you have to write code, because it's almost certainly the case that you're wrong. There are lots of things you actually haven't figured out yet, but you're never going to surface those questions until you start writing code.

Rob: A big thanks to Paddy and everyone at Ably, for sharing how that you think about an architect distributed systems and a look into their thought process for design and dealing with failures. And if you're excited about building the next big thing or you want to learn from the experts that have been there and done that, subscribe to Startup Engineering wherever you get your podcasts. And until the next time, keep on building.