Startup Engineering

Zego - Retrofitting an insurance accounting ledger

Episode Summary

Startups need to move quickly, but doing this in a regulated industry is difficult. Zego is a London based insurance technology startup that provides short term insurance for the gig economy workers. Stuart, a co-founder and staff engineer, explains how Zego balances regulatory requirements and fast iteration.

Episode Notes

Stuart has been a coder since he was 13 and professionally for the last 15 years. He has worked in various London startups since 2011. In 2016 Stuart eventually bit the bullet and co-founded Zego, a micro-insurance startup for the gig economy.

Zego is a global insurtech business providing flexible commercial insurance for businesses and professionals.

Resources:

Zego Blog
AWS Activate founder tier providers $1,000 in AWS Credits,access to experts and the resources needed to build, test & deploy. Activate your startup today.
How to break a monolith application into microservices using ECS and Docker.

Episode Transcription

Rob: Welcome back to startup engineering. A podcast that goes behind the scenes at startups I am Rob De Feo, Startup Advocate at AWS. Together we'll hear from the engineers CEOs, founders that built the technology and products that some of the world's leading startups from launch through to achieving mass scale and all the bumps in between. Experts share their experiences, lessons learned and best practices. In this episode, our guest Stu, a co-founder and staff engineer at Zego takes us behind the scenes of how they built and retrofitted a ledger. Stu can you describe the problem that Zego is solving, for the people that have not yet used it?

Stu: Yes, I'd love to. Zego is building a brand new insurance company from scratch. The way insurance works is not fit for the modern world. People are changing the way they work, people changing the way they build companies. The gig economy, the new mobility ability economy, the sharing economy. All of these things have emerged and really grown in the last five years, and traditional insurance companies are hampered by their systems. Hampered by legacy systems and legacy processes that mean that they can't keep up and provide the kind of insurance products that people need to be able to do their job. Zego is all about empowering people to go out and live their life and work the way they want to work, without being held back by insurance.

Rob: That's amazing. So you were really quickly able to build this new product for this new market. Can you talk about some of the technology challenges that you faced when doing this?

Stu: I think probably a couple of the big ones were the fact that we needed to build a whole bunch of software from scratch. Traditional policy management systems, for example, are used to dealing with policies that last a year, maybe they last for a month at the shortest time. We wanted to write policies that lasted 12 minutes. When we spoke to some initial underwriters to try and get them onboard with this product, they said it costs us £8 every time we write a policy to our database. For whatever reasons, all of their legacy systems and all of the licensing that they had, they said “we can't possibly sell a policy for £0.65”. We did have to write our own policy management system from scratch. Insurance is complicated. It's a legal minefield. You need to be very, very careful about what you're doing and making sure that people are covered because at the end of the day, it's potentially someone's livelihood. It's very, very important to get it right the first time. On top of the massive amount of data that we started getting from these work providers, we were finding out who was on shift and working patterns. Dealing with all of that data, getting their data into a system, that we are able to handle customers working for multiple work providers at different times, at the same time and overlapping shifts. When it came to the finance side of things, it also became very, very complex.

Rob: It sounds like the approach you took, you didn't necessarily expect all these challenges that you had. Especially when you're moving quickly in the beginning. What are the challenges that you're having to face building the next step in your technology?

Stu: Probably one of the most interesting projects that we've been building recently is actually our accounting ledger. There's a special type of accounting for insurance brokers called Insurance Broker Accounting. Yeah, they didn't get too creative. Essentially, we hold at any time an uncomfortable amount of other people's money. Whether that's our customers money, underwriters money, or the tax man’s money. If you look in our bank accounts, that's not all ours. We can't go spend it on Friday breakfast and software engineering salaries. We needed to know exactly how much of that belonged to each one of our 40,000 customers. How much of that money belongs to insurers and how much of that money we had to give to the taxman. That there itself is very complex because it's a rabbit hole.

We sell in five different countries around Europe, not a single one of them deals with tax the same way. We thought it was gonna be easy because we started in the UK and the UK taxes are the simplest, 12% flat tax. The second country was Ireland, and we know there it's a little bit more complex. It's a flat tax plus a levy that you have to pay. Then we went to Spain, there's a flat tax plus a levy plus another levy, that depends on what kind of vehicle you're insuring. “Okay, cool, we can get there.” We went to France, there are five different flat taxes depending on how much of the policy covers a certain type of risk. It just got more and more complex as we did it.

Most companies end up solving it, for accounting purposes is to throw an army of accountants at it. In traditional insurance companies and traditional finance companies a massive amount of their staff ends up being a very large finance department.

Rob: You're trying to build a technology first company above all else?

Stu: Exactly. Our CEO Sten. He's tasked us with beating Aviva, who have 120,000 staff and doing it with less than 2,000. That's going to require an awful lot of work, on the technology side to make sure that we can build a super efficient finance department.

Rob: What did this look like before you have this new solution? What was the process you had to go through to rebuild it?

Stu: You don't hear about our first version. We just kind of kept track of how much money people have put into their account. Then overfunded our client money account to make sure that we would always have enough. We didn't have that many customers, so we did throw a few accountants at it to find out exactly how much money we had. Then we would pay our insurers out of our savings account instead. We always had enough money to pay everybody. But as we grew bigger and took on more and more customers, more and more varied products, more, more different underwriters, we needed to start actually doing this properly.

Rob: How important is it to be precise with this type of system?

Stu: When it comes down to, certainly things like tax and client money, things have to reconcile to the penny. If something is out by one penny, then that is enough for us to stop what we're doing and go in and try and find out why it's out by one penny.

Rob: It’s interesting you're going down to the accuracy of the nearest penny. Is there a specific reason why you're doing that?

Stu: Probably regulations, is the biggest one. In a regulated industry we get audited. We have been audited in three years, I think three times. We fully expect to be audited on a regular basis, especially because we are doing something new. We are building a whole bunch of new technology and building things that the regulators haven't seen and building new products that the regulators haven't seen. They want to make sure that we are doing things fair and that we're treating people fairly, that is also one of our main goals. Our big three company things are that we want to be fair, we want to be simple and we want to be flexible. That means that if you give someone money, you expect to get that money back and to the penny. E.g. if you lent me £10 and I give you back £9.99 when it's only a penny, it's fine. You'd still feel a little bit like “well, hang on, yeah, it might be fine, but it's still my penny”. Combined with the fact that, we want to make sure that if a user asks for their money back, which they it's their money, they can have it. They get exactly what they're entitled to. Certainly the tax man he cares about a penny. It was very much a case of making sure that this stuff worked. That was one of the big problems. We've essentially been retrofitting this ledger onto an accounting system that has been running for two and 1/2 years.

Rob: Was this an off the shelf solution, or was it something you built from the ground up?

Stu: No, this is a new ledger that we're building on to replace the old charges and credit system that we built back in 2017/16. What that meant was, it was much more accurate. It’s much more specific about exactly what different charges. Who they were for? What account there was supposed to go into? Things like that.

Rob: How are you able to get to that level of detail and accuracy?

Stu: A lot of work! Mainly because we write our own software from scratch, we were able to integrate it really well into the platform that we're building. We didn't have to integrate with an off the shelf policy management system and we didn't have to integrate with an off the shelf user management system. It was very easy for us to dig into the bits of the system that required integrating directly into that accountancy system and just put it in there. It's all our software, it's all our code. Where it became tricky was when we dealt with third parties. We don’t handle payments ourselves. For example, we use Stripe for payments, we GoCardless for bank transfers, and we use financing companies for doing premium financing. Those all have varying levels of technical integration. Stripe is one of the gold standards for how to do really great developer experience and really great integration. We can get a lot of information from them which allows us to automate things. Like when someone does a dispute on a charge on their card, it doesn’t have to go through a manual process. That whole system can now be automated. Some of our other partners are less technically efficient and it still requires little bits of manual processes and getting those integrations in place.

Rob: You have the requirement to be able to reconcile these thousands of policies in a complete way. What's the most important thing that you need to be aware of? Or the most important piece of information that you're capturing to do this?

Stu: Most of it is about exposure. So exposing as much of the underlying data as we can, exposing as much of the underlying business events that caused a transaction to happen. Money doesn't just move for no reason. If you can understand why a transaction happened, then even if it's something that system can't exactly automate you can still know why this transaction happened. We know what the parties involved are. We know that money needs to approve from these different places. From this account to our insure account and from one of our customer accounts through to our account. Or maybe they're used up some of the promotional credit and we have to move some money from our marketing account and marking budget into their insure account. Knowing exactly what's supposed to take place, which the system can do. Sometimes to know exactly how much is supposed to move, that requires one of our very smart accountants.

Rob: You’ve given us a good idea about why the ledger is so important and what it does. Can you explain to people a little bit about the architecture? And some of the key pieces of technology used to build this?

Stu: Our main tech stack is built on Django and Python. We were currently decomposing our initial monolith. Which I think is a phase that startups go through. It's quite an exciting one. We went with Python across the stack and across the entire business we use Python. Data scientists write in Python. We taught all of the business intelligence team Python. All of our application stack is written in Python. It's all hosted on AWS. We went with AWS quite early because we knew that we needed to be cloud based and we knew we needed to scale. I didn't want to be managing Postgres databases at two o'clock in the morning, So we are on Postgres RDS. We use all sorts of things now, Lambda, SQS for a lot of the events and job systems that we do.

Rob: What is your team mostly spending their time working on now.

Stu: The work we do and going at the moment is to start pulling a bunch of these things apart. Really untangling the web of a monolith. Then we can move into a service based world. Our systems engineers are very excited because they'll get to use all sorts of fun new tools like Istio, and Kubernetes. We are spiking a lot of that at the moment. We are probably going to go down the GRPC route for our services.

Rob: You're using GRPC to have that quick internal communication with your services.

Stu: Yeah, exactly. We really like the way that you can enforce that hard contract and really code your API contract into the GRPC layer itself and then build out stubbs. Which will mean that you when, it's not even if. When we decide to use languages other than Python. They can also interact really, really well.

Rob: You spoke a lot about having a monolith. When you build this new ledger, is that something that you built outside the monolith?

Stu: It’s currently still inside. When we are building at the moment, we are building in a way that is going to allow us to pull stuff out of the monolith very easily. Whilst we still spike exactly how we're going to do monitoring, tracing, logging authorization, security, and all of those bits around service. We're building new applications and new things into the monolith. We are making sure that they stay very self contained. They still talk to other parts of the monolith via a fairly well defined contract. It's just that the contract happens to be in the running in the same process and on the same machine rather than over a network call.

Rob: Without architectural boundaries, are you having to be more strict in code reviews to ensure that people aren't breaking this?

Stu: It takes a lot of developer strictness. Everyone is on the same page. Everyone knows that we would rather spend a little bit of extra time doing this properly to save us a lot of time later trying to untangle it. Especially for something as complex as this ledger.

Rob: Switching back to the ledger, when you were building this, I guess there were some unexpected events or unexpected problems that you came across. Can you talk about some of the edge cases you have to deal with?

Stu: I guess one of the interesting bits is we didn't have it from the start. By putting in two and 1/2 years in, we needed to go back and reconcile to the beginning of time. Which is very difficult because of some data quality issues that were around from late 2016 early 2017, when we didn't really know what we were doing. Meant that as soon as you started to try and reason about exactly “why this particular bit of money moved” or “why this transaction happened” made it very difficult. We've actually got a new transaction type in the ledger where a transaction is “in suspense”. In suspense basically means no one's really quite worked out why this transaction happened. We can tell what happened. We've always been able to know that some money moved between here and there and who had moved to. Understanding why that happened, sometimes it's not there. We had to come up with a version that we could say to people “hey, we're either still looking into why this one happened” or sometimes just writing off and saying “yeah we know that we gave this person some money at some point in the past”. As long as we are above board with it and so long as most of the time it's us giving other people money, nobody cares too much.

Rob: That's a really interesting approach and solution to the way that you solve this. Was that because that you give a lot of flexibility to support staff? Or was it because of the early versions of the MVP and software built or something else?

Stu: A lot of it came down to flexibility we had early on. We've always been very customer centric and we've always had a really great customer service team. Who very early on were given quite open tools to be able to do whatever they needed to do it. If we had customers who needed to be given a refund for some reason, they could just go in and give them a refund. Then going back two and 1/2 years later and saying to somebody “this time that you transferred this money from our account to this customer's account, why did you do that?” and they are like “this is two and 1/2 years ago, I don't remember”. Which is a pretty valid excuse for someone who speaks to like hundreds of customers a day. It was a lot of, not human error, but human processes that happened very early on that we didn't have a record of.

The other one was technical. Early versions of the software, like the quotation engine, pricing engine and things like that which didn't round decimal places as deep as they do now. There would be rounding errors. That's where quite a lot of the one penny errors come from. The early versions of the software might round-up or round-down. When you're dealing with taxes in percentages on low numbers, you can quite easily break down a premium that should be £1.00, When you add up all of the bits of that break down, it gets to £0.99 instead. It's very easy to see those bits and to understand “hey, I can see that this doesn't reconcile” but to understand exactly who that £0.01 belongs to becomes quite tricky. You then have to go through, have we underpaid our tax man or our insurer? Have we made sure that the customer has paid the right amount and not too much for their insurance? Those are the bits that when we went back in, retrofitted all of the legacy data that we had into this new system, we found quite a lot of those.

Rob: At the moment you built the ledger, you think it's finished. Can you talk us through what happens when you turn on and you're about to put two and 1/2 years worth of data through it. What are the mechanics of this? How does that work?

Stu: The way it's built now it's all on an event based system. An event happens that causes a transaction and we write those transactions or it causes a future transaction and we write a pending transaction. That is really easy and volumes are fairly steady. When it came time to do that for all of the backfill transactions. It was one big job. We had something in the order of 500,000 transactions. Each transaction has like 7/8 different entries in the transaction. And so we're talking like 6/6.5 million rows that we needed to calculate and write into the database. We batched it up and we did it as a job and fully expected that job to take about six hours. I kicked it off one night and said, “cool, let's let's come back tomorrow morning and see what this job looks like”. We come back the next morning in the office and we look at the logs and it's about 20% of the way through. Okay, this six hours is probably going to be more like a week. We actually had to stop the job because what we hadn't built into it was the ability for the job to restart and not have to restart from the beginning of time.

Rob: You built it in such a way that if you got the 1st 10,000 transactions correct, but then there was a problem, you wouldn't have to start again?

Stu: If the job was to be killed, and this is one of the one of the considerations you need to take into account when running long running jobs. On cloud services if you don't batch them up into lots of little jobs, a server can get killed underneath you. It's one of the things that you take into account when you're building an application. Most requests are relatively short lived and can be retried. If you have a job, and that one single job is taking hours and hours at a time. The chances of needing to restart that job grow exponentially. We stopped the job and then we broke it up. Instead of one long running job, it would be a couple of million very, very short running jobs. Which is great because it meant we could stick them on a queue and watch the queue go down. If for whatever reason a server died, got taken away, or somebody was deploying and they needed to restart it wouldn't hold that up.

Rob: Now you have the ability to retry things. Then what happens when you run subsequent jobs? Were there other things that you learnt? What did you experiment with?

Stu: Most of the iterations we went through once we had done the backfill were finding the bugs. Finding not only the bits where the bugs were caused by legacy data. Those were just a manual process, a very long manual process. Involving the engineers and the accounting team to actually work out what should have happened then manually updating transactions and entries. Then it was the ones that were continuing to happen. The ones where the rounding errors we're coming from deep within parts of the system that were two years old. The option there is either to paper over those cracks in the accounting ledger or dig into the systems that people hadn't touched for a year and fix them. We went with the second option. If we are going to spend the time, let's spend the time fixing these deep in the issues. The tax engine, quotation engine and all of our pricing factors. Instead of going to 2 or 3 decimal places, some of them now go to 10 decimal places. What that means is that the whole system is more accurate.

Rob: Each time you fixed the bug or you've improved the system, you've made it correct. But you've made it correct in the past. How did you manage that? How do you fix these things?

Stu: We go back and we fix the past.

Rob: That means you're running it over all the old data and getting new results for it.

Stu: Yeah, we ran and we ran it over all of the old data. Everytime we find one of these bugs, we know how much of the old data is not precise. We can work out which transactions are in suspense or which transactions require a fix. You can look at that and you can say “ok cool if we make this fixed, not only is it going to fix this but going forward, but this is going to fix, you know, 10% off all of the errors in our historical transactions”. When I say we changed the past, most of the time, changing the past is not going to the database records and changing the previous database records. A ledger is supposed to be appended only. It means creating new transactions that will actually fix from an accounting sense the previous ones. If you've overcharged someone by £0.01 two years ago, you give them a £0.01 refund today.

Rob: This is an important characteristic about Ledger. You can't go back in the past and make changes. You need to add another transaction to make an alteration. Now the ledger is working, you’re been firing transactions at it, it's all updated. It's more accurate. Does it just work? Or is there more that needs to be done?

Stu: There's a lot more to do. It’s running for probably about 80% of all products at the moment. When we started we had about 500,000 transactions including backfill. We have now just over 1,000,000 transactions which means 9,000,000 million entries. That doesn't include any of the B2B transactions, they are still handled manually by the accounting team. We sell products to consumers, delivery riders and Uber drivers. But we also sell larger group fleet policies to people starting kick scooter companies or people who have a fleet 100 vans who don't need insurance on all 100 vans all year round because half the year their vans are sitting in a garage and half year it's Christmas time and they're all out on the road delivering.

They are also really beneficial for our flexible insurance policies. What we do is we currently manually handle all of those transactions. They are lower in volume, but often a little bit more complex because of the usage data for large volumes of fleets even if it's under one transaction. The next step is to get the rest of those in so that we can completely automate all of the manual processes and the grudge work that our accounting team has to do.

Rob: With the ledger in place now, and all the data has been updated and even more accuracy. Are you able to run different types of reports and use different tooling on top of it?

Stu: We have quite a strong data engineering team. What they do is pull data not only from this ledger for analytics, but also from all of our policy management systems all the way through to all of our website and app analytics and put it into a data warehouse. Make that available for the business intelligence team. We use a tool called Looker.

Rob: With Looker being the tool that your business analysts are using. Where's the data? What tool is used to get the information out?

Stu: We have RedShift for our data warehouse, and that pulls all of the other data stores. We have a number of RDS databases, we’ve got a DynamoDB database, then it also pulls from external sources. It pulls everything out of not only our ledger, but also our bank account details, things like Xero that we use for expenses, Google Analytics, and from all of these other places into this one source. That aggregates all of our different data sources in one and then makes that available so that you combine analytics across multiple different data sources.

Rob: Solving the problem of accounting for every penny when insurance premiums are measured in minutes rather than years. Then retrofitting years worth of data and transactions against it was a difficult challenge. Using engineering resources rather than an army of accountants creates a scalable solution. Zego takes an iterative approach to building software. It has well defined contracts in their monolith which allows them to move quickly and create code boundaries that could be broken out into micro services in the future. Engineering a ledger to be 100% correct when running over old data is an impractical approach. Creating checkpointing allowed Zego to iteratively build the ledger and fix bugs without having to start from zero each time. Let's get back to Stu to hear about the learnings, best princes and advice he has to offer.

Going through this process you have learned a lot. If you were able to start again today knowing everything that you know now, what would you do differently?

Stu: One of the big ones is making sure that we understand where the deficiencies in our legacy data occur. There are a number of other projects on-going at the moment to make sure that the application that we built when we were six months old and had no idea what we were doing or where we're gonna be in a few years is actually scaleable to the existing systems. It's all well and good to come up with a brand new system and a brand new set of data models and go “yeah, this is so much better”. If you have no way of migrating your existing data into those models, then you're gonna be in for a world of hurt. I mean, that's what we found in this. There have been a lot of late nights and a lot of manual work gone into ensuring that all of that past data was cleaned up. Some of that we could have done beforehand, it would be faster. You spend a lot of computing time calculating things that are wrong, and then fixing a bug and having to spend completely re-calculating those things to make them right. Some of that data could have been picked up beforehand, and some of it couldn't, some of it we didn't know that it was wrong until we had done this work. Looking into some of the other projects that we have upcoming around user management and CRM. It's really about cleaning up some of that legacy data before we start embarking on the project. Then we are very much in the process of decomposing our monolith into services. When we build stuff, we are still needing to build on to our existing platform, our existing monolith and building that in a way that we know it's going to be easy to pull it apart in six months time is something that goes into every single bit that we do now. It's one of the main considerations.

Rob: You were very intentional about the way that you started out with a monolith and it's worked really well for you to get up and running really quickly. Now you have to pull it apart and that's a significant engineering effort. Is there something that you would have done differently?

Stu: Yes and no. I definitely still would have started with a monolith. I think that going into an industry that you don't have years and years and years of domain knowledge about, you don't know where those boundaries are going to be drawn. We changed our business model three times in the first six months. We changed our pricing system. We changed so many things very, very early on. If I had been trying to draw very hard boundaries right at the beginning, I would have spent all of our time redrawing boundaries instead of just allowing everything to grow. That's been one of the things that allowed us to succeed and grow quickly. I probably would have started the work that we're doing now to really start drawing those boundaries even internally within the monolith a little bit earlier. We started about halfway through last year. New modules that were added with essentially a contract layer on top of them and anything that needed to use those new modules use it via this contract. I think we probably could have started that maybe 6 to 12 months earlier.

Rob: There's a lot of discussion around building microservices or a monolith first. If you're building microservices from the beginning, defining the process boundaries is really, really difficult. Is their methodology to be able to define the process boundaries upfront?

Stu: Exactly, if anyone ever says that they fully understand where all of their boundaries for their services should lie on day one, then either they are an absolute genius or their little bit delusional. I certainly would have been delusional if I tried to tell anyone that I knew enough about insurance and enough about a product that was brand new in an industry that was fast changing in the middle of 2016, because I didn't.

Rob: Thank you Stu for sharing your best practices, experiences and lessons learned. If you're excited about building the next big thing or you want to learn from the engineers that have been there and done that, subscribe to startup engineering wherever you get your podcasts.

Remember to check out the show notes for useful resources related to this episode.

Until the next time, keep on building.