Startup Engineering

Deepset - Machine learning research to enterprise ready services

Episode Summary

Using the latest from machine learning research in enterprise products is hard. Research projects are built to advance research goals. Its not easy to convert papers, code, and scripts in products. They are difficult to maintain and scale. Malte Pietsch is a Co-Founder of deepset explains their approach to scaling research into production ready enterprise scale applications.

Episode Notes

Our guest Malte Pietsch is a Co-Founder of deepset, where he builds NLP solutions for enterprise clients, such as Siemens, Airbus and Springer Nature. He holds a M.Sc. with honors from TU Munich and conducted research at Carnegie Mellon University.

He is an active open-source contributor, creator of the NLP frameworks FARM & haystack and published the German BERT model. He is particularly interested in transfer learning and its application to question answering / semantic search.


Episode Transcription

Rob: Welcome back to Startup Engineering. A podcast that goes behind the scenes at startups. I'm Rob De Feo Startup Advocate at AWS. Together we'll hear from the engineers, CTO's and founder's that build the technology and products at some of the world's leading startups. From launch through to achieving mass scale and everything in between. Experts share the experiences, lessons learned and best practices. In this episode, our guest, Malte, a co-founder of deepset, takes us behind the scenes of how they take the latest from machine learning research to use in enterprise scale products. Malte can you describe the problem that you're solving for anyone that's not yet had a chance to use deepset?

Malte: We are startup from Berlin, founded almost two years ago, and we are working on deep learning based natural language processing, mostly focusing on transfer learning these days.

Rob: With your focus on transfer learning, what's the big problem you're solving with an NLP?

Malte: Biggest problem that used to be out there is the the gap between research and the actual industry or you don't have enough training data usually in the industry. Transfer learning is one way off of actually solving that where you can have pre trained models and apply it in the industry using less data. Our company right now focused a lot on improving enterprise search engines with the help of NLP. We use transfer learning, but for the sake of improving search results.

Rob: Can you describe why transfer learning is so important for you and also for the industry at large?

Malte: Yeah. So back at the days you mostly had one problem, one NLP problem. Then you were looking out there "what kind of model architecture helps here?" We're collecting some training data for this particular use case, for example, for the case of question answering where you have an input, a natural language question. As an output, you want to get the answer from some some kind of text. There you would have to basically collect a lot of examples, have lots of people annotating this kind of data and then train your model. Nowadays what you can do with transfer learning is that your basically train a model just on raw text data without any annotations. Just like everybody has this kind of text data laying around. Then you use these models on some downstream tasks and there you don't need this much of training data anymore. Transfer learning allows building your machine and what's with less training data giving you better performance. The third effect that we see is your development cycle that's more streamlined. You can re-use the models across different tasks.

Rob: So transfer load is an important development. You're able to take the same model and use across different domains, that really reduces the amount of data annotation needed.

Malte: Yeah, exactly.

Rob: This is one of the big developments in machine learning. So how are you able to take this research and put into production and what products you use it for?

Malte: We are currently focusing a lot on the search engine style or search engine problem. What we find there is we barely started with looking at research and what is out there. More or less two years ago there was a really big jump in performance when you look at research papers and that was basically a start when when we also said, okay, let's focus on this, let's get this into production enterprise level. That was basically our journey then taking this research code and bringing it now to a stage where we can use it really at scale in enterprises.

Rob: There's continuously jumps in research and there's a lot of available public information. But transferring that into a product is a difficult task. So how do you go about doing that and what some of the problems you found along the way?

Malte: So, first of all, I think the dispute and research was crazy. These days in NLP there's new papers published every week and new state of the art models. Just keeping track of all these models is really challenging, if your main job is not reading research papers. Then once you think you're settled on, let's say, a model or what you want to do. I think there are a couple of problems or challenges when you want to bring them into production. One of them is a basically pure scale. Usually enterprises have way more data. Speed matters, you have to have it in real time. Sometimes the tasks also a bit different. It's not always the task that you see in the research papers. They're not exactly the tasks that you need to solve in the real world. In the area of NLP, we have another problem of domain data. In most companies that I know you speak English, but not the plain English language that is, for example, use in Wikipedia. There's certain terminology, say, in the legal sector or aerospace, which makes the NLP task a bit different.

Rob: And these are the areas where research has been developing very quickly.

Malte: Yes. I can also look on the domain data as well where you can transfer learning, but I would say mostly that it is the jumps that we saw on just model performance accuracy. This really enabled a lot of new opportunities for businesses and it became really interesting to use these models.

Rob: When I've looked at research papers or code samples attached to them. They tend to be focused around the task in hand as in developing the research. Can you talk about how you've taken that and then brought it to the next step in your product development?

Malte: The key problem with many research papers is that they have are really meant for this or one purpose, and that is basically you were training the model for this research task. One problem that comes with it, is that this code is usually a couple of scripts. It's not really meant for production. I think there's a big temptation just take these scripts and use them and your proof of concept. What we found there and with a lot of customers in industry, is that there's a risk that you end up in a proof of concept trap, you do your POC, but then you can't really bring it into production. Even if you can bring it to production, you risk some of their long term technical debt, where you feeling a lot of silos. This is causing I think, a lot of frustration. There's a risk that it causes frustration, disappointment about machinery learning, if you just take this research code and go forward with it.

Rob: What's the process you take when you have an idea and then you create a POC? How do you bring that into production?

Malte: I think it's a very common approach for machining projects that you start with as proof of concept. Where you say, "okay, let's be very fast, agile, try a few things out". This is I think totally fair, but I think I saw a recent number, which I think only 16 percent of a few POC's really make it into production in the end. This is what some people called POC trap. You start a lot of POC's in a company, but only very few of them make it into production and really create value for the business. There's a couple of reasons of course, but one that we saw a lot is that you will have a model that works actually in the POC stage. People are first happy and say, "Wow, OK, that kind of works." But then you start talking, "OK, how can we bring this into production?" Then the problems occur and say, "okay, that's not scalable or we have to completely rewrite our code, this would take half a year or a year or this doesn't even scale at all". And then a lot of projects just die.

Rob: So you're acutely aware of the POC trap? But being aware is just one thing, I see a lot of people that often aware of something yet still fall into that same trap. How do you avoid it?

Malte: One way of dealing with it is to force yourself, even in this POC stage, to write maintainable code, modular code. Not just taking these POC scripts that are published together with the papers but really have a clean software engineering approach. I think is just a lot of best practices that have been established in software engineering that also apply to machine learning projects. Making record modular, having tests, having parameters put into it to a config that you can easily share and store. In the long run, allowing reproducible experiments because the POC stage, you really do a lot of different experiments and you need some system, a way of organizing things that you in the end can easily select the useful experiments and then move them into production.

Rob: Can you go into detail about one, the examples where you've done that?

Malte: Yeah. One was actually in the area of transfer learning. We edit a framework, an open source framework called FARM started with Google publishing code together with a BERT model. We had a look at this code and said, "OK, let's take this and let's use it in the industry". One step of that was really making it modular. That meant that we have certain components in the pipelines for pre-processing, in this case it's usually a tokenizer, which separates your text and into individual tokens, something like words. Then having processors that can consume different inputs from files, from API requests and so on. And also a way of sticking together these models, which is not always done like this, but we find it very useful to also separate the models to smaller pieces.

Rob: You going a step further and just creating modular code. Can you describe what you mean by separating the models.

Malte: In the area of transfer learning, you usually have a language model that you would train just on this pure text and that's something like BERT. To actually use these language models you need something called a prediction hat. So a few layers in your neural network that you stick on top. These layers can then deal with let's say a text classification or question answering task. You have now two options either you treat the whole thing as one model. So you have, let's say one BERT model for a question answering or you treat it really as these two components. You have a language model as a core. Then you have a prediction hat as a second object on top. If you do that, you gain some flexibility because you can exchange the language model part quite easily if new architectures come out. And that's what we see, on a on a weekly basis now, with many different architectures. You can also experiment in a nice way, where you stick multiple of these prediction hats on top of the model.

Rob: What you're describing is creating modular code, not only so it helps you bring into production, but also so you can help continue experimentation. What's the most important factor when you're building modular code?

Malte: I would say was we both, having this flexibility during experimentation, but then also later when you moved to production, you want lots also there to be future proof. Let's say there is in your model architecture coming out, you don't want to end up rewriting your whole codebase and testing everything again. You really just want to switch the language model, for example. So I see it's both experimenting and operating in the long run.

Rob: How long does it take you normally to take code that you've seen in a research product and make that into production ready code.

Malte: It depends, I would say a lot of the task. In our case for transfer learning, we were really ambitious and said "okay, we want to build a framework around it and have it open source, it's really applicable to many different tasks". The first version took us maybe around two months and published it and in July last year. We are working on it constantly now to improve it, to extend it, yeah, of course, it's never a never ending story. But I would say like we got the first model with this kind of framework into production after two months more or less

Rob: To build the way that you're describing, there's a bigger burden on the initial effort, but that's going to pay off over time. How do you balance the competing goals in a startup to build software that can scale but also to move quickly?

Malte: Yeah, I think thats the art, having this trade off and realizing that there is this trade off. I think in my early days when I worked for another startup, I was really tempted to just take the code and get it out as fast as possible. But then, yeah, you really learned that really need to waste a lot of time afterwards for debugging, testing. Maintaining your codebase. I don't have like a number here I guess.

Rob: Do you have a specific example of when you didn't do this and it actually gave you really big problems and that really changed the way that you thought about how you would build.

Malte: Yeah I think like debugging in general. I had a model once which was which was failing in production and it was failing very silently, it was not throwing error messages of course, but the performance was just degrading. In the end, we kind of figured out that this was very nested bug, deep down in a script. We had updated a few other components of this pipeline, but didn't think of this script, this configuration there. That ended up costing me a lot of debugging time to trace this bug.

Rob: It would be great to hear about the reasoning why that you not just create the code more maintainable, but you actually adapt the models. Can you talk a little bit about the reasoning why you do this?

Malte: There's a lot of what, say, standard research tasks, that research has focused on and there's a good reason for it, because if they a stand a task standard, let's say also a leaderboard, maybe a standard dataset they work on. It's easy for them to compare different models, this makes a lot of sense. These tasks are not always translatable or always transferred to the real world or to companies. One example of that is this question answering, where you have a dataset out there and a task, where you have a question and very small passage from Wikipedia and you need to find the answer in this small passage. The dataset is called SQuAD and it's really the what says most popular dataset out there, everybody talks about it and big companies are in competition to get the most prominent ranks on the leaderboard. But in the real world, you will rarely have a case, I would argue there where you need to find an answer within a very small passage, say 100 words, 200 words. In a real world, you usually large collections of documents, thousands of documents lying around and on SharePoint or somewhere on a file storage system. You want to find the information from there. This is what I mean, there is a gap between research tasks and real world tasks, and you need to find a way to transfer the results that were made in research to your real world task. In this case, for example, let's maybe walk through that a bit. It means many two things, the first one is really scaling from this task of having passages, where you need to find the answer to large document basis. You could say, "OK, maybe just a matter model speed so let's optimize the hell out of the model" there are a lot of best practice to do that. For this particular problem there's no way to do that. Even with all the best practice out there, you could never gain such a speed that it's scales to thousands of documents. What you need to do there is becoming a bit more creative, utilizing what is out there and stitching things together. What we did in this scenario was creating a pipeline of two models. We have one model that is very slow, for example, a BERT model. It's very powerful but slow and that's what people use in research. We now put another model in front of it called standard retriever, which is very fast, but it's only a heuristic model. This first model basically can identify from the thousands of documents, the 20 or 50 most promising ones. These get then fed to our BERT model and with that, we get really quite good accuracy and speed at the same time.

Rob: That's a really cool approach. You have one model that looking for candidate examples. Then there's another model that takes this much smaller list and gives a more detailed answer. What sort of improvements did you see with this new implementation?

Malte: From impossible to a few seconds? Yeah, I really think if you talk about thousands of documents just applying the model out of the box, the BERT model I think would take days to process it even on quite, quite powerful infrastructure on GPU's. Now we're down to one second, two seconds, depend a bit on how accurate your results should be. More or less the order of magnitude we're talking about.

Rob: That's a really big jump. So what tools and architecture do you use to build and train these models?

Malte: We basically use PyTorch as a deep learning framework when we train our models, we have a couple of steps that are involved. The first one is training models from scratch if it's really needed. This is really a heavy workload task, we rely on a large GPU clusters we use currently for that purpose are our FRAM framework, which is open source. It's quite tightly now integrated with SageMaker, you can train on large GPU instances. We use right now, most of the time 4 to 8, sometimes 16 and NVIDIA V100 GPU's for the training step. But this is really only need a few scenarios, more common is than for the QA model to take a pre-trade model that maybe already out there and fine tune it for the question answering task. The setup or architecture behind it is pretty similar, also done with PyTorch on GPU's. You need to usually not that many and takes I think maybe like an hour or two on a 4 time V100 instance P3x8Large. That's basically for training the model and then we need to move them ready for inference, there we then integrated them in this pipeline that I mentioned. Having a fast heuristic model in the front and then adding our new trained model afterwards. We have there usually quite tight integration with ElasticSearch, which is very good in getting this fast high heuristic results and scales nicely even across millions of documents.

Rob: Cost can be an important factor when training models are you do anything to manage the costs?

Malte: Oh yeah, of course as a startup that's something we also have to keep an eye on. Especially training these large language models is quite expensive. We worked a lot on the integration with SageMaker particularly for one reason, and that is saving costs using spot instances. You can now use spot instances and basically the model starts training if instances are available. At some point it might stop, we save all the checkpoints, store the all the states of optimizer and so on, then resume training once there is again another instance available. That helped us to reduce the costs by around 70 percent in the last runs that we measured. That was definitely for us an interesting feature of SageMaker.

Rob: To achieve this, you need a way of be able to stop and resume the models. Is this something that was native to the model or is this part of the process you build in when you're productionizing? Or is there another technique use?

Malte: Yeah, that's something we had to build in. It's actually not not super difficult, but we made our learnings there. What you need to do, what you need to save. Every state of every object that you have in your training pipeline. For us that meant saving the current model, the weights of the neural network and that is pretty easy in PyTorch to do. What was more tricky is than actually saving the state of the optimizer. In our case we was learning rate schedules, the learning rate changes over time depending on the progress and training. This is really something you also need to save and load again when you resume training, that was actually a tricky part to figure out which states you need to save there in PyTorch. We wanted to have full reproducibility, we always measure say with one run without spot training and another run using spot training and there should completely line up. We figured out in the end a lot of seeds, not only the regular ones PyTouch, NumPy, random library, but also there are random number generators that you need to set and only then you can really have full reproducibility.

Rob: You mentioned previously about the FARM. Can you describe a little bit more detail about what it is and what purpose it plays in your training?

Malte: Yeah, it's basically a framework for transfer learning, it can take one of these pre-trained models that out there that are published, the most popular one BERT and apply it to your own problem. For example, classifying documents or this question answering tasks that I was talking about. We built this framework in a modular way, we believe that's how you can maintain the code in the long run and with a lot of support for experiments and tracking these experiments with other open source frameworks, e.g. MLflow. We built it because we needed it, we found it very useful in our work and then decide to open source it. That's I think everything you need to have a fast POC, but avoiding some technical debt. You have some modular code already, maintainable code, if you transition to production, it's quite easy to keep it up to date.

Rob: When you're scaling up into production. What are the key problems or acceptance criteria that you're working towards?

Malte: I think the most common problem or problem that people find us optimizing model speed, that's what we find a lot of blog articles how to do that. Just a few things that were useful for us in the past are automatic mixed precision training, the idea is to not use full precision of a float32 in all your model weights, for some parameters it's enough if you have less precision. Automatic mix precision training (AMP) is a smart way of figuring out which parameters need this precision and which are fine with less precision. This got us some deployments improvements by about 30 - 40 percent of speed, which is fun machine and quite interesting, and also saves costs on GPU'S.

Rob: How can I find out more about the open source projects that you're working on?

Malte: You can find to it on our website, for us what's most important is our open source projects, there's FARM, which is a framework for transfer learning. What we currently focus a lot on the second framework called haystack, to find the needle in the haystack, if you want to do search, if you want to do question answering at scale, that might be worth looking at. That's where we implemented a few of the tenants such we just discussed, integrating it with ElasticSearch and having these two models in one pipeline.

Rob: Research projects and products have significantly different implementations that reflect in their different goals. Research projects are designed to continuously test and validate theories and improve on each other's results. Yet products are built to solve valuable customer problems. Taking the latest from research and using it in products can help to solve new problems, improve existing performance but converting or scanning research for use in real world problems is an involved process. The good news is it follows many of the engineering principles that exist in software engineering today. Making code modular allows for re-use and improves maintainability and it has similar benefits in machine learning software development. Another key consideration for startups build a machine learning products is cost use and spot instances in SageMaker can help reduce costs significantly, in deepset's case they saved 70 percent on their training costs. Let's get back to Malte to learn about his learnings, best practices and advice on how to stay up to date in the fast moving world of machine learning. You've now gone through this process multiple times and also built open source projects. If you were to start from zero today, what would you do differently?

Malte: Yeah, I would definitely start even in POC's, I would pay more attention to good software engineering practices, this is something I learned along the way and also to model the monitoring if you deploy models. We had a couple of situations where it was difficult to find out if actually the model was failing or not and some good dash-boarding good monitoring helps there a lot. Then I think when it comes to open source, that's something that's really important to us and in our DNA, I would even earlier engage in an open source development. It took me quite a while to become a contributor in other projects and even longer to get our own projects out there. But it's really super rewarding and you learn a lot of things as a startup, it's super helpful to get early user feedback, to get other contributors on board and also to get some some visibility. For us it was very helpful to publish the German BERT model very early on and got a lot of traction just because of this model, a lot of applications, a lot of talent coming in. That was really key for us, I can just encourage everybody to either engage an existing open source projects or consider open sourcing your own products.

Rob: And if you were to give a single piece of advice to an engineer that wants to build a machine learning. What advice would you give them?

Malte: If working on the NLP side so working with text, definitely transfer learning. I think there's no way around it these days, it's still worth comparing those models to simpler models, more traditional ones as a benchmark. But from my experience, you usually go better with the with transfer learning and transformers. Secondly, think about your long term strategy and just not implement something hacky, but really built it in a way that can last and that you can monitor and that you can maintain,.

Rob: Machine learning is a fast moving world with a lot of new developments, what are the key resources that you use to keep up to date?

Malte: I' m big fan of research papers, I read usually on the weekend, sometimes during breakfast just to keep a bit updated. There are a couple of very great newsletters in this area from Sebastian Ruder for example has a very NLP news letter. A lot of great resources for online courses Andrew Ng and so on. I think conferences are great to keep updated on on some modelling and engineering practices, but more importantly, to exchange with a fellow's to discuss what they are doing, what what went wrong, what their learnings are.

Rob: Thank you Malte for sharing how you take machine learning research into production, your lessons learned and best practices. If you're excited about building the next big thing or you want to learn from the engineers that have been there and done that, subscribe to Startup Engineering wherever you get your podcasts. Remember to check out the show notes for useful resources related to this episode. And until the next time. Keep on building.