Enterprise MLOps Interview-Simon Stiebellehner from 52 Weeks of Cloud

52 Weeks of Cloud

Noah Gift

Episodes

About

Reviews

Promote

Enterprise MLOps Interview-Simon Stiebellehner

56 minutes Posted Sep 23, 2022 at 6:52 pm.

.000 --> 00:02.260] Hey, three, two, one, there we go, we're live.[00:02.260 --> 00:07.260] All right, so welcome Simon to Enterprise ML Ops interviews.[00:09.760 --> 00:13.480] The goal of these interviews is to get people exposed[00:13.480 --> 00:17.680] to real professionals who are doing work in ML Ops.[00:17.680 --> 00:20.360] It's such a cutting edge field[00:20.360 --> 00:22.760] that I think a lot of people are very curious about.[00:22.760 --> 00:23.600] What is it?[00:23.600 --> 00:24.960] You know, how do you do it?[00:24.960 --> 00:27.760] And very honored to have Simon here.[00:27.760 --> 00:29.200] And do you wanna introduce yourself[00:29.200 --> 00:31.520] and maybe talk a little bit about your background?[00:31.520 --> 00:32.360] Sure.[00:32.360 --> 00:33.960] Yeah, thanks again for inviting me.[00:34.960 --> 00:38.160] My name is Simon Stebelena or Simon.[00:38.160 --> 00:40.440] I am originally from Austria,[00:40.440 --> 00:43.120] but currently working in the Netherlands and Amsterdam[00:43.120 --> 00:46.080] at Transaction Monitoring Netherlands.[00:46.080 --> 00:48.780] Here I am the lead ML Ops engineer.[00:49.840 --> 00:51.680] What are we doing at TML actually?[00:51.680 --> 00:55.560] We are a data processing company actually.[00:55.560 --> 00:59.320] We are owned by the five large banks of Netherlands.[00:59.320 --> 01:02.080] And our purpose is kind of what the name says.[01:02.080 --> 01:05.920] We are basically lifting specifically anti money laundering.[01:05.920 --> 01:08.040] So anti money laundering models that run[01:08.040 --> 01:11.440] on a personalized transactions of businesses[01:11.440 --> 01:13.240] we get from these five banks[01:13.240 --> 01:15.760] to detect unusual patterns on that transaction graph[01:15.760 --> 01:19.000] that might indicate money laundering.[01:19.000 --> 01:20.520] That's a natural what we do.[01:20.520 --> 01:21.800] So as you can imagine,[01:21.800 --> 01:24.160] we are really focused on building models[01:24.160 --> 01:27.280] and obviously ML Ops is a big component there[01:27.280 --> 01:29.920] because that is really the core of what you do.[01:29.920 --> 01:32.680] You wanna do it efficiently and effectively as well.[01:32.680 --> 01:34.760] In my role as lead ML Ops engineer,[01:34.760 --> 01:36.880] I'm on the one hand the lead engineer[01:36.880 --> 01:38.680] of the actual ML Ops platform team.[01:38.680 --> 01:40.200] So this is actually a centralized team[01:40.200 --> 01:42.680] that builds out lots of the infrastructure[01:42.680 --> 01:47.320] that's needed to do modeling effectively and efficiently.[01:47.320 --> 01:50.360] But also I am the craft lead[01:50.360 --> 01:52.640] for the machine learning engineering craft.[01:52.640 --> 01:55.120] These are actually in our case, the machine learning engineers,[01:55.120 --> 01:58.360] the people working within the model development teams[01:58.360 --> 01:59.360] and cross functional teams[01:59.360 --> 02:01.680] actually building these models.[02:01.680 --> 02:03.640] That's what I'm currently doing[02:03.640 --> 02:05.760] during the evenings and weekends.[02:05.760 --> 02:09.400] I'm also lecturer at the University of Applied Sciences, Vienna.[02:09.400 --> 02:12.080] And there I'm teaching data mining[02:12.080 --> 02:15.160] and data warehousing to master students, essentially.[02:16.240 --> 02:19.080] Before TMNL, I was at bold.com,[02:19.080 --> 02:21.960] which is the largest eCommerce retailer in the Netherlands.[02:21.960 --> 02:25.040] So I always tend to see the Amazon of the Netherlands[02:25.040 --> 02:27.560] or been a lux actually.[02:27.560 --> 02:30.920] It is still the biggest eCommerce retailer in the Netherlands[02:30.920 --> 02:32.960] even before Amazon actually.[02:32.960 --> 02:36.160] And there I was an expert machine learning engineer.[02:36.160 --> 02:39.240] So doing somewhat comparable stuff,[02:39.240 --> 02:42.440] a bit more still focused on the actual modeling part.[02:42.440 --> 02:44.800] Now it's really more on the infrastructure end.[02:45.760 --> 02:46.760] And well, before that,[02:46.760 --> 02:49.360] I spent some time in consulting, leading a data science team.[02:49.360 --> 02:50.880] That's actually where I kind of come from.[02:50.880 --> 02:53.360] I really come from originally the data science end.[02:54.640 --> 02:57.840] And there I kind of started drifting towards ML Ops[02:57.840 --> 02:59.200] because we started building out[02:59.200 --> 03:01.640] a deployment and serving platform[03:01.640 --> 03:04.440] that would as consulting company would make it easier[03:04.440 --> 03:07.920] for us to deploy models for our clients[03:07.920 --> 03:10.840] to serve these models, to also monitor these models.[03:10.840 --> 03:12.800] And that kind of then made me drift further and further[03:12.800 --> 03:15.520] down the engineering lane all the way to ML Ops.[03:17.000 --> 03:19.600] Great, yeah, that's a great background.[03:19.600 --> 03:23.200] I'm kind of curious in terms of the data science[03:23.200 --> 03:25.240] to ML Ops journey,[03:25.240 --> 03:27.720] that I think would be a great discussion[03:27.720 --> 03:29.080] to dig into a little bit.[03:30.280 --> 03:34.320] My background is originally more on the software engineering[03:34.320 --> 03:36.920] side and when I was in the Bay Area,[03:36.920 --> 03:41.160] I did individual contributor and then ran companies[03:41.160 --> 03:44.240] at one point and ran multiple teams.[03:44.240 --> 03:49.240] And then as the data science field exploded,[03:49.240 --> 03:52.880] I hired multiple data science teams and worked with them.[03:52.880 --> 03:55.800] But what was interesting is that I found that[03:56.840 --> 03:59.520] I think the original approach of data science[03:59.520 --> 04:02.520] from my perspective was lacking[04:02.520 --> 04:07.240] in that there wasn't really like deliverables.[04:07.240 --> 04:10.520] And I think when you look at a software engineering team,[04:10.520 --> 04:12.240] it's very clear there's deliverables.[04:12.240 --> 04:14.800] Like you have a mobile app and it has to get better[04:14.800 --> 04:15.880] each week, right?[04:15.880 --> 04:18.200] Where else, what are you doing?[04:18.200 --> 04:20.880] And so I would love to hear your story[04:20.880 --> 04:25.120] about how you went from doing kind of more pure data science[04:25.120 --> 04:27.960] to now it sounds like ML Ops.[04:27.960 --> 04:30.240] Yeah, yeah, actually.[04:30.240 --> 04:33.800] So back then in consulting one of the,[04:33.800 --> 04:36.200] which was still at least back then in Austria,[04:36.200 --> 04:39.280] data science and everything around it was still kind of[04:39.280 --> 04:43.720] in this infancy back then 2016 and so on.[04:43.720 --> 04:46.560] It was still really, really new to many organizations,[04:46.560 --> 04:47.400] at least in Austria.[04:47.400 --> 04:50.120] There might be some years behind in the US and stuff.[04:50.120 --> 04:52.040] But back then it was still relatively fresh.[04:52.040 --> 04:55.240] So in consulting, what we very often struggled with was[04:55.240 --> 04:58.520] on the modeling end, problems could be solved,[04:58.520 --> 05:02.040] but actually then easy deployment,[05:02.040 --> 05:05.600] keeping these models in production at client side.[05:05.600 --> 05:08.880] That was always a bit more of the challenge.[05:08.880 --> 05:12.400] And so naturally kind of I started thinking[05:12.400 --> 05:16.200] and focusing more on the actual bigger problem that I saw,[05:16.200 --> 05:19.440] which was not so much building the models,[05:19.440 --> 05:23.080] but it was really more, how can we streamline things?[05:23.080 --> 05:24.800] How can we keep things operating?[05:24.800 --> 05:27.960] How can we make that move easier from a prototype,[05:27.960 --> 05:30.680] from a PUC to a productionized model?[05:30.680 --> 05:33.160] Also how can we keep it there and maintain it there?[05:33.160 --> 05:35.480] So personally I was really more,[05:35.480 --> 05:37.680] I saw that this problem was coming up[05:38.960 --> 05:40.320] and that really fascinated me.[05:40.320 --> 05:44.120] So I started jumping more on that exciting problem.[05:44.120 --> 05:45.080] That's how it went for me.[05:45.080 --> 05:47.000] And back then we then also recognized it[05:47.000 --> 05:51.560] as a potential product in our case.[05:51.560 --> 05:54.120] So we started building out that deployment[05:54.120 --> 05:56.960] and serving and monitoring platform, actually.[05:56.960 --> 05:59.520] And that then really for me, naturally,[05:59.520 --> 06:01.840] I fell into that rabbit hole[06:01.840 --> 06:04.280] and I also never wanted to get out of it again.[06:05.680 --> 06:09.400] So the system that you built initially,[06:09.400 --> 06:10.840] what was your stack?[06:10.840 --> 06:13.760] What were some of the things you were using?[06:13.760 --> 06:17.000] Yeah, so essentially we had,[06:17.000 --> 06:19.560] when we talk about the stack on the backend,[06:19.560 --> 06:20.560] there was a lot of,[06:20.560 --> 06:23.000] so the full backend was written in Java.[06:23.000 --> 06:25.560] We were using more from a user perspective,[06:25.560 --> 06:28.040] the contract that we kind of had,[06:28.040 --> 06:32.560] our goal was to build a drag and drop platform for models.[06:32.560 --> 06:35.760] So basically the contract was you package your model[06:35.760 --> 06:37.960] as an MLflow model,[06:37.960 --> 06:41.520] and then you basically drag and drop it into a web UI.[06:41.520 --> 06:43.640] It's gonna be wrapped in containers.[06:43.640 --> 06:45.040] It's gonna be deployed.[06:45.040 --> 06:45.880] It's gonna be,[06:45.880 --> 06:49.680] there will be a monitoring layer in front of it[06:49.680 --> 06:52.760] based on whatever the dataset is you trained it on.[06:52.760 --> 06:55.920] You would automatically calculate different metrics,[06:55.920 --> 06:57.360] different distributional metrics[06:57.360 --> 06:59.240] around your variables that you are using.[06:59.240 --> 07:02.080] And so we were layering this approach[07:02.080 --> 07:06.840] to, so that eventually every incoming request would be,[07:06.840 --> 07:08.160] you would have a nice dashboard.[07:08.160 --> 07:10.040] You could monitor all that stuff.[07:10.040 --> 07:12.600] So stackwise it was actually MLflow.[07:12.600 --> 07:15.480] Specifically MLflow models a lot.[07:15.480 --> 07:17.920] Then it was Java in the backend, Python.[07:17.920 --> 07:19.760] There was a lot of Python,[07:19.760 --> 07:22.040] especially PySpark component as well.[07:23.000 --> 07:25.880] There was a, it's been quite a while actually,[07:25.880 --> 07:29.160] there was a quite some part written in Scala.[07:29.160 --> 07:32.280] Also, because there was a component of this platform[07:32.280 --> 07:34.800] was also a bit of an auto ML approach,[07:34.800 --> 07:36.480] but that died then over time.[07:36.480 --> 07:40.120] And that was also based on PySpark[07:40.120 --> 07:43.280] and vanilla Spark written in Scala.[07:43.280 --> 07:45.560] So we could facilitate the auto ML part.[07:45.560 --> 07:48.600] And then later on we actually added that deployment,[07:48.600 --> 07:51.480] the easy deployment and serving part.[07:51.480 --> 07:55.280] So that was kind of, yeah, a lot of custom build stuff.[07:55.280 --> 07:56.120] Back then, right?[07:56.120 --> 07:59.720] There wasn't that much MLOps tooling out there yet.[07:59.720 --> 08:02.920] So you need to build a lot of that stuff custom.[08:02.920 --> 08:05.280] So it was largely custom built.[08:05.280 --> 08:09.280] Yeah, the MLflow concept is an interesting concept[08:09.280 --> 08:13.880] because they provide this package structure[08:13.880 --> 08:17.520] that at least you have some idea of,[08:17.520 --> 08:19.920] what is gonna be sent into the model[08:19.920 --> 08:22.680] and like there's a format for the model.[08:22.680 --> 08:24.720] And I think that part of MLflow[08:24.720 --> 08:27.520] seems to be a pretty good idea,[08:27.520 --> 08:30.080] which is you're creating a standard where,[08:30.080 --> 08:32.360] you know, if in the case of,[08:32.360 --> 08:34.720] if you're using scikit learn or something,[08:34.720 --> 08:37.960] you don't necessarily want to just throw[08:37.960 --> 08:40.560] like a pickled model somewhere and just say,[08:40.560 --> 08:42.720] okay, you know, let's go.[08:42.720 --> 08:44.760] Yeah, that was also our thinking back then.[08:44.760 --> 08:48.040] So we thought a lot about what would be a,[08:48.040 --> 08:51.720] what would be, what could become the standard actually[08:51.720 --> 08:53.920] for how you package models.[08:53.920 --> 08:56.200] And back then MLflow was one of the little tools[08:56.200 --> 08:58.160] that was already there, already existent.[08:58.160 --> 09:00.360] And of course there was data bricks behind it.[09:00.360 --> 09:02.680] So we also made a bet on that back then and said,[09:02.680 --> 09:04.920] all right, let's follow that packaging standard[09:04.920 --> 09:08.680] and make it the contract how you would as a data scientist,[09:08.680 --> 09:10.800] then how you would need to package it up[09:10.800 --> 09:13.640] and submit it to the platform.[09:13.640 --> 09:16.800] Yeah, it's interesting because the,[09:16.800 --> 09:19.560] one of the, this reminds me of one of the issues[09:19.560 --> 09:21.800] that's happening right now with cloud computing,[09:21.800 --> 09:26.800] where in the cloud AWS has dominated for a long time[09:29.480 --> 09:34.480] and they have 40% market share, I think globally.[09:34.480 --> 09:38.960] And Azure's now gaining and they have some pretty good traction[09:38.960 --> 09:43.120] and then GCP's been down for a bit, you know,[09:43.120 --> 09:45.760] in that maybe the 10% range or something like that.[09:45.760 --> 09:47.760] But what's interesting is that it seems like[09:47.760 --> 09:51.480] in the case of all of the cloud providers,[09:51.480 --> 09:54.360] they haven't necessarily been leading the way[09:54.360 --> 09:57.840] on things like packaging models, right?[09:57.840 --> 10:01.480] Or, you know, they have their own proprietary systems[10:01.480 --> 10:06.480] which have been developed and are continuing to be developed[10:06.640 --> 10:08.920] like Vertex AI in the case of Google,[10:09.760 --> 10:13.160] the SageMaker in the case of Amazon.[10:13.160 --> 10:16.480] But what's interesting is, let's just take SageMaker,[10:16.480 --> 10:20.920] for example, there isn't really like this, you know,[10:20.920 --> 10:25.480] industry wide standard of model packaging[10:25.480 --> 10:28.680] that SageMaker uses, they have their own proprietary stuff[10:28.680 --> 10:31.040] that kind of builds in and Vertex AI[10:31.040 --> 10:32.440] has their own proprietary stuff.[10:32.440 --> 10:34.920] So, you know, I think it is interesting[10:34.920 --> 10:36.960] to see what's gonna happen[10:36.960 --> 10:41.120] because I think your original hypothesis which is,[10:41.120 --> 10:44.960] let's pick, you know, this looks like it's got some traction[10:44.960 --> 10:48.760] and it wasn't necessarily tied directly to a cloud provider[10:48.760 --> 10:51.600] because Databricks can work on anything.[10:51.600 --> 10:53.680] It seems like that in particular,[10:53.680 --> 10:56.800] that's one of the more sticky problems right now[10:56.800 --> 11:01.800] with MLopsis is, you know, who's the leader?[11:02.280 --> 11:05.440] Like, who's developing the right, you know,[11:05.440 --> 11:08.880] kind of a standard for tooling.[11:08.880 --> 11:12.320] And I don't know, maybe that leads into kind of you talking[11:12.320 --> 11:13.760] a little bit about what you're doing currently.[11:13.760 --> 11:15.600] Like, do you have any thoughts about the, you know,[11:15.600 --> 11:18.720] current tooling and what you're doing at your current company[11:18.720 --> 11:20.920] and what's going on with that?[11:20.920 --> 11:21.760] Absolutely.[11:21.760 --> 11:24.200] So at my current organization,[11:24.200 --> 11:26.040] Transaction Monitor Netherlands,[11:26.040 --> 11:27.480] we are fully on AWS.[11:27.480 --> 11:32.000] So we're really almost cloud native AWS.[11:32.000 --> 11:34.840] And so that also means everything we do on the modeling side[11:34.840 --> 11:36.600] really evolves around SageMaker.[11:37.680 --> 11:40.840] So for us, specifically for us as MLops team,[11:40.840 --> 11:44.680] we are building the platform around SageMaker capabilities.[11:45.680 --> 11:48.360] And on that end, at least company internal,[11:48.360 --> 11:52.880] we have a contract how you must actually deploy models.[11:52.880 --> 11:56.200] There is only one way, what we call the golden path,[11:56.200 --> 11:59.800] in that case, this is the streamlined highly automated path[11:59.800 --> 12:01.360] that is supported by the platform.[12:01.360 --> 12:04.360] This is the only way how you can actually deploy models.[12:04.360 --> 12:09.360] And in our case, that is actually a SageMaker pipeline object.[12:09.640 --> 12:12.680] So in our company, we're doing large scale batch processing.[12:12.680 --> 12:15.040] So we're actually not doing anything real time at present.[12:15.040 --> 12:17.040] We are doing post transaction monitoring.[12:17.040 --> 12:20.960] So that means you need to submit essentially DAX, right?[12:20.960 --> 12:23.400] This is what we use for training.[12:23.400 --> 12:25.680] This is what we also deploy eventually.[12:25.680 --> 12:27.720] And this is our internal contract.[12:27.720 --> 12:32.200] You need to provision a SageMaker in your model repository.[12:32.200 --> 12:34.640] You got to have one place,[12:34.640 --> 12:37.840] and there must be a function with a specific name[12:37.840 --> 12:41.440] and that function must return a SageMaker pipeline object.[12:41.440 --> 12:44.920] So this is our internal contract actually.[12:44.920 --> 12:46.600] Yeah, that's interesting.[12:46.600 --> 12:51.200] I mean, and I could see like for, I know many people[12:51.200 --> 12:53.880] that are using SageMaker in production,[12:53.880 --> 12:58.680] and it does seem like where it has some advantages[12:58.680 --> 13:02.360] is that AWS generally does a pretty good job[13:02.360 --> 13:04.240] at building solutions.[13:04.240 --> 13:06.920] And if you just look at the history of services,[13:06.920 --> 13:09.080] the odds are pretty high[13:09.080 --> 13:12.880] that they'll keep getting better, keep improving things.[13:12.880 --> 13:17.080] And it seems like what I'm hearing from people,[13:17.080 --> 13:19.080] and it sounds like maybe with your organization as well,[13:19.080 --> 13:24.080] is that potentially the SDK for SageMaker[13:24.440 --> 13:29.120] is really the win versus some of the UX tools they have[13:29.120 --> 13:32.680] and the interface for Canvas and Studio.[13:32.680 --> 13:36.080] Is that what's happening?[13:36.080 --> 13:38.720] Yeah, so I think, right,[13:38.720 --> 13:41.440] what we try to do is we always try to think about our users.[13:41.440 --> 13:44.880] So how do our users, who are our users?[13:44.880 --> 13:47.000] What capabilities and skills do they have?[13:47.000 --> 13:50.080] And what freedom should they have[13:50.080 --> 13:52.640] and what abilities should they have to develop models?[13:52.640 --> 13:55.440] In our case, we don't really have use cases[13:55.440 --> 13:58.640] for stuff like Canvas because our users[13:58.640 --> 14:02.680] are fairly mature teams that know how to do their,[14:02.680 --> 14:04.320] on the one hand, the data science stuff, of course,[14:04.320 --> 14:06.400] but also the engineering stuff.[14:06.400 --> 14:08.160] So in our case, things like Canvas[14:08.160 --> 14:10.320] do not really play so much role[14:10.320 --> 14:12.960] because obviously due to the high abstraction layer[14:12.960 --> 14:15.640] of more like graphical user interfaces,[14:15.640 --> 14:17.360] drag and drop tooling,[14:17.360 --> 14:20.360] you are also limited in what you can do,[14:20.360 --> 14:22.480] or what you can do easily.[14:22.480 --> 14:26.320] So in our case, really, it is the strength of the flexibility[14:26.320 --> 14:28.320] that the SageMaker SDK gives you.[14:28.320 --> 14:33.040] And in general, the SDK around most AWS services.[14:34.080 --> 14:36.760] But also it comes with challenges, of course.[14:37.720 --> 14:38.960] You give a lot of freedom,[14:38.960 --> 14:43.400] but also you're creating a certain ask,[14:43.400 --> 14:47.320] certain requirements for your model development teams,[14:47.320 --> 14:49.600] which is also why we've also been working[14:49.600 --> 14:52.600] about abstracting further away from the SDK.[14:52.600 --> 14:54.600] So our objective is actually[14:54.600 --> 14:58.760] that you should not be forced to interact with the raw SDK[14:58.760 --> 15:00.600] when you use SageMaker anymore,[15:00.600 --> 15:03.520] but you have a thin layer of abstraction[15:03.520 --> 15:05.480] on top of what you are doing.[15:05.480 --> 15:07.480] That's actually something we are moving towards[15:07.480 --> 15:09.320] more and more as well.[15:09.320 --> 15:11.120] Because yeah, it gives you the flexibility,[15:11.120 --> 15:12.960] but also flexibility comes at a cost,[15:12.960 --> 15:15.080] comes often at the cost of speeds,[15:15.080 --> 15:18.560] specifically when it comes to the 90% default stuff[15:18.560 --> 15:20.720] that you want to do, yeah.[15:20.720 --> 15:24.160] And one of the things that I have as a complaint[15:24.160 --> 15:29.160] against SageMaker is that it only uses virtual machines,[15:30.000 --> 15:35.000] and it does seem like a strange strategy in some sense.[15:35.000 --> 15:40.000] Like for example, I guess if you're doing batch only,[15:40.000 --> 15:42.000] it doesn't matter as much,[15:42.000 --> 15:45.000] which I think is a good strategy actually[15:45.000 --> 15:50.000] to get your batch based predictions very, very strong.[15:50.000 --> 15:53.000] And in that case, maybe the virtual machines[15:53.000 --> 15:56.000] make a little bit less of a complaint.[15:56.000 --> 16:00.000] But in the case of the endpoints with SageMaker,[16:00.000 --> 16:02.000] the fact that you have to spend up[16:02.000 --> 16:04.000] these really expensive virtual machines[16:04.000 --> 16:08.000] and let them run 24 seven to do online prediction,[16:08.000 --> 16:11.000] is that something that your organization evaluated[16:11.000 --> 16:13.000] and decided not to use?[16:13.000 --> 16:15.000] Or like, what are your thoughts behind that?[16:15.000 --> 16:19.000] Yeah, in our case, doing real time[16:19.000 --> 16:22.000] or near real time inference is currently not really relevant[16:22.000 --> 16:25.000] for the simple reason that when you think a bit more[16:25.000 --> 16:28.000] about the money laundering or anti money laundering space,[16:28.000 --> 16:31.000] typically when, right,[16:31.000 --> 16:34.000] all every individual bank must do anti money laundering[16:34.000 --> 16:37.000] and they have armies of people doing that.[16:37.000 --> 16:39.000] But on the other hand,[16:39.000 --> 16:43.000] the time it actually takes from one of their systems,[16:43.000 --> 16:46.000] one of their AML systems actually detecting something[16:46.000 --> 16:49.000] that's unusual that then goes into a review process[16:49.000 --> 16:54.000] until it eventually hits the governmental institution[16:54.000 --> 16:56.000] that then takes care of the cases that have been[16:56.000 --> 16:58.000] at least twice validated that they are indeed,[16:58.000 --> 17:01.000] they look very unusual.[17:01.000 --> 17:04.000] So this takes a while, this can take quite some time,[17:04.000 --> 17:06.000] which is also why it doesn't really matter[17:06.000 --> 17:09.000] whether you ship your prediction within a second[17:09.000 --> 17:13.000] or whether it takes you a week or two weeks.[17:13.000 --> 17:15.000] It doesn't really matter, hence for us,[17:15.000 --> 17:19.000] that problem so far thinking about real time inference[17:19.000 --> 17:21.000] has not been there.[17:21.000 --> 17:25.000] But yeah, indeed, for other use cases,[17:25.000 --> 17:27.000] for also private projects,[17:27.000 --> 17:29.000] we've also been considering SageMaker Endpoints[17:29.000 --> 17:31.000] for a while, but exactly what you said,[17:31.000 --> 17:33.000] the fact that you need to have a very beefy machine[17:33.000 --> 17:35.000] running all the time,[17:35.000 --> 17:39.000] specifically when you have heavy GPU loads, right,[17:39.000 --> 17:43.000] and you're actually paying for that machine running 2047,[17:43.000 --> 17:46.000] although you do have quite fluctuating load.[17:46.000 --> 17:49.000] Yeah, then that definitely becomes quite a consideration[17:49.000 --> 17:51.000] of what you go for.[17:51.000 --> 17:58.000] Yeah, and I actually have been talking to AWS about that,[17:58.000 --> 18:02.000] because one of the issues that I have is that[18:02.000 --> 18:07.000] the AWS platform really pushes serverless,[18:07.000 --> 18:10.000] and then my question for AWS is,[18:10.000 --> 18:13.000] so why aren't you using it?[18:13.000 --> 18:16.000] I mean, if you're pushing serverless for everything,[18:16.000 --> 18:19.000] why is SageMaker nothing serverless?[18:19.000 --> 18:21.000] And so maybe they're going to do that, I don't know.[18:21.000 --> 18:23.000] I don't have any inside information,[18:23.000 --> 18:29.000] but it is interesting to hear you had some similar concerns.[18:29.000 --> 18:32.000] I know that there's two questions here.[18:32.000 --> 18:37.000] One is someone asked about what do you do for data versioning,[18:37.000 --> 18:41.000] and a second one is how do you do event based MLOps?[18:41.000 --> 18:43.000] So maybe kind of following up.[18:43.000 --> 18:46.000] Yeah, what do we do for data versioning?[18:46.000 --> 18:51.000] On the one hand, we're running a data lakehouse,[18:51.000 --> 18:54.000] where after data we get from the financial institutions,[18:54.000 --> 18:57.000] from the banks that runs through massive data pipeline,[18:57.000 --> 19:01.000] also on AWS, we're using glue and step functions actually for that,[19:01.000 --> 19:03.000] and then eventually it ends up modeled to some extent,[19:03.000 --> 19:06.000] sanitized, quality checked in our data lakehouse,[19:06.000 --> 19:10.000] and there we're actually using hoodie on top of S3.[19:10.000 --> 19:13.000] And this is also what we use for versioning,[19:13.000 --> 19:16.000] which we use for time travel and all these things.[19:16.000 --> 19:19.000] So that is hoodie on top of S3,[19:19.000 --> 19:21.000] when then pipelines,[19:21.000 --> 19:24.000] so actually our model pipelines plug in there[19:24.000 --> 19:27.000] and spit out predictions, alerts,[19:27.000 --> 19:29.000] what we call alerts eventually.[19:29.000 --> 19:33.000] That is something that we version based on unique IDs.[19:33.000 --> 19:36.000] So processing IDs, we track pretty much everything,[19:36.000 --> 19:39.000] every line of code that touched,[19:39.000 --> 19:43.000] is related to a specific row in our data.[19:43.000 --> 19:46.000] So we can exactly track back for every single row[19:46.000 --> 19:48.000] in our predictions and in our alerts,[19:48.000 --> 19:50.000] what pipeline ran on it,[19:50.000 --> 19:52.000] which jobs were in that pipeline,[19:52.000 --> 19:56.000] which code exactly was running in each job,[19:56.000 --> 19:58.000] which intermediate results were produced.[19:58.000 --> 20:01.000] So we're basically adding lineage information[20:01.000 --> 20:03.000] to everything we output along that line,[20:03.000 --> 20:05.000] so we can track everything back[20:05.000 --> 20:09.000] using a few tools we've built.[20:09.000 --> 20:12.000] So the tool you mentioned,[20:12.000 --> 20:13.000] I'm not familiar with it.[20:13.000 --> 20:14.000] What is it called again?[20:14.000 --> 20:15.000] It's called hoodie?[20:15.000 --> 20:16.000] Hoodie.[20:16.000 --> 20:17.000] Hoodie.[20:17.000 --> 20:18.000] Oh, what is it?[20:18.000 --> 20:19.000] Maybe you can describe it.[20:19.000 --> 20:22.000] Yeah, hoodie is essentially,[20:22.000 --> 20:29.000] it's quite similar to other tools such as[20:29.000 --> 20:31.000] Databricks, how is it called?[20:31.000 --> 20:32.000] Databricks?[20:32.000 --> 20:33.000] Delta Lake maybe?[20:33.000 --> 20:34.000] Yes, exactly.[20:34.000 --> 20:35.000] Exactly.[20:35.000 --> 20:38.000] It's basically, it's equivalent to Delta Lake,[20:38.000 --> 20:40.000] just back then when we looked into[20:40.000 --> 20:42.000] what are we going to use.[20:42.000 --> 20:44.000] Delta Lake was not open sourced yet.[20:44.000 --> 20:46.000] Databricks open sourced a while ago.[20:46.000 --> 20:47.000] We went for Hoodie.[20:47.000 --> 20:50.000] It essentially, it is a layer on top of,[20:50.000 --> 20:53.000] in our case, S3 that allows you[20:53.000 --> 20:58.000] to more easily keep track of what you,[20:58.000 --> 21:03.000] of the actions you are performing on your data.[21:03.000 --> 21:08.000] So it's essentially very similar to Delta Lake,[21:08.000 --> 21:13.000] just already before an open sourced solution.[21:13.000 --> 21:15.000] Yeah, that's, I didn't know anything about that.[21:15.000 --> 21:16.000] So now I do.[21:16.000 --> 21:19.000] So thanks for letting me know.[21:19.000 --> 21:21.000] I'll have to look into that.[21:21.000 --> 21:27.000] The other, I guess, interesting stack related question is,[21:27.000 --> 21:29.000] what are your thoughts about,[21:29.000 --> 21:32.000] I think there's two areas that I think[21:32.000 --> 21:34.000] are interesting and that are emerging.[21:34.000 --> 21:36.000] Oh, actually there's, there's multiple.[21:36.000 --> 21:37.000] Maybe I'll just bring them all up.[21:37.000 --> 21:39.000] So we'll do one by one.[21:39.000 --> 21:42.000] So these are some emerging areas that I'm, that I'm seeing.[21:42.000 --> 21:49.000] So one is the concept of event driven, you know,[21:49.000 --> 21:54.000] architecture versus, versus maybe like a static architecture.[21:54.000 --> 21:57.000] And so I think obviously you're using step functions.[21:57.000 --> 22:00.000] So you're a fan of, of event driven architecture.[22:00.000 --> 22:04.000] Maybe we start, we'll start with that one is what are your,[22:04.000 --> 22:08.000] what are your thoughts on going more event driven in your organization?[22:08.000 --> 22:09.000] Yeah.[22:09.000 --> 22:13.000] In, in, in our case, essentially everything works event driven.[22:13.000 --> 22:14.000] Right.[22:14.000 --> 22:19.000] So since we on AWS, we're using event bridge or cloud watch events.[22:19.000 --> 22:21.000] I think now it's called everywhere.[22:21.000 --> 22:22.000] Right.[22:22.000 --> 22:24.000] This is how we trigger pretty much everything in our stack.[22:24.000 --> 22:27.000] This is how we trigger our data pipelines when data comes in.[22:27.000 --> 22:32.000] This is how we trigger different, different lambdas that parse our[22:32.000 --> 22:35.000] certain information from your log, store them in different databases.[22:35.000 --> 22:40.000] This is how we also, how we, at some point in the back in the past,[22:40.000 --> 22:44.000] how we also triggered new deployments when new models were approved in[22:44.000 --> 22:46.000] your model registry.[22:46.000 --> 22:50.000] So basically everything we've been doing is, is fully event driven.[22:50.000 --> 22:51.000] Yeah.[22:51.000 --> 22:56.000] So, so I think this is a key thing you bring up here is that I've,[22:56.000 --> 23:00.000] I've talked to many people who don't use AWS, who are, you know,[23:00.000 --> 23:03.000] all alternatively experts at technology.[23:03.000 --> 23:06.000] And one of the things that I've heard some people say is like, oh,[23:06.000 --> 23:13.000] well, AWS is in as fast as X or Y, like Lambda is in as fast as X or Y or,[23:13.000 --> 23:17.000] you know, Kubernetes or, but, but the point you bring up is exactly the[23:17.000 --> 23:24.000] way I think about AWS is that the true advantage of AWS platform is the,[23:24.000 --> 23:29.000] is the tight integration with the services and you can design event[23:29.000 --> 23:31.000] driven workflows.[23:31.000 --> 23:33.000] Would you say that's, that's absolutely.[23:33.000 --> 23:34.000] Yeah.[23:34.000 --> 23:35.000] Yeah.[23:35.000 --> 23:39.000] I think designing event driven workflows on AWS is incredibly easy to do.[23:39.000 --> 23:40.000] Yeah.[23:40.000 --> 23:43.000] And it also comes incredibly natural and that's extremely powerful.[23:43.000 --> 23:44.000] Right.[23:44.000 --> 23:49.000] And simply by, by having an easy way how to trigger lambdas event driven,[23:49.000 --> 23:52.000] you can pretty much, right, pretty much do everything and glue[23:52.000 --> 23:54.000] everything together that you want.[23:54.000 --> 23:56.000] I think that gives you a tremendous flexibility.[23:56.000 --> 23:57.000] Yeah.[23:57.000 --> 24:00.000] So, so I think there's two things that come to mind now.[24:00.000 --> 24:07.000] One is that, that if you are developing an ML ops platform that you[24:07.000 --> 24:09.000] can't ignore Lambda.[24:09.000 --> 24:12.000] So I, because I've had some people tell me, oh, well, we can do this and[24:12.000 --> 24:13.000] this and this better.[24:13.000 --> 24:17.000] It's like, yeah, but if you're going to be on AWS, you have to understand[24:17.000 --> 24:18.000] why people use Lambda.[24:18.000 --> 24:19.000] It isn't speed.[24:19.000 --> 24:24.000] It's, it's the ease of, ease of developing very rich solutions.[24:24.000 --> 24:25.000] Right.[24:25.000 --> 24:26.000] Absolutely.[24:26.000 --> 24:28.000] And then the glue between, between what you are building eventually.[24:28.000 --> 24:33.000] And you can even almost your, the thoughts in your mind turn into Lambda.[24:33.000 --> 24:36.000] You know, like you can be thinking and building code so quickly.[24:36.000 --> 24:37.000] Absolutely.[24:37.000 --> 24:41.000] Everything turns into which event do I need to listen to and then I trigger[24:41.000 --> 24:43.000] a Lambda and that Lambda does this and that.[24:43.000 --> 24:44.000] Yeah.[24:44.000 --> 24:48.000] And the other part about Lambda that's pretty, pretty awesome is that it[24:48.000 --> 24:52.000] hooks into services that have infinite scale.[24:52.000 --> 24:56.000] Like so SQS, like you can't break SQS.[24:56.000 --> 24:59.000] Like there's nothing you can do to ever take SQS down.[24:59.000 --> 25:02.000] It handles unlimited requests in and unlimited requests out.[25:02.000 --> 25:04.000] How many systems are like that?[25:04.000 --> 25:05.000] Yeah.[25:05.000 --> 25:06.000] Yeah, absolutely.[25:06.000 --> 25:07.000] Yeah.[25:07.000 --> 25:12.000] So then this kind of a followup would be that, that maybe data scientists[25:12.000 --> 25:17.000] should learn Lambda and step functions in order to, to get to[25:17.000 --> 25:18.000] MLOps.[25:18.000 --> 25:21.000] I think that's a yes.[25:21.000 --> 25:25.000] If you want to, if you want to put the foot into MLOps and you are on AWS,[25:25.000 --> 25:31.000] then I think there is no way around learning these fundamentals.[25:31.000 --> 25:32.000] Right.[25:32.000 --> 25:35.000] There's no way around learning things like what is a Lambda?[25:35.000 --> 25:39.000] How do I, how do I create a Lambda via Terraform or whatever tool you're[25:39.000 --> 25:40.000] using there?[25:40.000 --> 25:42.000] And how do I hook it up to an event?[25:42.000 --> 25:47.000] And how do I, how do I use the AWS SDK to interact with different[25:47.000 --> 25:48.000] services?[25:48.000 --> 25:49.000] So, right.[25:49.000 --> 25:53.000] I think if you want to take a step into MLOps from, from coming more from[25:53.000 --> 25:57.000] the data science and it's extremely important to familiarize yourself[25:57.000 --> 26:01.000] with how do you, at least the fundamentals, how do you architect[26:01.000 --> 26:03.000] basic solutions on AWS?[26:03.000 --> 26:05.000] How do you glue services together?[26:05.000 --> 26:07.000] How do you make them speak to each other?[26:07.000 --> 26:09.000] So yeah, I think that's quite fundamental.[26:09.000 --> 26:14.000] Ideally, ideally, I think that's what the platform should take away from you[26:14.000 --> 26:16.000] as a, as a pure data scientist.[26:16.000 --> 26:19.000] You don't, should not necessarily have to deal with that stuff.[26:19.000 --> 26:23.000] But if you're interested in, if you want to make that move more towards MLOps,[26:23.000 --> 26:27.000] I think learning about infrastructure and specifically in the context of AWS[26:27.000 --> 26:31.000] about the services and how to use them is really fundamental.[26:31.000 --> 26:32.000] Yeah, it's good.[26:32.000 --> 26:33.000] Because this is automation eventually.[26:33.000 --> 26:37.000] And if you want to automate, if you want to automate your complex processes,[26:37.000 --> 26:39.000] then you need to learn that stuff.[26:39.000 --> 26:41.000] How else are you going to do it?[26:41.000 --> 26:42.000] Yeah, I agree.[26:42.000 --> 26:46.000] I mean, that's really what, what, what Lambda step functions are is their[26:46.000 --> 26:47.000] automation tools.[26:47.000 --> 26:49.000] So that's probably the better way to describe it.[26:49.000 --> 26:52.000] That's a very good point you bring up.[26:52.000 --> 26:57.000] Another technology that I think is an emerging technology is the[26:57.000 --> 26:58.000] managed file system.[26:58.000 --> 27:05.000] And the reason why I think it's interesting is that, so I 20 plus years[27:05.000 --> 27:11.000] ago, I was using file systems in the university setting when I was at[27:11.000 --> 27:14.000] Caltech and then also in film, film industry.[27:14.000 --> 27:22.000] So film has been using managed file servers with parallel processing[27:22.000 --> 27:24.000] farms for a long time.[27:24.000 --> 27:27.000] I don't know how many people know this, but in the film industry,[27:27.000 --> 27:32.000] the, the, the architecture, even from like 2000 was there's a very[27:32.000 --> 27:38.000] expensive file server and then there's let's say 40,000 machines or 40,000[27:38.000 --> 27:39.000] cores.[27:39.000 --> 27:40.000] And that's, that's it.[27:40.000 --> 27:41.000] That's the architecture.[27:41.000 --> 27:46.000] And now what's interesting is I see with data science and machine learning[27:46.000 --> 27:52.000] operations that like that, that could potentially happen in the future is[27:52.000 --> 27:57.000] actually a managed NFS mount point with maybe Kubernetes or something like[27:57.000 --> 27:58.000] that.[27:58.000 --> 28:01.000] Do you see any of that on the horizon?[28:01.000 --> 28:04.000] Oh, that's a good question.[28:04.000 --> 28:08.000] I think for our, for our, what we're currently doing, that's probably a[28:08.000 --> 28:10.000] bit further away.[28:10.000 --> 28:15.000] But in principle, I could very well imagine that in our use case, not,[28:15.000 --> 28:17.000] not quite.[28:17.000 --> 28:20.000] But in principle, definitely.[28:20.000 --> 28:26.000] And then maybe a third, a third emerging thing I'm seeing is what's going[28:26.000 --> 28:29.000] on with open AI and hugging face.[28:29.000 --> 28:34.000] And that has the potential, but maybe to change the game a little bit,[28:34.000 --> 28:38.000] especially with hugging face, I think, although both of them, I mean,[28:38.000 --> 28:43.000] there is that, you know, in the case of pre trained models, here's a[28:43.000 --> 28:48.000] perfect example is that an organization may have, you know, maybe they're[28:48.000 --> 28:53.000] using AWS even for this, they're transcribing videos and they're going[28:53.000 --> 28:56.000] to do something with them, maybe they're going to detect, I don't know,[28:56.000 --> 29:02.000] like, you know, if you recorded customers in your, I'm just brainstorm,[29:02.000 --> 29:05.000] I'm not seeing your company did this, but I'm just creating a hypothetical[29:05.000 --> 29:09.000] situation that they recorded, you know, customer talking and then they,[29:09.000 --> 29:12.000] they transcribe it to text and then run some kind of a, you know,[29:12.000 --> 29:15.000] criminal detection feature or something like that.[29:15.000 --> 29:19.000] Like they could build their own models or they could download the thing[29:19.000 --> 29:23.000] that was released two days ago or a day ago from open AI that transcribes[29:23.000 --> 29:29.000] things, you know, and then, and then turn that transcribe text into[29:29.000 --> 29:34.000] hugging face, some other model that summarizes it and then you could[29:34.000 --> 29:38.000] feed that into a system. So it's, what is, what is your, what are your[29:38.000 --> 29:42.000] thoughts around some of these pre trained models and is your, are you[29:42.000 --> 29:48.000] thinking of in terms of your stack, trying to look into doing fine tuning?[29:48.000 --> 29:53.000] Yeah, so I think pre trained models and especially the way that hugging face,[29:53.000 --> 29:57.000] I think really revolutionized the space in terms of really kind of[29:57.000 --> 30:02.000] platformizing the entire business around or the entire market around[30:02.000 --> 30:07.000] pre trained models. I think that is really quite incredible and I think[30:07.000 --> 30:10.000] really for the ecosystem a changing way how to do things.[30:10.000 --> 30:16.000] And I believe that looking at the, the costs of training large models[30:16.000 --> 30:19.000] and looking at the fact that many organizations are not able to do it[30:19.000 --> 30:23.000] for, because of massive costs or because of lack of data.[30:23.000 --> 30:29.000] I think this is a, this is a clear, makes it very clear how important[30:29.000 --> 30:33.000] such platforms are, how important sharing of pre trained models actually is.[30:33.000 --> 30:37.000] I believe it's a, we are only at the, quite at the beginning actually of that.[30:37.000 --> 30:42.000] And I think we're going to see that nowadays you see it mostly when it[30:42.000 --> 30:47.000] comes to fairly generalized data format, images, potentially videos, text,[30:47.000 --> 30:52.000] speech, these things. But I believe that we're going to see more marketplace[30:52.000 --> 30:57.000] approaches when it comes to pre trained models in a lot more industries[30:57.000 --> 31:01.000] and in a lot more, in a lot more use cases where data is to some degree[31:01.000 --> 31:05.000] standardized. Also when you think about, when you think about banking,[31:05.000 --> 31:10.000] for example, right? When you think about transactions to some extent,[31:10.000 --> 31:14.000] transaction, transaction data always looks the same, kind of at least at[31:14.000 --> 31:17.000] every bank. Of course you might need to do some mapping here and there,[31:17.000 --> 31:22.000] but also there is a lot of power in it. But because simply also thinking[31:22.000 --> 31:28.000] about sharing data is always a difficult thing, especially in Europe.[31:28.000 --> 31:32.000] Sharing data between organizations is incredibly difficult legally.[31:32.000 --> 31:36.000] It's difficult. Sharing models is a different thing, right?[31:36.000 --> 31:40.000] Basically, similar to the concept of federated learning. Sharing models[31:40.000 --> 31:44.000] is significantly easier legally than actually sharing data.[31:44.000 --> 31:48.000] And then applying these models, fine tuning them and so on.[31:48.000 --> 31:52.000] Yeah, I mean, I could just imagine. I really don't know much about[31:52.000 --> 31:56.000] banking transactions, but I would imagine there could be several[31:56.000 --> 32:01.000] kinds of transactions that are very normal. And then there's some[32:01.000 --> 32:06.000] transactions, like if you're making every single second,[32:06.000 --> 32:11.000] you're transferring a lot of money. And it happens just[32:11.000 --> 32:14.000] very quickly. It's like, wait, why are you doing this? Why are you transferring money[32:14.000 --> 32:20.000] constantly? What's going on? Or the huge sum of money only[32:20.000 --> 32:24.000] involves three different points in the network. Over and over again,[32:24.000 --> 32:29.000] just these three points are constantly... And so once you've developed[32:29.000 --> 32:33.000] a model that is anomaly detection, then[32:33.000 --> 32:37.000] yeah, why would you need to develop another one? I mean, somebody already did it.[32:37.000 --> 32:41.000] Exactly. Yes, absolutely, absolutely. And that's[32:41.000 --> 32:45.000] definitely... That's encoded knowledge, encoded information in terms of the model,[32:45.000 --> 32:49.000] which is not personally... Well, abstracts away from[32:49.000 --> 32:53.000] but personally identifiable data. And that's really the power. That is something[32:53.000 --> 32:57.000] that, yeah, as I've said before, you can share significantly easier and you can[32:57.000 --> 33:03.000] apply to your use cases. The kind of related to this in[33:03.000 --> 33:09.000] terms of upcoming technologies is, I think, dealing more with graphs.[33:09.000 --> 33:13.000] And so is that something from a stackwise that your[33:13.000 --> 33:19.000] company's investigated resource can do? Yeah, so when you think about[33:19.000 --> 33:23.000] transactions, bank transactions, right? And bank customers.[33:23.000 --> 33:27.000] So in our case, again, it's a... We only have pseudonymized[33:27.000 --> 33:31.000] transaction data, so actually we cannot see anything, right? We cannot see names, we cannot see[33:31.000 --> 33:35.000] iPads or whatever. We really can't see much. But[33:35.000 --> 33:39.000] you can look at transactions moving between[33:39.000 --> 33:43.000] different entities, between different accounts. You can look at that[33:43.000 --> 33:47.000] as a network, as a graph. And that's also what we very frequently do.[33:47.000 --> 33:51.000] You have your nodes in your network, these are your accounts[33:51.000 --> 33:55.000] or your presence, even. And the actual edges between them,[33:55.000 --> 33:59.000] that's what your transactions are. So you have this[33:59.000 --> 34:03.000] massive graph, actually, that also we as TMNL, as Transaction Montenegro,[34:03.000 --> 34:07.000] are sitting on. We're actually sitting on a massive transaction graph.[34:07.000 --> 34:11.000] So yeah, absolutely. For us, doing analysis on top of[34:11.000 --> 34:15.000] that graph, building models on top of that graph is a quite important[34:15.000 --> 34:19.000] thing. And like I taught a class[34:19.000 --> 34:23.000] a few years ago at Berkeley where we had to[34:23.000 --> 34:27.000] cover graph databases a little bit. And I[34:27.000 --> 34:31.000] really didn't know that much about graph databases, although I did use one actually[34:31.000 --> 34:35.000] at one company I was at. But one of the things I learned in teaching that[34:35.000 --> 34:39.000] class was about the descriptive statistics[34:39.000 --> 34:43.000] of a graph network. And it[34:43.000 --> 34:47.000] is actually pretty interesting, because I think most of the time everyone talks about[34:47.000 --> 34:51.000] median and max min and standard deviation and everything.[34:51.000 --> 34:55.000] But then with a graph, there's things like centrality[34:55.000 --> 34:59.000] and I forget all the terms off the top of my head, but you can see[34:59.000 --> 35:03.000] if there's a node in the network that's[35:03.000 --> 35:07.000] everybody's interacting with. Absolutely. You can identify communities[35:07.000 --> 35:11.000] of people moving around a lot of money all the time. For example,[35:11.000 --> 35:15.000] you can detect different metric features eventually[35:15.000 --> 35:19.000] doing computations on your graph and then plugging in some model.[35:19.000 --> 35:23.000] Often it's feature engineering. You're computing between the centrality scores[35:23.000 --> 35:27.000] across your graph or your different entities. And then[35:27.000 --> 35:31.000] you're building your features actually. And then you're plugging in some[35:31.000 --> 35:35.000] model in the end. If you do classic machine learning, so to say[35:35.000 --> 35:39.000] if you do graph deep learning, of course that's a bit different.[35:39.000 --> 35:43.000] So basically that could for people that are analyzing[35:43.000 --> 35:47.000] essentially networks of people or networks, then[35:47.000 --> 35:51.000] basically a graph database would be step one is[35:51.000 --> 35:55.000] generate the features which could be centrality.[35:55.000 --> 35:59.000] There's a score and then you then go and train[35:59.000 --> 36:03.000] the model based on that descriptive statistic.[36:03.000 --> 36:07.000] Exactly. So one way how you could think about it is[36:07.000 --> 36:11.000] whether we need a graph database or not, that always depends on your specific use case[36:11.000 --> 36:15.000] and what database. We're actually also running[36:15.000 --> 36:19.000] that using Spark. You have graph frames, you have[36:19.000 --> 36:23.000] graph X actually. So really stuff in Spark built for[36:23.000 --> 36:27.000] doing analysis on graphs.[36:27.000 --> 36:31.000] And then what you usually do is exactly what you said. You are trying[36:31.000 --> 36:35.000] to build features based on that graph.[36:35.000 --> 36:39.000] Based on the attributes of the nodes and the attributes on the edges and so on.[36:39.000 --> 36:43.000] And so I guess in terms of graph databases right[36:43.000 --> 36:47.000] now, it sounds like maybe the three[36:47.000 --> 36:51.000] main players maybe are there's Neo4j which[36:51.000 --> 36:55.000] has been around for a long time. There's I guess Spark[36:55.000 --> 36:59.000] and then there's also, I forgot what the one is called for AWS[36:59.000 --> 37:03.000] is it? Neptune, that's Neptune.[37:03.000 --> 37:07.000] Have you played with all three of those and did you[37:07.000 --> 37:11.000] like Neptune? Neptune was something we, Spark of course we actually currently[37:11.000 --> 37:15.000] using for exactly that. Also because it allows us to do[37:15.000 --> 37:19.000] to keep our stack fairly homogeneous. We did[37:19.000 --> 37:23.000] also PUC in Neptune a while ago already[37:23.000 --> 37:27.000] and well Neptune you definitely have essentially two ways[37:27.000 --> 37:31.000] how to query Neptune either using Gremlin or SparkQL.[37:31.000 --> 37:35.000] So that means the people, your data science[37:35.000 --> 37:39.000] need to get familiar with that which then is already one bit of a hurdle[37:39.000 --> 37:43.000] because usually data scientists are not familiar with either.[37:43.000 --> 37:47.000] But also what we found with Neptune[37:47.000 --> 37:51.000] is also that it's not necessarily built for[37:51.000 --> 37:55.000] as an analytics graph database. It's not necessarily made for[37:55.000 --> 37:59.000] that. And that then become, then it's sometimes, at least[37:59.000 --> 38:03.000] for us, it has become quite complicated to handle different performance considerations[38:03.000 --> 38:07.000] when you actually do fairly complex queries across that graph.[38:07.000 --> 38:11.000] Yeah, so you're bringing up like a point which[38:11.000 --> 38:15.000] happens a lot in my experience with[38:15.000 --> 38:19.000] technology is that sometimes[38:19.000 --> 38:23.000] the purity of the solution becomes the problem[38:23.000 --> 38:27.000] where even though Spark isn't necessarily[38:27.000 --> 38:31.000] designed to be a graph database system, the fact is[38:31.000 --> 38:35.000] people in your company are already using it. So[38:35.000 --> 38:39.000] if you just turn on that feature now you can use it and it's not like[38:39.000 --> 38:43.000] this huge technical undertaking and retraining effort.[38:43.000 --> 38:47.000] So even if it's not as good, if it works, then that's probably[38:47.000 --> 38:51.000] the solution your company will use versus I agree with you like a lot of times[38:51.000 --> 38:55.000] even if a solution like Neo4j is a pretty good example of[38:55.000 --> 38:59.000] it's an interesting product but[38:59.000 --> 39:03.000] you already have all these other products like do you really want to introduce yet[39:03.000 --> 39:07.000] another product into your stack. Yeah, because eventually[39:07.000 --> 39:11.000] it all comes with an overhead of course introducing it. That is one thing[39:11.000 --> 39:15.000] it requires someone to maintain it even if it's a[39:15.000 --> 39:19.000] managed service. Somebody needs to actually own it and look after it[39:19.000 --> 39:23.000] and then as you said you need to retrain people to also use it effectively.[39:23.000 --> 39:27.000] So it comes at significant cost and that is really[39:27.000 --> 39:31.000] something that I believe should be quite critically[39:31.000 --> 39:35.000] assessed. What is really the game you have? How far can you go with[39:35.000 --> 39:39.000] your current tooling and then eventually make[39:39.000 --> 39:43.000] that decision. At least personally I'm really[39:43.000 --> 39:47.000] not a fan of thinking tooling first[39:47.000 --> 39:51.000] but personally I really believe in looking at your organization, looking at the people[39:51.000 --> 39:55.000] what skills are there, looking at how effective[39:55.000 --> 39:59.000] are these people actually performing certain activities and processes[39:59.000 --> 40:03.000] and then carefully thinking about what really makes sense[40:03.000 --> 40:07.000] because it's one thing but people need to[40:07.000 --> 40:11.000] adopt and use the tooling and eventually it should really speed them up and improve[40:11.000 --> 40:15.000] how they develop. Yeah, I think it's very[40:15.000 --> 40:19.000] that's great advice that it's hard to understand how good of advice it is[40:19.000 --> 40:23.000] because it takes experience getting burned[40:23.000 --> 40:27.000] creating new technology. I've[40:27.000 --> 40:31.000] had experiences before where[40:31.000 --> 40:35.000] one of the mistakes I've made was putting too many different technologies in an organization[40:35.000 --> 40:39.000] and the problem is once you get enough complexity[40:39.000 --> 40:43.000] it can really explode and then[40:43.000 --> 40:47.000] this is the part that really gets scary is that[40:47.000 --> 40:51.000] let's take Spark for example. How hard is it to hire somebody that knows Spark? Pretty easy[40:51.000 --> 40:55.000] how hard is it going to be to hire somebody that knows[40:55.000 --> 40:59.000] Spark and then hire another person that knows the gremlin query[40:59.000 --> 41:03.000] language for Neptune, then hire another person that knows Kubernetes[41:03.000 --> 41:07.000] then tire another, after a while if you have so many different kinds of tools[41:07.000 --> 41:11.000] you have to hire so many different kinds of people that all[41:11.000 --> 41:15.000] productivity goes to a stop. So it's the hiring as well[41:15.000 --> 41:19.000] Absolutely, I mean it's virtually impossible[41:19.000 --> 41:23.000] to find someone who is really well versed with gremlin for example[41:23.000 --> 41:27.000] it's incredibly hard and I think tech hiring is hard[41:27.000 --> 41:31.000] by itself already[41:31.000 --> 41:35.000] so you really need to think about what can I hire for as well[41:35.000 --> 41:39.000] what expertise can I realistically build up?[41:39.000 --> 41:43.000] So that's why I think AWS[41:43.000 --> 41:47.000] even with some of the limitations about the ML platform[41:47.000 --> 41:51.000] the advantages of using AWS is that[41:51.000 --> 41:55.000] you have a huge audience of people to hire from and then the same thing like[41:55.000 --> 41:59.000] Spark, there's a lot of things I don't like about Spark but a lot of people[41:59.000 --> 42:03.000] use Spark and so if you use AWS and you use Spark[42:03.000 --> 42:07.000] let's say those two which you are then you're going to have a much easier time[42:07.000 --> 42:11.000] hiring people, you're going to have a much easier time training people[42:11.000 --> 42:15.000] there's tons of documentation about it so I think a lot of people[42:15.000 --> 42:19.000] are very wise that you're thinking that way but a lot of people don't think about that[42:19.000 --> 42:23.000] they're like oh I've got to use the latest, greatest stuff and this and this and this[42:23.000 --> 42:27.000] and then their company starts to get into trouble because they can't hire[42:27.000 --> 42:31.000] people, they can't maintain systems and then productivity starts to[42:31.000 --> 42:35.000] to degrees. Also something[42:35.000 --> 42:39.000] not to ignore is the cognitive load you put on a team[42:39.000 --> 42:43.000] that needs to manage a broad range of very different[42:43.000 --> 42:47.000] tools or services. It also puts incredible[42:47.000 --> 42:51.000] cognitive load on that team and you suddenly also need an incredible breadth[42:51.000 --> 42:55.000] of expertise in that team and that means you're also going[42:55.000 --> 42:59.000] to create single points of failures if you don't really[42:59.000 --> 43:03.000] scale up your team.[43:03.000 --> 43:07.000] It's something to really, I think when you go for[43:07.000 --> 43:11.000] new tooling you should really look at it from a holistic perspective[43:11.000 --> 43:15.000] not only about this is the latest and greatest.[43:15.000 --> 43:19.000] In terms of Europe versus[43:19.000 --> 43:23.000] US, have you spent much time in the US at all?[43:23.000 --> 43:27.000] Not at all actually, flying to the US Monday but no, not at all.[43:27.000 --> 43:31.000] That also would be kind of an interesting[43:31.000 --> 43:35.000] comparison in that the culture of the United States[43:35.000 --> 43:39.000] is really this culture of[43:39.000 --> 43:43.000] I would say more like survival of the fittest or you work[43:43.000 --> 43:47.000] seven days a week and you're constantly like you don't go on vacation[43:47.000 --> 43:51.000] and you're proud of it and I think it's not[43:51.000 --> 43:55.000] a good culture. I'm not saying that's a good thing, I think it's a bad[43:55.000 --> 43:59.000] thing and that a lot of times the critique people have[43:59.000 --> 44:03.000] about Europe is like oh will people take vacation all the time and all this[44:03.000 --> 44:07.000] and as someone who has spent time in both I would say[44:07.000 --> 44:11.000] yes that's a better approach. A better approach is that people[44:11.000 --> 44:15.000] should feel relaxed because when[44:15.000 --> 44:19.000] especially the kind of work you do in MLOPs[44:19.000 --> 44:23.000] is that you need people to feel comfortable and happy[44:23.000 --> 44:27.000] and more the question[44:27.000 --> 44:31.000] what I was going to is that[44:31.000 --> 44:35.000] I wonder if there is a more productive culture[44:35.000 --> 44:39.000] for MLOPs in Europe[44:39.000 --> 44:43.000] versus the US in terms of maintaining[44:43.000 --> 44:47.000] systems and building software where the US[44:47.000 --> 44:51.000] what it's really been good at I guess is kind of coming up with new[44:51.000 --> 44:55.000] ideas and there's lots of new services that get generated but[44:55.000 --> 44:59.000] the quality and longevity[44:59.000 --> 45:03.000] is not necessarily the same where I could see[45:03.000 --> 45:07.000] in the stuff we just talked about which is if you're trying to build a team[45:07.000 --> 45:11.000] where there's low turnover[45:11.000 --> 45:15.000] you have very high quality output[45:15.000 --> 45:19.000] it seems like that maybe organizations[45:19.000 --> 45:23.000] could learn from the European approach to building[45:23.000 --> 45:27.000] and maintaining systems for MLOPs.[45:27.000 --> 45:31.000] I think there's definitely some truth in it especially when you look at the median[45:31.000 --> 45:35.000] tenure of a tech person in an organization[45:35.000 --> 45:39.000] I think that is actually still significantly lower in the US[45:39.000 --> 45:43.000] I'm not sure I think in the Bay Area somewhere around one year or two months or something like that[45:43.000 --> 45:47.000] compared to Europe I believe[45:47.000 --> 45:51.000] still fairly low. Here of course in tech people also like to switch companies more often[45:51.000 --> 45:55.000] but I would say average is still more around[45:55.000 --> 45:59.000] two years something around that staying with the same company[45:59.000 --> 46:03.000] also in tech which I think is a bit longer[46:03.000 --> 46:07.000] than you would typically have it in the US.[46:07.000 --> 46:11.000] I think from my perspective where I've also built up most of the[46:11.000 --> 46:15.000] current team I think it's[46:15.000 --> 46:19.000] super important to hire good people[46:19.000 --> 46:23.000] and people that fit to the team fit to the company culture wise[46:23.000 --> 46:27.000] but also give them[46:27.000 --> 46:31.000] let them not be in a sprint all the time[46:31.000 --> 46:35.000] it's about having a sustainable way of working in my opinion[46:35.000 --> 46:39.000] and that sustainable way means you should definitely take your vacation[46:39.000 --> 46:43.000] and I think usually in Europe we have quite generous[46:43.000 --> 46:47.000] even by law vacation I mean in Netherlands by law you get 20 days a year[46:47.000 --> 46:51.000] but most companies give you 25 many IT companies[46:51.000 --> 46:55.000] 30 per year so that's quite nice[46:55.000 --> 46:59.000] but I do take that so culture wise it's really everyone[46:59.000 --> 47:03.000] likes to take vacations whether that's sea level or whether that's an engineer on a team[47:03.000 --> 47:07.000] and that's in many companies that's also really encouraged[47:07.000 --> 47:11.000] to have a healthy work life balance[47:11.000 --> 47:15.000] and of course it's not only about vacations also but growth opportunities[47:15.000 --> 47:19.000] letting people explore develop themselves[47:19.000 --> 47:23.000] and not always pushing on max performance[47:23.000 --> 47:27.000] so really at least I always see like a partnership[47:27.000 --> 47:31.000] the organization wants to get something from an[47:31.000 --> 47:35.000] employee but the employee should also be encouraged and developed[47:35.000 --> 47:39.000] in that organization and I think that is something that in many parts of[47:39.000 --> 47:43.000] Europe where there is big awareness for that[47:43.000 --> 47:47.000] so my hypothesis is that[47:47.000 --> 47:51.000] it's possible that Europe becomes[47:51.000 --> 47:55.000] the new hub of technology[47:55.000 --> 47:59.000] and I'll tell you why here's my hypothesis the reason why is that[47:59.000 --> 48:03.000] in terms of machine learning operations[48:03.000 --> 48:07.000] I've already talked to multiple people who know the[48:07.000 --> 48:11.000] data around it like big companies and they've told me that[48:11.000 --> 48:15.000] it's going to be close to impossible to hire people soon[48:15.000 --> 48:19.000] because essentially there's too many job openings[48:19.000 --> 48:23.000] and there's not enough people that know machine learning, machine learning operations, cloud computing[48:23.000 --> 48:27.000] and so the American culture unfortunately I think[48:27.000 --> 48:31.000] is so cutthroat that they don't encourage[48:31.000 --> 48:35.000] people to be loyal to their company[48:35.000 --> 48:39.000] and in addition to that because there is no universal healthcare system[48:39.000 --> 48:43.000] in the US[48:43.000 --> 48:47.000] it's kind of a prisoner's dilemma where nobody[48:47.000 --> 48:51.000] sees each other and so they're constantly optimizing[48:51.000 --> 48:55.000] but in the case of machine learning it's a different[48:55.000 --> 48:59.000] industry where you do really need to have[48:59.000 --> 49:03.000] some longevity for employees because the systems are very complex[49:03.000 --> 49:07.000] system to develop and so if the culture of Europe[49:07.000 --> 49:11.000] which is much more friendly to the worker I think it[49:11.000 --> 49:15.000] could lead to Europe having[49:15.000 --> 49:19.000] a better outcome for machine learning operations[49:19.000 --> 49:23.000] so that's one part of it and then the second part of it is the other thing the US has[49:23.000 --> 49:27.000] has done that I think Europe[49:27.000 --> 49:31.000] has done that if I compare Europe versus the US in terms of[49:31.000 --> 49:35.000] data privacy that I think the US has dropped the ball[49:35.000 --> 49:39.000] and they haven't done a good job at it but Europe has actually[49:39.000 --> 49:43.000] done much much better at holding tech companies accountable[49:43.000 --> 49:47.000] and I think if you asked[49:47.000 --> 49:51.000] well informed people if they would like some of the[49:51.000 --> 49:55.000] practices of the United States tech companies to change I think most[49:55.000 --> 49:59.000] well informed people would say we don't want you to recommend[49:59.000 --> 50:03.000] bad data like extremist video content[50:03.000 --> 50:07.000] I mean there's people that are extremists that love it[50:07.000 --> 50:11.000] or we don't want you to sell our personal information without our consent[50:11.000 --> 50:15.000] so it could also lead to a better[50:15.000 --> 50:19.000] outcome for the people[50:19.000 --> 50:23.000] that are using machine learning and AI in Europe[50:23.000 --> 50:27.000] so I actually suspect and this is my hypothesis[50:27.000 --> 50:31.000] who knows if I'm true or not is that I think Europe could be[50:31.000 --> 50:35.000] the leader from let's say 2022 to[50:35.000 --> 50:39.000] 2040 in AI and ML because of[50:39.000 --> 50:43.000] the culture but I don't know that's just one hypothesis I have[50:43.000 --> 50:47.000] yeah I think around the what you mentioned before[50:47.000 --> 50:51.000] around the fact that perhaps Turnover is in tech companies here in Europe[50:51.000 --> 50:55.000] is less I think that definitely helps you build systems that survive the test of time as well[50:55.000 --> 50:59.000] right I mean everyone had the case when a key engineer[50:59.000 --> 51:03.000] off boards from a team leaves the company and then you need to[51:03.000 --> 51:07.000] hire another person right it's long times of not being super productive[51:07.000 --> 51:11.000] long time not being super effective so you continuously[51:11.000 --> 51:15.000] lose track that you need[51:15.000 --> 51:19.000] so I think you could be right there that in the[51:19.000 --> 51:23.000] longer run when systems really need to be matured and developed over[51:23.000 --> 51:27.000] longer time Europe might have an edge there[51:27.000 --> 51:31.000] might be a bit better suited to do that[51:31.000 --> 51:37.000] the salaries are still higher in the US and also I think many US companies are starting to enter more[51:37.000 --> 51:41.000] from a people perspective even remote work and everything they're starting to also[51:41.000 --> 51:45.000] poach more and more engineers from Europe because[51:45.000 --> 51:49.000] of course vacation and everything and having a healthy work life balance[51:49.000 --> 51:53.000] is one thing but for many people if you[51:53.000 --> 51:57.000] give you a 50% higher paycheck that's also a strong argument[51:57.000 --> 52:01.000] so it's difficult actually to also for Europe to[52:01.000 --> 52:05.000] keep the engineers here that as well[52:05.000 --> 52:09.000] no I will say this though if you work remote from[52:09.000 --> 52:13.000] Europe that's a very different scenario than living[52:13.000 --> 52:17.000] in the US because you'll see when[52:17.000 --> 52:21.000] unfortunately the United States since about 1980[52:21.000 --> 52:25.000] has declined and[52:25.000 --> 52:29.000] the data around the US is pretty dire[52:29.000 --> 52:33.000] actually the life expectancy is one of the[52:33.000 --> 52:37.000] lowest in the world for a G20 country[52:37.000 --> 52:41.000] so then if you walk through the major[52:41.000 --> 52:45.000] cities of the US there's just poverty[52:45.000 --> 52:49.000] everywhere like people are living in very low[52:49.000 --> 52:53.000] quality conditions where every time I go to Europe[52:53.000 --> 52:57.000] I go to Munich, I go to London, I go to wherever[52:57.000 --> 53:01.000] that basically the cities are beautiful[53:01.000 --> 53:05.000] and well maintained so I think if the cases that if a US company[53:05.000 --> 53:09.000] let a European live in Europe and work[53:09.000 --> 53:13.000] remote yeah that could work out because the European[53:13.000 --> 53:17.000] citizen has an EU citizen has amazing[53:17.000 --> 53:21.000] healthcare they have the[53:21.000 --> 53:25.000] safety net their cities aren't basically[53:25.000 --> 53:29.000] highly unequal but I think it's the[53:29.000 --> 53:33.000] location of the US in its current form[53:33.000 --> 53:37.000] I personally wouldn't recommend[53:37.000 --> 53:41.000] someone from Europe moving to the US because[53:41.000 --> 53:45.000] unfortunately I think it's a[53:45.000 --> 53:49.000] great place to live just to be totally honest[53:49.000 --> 53:53.000] if you're already in Europe and on the flip side I think that[53:53.000 --> 53:57.000] there's a lot of Americans actually who are very interested in[53:57.000 --> 54:01.000] universal healthcare in particular is not even[54:01.000 --> 54:05.000] possible in the US because of the politics in the US[54:05.000 --> 54:09.000] and a lot of medical bankruptcies occur[54:09.000 --> 54:13.000] and so from a start up perspective as well[54:13.000 --> 54:17.000] this is something that people don't talk about in America it's like yeah we're all about[54:17.000 --> 54:21.000] startups well think about how many more people would be able to[54:21.000 --> 54:25.000] create a company if you didn't have to worry about going bankrupt[54:25.000 --> 54:29.000] if you broke your arm or you have some kind of[54:29.000 --> 54:33.000] sickness or whatever so[54:33.000 --> 54:37.000] I think it's an interesting trade off[54:37.000 --> 54:41.000] situation and I would say that the sweet spot might be[54:41.000 --> 54:45.000] you work for an American company and get the higher salary but you still live in Europe[54:45.000 --> 54:49.000] that would be the dream scenario I think that's why many people are actually doing it[54:49.000 --> 54:53.000] I think especially since covid started you can really see it[54:53.000 --> 54:57.000] before that it wasn't really a thing working for a US company[54:57.000 --> 55:01.000] who really sits in the US and you're full remote but I think now since 2, 2 and a half years[55:01.000 --> 55:05.000] it's really becoming reality actually[55:05.000 --> 55:09.000] interesting yeah well[55:09.000 --> 55:13.000] hearing a lot of your ideas around[55:13.000 --> 55:17.000] startups and what you're doing and[55:17.000 --> 55:21.000] also about how you're a SageMaker[55:21.000 --> 55:25.000] is there any place that someone can get a hold of you[55:25.000 --> 55:29.000] if they listen to this on the Orelia platform or[55:29.000 --> 55:33.000] think content that you're developing yourself or any other information you want to share[55:33.000 --> 55:37.000] yeah definitely so I think best place to reach out to me and I'm always[55:37.000 --> 55:41.000] happy to receive a few messages and have a good chat or a virtual coffee[55:41.000 --> 55:45.000] is via LinkedIn my name is here that's how you can find me on LinkedIn[55:45.000 --> 55:49.000] I'm also at conferences here and there well in Europe mostly[55:49.000 --> 55:53.000] typically when there is an MLOps conference you're probably going to see me there[55:53.000 --> 55:57.000] in one way or another that is something as well[55:57.000 --> 56:01.000] cool yeah well I'm glad we had a chance to talk[56:01.000 --> 56:05.000] you taught me a few things that I'm definitely going to follow up on[56:05.000 --> 56:09.000] and I really appreciate it and hopefully we can talk again soon[56:09.000 --> 56:13.000] thanks a lot for the chat okay all right

0:00

56:16

Download MP3

Share at current time

Show notes

If you enjoyed this video, here are additional resources to look at:Coursera + Duke Specialization: Building Cloud Computing Solutions at Scale Specialization: https://www.coursera.org/specializations/building-cloud-computing-solutions-at-scalePython, Bash, and SQL Essentials for Data Engineering Specialization: https://www.coursera.org/specializations/python-bash-sql-data-engineering-dukeAWS Certified Solutions Architect - Professional (SAP-C01) Cert Prep: 1 Design for Organizational Complexity:https://www.linkedin.com/learning/aws-certified-solutions-architect-professional-sap-c01-cert-prep-1-design-for-organizational-complexity/design-for-organizational-complexity?autoplay=trueO'Reilly Book: Practical MLOps: https://www.amazon.com/Practical-MLOps-Operationalizing-Machine-Learning/dp/1098103017O'Reilly Book: Python for DevOps: https://www.amazon.com/gp/product/B082P97LDW/O'Reilly Book: Developing on AWS with C#: A Comprehensive Guide on Using C# to Build Solutions on the AWS Platformhttps://www.amazon.com/Developing-AWS-Comprehensive-Solutions-Platform/dp/1492095877Pragmatic AI: An Introduction to Cloud-based Machine Learning: https://www.amazon.com/gp/product/B07FB8F8QP/Pragmatic AI Labs Book: Python Command-Line Tools: https://www.amazon.com/gp/product/B0855FSFYZPragmatic AI Labs Book: Cloud Computing for Data Analysis: https://www.amazon.com/gp/product/B0992BN7W8Pragmatic AI Book: Minimal Python: https://www.amazon.com/gp/product/B0855NSRR7Pragmatic AI Book: Testing in Python: https://www.amazon.com/gp/product/B0855NSRR7Subscribe to Pragmatic AI Labs YouTube Channel: https://www.youtube.com/channel/UCNDfiL0D1LUeKWAkRE1xO5QSubscribe to 52 Weeks of AWS Podcast: https://52-weeks-of-cloud.simplecast.comView content on noahgift.com: https://noahgift.com/View content on Pragmatic AI Labs Website: https://paiml.com/[00:00.000 --> 00:02.260] Hey, three, two, one, there we go, we're live.[00:02.260 --> 00:07.260] All right, so welcome Simon to Enterprise ML Ops interviews.[00:09.760 --> 00:13.480] The goal of these interviews is to get people exposed[00:13.480 --> 00:17.680] to real professionals who are doing work in ML Ops.[00:17.680 --> 00:20.360] It's such a cutting edge field[00:20.360 --> 00:22.760] that I think a lot of people are very curious about.[00:22.760 --> 00:23.600] What is it?[00:23.600 --> 00:24.960] You know, how do you do it?[00:24.960 --> 00:27.760] And very honored to have Simon here.[00:27.760 --> 00:29.200] And do you wanna introduce yourself[00:29.200 --> 00:31.520] and maybe talk a little bit about your background?[00:31.520 --> 00:32.360] Sure.[00:32.360 --> 00:33.960] Yeah, thanks again for inviting me.[00:34.960 --> 00:38.160] My name is Simon Stebelena or Simon.[00:38.160 --> 00:40.440] I am originally from Austria,[00:40.440 --> 00:43.120] but currently working in the Netherlands and Amsterdam[00:43.120 --> 00:46.080] at Transaction Monitoring Netherlands.[00:46.080 --> 00:48.780] Here I am the lead ML Ops engineer.[00:49.840 --> 00:51.680] What are we doing at TML actually?[00:51.680 --> 00:55.560] We are a data processing company actually.[00:55.560 --> 00:59.320] We are owned by the five large banks of Netherlands.[00:59.320 --> 01:02.080] And our purpose is kind of what the name says.[01:02.080 --> 01:05.920] We are basically lifting specifically anti money laundering.[01:05.920 --> 01:08.040] So anti money laundering models that run[01:08.040 --> 01:11.440] on a personalized transactions of businesses[01:11.440 --> 01:13.240] we get from these five banks[01:13.240 --> 01:15.760] to detect unusual patterns on that transaction graph[01:15.760 --> 01:19.000] that might indicate money laundering.[01:19.000 --> 01:20.520] That's a natural what we do.[01:20.520 --> 01:21.800] So as you can imagine,[01:21.800 --> 01:24.160] we are really focused on building models[01:24.160 --> 01:27.280] and obviously ML Ops is a big component there[01:27.280 --> 01:29.920] because that is really the core of what you do.[01:29.920 --> 01:32.680] You wanna do it efficiently and effectively as well.[01:32.680 --> 01:34.760] In my role as lead ML Ops engineer,[01:34.760 --> 01:36.880] I'm on the one hand the lead engineer[01:36.880 --> 01:38.680] of the actual ML Ops platform team.[01:38.680 --> 01:40.200] So this is actually a centralized team[01:40.200 --> 01:42.680] that builds out lots of the infrastructure[01:42.680 --> 01:47.320] that's needed to do modeling effectively and efficiently.[01:47.320 --> 01:50.360] But also I am the craft lead[01:50.360 --> 01:52.640] for the machine learning engineering craft.[01:52.640 --> 01:55.120] These are actually in our case, the machine learning engineers,[01:55.120 --> 01:58.360] the people working within the model development teams[01:58.360 --> 01:59.360] and cross functional teams[01:59.360 --> 02:01.680] actually building these models.[02:01.680 --> 02:03.640] That's what I'm currently doing[02:03.640 --> 02:05.760] during the evenings and weekends.[02:05.760 --> 02:09.400] I'm also lecturer at the University of Applied Sciences, Vienna.[02:09.400 --> 02:12.080] And there I'm teaching data mining[02:12.080 --> 02:15.160] and data warehousing to master students, essentially.[02:16.240 --> 02:19.080] Before TMNL, I was at bold.com,[02:19.080 --> 02:21.960] which is the largest eCommerce retailer in the Netherlands.[02:21.960 --> 02:25.040] So I always tend to see the Amazon of the Netherlands[02:25.040 --> 02:27.560] or been a lux actually.[02:27.560 --> 02:30.920] It is still the biggest eCommerce retailer in the Netherlands[02:30.920 --> 02:32.960] even before Amazon actually.[02:32.960 --> 02:36.160] And there I was an expert machine learning engineer.[02:36.160 --> 02:39.240] So doing somewhat comparable stuff,[02:39.240 --> 02:42.440] a bit more still focused on the actual modeling part.[02:42.440 --> 02:44.800] Now it's really more on the infrastructure end.[02:45.760 --> 02:46.760] And well, before that,[02:46.760 --> 02:49.360] I spent some time in consulting, leading a data science team.[02:49.360 --> 02:50.880] That's actually where I kind of come from.[02:50.880 --> 02:53.360] I really come from originally the data science end.[02:54.640 --> 02:57.840] And there I kind of started drifting towards ML Ops[02:57.840 --> 02:59.200] because we started building out[02:59.200 --> 03:01.640] a deployment and serving platform[03:01.640 --> 03:04.440] that would as consulting company would make it easier[03:04.440 --> 03:07.920] for us to deploy models for our clients[03:07.920 --> 03:10.840] to serve these models, to also monitor these models.[03:10.840 --> 03:12.800] And that kind of then made me drift further and further[03:12.800 --> 03:15.520] down the engineering lane all the way to ML Ops.[03:17.000 --> 03:19.600] Great, yeah, that's a great background.[03:19.600 --> 03:23.200] I'm kind of curious in terms of the data science[03:23.200 --> 03:25.240] to ML Ops journey,[03:25.240 --> 03:27.720] that I think would be a great discussion[03:27.720 --> 03:29.080] to dig into a little bit.[03:30.280 --> 03:34.320] My background is originally more on the software engineering[03:34.320 --> 03:36.920] side and when I was in the Bay Area,[03:36.920 --> 03:41.160] I did individual contributor and then ran companies[03:41.160 --> 03:44.240] at one point and ran multiple teams.[03:44.240 --> 03:49.240] And then as the data science field exploded,[03:49.240 --> 03:52.880] I hired multiple data science teams and worked with them.[03:52.880 --> 03:55.800] But what was interesting is that I found that[03:56.840 --> 03:59.520] I think the original approach of data science[03:59.520 --> 04:02.520] from my perspective was lacking[04:02.520 --> 04:07.240] in that there wasn't really like deliverables.[04:07.240 --> 04:10.520] And I think when you look at a software engineering team,[04:10.520 --> 04:12.240] it's very clear there's deliverables.[04:12.240 --> 04:14.800] Like you have a mobile app and it has to get better[04:14.800 --> 04:15.880] each week, right?[04:15.880 --> 04:18.200] Where else, what are you doing?[04:18.200 --> 04:20.880] And so I would love to hear your story[04:20.880 --> 04:25.120] about how you went from doing kind of more pure data science[04:25.120 --> 04:27.960] to now it sounds like ML Ops.[04:27.960 --> 04:30.240] Yeah, yeah, actually.[04:30.240 --> 04:33.800] So back then in consulting one of the,[04:33.800 --> 04:36.200] which was still at least back then in Austria,[04:36.200 --> 04:39.280] data science and everything around it was still kind of[04:39.280 --> 04:43.720] in this infancy back then 2016 and so on.[04:43.720 --> 04:46.560] It was still really, really new to many organizations,[04:46.560 --> 04:47.400] at least in Austria.[04:47.400 --> 04:50.120] There might be some years behind in the US and stuff.[04:50.120 --> 04:52.040] But back then it was still relatively fresh.[04:52.040 --> 04:55.240] So in consulting, what we very often struggled with was[04:55.240 --> 04:58.520] on the modeling end, problems could be solved,[04:58.520 --> 05:02.040] but actually then easy deployment,[05:02.040 --> 05:05.600] keeping these models in production at client side.[05:05.600 --> 05:08.880] That was always a bit more of the challenge.[05:08.880 --> 05:12.400] And so naturally kind of I started thinking[05:12.400 --> 05:16.200] and focusing more on the actual bigger problem that I saw,[05:16.200 --> 05:19.440] which was not so much building the models,[05:19.440 --> 05:23.080] but it was really more, how can we streamline things?[05:23.080 --> 05:24.800] How can we keep things operating?[05:24.800 --> 05:27.960] How can we make that move easier from a prototype,[05:27.960 --> 05:30.680] from a PUC to a productionized model?[05:30.680 --> 05:33.160] Also how can we keep it there and maintain it there?[05:33.160 --> 05:35.480] So personally I was really more,[05:35.480 --> 05:37.680] I saw that this problem was coming up[05:38.960 --> 05:40.320] and that really fascinated me.[05:40.320 --> 05:44.120] So I started jumping more on that exciting problem.[05:44.120 --> 05:45.080] That's how it went for me.[05:45.080 --> 05:47.000] And back then we then also recognized it[05:47.000 --> 05:51.560] as a potential product in our case.[05:51.560 --> 05:54.120] So we started building out that deployment[05:54.120 --> 05:56.960] and serving and monitoring platform, actually.[05:56.960 --> 05:59.520] And that then really for me, naturally,[05:59.520 --> 06:01.840] I fell into that rabbit hole[06:01.840 --> 06:04.280] and I also never wanted to get out of it again.[06:05.680 --> 06:09.400] So the system that you built initially,[06:09.400 --> 06:10.840] what was your stack?[06:10.840 --> 06:13.760] What were some of the things you were using?[06:13.760 --> 06:17.000] Yeah, so essentially we had,[06:17.000 --> 06:19.560] when we talk about the stack on the backend,[06:19.560 --> 06:20.560] there was a lot of,[06:20.560 --> 06:23.000] so the full backend was written in Java.[06:23.000 --> 06:25.560] We were using more from a user perspective,[06:25.560 --> 06:28.040] the contract that we kind of had,[06:28.040 --> 06:32.560] our goal was to build a drag and drop platform for models.[06:32.560 --> 06:35.760] So basically the contract was you package your model[06:35.760 --> 06:37.960] as an MLflow model,[06:37.960 --> 06:41.520] and then you basically drag and drop it into a web UI.[06:41.520 --> 06:43.640] It's gonna be wrapped in containers.[06:43.640 --> 06:45.040] It's gonna be deployed.[06:45.040 --> 06:45.880] It's gonna be,[06:45.880 --> 06:49.680] there will be a monitoring layer in front of it[06:49.680 --> 06:52.760] based on whatever the dataset is you trained it on.[06:52.760 --> 06:55.920] You would automatically calculate different metrics,[06:55.920 --> 06:57.360] different distributional metrics[06:57.360 --> 06:59.240] around your variables that you are using.[06:59.240 --> 07:02.080] And so we were layering this approach[07:02.080 --> 07:06.840] to, so that eventually every incoming request would be,[07:06.840 --> 07:08.160] you would have a nice dashboard.[07:08.160 --> 07:10.040] You could monitor all that stuff.[07:10.040 --> 07:12.600] So stackwise it was actually MLflow.[07:12.600 --> 07:15.480] Specifically MLflow models a lot.[07:15.480 --> 07:17.920] Then it was Java in the backend, Python.[07:17.920 --> 07:19.760] There was a lot of Python,[07:19.760 --> 07:22.040] especially PySpark component as well.[07:23.000 --> 07:25.880] There was a, it's been quite a while actually,[07:25.880 --> 07:29.160] there was a quite some part written in Scala.[07:29.160 --> 07:32.280] Also, because there was a component of this platform[07:32.280 --> 07:34.800] was also a bit of an auto ML approach,[07:34.800 --> 07:36.480] but that died then over time.[07:36.480 --> 07:40.120] And that was also based on PySpark[07:40.120 --> 07:43.280] and vanilla Spark written in Scala.[07:43.280 --> 07:45.560] So we could facilitate the auto ML part.[07:45.560 --> 07:48.600] And then later on we actually added that deployment,[07:48.600 --> 07:51.480] the easy deployment and serving part.[07:51.480 --> 07:55.280] So that was kind of, yeah, a lot of custom build stuff.[07:55.280 --> 07:56.120] Back then, right?[07:56.120 --> 07:59.720] There wasn't that much MLOps tooling out there yet.[07:59.720 --> 08:02.920] So you need to build a lot of that stuff custom.[08:02.920 --> 08:05.280] So it was largely custom built.[08:05.280 --> 08:09.280] Yeah, the MLflow concept is an interesting concept[08:09.280 --> 08:13.880] because they provide this package structure[08:13.880 --> 08:17.520] that at least you have some idea of,[08:17.520 --> 08:19.920] what is gonna be sent into the model[08:19.920 --> 08:22.680] and like there's a format for the model.[08:22.680 --> 08:24.720] And I think that part of MLflow[08:24.720 --> 08:27.520] seems to be a pretty good idea,[08:27.520 --> 08:30.080] which is you're creating a standard where,[08:30.080 --> 08:32.360] you know, if in the case of,[08:32.360 --> 08:34.720] if you're using scikit learn or something,[08:34.720 --> 08:37.960] you don't necessarily want to just throw[08:37.960 --> 08:40.560] like a pickled model somewhere and just say,[08:40.560 --> 08:42.720] okay, you know, let's go.[08:42.720 --> 08:44.760] Yeah, that was also our thinking back then.[08:44.760 --> 08:48.040] So we thought a lot about what would be a,[08:48.040 --> 08:51.720] what would be, what could become the standard actually[08:51.720 --> 08:53.920] for how you package models.[08:53.920 --> 08:56.200] And back then MLflow was one of the little tools[08:56.200 --> 08:58.160] that was already there, already existent.[08:58.160 --> 09:00.360] And of course there was data bricks behind it.[09:00.360 --> 09:02.680] So we also made a bet on that back then and said,[09:02.680 --> 09:04.920] all right, let's follow that packaging standard[09:04.920 --> 09:08.680] and make it the contract how you would as a data scientist,[09:08.680 --> 09:10.800] then how you would need to package it up[09:10.800 --> 09:13.640] and submit it to the platform.[09:13.640 --> 09:16.800] Yeah, it's interesting because the,[09:16.800 --> 09:19.560] one of the, this reminds me of one of the issues[09:19.560 --> 09:21.800] that's happening right now with cloud computing,[09:21.800 --> 09:26.800] where in the cloud AWS has dominated for a long time[09:29.480 --> 09:34.480] and they have 40% market share, I think globally.[09:34.480 --> 09:38.960] And Azure's now gaining and they have some pretty good traction[09:38.960 --> 09:43.120] and then GCP's been down for a bit, you know,[09:43.120 --> 09:45.760] in that maybe the 10% range or something like that.[09:45.760 --> 09:47.760] But what's interesting is that it seems like[09:47.760 --> 09:51.480] in the case of all of the cloud providers,[09:51.480 --> 09:54.360] they haven't necessarily been leading the way[09:54.360 --> 09:57.840] on things like packaging models, right?[09:57.840 --> 10:01.480] Or, you know, they have their own proprietary systems[10:01.480 --> 10:06.480] which have been developed and are continuing to be developed[10:06.640 --> 10:08.920] like Vertex AI in the case of Google,[10:09.760 --> 10:13.160] the SageMaker in the case of Amazon.[10:13.160 --> 10:16.480] But what's interesting is, let's just take SageMaker,[10:16.480 --> 10:20.920] for example, there isn't really like this, you know,[10:20.920 --> 10:25.480] industry wide standard of model packaging[10:25.480 --> 10:28.680] that SageMaker uses, they have their own proprietary stuff[10:28.680 --> 10:31.040] that kind of builds in and Vertex AI[10:31.040 --> 10:32.440] has their own proprietary stuff.[10:32.440 --> 10:34.920] So, you know, I think it is interesting[10:34.920 --> 10:36.960] to see what's gonna happen[10:36.960 --> 10:41.120] because I think your original hypothesis which is,[10:41.120 --> 10:44.960] let's pick, you know, this looks like it's got some traction[10:44.960 --> 10:48.760] and it wasn't necessarily tied directly to a cloud provider[10:48.760 --> 10:51.600] because Databricks can work on anything.[10:51.600 --> 10:53.680] It seems like that in particular,[10:53.680 --> 10:56.800] that's one of the more sticky problems right now[10:56.800 --> 11:01.800] with MLopsis is, you know, who's the leader?[11:02.280 --> 11:05.440] Like, who's developing the right, you know,[11:05.440 --> 11:08.880] kind of a standard for tooling.[11:08.880 --> 11:12.320] And I don't know, maybe that leads into kind of you talking[11:12.320 --> 11:13.760] a little bit about what you're doing currently.[11:13.760 --> 11:15.600] Like, do you have any thoughts about the, you know,[11:15.600 --> 11:18.720] current tooling and what you're doing at your current company[11:18.720 --> 11:20.920] and what's going on with that?[11:20.920 --> 11:21.760] Absolutely.[11:21.760 --> 11:24.200] So at my current organization,[11:24.200 --> 11:26.040] Transaction Monitor Netherlands,[11:26.040 --> 11:27.480] we are fully on AWS.[11:27.480 --> 11:32.000] So we're really almost cloud native AWS.[11:32.000 --> 11:34.840] And so that also means everything we do on the modeling side[11:34.840 --> 11:36.600] really evolves around SageMaker.[11:37.680 --> 11:40.840] So for us, specifically for us as MLops team,[11:40.840 --> 11:44.680] we are building the platform around SageMaker capabilities.[11:45.680 --> 11:48.360] And on that end, at least company internal,[11:48.360 --> 11:52.880] we have a contract how you must actually deploy models.[11:52.880 --> 11:56.200] There is only one way, what we call the golden path,[11:56.200 --> 11:59.800] in that case, this is the streamlined highly automated path[11:59.800 --> 12:01.360] that is supported by the platform.[12:01.360 --> 12:04.360] This is the only way how you can actually deploy models.[12:04.360 --> 12:09.360] And in our case, that is actually a SageMaker pipeline object.[12:09.640 --> 12:12.680] So in our company, we're doing large scale batch processing.[12:12.680 --> 12:15.040] So we're actually not doing anything real time at present.[12:15.040 --> 12:17.040] We are doing post transaction monitoring.[12:17.040 --> 12:20.960] So that means you need to submit essentially DAX, right?[12:20.960 --> 12:23.400] This is what we use for training.[12:23.400 --> 12:25.680] This is what we also deploy eventually.[12:25.680 --> 12:27.720] And this is our internal contract.[12:27.720 --> 12:32.200] You need to provision a SageMaker in your model repository.[12:32.200 --> 12:34.640] You got to have one place,[12:34.640 --> 12:37.840] and there must be a function with a specific name[12:37.840 --> 12:41.440] and that function must return a SageMaker pipeline object.[12:41.440 --> 12:44.920] So this is our internal contract actually.[12:44.920 --> 12:46.600] Yeah, that's interesting.[12:46.600 --> 12:51.200] I mean, and I could see like for, I know many people[12:51.200 --> 12:53.880] that are using SageMaker in production,[12:53.880 --> 12:58.680] and it does seem like where it has some advantages[12:58.680 --> 13:02.360] is that AWS generally does a pretty good job[13:02.360 --> 13:04.240] at building solutions.[13:04.240 --> 13:06.920] And if you just look at the history of services,[13:06.920 --> 13:09.080] the odds are pretty high[13:09.080 --> 13:12.880] that they'll keep getting better, keep improving things.[13:12.880 --> 13:17.080] And it seems like what I'm hearing from people,[13:17.080 --> 13:19.080] and it sounds like maybe with your organization as well,[13:19.080 --> 13:24.080] is that potentially the SDK for SageMaker[13:24.440 --> 13:29.120] is really the win versus some of the UX tools they have[13:29.120 --> 13:32.680] and the interface for Canvas and Studio.[13:32.680 --> 13:36.080] Is that what's happening?[13:36.080 --> 13:38.720] Yeah, so I think, right,[13:38.720 --> 13:41.440] what we try to do is we always try to think about our users.[13:41.440 --> 13:44.880] So how do our users, who are our users?[13:44.880 --> 13:47.000] What capabilities and skills do they have?[13:47.000 --> 13:50.080] And what freedom should they have[13:50.080 --> 13:52.640] and what abilities should they have to develop models?[13:52.640 --> 13:55.440] In our case, we don't really have use cases[13:55.440 --> 13:58.640] for stuff like Canvas because our users[13:58.640 --> 14:02.680] are fairly mature teams that know how to do their,[14:02.680 --> 14:04.320] on the one hand, the data science stuff, of course,[14:04.320 --> 14:06.400] but also the engineering stuff.[14:06.400 --> 14:08.160] So in our case, things like Canvas[14:08.160 --> 14:10.320] do not really play so much role[14:10.320 --> 14:12.960] because obviously due to the high abstraction layer[14:12.960 --> 14:15.640] of more like graphical user interfaces,[14:15.640 --> 14:17.360] drag and drop tooling,[14:17.360 --> 14:20.360] you are also limited in what you can do,[14:20.360 --> 14:22.480] or what you can do easily.[14:22.480 --> 14:26.320] So in our case, really, it is the strength of the flexibility[14:26.320 --> 14:28.320] that the SageMaker SDK gives you.[14:28.320 --> 14:33.040] And in general, the SDK around most AWS services.[14:34.080 --> 14:36.760] But also it comes with challenges, of course.[14:37.720 --> 14:38.960] You give a lot of freedom,[14:38.960 --> 14:43.400] but also you're creating a certain ask,[14:43.400 --> 14:47.320] certain requirements for your model development teams,[14:47.320 --> 14:49.600] which is also why we've also been working[14:49.600 --> 14:52.600] about abstracting further away from the SDK.[14:52.600 --> 14:54.600] So our objective is actually[14:54.600 --> 14:58.760] that you should not be forced to interact with the raw SDK[14:58.760 --> 15:00.600] when you use SageMaker anymore,[15:00.600 --> 15:03.520] but you have a thin layer of abstraction[15:03.520 --> 15:05.480] on top of what you are doing.[15:05.480 --> 15:07.480] That's actually something we are moving towards[15:07.480 --> 15:09.320] more and more as well.[15:09.320 --> 15:11.120] Because yeah, it gives you the flexibility,[15:11.120 --> 15:12.960] but also flexibility comes at a cost,[15:12.960 --> 15:15.080] comes often at the cost of speeds,[15:15.080 --> 15:18.560] specifically when it comes to the 90% default stuff[15:18.560 --> 15:20.720] that you want to do, yeah.[15:20.720 --> 15:24.160] And one of the things that I have as a complaint[15:24.160 --> 15:29.160] against SageMaker is that it only uses virtual machines,[15:30.000 --> 15:35.000] and it does seem like a strange strategy in some sense.[15:35.000 --> 15:40.000] Like for example, I guess if you're doing batch only,[15:40.000 --> 15:42.000] it doesn't matter as much,[15:42.000 --> 15:45.000] which I think is a good strategy actually[15:45.000 --> 15:50.000] to get your batch based predictions very, very strong.[15:50.000 --> 15:53.000] And in that case, maybe the virtual machines[15:53.000 --> 15:56.000] make a little bit less of a complaint.[15:56.000 --> 16:00.000] But in the case of the endpoints with SageMaker,[16:00.000 --> 16:02.000] the fact that you have to spend up[16:02.000 --> 16:04.000] these really expensive virtual machines[16:04.000 --> 16:08.000] and let them run 24 seven to do online prediction,[16:08.000 --> 16:11.000] is that something that your organization evaluated[16:11.000 --> 16:13.000] and decided not to use?[16:13.000 --> 16:15.000] Or like, what are your thoughts behind that?[16:15.000 --> 16:19.000] Yeah, in our case, doing real time[16:19.000 --> 16:22.000] or near real time inference is currently not really relevant[16:22.000 --> 16:25.000] for the simple reason that when you think a bit more[16:25.000 --> 16:28.000] about the money laundering or anti money laundering space,[16:28.000 --> 16:31.000] typically when, right,[16:31.000 --> 16:34.000] all every individual bank must do anti money laundering[16:34.000 --> 16:37.000] and they have armies of people doing that.[16:37.000 --> 16:39.000] But on the other hand,[16:39.000 --> 16:43.000] the time it actually takes from one of their systems,[16:43.000 --> 16:46.000] one of their AML systems actually detecting something[16:46.000 --> 16:49.000] that's unusual that then goes into a review process[16:49.000 --> 16:54.000] until it eventually hits the governmental institution[16:54.000 --> 16:56.000] that then takes care of the cases that have been[16:56.000 --> 16:58.000] at least twice validated that they are indeed,[16:58.000 --> 17:01.000] they look very unusual.[17:01.000 --> 17:04.000] So this takes a while, this can take quite some time,[17:04.000 --> 17:06.000] which is also why it doesn't really matter[17:06.000 --> 17:09.000] whether you ship your prediction within a second[17:09.000 --> 17:13.000] or whether it takes you a week or two weeks.[17:13.000 --> 17:15.000] It doesn't really matter, hence for us,[17:15.000 --> 17:19.000] that problem so far thinking about real time inference[17:19.000 --> 17:21.000] has not been there.[17:21.000 --> 17:25.000] But yeah, indeed, for other use cases,[17:25.000 --> 17:27.000] for also private projects,[17:27.000 --> 17:29.000] we've also been considering SageMaker Endpoints[17:29.000 --> 17:31.000] for a while, but exactly what you said,[17:31.000 --> 17:33.000] the fact that you need to have a very beefy machine[17:33.000 --> 17:35.000] running all the time,[17:35.000 --> 17:39.000] specifically when you have heavy GPU loads, right,[17:39.000 --> 17:43.000] and you're actually paying for that machine running 2047,[17:43.000 --> 17:46.000] although you do have quite fluctuating load.[17:46.000 --> 17:49.000] Yeah, then that definitely becomes quite a consideration[17:49.000 --> 17:51.000] of what you go for.[17:51.000 --> 17:58.000] Yeah, and I actually have been talking to AWS about that,[17:58.000 --> 18:02.000] because one of the issues that I have is that[18:02.000 --> 18:07.000] the AWS platform really pushes serverless,[18:07.000 --> 18:10.000] and then my question for AWS is,[18:10.000 --> 18:13.000] so why aren't you using it?[18:13.000 --> 18:16.000] I mean, if you're pushing serverless for everything,[18:16.000 --> 18:19.000] why is SageMaker nothing serverless?[18:19.000 --> 18:21.000] And so maybe they're going to do that, I don't know.[18:21.000 --> 18:23.000] I don't have any inside information,[18:23.000 --> 18:29.000] but it is interesting to hear you had some similar concerns.[18:29.000 --> 18:32.000] I know that there's two questions here.[18:32.000 --> 18:37.000] One is someone asked about what do you do for data versioning,[18:37.000 --> 18:41.000] and a second one is how do you do event based MLOps?[18:41.000 --> 18:43.000] So maybe kind of following up.[18:43.000 --> 18:46.000] Yeah, what do we do for data versioning?[18:46.000 --> 18:51.000] On the one hand, we're running a data lakehouse,[18:51.000 --> 18:54.000] where after data we get from the financial institutions,[18:54.000 --> 18:57.000] from the banks that runs through massive data pipeline,[18:57.000 --> 19:01.000] also on AWS, we're using glue and step functions actually for that,[19:01.000 --> 19:03.000] and then eventually it ends up modeled to some extent,[19:03.000 --> 19:06.000] sanitized, quality checked in our data lakehouse,[19:06.000 --> 19:10.000] and there we're actually using hoodie on top of S3.[19:10.000 --> 19:13.000] And this is also what we use for versioning,[19:13.000 --> 19:16.000] which we use for time travel and all these things.[19:16.000 --> 19:19.000] So that is hoodie on top of S3,[19:19.000 --> 19:21.000] when then pipelines,[19:21.000 --> 19:24.000] so actually our model pipelines plug in there[19:24.000 --> 19:27.000] and spit out predictions, alerts,[19:27.000 --> 19:29.000] what we call alerts eventually.[19:29.000 --> 19:33.000] That is something that we version based on unique IDs.[19:33.000 --> 19:36.000] So processing IDs, we track pretty much everything,[19:36.000 --> 19:39.000] every line of code that touched,[19:39.000 --> 19:43.000] is related to a specific row in our data.[19:43.000 --> 19:46.000] So we can exactly track back for every single row[19:46.000 --> 19:48.000] in our predictions and in our alerts,[19:48.000 --> 19:50.000] what pipeline ran on it,[19:50.000 --> 19:52.000] which jobs were in that pipeline,[19:52.000 --> 19:56.000] which code exactly was running in each job,[19:56.000 --> 19:58.000] which intermediate results were produced.[19:58.000 --> 20:01.000] So we're basically adding lineage information[20:01.000 --> 20:03.000] to everything we output along that line,[20:03.000 --> 20:05.000] so we can track everything back[20:05.000 --> 20:09.000] using a few tools we've built.[20:09.000 --> 20:12.000] So the tool you mentioned,[20:12.000 --> 20:13.000] I'm not familiar with it.[20:13.000 --> 20:14.000] What is it called again?[20:14.000 --> 20:15.000] It's called hoodie?[20:15.000 --> 20:16.000] Hoodie.[20:16.000 --> 20:17.000] Hoodie.[20:17.000 --> 20:18.000] Oh, what is it?[20:18.000 --> 20:19.000] Maybe you can describe it.[20:19.000 --> 20:22.000] Yeah, hoodie is essentially,[20:22.000 --> 20:29.000] it's quite similar to other tools such as[20:29.000 --> 20:31.000] Databricks, how is it called?[20:31.000 --> 20:32.000] Databricks?[20:32.000 --> 20:33.000] Delta Lake maybe?[20:33.000 --> 20:34.000] Yes, exactly.[20:34.000 --> 20:35.000] Exactly.[20:35.000 --> 20:38.000] It's basically, it's equivalent to Delta Lake,[20:38.000 --> 20:40.000] just back then when we looked into[20:40.000 --> 20:42.000] what are we going to use.[20:42.000 --> 20:44.000] Delta Lake was not open sourced yet.[20:44.000 --> 20:46.000] Databricks open sourced a while ago.[20:46.000 --> 20:47.000] We went for Hoodie.[20:47.000 --> 20:50.000] It essentially, it is a layer on top of,[20:50.000 --> 20:53.000] in our case, S3 that allows you[20:53.000 --> 20:58.000] to more easily keep track of what you,[20:58.000 --> 21:03.000] of the actions you are performing on your data.[21:03.000 --> 21:08.000] So it's essentially very similar to Delta Lake,[21:08.000 --> 21:13.000] just already before an open sourced solution.[21:13.000 --> 21:15.000] Yeah, that's, I didn't know anything about that.[21:15.000 --> 21:16.000] So now I do.[21:16.000 --> 21:19.000] So thanks for letting me know.[21:19.000 --> 21:21.000] I'll have to look into that.[21:21.000 --> 21:27.000] The other, I guess, interesting stack related question is,[21:27.000 --> 21:29.000] what are your thoughts about,[21:29.000 --> 21:32.000] I think there's two areas that I think[21:32.000 --> 21:34.000] are interesting and that are emerging.[21:34.000 --> 21:36.000] Oh, actually there's, there's multiple.[21:36.000 --> 21:37.000] Maybe I'll just bring them all up.[21:37.000 --> 21:39.000] So we'll do one by one.[21:39.000 --> 21:42.000] So these are some emerging areas that I'm, that I'm seeing.[21:42.000 --> 21:49.000] So one is the concept of event driven, you know,[21:49.000 --> 21:54.000] architecture versus, versus maybe like a static architecture.[21:54.000 --> 21:57.000] And so I think obviously you're using step functions.[21:57.000 --> 22:00.000] So you're a fan of, of event driven architecture.[22:00.000 --> 22:04.000] Maybe we start, we'll start with that one is what are your,[22:04.000 --> 22:08.000] what are your thoughts on going more event driven in your organization?[22:08.000 --> 22:09.000] Yeah.[22:09.000 --> 22:13.000] In, in, in our case, essentially everything works event driven.[22:13.000 --> 22:14.000] Right.[22:14.000 --> 22:19.000] So since we on AWS, we're using event bridge or cloud watch events.[22:19.000 --> 22:21.000] I think now it's called everywhere.[22:21.000 --> 22:22.000] Right.[22:22.000 --> 22:24.000] This is how we trigger pretty much everything in our stack.[22:24.000 --> 22:27.000] This is how we trigger our data pipelines when data comes in.[22:27.000 --> 22:32.000] This is how we trigger different, different lambdas that parse our[22:32.000 --> 22:35.000] certain information from your log, store them in different databases.[22:35.000 --> 22:40.000] This is how we also, how we, at some point in the back in the past,[22:40.000 --> 22:44.000] how we also triggered new deployments when new models were approved in[22:44.000 --> 22:46.000] your model registry.[22:46.000 --> 22:50.000] So basically everything we've been doing is, is fully event driven.[22:50.000 --> 22:51.000] Yeah.[22:51.000 --> 22:56.000] So, so I think this is a key thing you bring up here is that I've,[22:56.000 --> 23:00.000] I've talked to many people who don't use AWS, who are, you know,[23:00.000 --> 23:03.000] all alternatively experts at technology.[23:03.000 --> 23:06.000] And one of the things that I've heard some people say is like, oh,[23:06.000 --> 23:13.000] well, AWS is in as fast as X or Y, like Lambda is in as fast as X or Y or,[23:13.000 --> 23:17.000] you know, Kubernetes or, but, but the point you bring up is exactly the[23:17.000 --> 23:24.000] way I think about AWS is that the true advantage of AWS platform is the,[23:24.000 --> 23:29.000] is the tight integration with the services and you can design event[23:29.000 --> 23:31.000] driven workflows.[23:31.000 --> 23:33.000] Would you say that's, that's absolutely.[23:33.000 --> 23:34.000] Yeah.[23:34.000 --> 23:35.000] Yeah.[23:35.000 --> 23:39.000] I think designing event driven workflows on AWS is incredibly easy to do.[23:39.000 --> 23:40.000] Yeah.[23:40.000 --> 23:43.000] And it also comes incredibly natural and that's extremely powerful.[23:43.000 --> 23:44.000] Right.[23:44.000 --> 23:49.000] And simply by, by having an easy way how to trigger lambdas event driven,[23:49.000 --> 23:52.000] you can pretty much, right, pretty much do everything and glue[23:52.000 --> 23:54.000] everything together that you want.[23:54.000 --> 23:56.000] I think that gives you a tremendous flexibility.[23:56.000 --> 23:57.000] Yeah.[23:57.000 --> 24:00.000] So, so I think there's two things that come to mind now.[24:00.000 --> 24:07.000] One is that, that if you are developing an ML ops platform that you[24:07.000 --> 24:09.000] can't ignore Lambda.[24:09.000 --> 24:12.000] So I, because I've had some people tell me, oh, well, we can do this and[24:12.000 --> 24:13.000] this and this better.[24:13.000 --> 24:17.000] It's like, yeah, but if you're going to be on AWS, you have to understand[24:17.000 --> 24:18.000] why people use Lambda.[24:18.000 --> 24:19.000] It isn't speed.[24:19.000 --> 24:24.000] It's, it's the ease of, ease of developing very rich solutions.[24:24.000 --> 24:25.000] Right.[24:25.000 --> 24:26.000] Absolutely.[24:26.000 --> 24:28.000] And then the glue between, between what you are building eventually.[24:28.000 --> 24:33.000] And you can even almost your, the thoughts in your mind turn into Lambda.[24:33.000 --> 24:36.000] You know, like you can be thinking and building code so quickly.[24:36.000 --> 24:37.000] Absolutely.[24:37.000 --> 24:41.000] Everything turns into which event do I need to listen to and then I trigger[24:41.000 --> 24:43.000] a Lambda and that Lambda does this and that.[24:43.000 --> 24:44.000] Yeah.[24:44.000 --> 24:48.000] And the other part about Lambda that's pretty, pretty awesome is that it[24:48.000 --> 24:52.000] hooks into services that have infinite scale.[24:52.000 --> 24:56.000] Like so SQS, like you can't break SQS.[24:56.000 --> 24:59.000] Like there's nothing you can do to ever take SQS down.[24:59.000 --> 25:02.000] It handles unlimited requests in and unlimited requests out.[25:02.000 --> 25:04.000] How many systems are like that?[25:04.000 --> 25:05.000] Yeah.[25:05.000 --> 25:06.000] Yeah, absolutely.[25:06.000 --> 25:07.000] Yeah.[25:07.000 --> 25:12.000] So then this kind of a followup would be that, that maybe data scientists[25:12.000 --> 25:17.000] should learn Lambda and step functions in order to, to get to[25:17.000 --> 25:18.000] MLOps.[25:18.000 --> 25:21.000] I think that's a yes.[25:21.000 --> 25:25.000] If you want to, if you want to put the foot into MLOps and you are on AWS,[25:25.000 --> 25:31.000] then I think there is no way around learning these fundamentals.[25:31.000 --> 25:32.000] Right.[25:32.000 --> 25:35.000] There's no way around learning things like what is a Lambda?[25:35.000 --> 25:39.000] How do I, how do I create a Lambda via Terraform or whatever tool you're[25:39.000 --> 25:40.000] using there?[25:40.000 --> 25:42.000] And how do I hook it up to an event?[25:42.000 --> 25:47.000] And how do I, how do I use the AWS SDK to interact with different[25:47.000 --> 25:48.000] services?[25:48.000 --> 25:49.000] So, right.[25:49.000 --> 25:53.000] I think if you want to take a step into MLOps from, from coming more from[25:53.000 --> 25:57.000] the data science and it's extremely important to familiarize yourself[25:57.000 --> 26:01.000] with how do you, at least the fundamentals, how do you architect[26:01.000 --> 26:03.000] basic solutions on AWS?[26:03.000 --> 26:05.000] How do you glue services together?[26:05.000 --> 26:07.000] How do you make them speak to each other?[26:07.000 --> 26:09.000] So yeah, I think that's quite fundamental.[26:09.000 --> 26:14.000] Ideally, ideally, I think that's what the platform should take away from you[26:14.000 --> 26:16.000] as a, as a pure data scientist.[26:16.000 --> 26:19.000] You don't, should not necessarily have to deal with that stuff.[26:19.000 --> 26:23.000] But if you're interested in, if you want to make that move more towards MLOps,[26:23.000 --> 26:27.000] I think learning about infrastructure and specifically in the context of AWS[26:27.000 --> 26:31.000] about the services and how to use them is really fundamental.[26:31.000 --> 26:32.000] Yeah, it's good.[26:32.000 --> 26:33.000] Because this is automation eventually.[26:33.000 --> 26:37.000] And if you want to automate, if you want to automate your complex processes,[26:37.000 --> 26:39.000] then you need to learn that stuff.[26:39.000 --> 26:41.000] How else are you going to do it?[26:41.000 --> 26:42.000] Yeah, I agree.[26:42.000 --> 26:46.000] I mean, that's really what, what, what Lambda step functions are is their[26:46.000 --> 26:47.000] automation tools.[26:47.000 --> 26:49.000] So that's probably the better way to describe it.[26:49.000 --> 26:52.000] That's a very good point you bring up.[26:52.000 --> 26:57.000] Another technology that I think is an emerging technology is the[26:57.000 --> 26:58.000] managed file system.[26:58.000 --> 27:05.000] And the reason why I think it's interesting is that, so I 20 plus years[27:05.000 --> 27:11.000] ago, I was using file systems in the university setting when I was at[27:11.000 --> 27:14.000] Caltech and then also in film, film industry.[27:14.000 --> 27:22.000] So film has been using managed file servers with parallel processing[27:22.000 --> 27:24.000] farms for a long time.[27:24.000 --> 27:27.000] I don't know how many people know this, but in the film industry,[27:27.000 --> 27:32.000] the, the, the architecture, even from like 2000 was there's a very[27:32.000 --> 27:38.000] expensive file server and then there's let's say 40,000 machines or 40,000[27:38.000 --> 27:39.000] cores.[27:39.000 --> 27:40.000] And that's, that's it.[27:40.000 --> 27:41.000] That's the architecture.[27:41.000 --> 27:46.000] And now what's interesting is I see with data science and machine learning[27:46.000 --> 27:52.000] operations that like that, that could potentially happen in the future is[27:52.000 --> 27:57.000] actually a managed NFS mount point with maybe Kubernetes or something like[27:57.000 --> 27:58.000] that.[27:58.000 --> 28:01.000] Do you see any of that on the horizon?[28:01.000 --> 28:04.000] Oh, that's a good question.[28:04.000 --> 28:08.000] I think for our, for our, what we're currently doing, that's probably a[28:08.000 --> 28:10.000] bit further away.[28:10.000 --> 28:15.000] But in principle, I could very well imagine that in our use case, not,[28:15.000 --> 28:17.000] not quite.[28:17.000 --> 28:20.000] But in principle, definitely.[28:20.000 --> 28:26.000] And then maybe a third, a third emerging thing I'm seeing is what's going[28:26.000 --> 28:29.000] on with open AI and hugging face.[28:29.000 --> 28:34.000] And that has the potential, but maybe to change the game a little bit,[28:34.000 --> 28:38.000] especially with hugging face, I think, although both of them, I mean,[28:38.000 --> 28:43.000] there is that, you know, in the case of pre trained models, here's a[28:43.000 --> 28:48.000] perfect example is that an organization may have, you know, maybe they're[28:48.000 --> 28:53.000] using AWS even for this, they're transcribing videos and they're going[28:53.000 --> 28:56.000] to do something with them, maybe they're going to detect, I don't know,[28:56.000 --> 29:02.000] like, you know, if you recorded customers in your, I'm just brainstorm,[29:02.000 --> 29:05.000] I'm not seeing your company did this, but I'm just creating a hypothetical[29:05.000 --> 29:09.000] situation that they recorded, you know, customer talking and then they,[29:09.000 --> 29:12.000] they transcribe it to text and then run some kind of a, you know,[29:12.000 --> 29:15.000] criminal detection feature or something like that.[29:15.000 --> 29:19.000] Like they could build their own models or they could download the thing[29:19.000 --> 29:23.000] that was released two days ago or a day ago from open AI that transcribes[29:23.000 --> 29:29.000] things, you know, and then, and then turn that transcribe text into[29:29.000 --> 29:34.000] hugging face, some other model that summarizes it and then you could[29:34.000 --> 29:38.000] feed that into a system. So it's, what is, what is your, what are your[29:38.000 --> 29:42.000] thoughts around some of these pre trained models and is your, are you[29:42.000 --> 29:48.000] thinking of in terms of your stack, trying to look into doing fine tuning?[29:48.000 --> 29:53.000] Yeah, so I think pre trained models and especially the way that hugging face,[29:53.000 --> 29:57.000] I think really revolutionized the space in terms of really kind of[29:57.000 --> 30:02.000] platformizing the entire business around or the entire market around[30:02.000 --> 30:07.000] pre trained models. I think that is really quite incredible and I think[30:07.000 --> 30:10.000] really for the ecosystem a changing way how to do things.[30:10.000 --> 30:16.000] And I believe that looking at the, the costs of training large models[30:16.000 --> 30:19.000] and looking at the fact that many organizations are not able to do it[30:19.000 --> 30:23.000] for, because of massive costs or because of lack of data.[30:23.000 --> 30:29.000] I think this is a, this is a clear, makes it very clear how important[30:29.000 --> 30:33.000] such platforms are, how important sharing of pre trained models actually is.[30:33.000 --> 30:37.000] I believe it's a, we are only at the, quite at the beginning actually of that.[30:37.000 --> 30:42.000] And I think we're going to see that nowadays you see it mostly when it[30:42.000 --> 30:47.000] comes to fairly generalized data format, images, potentially videos, text,[30:47.000 --> 30:52.000] speech, these things. But I believe that we're going to see more marketplace[30:52.000 --> 30:57.000] approaches when it comes to pre trained models in a lot more industries[30:57.000 --> 31:01.000] and in a lot more, in a lot more use cases where data is to some degree[31:01.000 --> 31:05.000] standardized. Also when you think about, when you think about banking,[31:05.000 --> 31:10.000] for example, right? When you think about transactions to some extent,[31:10.000 --> 31:14.000] transaction, transaction data always looks the same, kind of at least at[31:14.000 --> 31:17.000] every bank. Of course you might need to do some mapping here and there,[31:17.000 --> 31:22.000] but also there is a lot of power in it. But because simply also thinking[31:22.000 --> 31:28.000] about sharing data is always a difficult thing, especially in Europe.[31:28.000 --> 31:32.000] Sharing data between organizations is incredibly difficult legally.[31:32.000 --> 31:36.000] It's difficult. Sharing models is a different thing, right?[31:36.000 --> 31:40.000] Basically, similar to the concept of federated learning. Sharing models[31:40.000 --> 31:44.000] is significantly easier legally than actually sharing data.[31:44.000 --> 31:48.000] And then applying these models, fine tuning them and so on.[31:48.000 --> 31:52.000] Yeah, I mean, I could just imagine. I really don't know much about[31:52.000 --> 31:56.000] banking transactions, but I would imagine there could be several[31:56.000 --> 32:01.000] kinds of transactions that are very normal. And then there's some[32:01.000 --> 32:06.000] transactions, like if you're making every single second,[32:06.000 --> 32:11.000] you're transferring a lot of money. And it happens just[32:11.000 --> 32:14.000] very quickly. It's like, wait, why are you doing this? Why are you transferring money[32:14.000 --> 32:20.000] constantly? What's going on? Or the huge sum of money only[32:20.000 --> 32:24.000] involves three different points in the network. Over and over again,[32:24.000 --> 32:29.000] just these three points are constantly... And so once you've developed[32:29.000 --> 32:33.000] a model that is anomaly detection, then[32:33.000 --> 32:37.000] yeah, why would you need to develop another one? I mean, somebody already did it.[32:37.000 --> 32:41.000] Exactly. Yes, absolutely, absolutely. And that's[32:41.000 --> 32:45.000] definitely... That's encoded knowledge, encoded information in terms of the model,[32:45.000 --> 32:49.000] which is not personally... Well, abstracts away from[32:49.000 --> 32:53.000] but personally identifiable data. And that's really the power. That is something[32:53.000 --> 32:57.000] that, yeah, as I've said before, you can share significantly easier and you can[32:57.000 --> 33:03.000] apply to your use cases. The kind of related to this in[33:03.000 --> 33:09.000] terms of upcoming technologies is, I think, dealing more with graphs.[33:09.000 --> 33:13.000] And so is that something from a stackwise that your[33:13.000 --> 33:19.000] company's investigated resource can do? Yeah, so when you think about[33:19.000 --> 33:23.000] transactions, bank transactions, right? And bank customers.[33:23.000 --> 33:27.000] So in our case, again, it's a... We only have pseudonymized[33:27.000 --> 33:31.000] transaction data, so actually we cannot see anything, right? We cannot see names, we cannot see[33:31.000 --> 33:35.000] iPads or whatever. We really can't see much. But[33:35.000 --> 33:39.000] you can look at transactions moving between[33:39.000 --> 33:43.000] different entities, between different accounts. You can look at that[33:43.000 --> 33:47.000] as a network, as a graph. And that's also what we very frequently do.[33:47.000 --> 33:51.000] You have your nodes in your network, these are your accounts[33:51.000 --> 33:55.000] or your presence, even. And the actual edges between them,[33:55.000 --> 33:59.000] that's what your transactions are. So you have this[33:59.000 --> 34:03.000] massive graph, actually, that also we as TMNL, as Transaction Montenegro,[34:03.000 --> 34:07.000] are sitting on. We're actually sitting on a massive transaction graph.[34:07.000 --> 34:11.000] So yeah, absolutely. For us, doing analysis on top of[34:11.000 --> 34:15.000] that graph, building models on top of that graph is a quite important[34:15.000 --> 34:19.000] thing. And like I taught a class[34:19.000 --> 34:23.000] a few years ago at Berkeley where we had to[34:23.000 --> 34:27.000] cover graph databases a little bit. And I[34:27.000 --> 34:31.000] really didn't know that much about graph databases, although I did use one actually[34:31.000 --> 34:35.000] at one company I was at. But one of the things I learned in teaching that[34:35.000 --> 34:39.000] class was about the descriptive statistics[34:39.000 --> 34:43.000] of a graph network. And it[34:43.000 --> 34:47.000] is actually pretty interesting, because I think most of the time everyone talks about[34:47.000 --> 34:51.000] median and max min and standard deviation and everything.[34:51.000 --> 34:55.000] But then with a graph, there's things like centrality[34:55.000 --> 34:59.000] and I forget all the terms off the top of my head, but you can see[34:59.000 --> 35:03.000] if there's a node in the network that's[35:03.000 --> 35:07.000] everybody's interacting with. Absolutely. You can identify communities[35:07.000 --> 35:11.000] of people moving around a lot of money all the time. For example,[35:11.000 --> 35:15.000] you can detect different metric features eventually[35:15.000 --> 35:19.000] doing computations on your graph and then plugging in some model.[35:19.000 --> 35:23.000] Often it's feature engineering. You're computing between the centrality scores[35:23.000 --> 35:27.000] across your graph or your different entities. And then[35:27.000 --> 35:31.000] you're building your features actually. And then you're plugging in some[35:31.000 --> 35:35.000] model in the end. If you do classic machine learning, so to say[35:35.000 --> 35:39.000] if you do graph deep learning, of course that's a bit different.[35:39.000 --> 35:43.000] So basically that could for people that are analyzing[35:43.000 --> 35:47.000] essentially networks of people or networks, then[35:47.000 --> 35:51.000] basically a graph database would be step one is[35:51.000 --> 35:55.000] generate the features which could be centrality.[35:55.000 --> 35:59.000] There's a score and then you then go and train[35:59.000 --> 36:03.000] the model based on that descriptive statistic.[36:03.000 --> 36:07.000] Exactly. So one way how you could think about it is[36:07.000 --> 36:11.000] whether we need a graph database or not, that always depends on your specific use case[36:11.000 --> 36:15.000] and what database. We're actually also running[36:15.000 --> 36:19.000] that using Spark. You have graph frames, you have[36:19.000 --> 36:23.000] graph X actually. So really stuff in Spark built for[36:23.000 --> 36:27.000] doing analysis on graphs.[36:27.000 --> 36:31.000] And then what you usually do is exactly what you said. You are trying[36:31.000 --> 36:35.000] to build features based on that graph.[36:35.000 --> 36:39.000] Based on the attributes of the nodes and the attributes on the edges and so on.[36:39.000 --> 36:43.000] And so I guess in terms of graph databases right[36:43.000 --> 36:47.000] now, it sounds like maybe the three[36:47.000 --> 36:51.000] main players maybe are there's Neo4j which[36:51.000 --> 36:55.000] has been around for a long time. There's I guess Spark[36:55.000 --> 36:59.000] and then there's also, I forgot what the one is called for AWS[36:59.000 --> 37:03.000] is it? Neptune, that's Neptune.[37:03.000 --> 37:07.000] Have you played with all three of those and did you[37:07.000 --> 37:11.000] like Neptune? Neptune was something we, Spark of course we actually currently[37:11.000 --> 37:15.000] using for exactly that. Also because it allows us to do[37:15.000 --> 37:19.000] to keep our stack fairly homogeneous. We did[37:19.000 --> 37:23.000] also PUC in Neptune a while ago already[37:23.000 --> 37:27.000] and well Neptune you definitely have essentially two ways[37:27.000 --> 37:31.000] how to query Neptune either using Gremlin or SparkQL.[37:31.000 --> 37:35.000] So that means the people, your data science[37:35.000 --> 37:39.000] need to get familiar with that which then is already one bit of a hurdle[37:39.000 --> 37:43.000] because usually data scientists are not familiar with either.[37:43.000 --> 37:47.000] But also what we found with Neptune[37:47.000 --> 37:51.000] is also that it's not necessarily built for[37:51.000 --> 37:55.000] as an analytics graph database. It's not necessarily made for[37:55.000 --> 37:59.000] that. And that then become, then it's sometimes, at least[37:59.000 --> 38:03.000] for us, it has become quite complicated to handle different performance considerations[38:03.000 --> 38:07.000] when you actually do fairly complex queries across that graph.[38:07.000 --> 38:11.000] Yeah, so you're bringing up like a point which[38:11.000 --> 38:15.000] happens a lot in my experience with[38:15.000 --> 38:19.000] technology is that sometimes[38:19.000 --> 38:23.000] the purity of the solution becomes the problem[38:23.000 --> 38:27.000] where even though Spark isn't necessarily[38:27.000 --> 38:31.000] designed to be a graph database system, the fact is[38:31.000 --> 38:35.000] people in your company are already using it. So[38:35.000 --> 38:39.000] if you just turn on that feature now you can use it and it's not like[38:39.000 --> 38:43.000] this huge technical undertaking and retraining effort.[38:43.000 --> 38:47.000] So even if it's not as good, if it works, then that's probably[38:47.000 --> 38:51.000] the solution your company will use versus I agree with you like a lot of times[38:51.000 --> 38:55.000] even if a solution like Neo4j is a pretty good example of[38:55.000 --> 38:59.000] it's an interesting product but[38:59.000 --> 39:03.000] you already have all these other products like do you really want to introduce yet[39:03.000 --> 39:07.000] another product into your stack. Yeah, because eventually[39:07.000 --> 39:11.000] it all comes with an overhead of course introducing it. That is one thing[39:11.000 --> 39:15.000] it requires someone to maintain it even if it's a[39:15.000 --> 39:19.000] managed service. Somebody needs to actually own it and look after it[39:19.000 --> 39:23.000] and then as you said you need to retrain people to also use it effectively.[39:23.000 --> 39:27.000] So it comes at significant cost and that is really[39:27.000 --> 39:31.000] something that I believe should be quite critically[39:31.000 --> 39:35.000] assessed. What is really the game you have? How far can you go with[39:35.000 --> 39:39.000] your current tooling and then eventually make[39:39.000 --> 39:43.000] that decision. At least personally I'm really[39:43.000 --> 39:47.000] not a fan of thinking tooling first[39:47.000 --> 39:51.000] but personally I really believe in looking at your organization, looking at the people[39:51.000 --> 39:55.000] what skills are there, looking at how effective[39:55.000 --> 39:59.000] are these people actually performing certain activities and processes[39:59.000 --> 40:03.000] and then carefully thinking about what really makes sense[40:03.000 --> 40:07.000] because it's one thing but people need to[40:07.000 --> 40:11.000] adopt and use the tooling and eventually it should really speed them up and improve[40:11.000 --> 40:15.000] how they develop. Yeah, I think it's very[40:15.000 --> 40:19.000] that's great advice that it's hard to understand how good of advice it is[40:19.000 --> 40:23.000] because it takes experience getting burned[40:23.000 --> 40:27.000] creating new technology. I've[40:27.000 --> 40:31.000] had experiences before where[40:31.000 --> 40:35.000] one of the mistakes I've made was putting too many different technologies in an organization[40:35.000 --> 40:39.000] and the problem is once you get enough complexity[40:39.000 --> 40:43.000] it can really explode and then[40:43.000 --> 40:47.000] this is the part that really gets scary is that[40:47.000 --> 40:51.000] let's take Spark for example. How hard is it to hire somebody that knows Spark? Pretty easy[40:51.000 --> 40:55.000] how hard is it going to be to hire somebody that knows[40:55.000 --> 40:59.000] Spark and then hire another person that knows the gremlin query[40:59.000 --> 41:03.000] language for Neptune, then hire another person that knows Kubernetes[41:03.000 --> 41:07.000] then tire another, after a while if you have so many different kinds of tools[41:07.000 --> 41:11.000] you have to hire so many different kinds of people that all[41:11.000 --> 41:15.000] productivity goes to a stop. So it's the hiring as well[41:15.000 --> 41:19.000] Absolutely, I mean it's virtually impossible[41:19.000 --> 41:23.000] to find someone who is really well versed with gremlin for example[41:23.000 --> 41:27.000] it's incredibly hard and I think tech hiring is hard[41:27.000 --> 41:31.000] by itself already[41:31.000 --> 41:35.000] so you really need to think about what can I hire for as well[41:35.000 --> 41:39.000] what expertise can I realistically build up?[41:39.000 --> 41:43.000] So that's why I think AWS[41:43.000 --> 41:47.000] even with some of the limitations about the ML platform[41:47.000 --> 41:51.000] the advantages of using AWS is that[41:51.000 --> 41:55.000] you have a huge audience of people to hire from and then the same thing like[41:55.000 --> 41:59.000] Spark, there's a lot of things I don't like about Spark but a lot of people[41:59.000 --> 42:03.000] use Spark and so if you use AWS and you use Spark[42:03.000 --> 42:07.000] let's say those two which you are then you're going to have a much easier time[42:07.000 --> 42:11.000] hiring people, you're going to have a much easier time training people[42:11.000 --> 42:15.000] there's tons of documentation about it so I think a lot of people[42:15.000 --> 42:19.000] are very wise that you're thinking that way but a lot of people don't think about that[42:19.000 --> 42:23.000] they're like oh I've got to use the latest, greatest stuff and this and this and this[42:23.000 --> 42:27.000] and then their company starts to get into trouble because they can't hire[42:27.000 --> 42:31.000] people, they can't maintain systems and then productivity starts to[42:31.000 --> 42:35.000] to degrees. Also something[42:35.000 --> 42:39.000] not to ignore is the cognitive load you put on a team[42:39.000 --> 42:43.000] that needs to manage a broad range of very different[42:43.000 --> 42:47.000] tools or services. It also puts incredible[42:47.000 --> 42:51.000] cognitive load on that team and you suddenly also need an incredible breadth[42:51.000 --> 42:55.000] of expertise in that team and that means you're also going[42:55.000 --> 42:59.000] to create single points of failures if you don't really[42:59.000 --> 43:03.000] scale up your team.[43:03.000 --> 43:07.000] It's something to really, I think when you go for[43:07.000 --> 43:11.000] new tooling you should really look at it from a holistic perspective[43:11.000 --> 43:15.000] not only about this is the latest and greatest.[43:15.000 --> 43:19.000] In terms of Europe versus[43:19.000 --> 43:23.000] US, have you spent much time in the US at all?[43:23.000 --> 43:27.000] Not at all actually, flying to the US Monday but no, not at all.[43:27.000 --> 43:31.000] That also would be kind of an interesting[43:31.000 --> 43:35.000] comparison in that the culture of the United States[43:35.000 --> 43:39.000] is really this culture of[43:39.000 --> 43:43.000] I would say more like survival of the fittest or you work[43:43.000 --> 43:47.000] seven days a week and you're constantly like you don't go on vacation[43:47.000 --> 43:51.000] and you're proud of it and I think it's not[43:51.000 --> 43:55.000] a good culture. I'm not saying that's a good thing, I think it's a bad[43:55.000 --> 43:59.000] thing and that a lot of times the critique people have[43:59.000 --> 44:03.000] about Europe is like oh will people take vacation all the time and all this[44:03.000 --> 44:07.000] and as someone who has spent time in both I would say[44:07.000 --> 44:11.000] yes that's a better approach. A better approach is that people[44:11.000 --> 44:15.000] should feel relaxed because when[44:15.000 --> 44:19.000] especially the kind of work you do in MLOPs[44:19.000 --> 44:23.000] is that you need people to feel comfortable and happy[44:23.000 --> 44:27.000] and more the question[44:27.000 --> 44:31.000] what I was going to is that[44:31.000 --> 44:35.000] I wonder if there is a more productive culture[44:35.000 --> 44:39.000] for MLOPs in Europe[44:39.000 --> 44:43.000] versus the US in terms of maintaining[44:43.000 --> 44:47.000] systems and building software where the US[44:47.000 --> 44:51.000] what it's really been good at I guess is kind of coming up with new[44:51.000 --> 44:55.000] ideas and there's lots of new services that get generated but[44:55.000 --> 44:59.000] the quality and longevity[44:59.000 --> 45:03.000] is not necessarily the same where I could see[45:03.000 --> 45:07.000] in the stuff we just talked about which is if you're trying to build a team[45:07.000 --> 45:11.000] where there's low turnover[45:11.000 --> 45:15.000] you have very high quality output[45:15.000 --> 45:19.000] it seems like that maybe organizations[45:19.000 --> 45:23.000] could learn from the European approach to building[45:23.000 --> 45:27.000] and maintaining systems for MLOPs.[45:27.000 --> 45:31.000] I think there's definitely some truth in it especially when you look at the median[45:31.000 --> 45:35.000] tenure of a tech person in an organization[45:35.000 --> 45:39.000] I think that is actually still significantly lower in the US[45:39.000 --> 45:43.000] I'm not sure I think in the Bay Area somewhere around one year or two months or something like that[45:43.000 --> 45:47.000] compared to Europe I believe[45:47.000 --> 45:51.000] still fairly low. Here of course in tech people also like to switch companies more often[45:51.000 --> 45:55.000] but I would say average is still more around[45:55.000 --> 45:59.000] two years something around that staying with the same company[45:59.000 --> 46:03.000] also in tech which I think is a bit longer[46:03.000 --> 46:07.000] than you would typically have it in the US.[46:07.000 --> 46:11.000] I think from my perspective where I've also built up most of the[46:11.000 --> 46:15.000] current team I think it's[46:15.000 --> 46:19.000] super important to hire good people[46:19.000 --> 46:23.000] and people that fit to the team fit to the company culture wise[46:23.000 --> 46:27.000] but also give them[46:27.000 --> 46:31.000] let them not be in a sprint all the time[46:31.000 --> 46:35.000] it's about having a sustainable way of working in my opinion[46:35.000 --> 46:39.000] and that sustainable way means you should definitely take your vacation[46:39.000 --> 46:43.000] and I think usually in Europe we have quite generous[46:43.000 --> 46:47.000] even by law vacation I mean in Netherlands by law you get 20 days a year[46:47.000 --> 46:51.000] but most companies give you 25 many IT companies[46:51.000 --> 46:55.000] 30 per year so that's quite nice[46:55.000 --> 46:59.000] but I do take that so culture wise it's really everyone[46:59.000 --> 47:03.000] likes to take vacations whether that's sea level or whether that's an engineer on a team[47:03.000 --> 47:07.000] and that's in many companies that's also really encouraged[47:07.000 --> 47:11.000] to have a healthy work life balance[47:11.000 --> 47:15.000] and of course it's not only about vacations also but growth opportunities[47:15.000 --> 47:19.000] letting people explore develop themselves[47:19.000 --> 47:23.000] and not always pushing on max performance[47:23.000 --> 47:27.000] so really at least I always see like a partnership[47:27.000 --> 47:31.000] the organization wants to get something from an[47:31.000 --> 47:35.000] employee but the employee should also be encouraged and developed[47:35.000 --> 47:39.000] in that organization and I think that is something that in many parts of[47:39.000 --> 47:43.000] Europe where there is big awareness for that[47:43.000 --> 47:47.000] so my hypothesis is that[47:47.000 --> 47:51.000] it's possible that Europe becomes[47:51.000 --> 47:55.000] the new hub of technology[47:55.000 --> 47:59.000] and I'll tell you why here's my hypothesis the reason why is that[47:59.000 --> 48:03.000] in terms of machine learning operations[48:03.000 --> 48:07.000] I've already talked to multiple people who know the[48:07.000 --> 48:11.000] data around it like big companies and they've told me that[48:11.000 --> 48:15.000] it's going to be close to impossible to hire people soon[48:15.000 --> 48:19.000] because essentially there's too many job openings[48:19.000 --> 48:23.000] and there's not enough people that know machine learning, machine learning operations, cloud computing[48:23.000 --> 48:27.000] and so the American culture unfortunately I think[48:27.000 --> 48:31.000] is so cutthroat that they don't encourage[48:31.000 --> 48:35.000] people to be loyal to their company[48:35.000 --> 48:39.000] and in addition to that because there is no universal healthcare system[48:39.000 --> 48:43.000] in the US[48:43.000 --> 48:47.000] it's kind of a prisoner's dilemma where nobody[48:47.000 --> 48:51.000] sees each other and so they're constantly optimizing[48:51.000 --> 48:55.000] but in the case of machine learning it's a different[48:55.000 --> 48:59.000] industry where you do really need to have[48:59.000 --> 49:03.000] some longevity for employees because the systems are very complex[49:03.000 --> 49:07.000] system to develop and so if the culture of Europe[49:07.000 --> 49:11.000] which is much more friendly to the worker I think it[49:11.000 --> 49:15.000] could lead to Europe having[49:15.000 --> 49:19.000] a better outcome for machine learning operations[49:19.000 --> 49:23.000] so that's one part of it and then the second part of it is the other thing the US has[49:23.000 --> 49:27.000] has done that I think Europe[49:27.000 --> 49:31.000] has done that if I compare Europe versus the US in terms of[49:31.000 --> 49:35.000] data privacy that I think the US has dropped the ball[49:35.000 --> 49:39.000] and they haven't done a good job at it but Europe has actually[49:39.000 --> 49:43.000] done much much better at holding tech companies accountable[49:43.000 --> 49:47.000] and I think if you asked[49:47.000 --> 49:51.000] well informed people if they would like some of the[49:51.000 --> 49:55.000] practices of the United States tech companies to change I think most[49:55.000 --> 49:59.000] well informed people would say we don't want you to recommend[49:59.000 --> 50:03.000] bad data like extremist video content[50:03.000 --> 50:07.000] I mean there's people that are extremists that love it[50:07.000 --> 50:11.000] or we don't want you to sell our personal information without our consent[50:11.000 --> 50:15.000] so it could also lead to a better[50:15.000 --> 50:19.000] outcome for the people[50:19.000 --> 50:23.000] that are using machine learning and AI in Europe[50:23.000 --> 50:27.000] so I actually suspect and this is my hypothesis[50:27.000 --> 50:31.000] who knows if I'm true or not is that I think Europe could be[50:31.000 --> 50:35.000] the leader from let's say 2022 to[50:35.000 --> 50:39.000] 2040 in AI and ML because of[50:39.000 --> 50:43.000] the culture but I don't know that's just one hypothesis I have[50:43.000 --> 50:47.000] yeah I think around the what you mentioned before[50:47.000 --> 50:51.000] around the fact that perhaps Turnover is in tech companies here in Europe[50:51.000 --> 50:55.000] is less I think that definitely helps you build systems that survive the test of time as well[50:55.000 --> 50:59.000] right I mean everyone had the case when a key engineer[50:59.000 --> 51:03.000] off boards from a team leaves the company and then you need to[51:03.000 --> 51:07.000] hire another person right it's long times of not being super productive[51:07.000 --> 51:11.000] long time not being super effective so you continuously[51:11.000 --> 51:15.000] lose track that you need[51:15.000 --> 51:19.000] so I think you could be right there that in the[51:19.000 --> 51:23.000] longer run when systems really need to be matured and developed over[51:23.000 --> 51:27.000] longer time Europe might have an edge there[51:27.000 --> 51:31.000] might be a bit better suited to do that[51:31.000 --> 51:37.000] the salaries are still higher in the US and also I think many US companies are starting to enter more[51:37.000 --> 51:41.000] from a people perspective even remote work and everything they're starting to also[51:41.000 --> 51:45.000] poach more and more engineers from Europe because[51:45.000 --> 51:49.000] of course vacation and everything and having a healthy work life balance[51:49.000 --> 51:53.000] is one thing but for many people if you[51:53.000 --> 51:57.000] give you a 50% higher paycheck that's also a strong argument[51:57.000 --> 52:01.000] so it's difficult actually to also for Europe to[52:01.000 --> 52:05.000] keep the engineers here that as well[52:05.000 --> 52:09.000] no I will say this though if you work remote from[52:09.000 --> 52:13.000] Europe that's a very different scenario than living[52:13.000 --> 52:17.000] in the US because you'll see when[52:17.000 --> 52:21.000] unfortunately the United States since about 1980[52:21.000 --> 52:25.000] has declined and[52:25.000 --> 52:29.000] the data around the US is pretty dire[52:29.000 --> 52:33.000] actually the life expectancy is one of the[52:33.000 --> 52:37.000] lowest in the world for a G20 country[52:37.000 --> 52:41.000] so then if you walk through the major[52:41.000 --> 52:45.000] cities of the US there's just poverty[52:45.000 --> 52:49.000] everywhere like people are living in very low[52:49.000 --> 52:53.000] quality conditions where every time I go to Europe[52:53.000 --> 52:57.000] I go to Munich, I go to London, I go to wherever[52:57.000 --> 53:01.000] that basically the cities are beautiful[53:01.000 --> 53:05.000] and well maintained so I think if the cases that if a US company[53:05.000 --> 53:09.000] let a European live in Europe and work[53:09.000 --> 53:13.000] remote yeah that could work out because the European[53:13.000 --> 53:17.000] citizen has an EU citizen has amazing[53:17.000 --> 53:21.000] healthcare they have the[53:21.000 --> 53:25.000] safety net their cities aren't basically[53:25.000 --> 53:29.000] highly unequal but I think it's the[53:29.000 --> 53:33.000] location of the US in its current form[53:33.000 --> 53:37.000] I personally wouldn't recommend[53:37.000 --> 53:41.000] someone from Europe moving to the US because[53:41.000 --> 53:45.000] unfortunately I think it's a[53:45.000 --> 53:49.000] great place to live just to be totally honest[53:49.000 --> 53:53.000] if you're already in Europe and on the flip side I think that[53:53.000 --> 53:57.000] there's a lot of Americans actually who are very interested in[53:57.000 --> 54:01.000] universal healthcare in particular is not even[54:01.000 --> 54:05.000] possible in the US because of the politics in the US[54:05.000 --> 54:09.000] and a lot of medical bankruptcies occur[54:09.000 --> 54:13.000] and so from a start up perspective as well[54:13.000 --> 54:17.000] this is something that people don't talk about in America it's like yeah we're all about[54:17.000 --> 54:21.000] startups well think about how many more people would be able to[54:21.000 --> 54:25.000] create a company if you didn't have to worry about going bankrupt[54:25.000 --> 54:29.000] if you broke your arm or you have some kind of[54:29.000 --> 54:33.000] sickness or whatever so[54:33.000 --> 54:37.000] I think it's an interesting trade off[54:37.000 --> 54:41.000] situation and I would say that the sweet spot might be[54:41.000 --> 54:45.000] you work for an American company and get the higher salary but you still live in Europe[54:45.000 --> 54:49.000] that would be the dream scenario I think that's why many people are actually doing it[54:49.000 --> 54:53.000] I think especially since covid started you can really see it[54:53.000 --> 54:57.000] before that it wasn't really a thing working for a US company[54:57.000 --> 55:01.000] who really sits in the US and you're full remote but I think now since 2, 2 and a half years[55:01.000 --> 55:05.000] it's really becoming reality actually[55:05.000 --> 55:09.000] interesting yeah well[55:09.000 --> 55:13.000] hearing a lot of your ideas around[55:13.000 --> 55:17.000] startups and what you're doing and[55:17.000 --> 55:21.000] also about how you're a SageMaker[55:21.000 --> 55:25.000] is there any place that someone can get a hold of you[55:25.000 --> 55:29.000] if they listen to this on the Orelia platform or[55:29.000 --> 55:33.000] think content that you're developing yourself or any other information you want to share[55:33.000 --> 55:37.000] yeah definitely so I think best place to reach out to me and I'm always[55:37.000 --> 55:41.000] happy to receive a few messages and have a good chat or a virtual coffee[55:41.000 --> 55:45.000] is via LinkedIn my name is here that's how you can find me on LinkedIn[55:45.000 --> 55:49.000] I'm also at conferences here and there well in Europe mostly[55:49.000 --> 55:53.000] typically when there is an MLOps conference you're probably going to see me there[55:53.000 --> 55:57.000] in one way or another that is something as well[55:57.000 --> 56:01.000] cool yeah well I'm glad we had a chance to talk[56:01.000 --> 56:05.000] you taught me a few things that I'm definitely going to follow up on[56:05.000 --> 56:09.000] and I really appreciate it and hopefully we can talk again soon[56:09.000 --> 56:13.000] thanks a lot for the chat okay all right

← See all 68 episodes of 52 Weeks of Cloud