Breaking down the different stages of data maturity with Jake Peterson, Head of Data at Vanta
Welcome to The Canvas Podcast, where we bring together business and data leaders to talk about how to make data easier for everyone. And so today, I'm super excited to have a friend and a former coworker at Flexport, Jake Peterson on the show. He is a former analytics engineering manager of Facebook analytics manager at Flexport, and currently the head of data at Vanta.
Let's start with an overview of yourself?
Sure. I've been working in this data science and analytics game for probably 16 years now.
I started at a company called Axiom doing predictive analytics. So statistical programming, marketing consultation, and marketing analytics for all kinds of clients in the direct marketing and email marketing space. It's really where I got my chops. And then, from there, I've had a pretty kind of weird career.
I ran analytics at a few startups and back then in 2009, a lot of the open-source packages for analytics didn't exist. For most of the tooling you had to do, you would buy either very basic tooling or you'd have to do pretty much all of it yourself.
And getting data from your other systems out of your other SaaS cloud systems was super difficult. Most of them didn't have APIs. If we could have data, we could do all these analytics. But we had access to none of it. If you don't have any raw data flowing through your systems, there's no analytics to do. There are no operational improvements to make. It's a real challenge.
I went to Facebook after that and got in on the big data thing. I was an early data science hire at Facebook in 2011. I don't know if that counts as early. I guess maybe now it does, the company is good sized. But when I joined, it was over a thousand people.
I did a lot of different things there. I built the search data science team and the platform data science teams. And then had a little brief stint in venture capital doing some data science. And then Flexport, I'd say in general, most of my career has been across kind of everything in the analytics and data science space, it's largely been just how do we take this raw data and repackage it up into something to change the business and drive forward.
What made you get started with data in the first place? What drew you to it?
I studied simulation statistics in college, but I wasn't really all that into it. And my first job outta college was actually like a production engineering role, like producing emails and producing websites and stuff like that.
Like I spent a good part of my day, I spent a good part of my days writing HTML, and it was a nice job. But it really wasn't for me. And a lot of the work that I actually got excited about doing was working with data. I spent a lot of time working with data for clients.
Our analytics team was run by a guy named Todd King and he tapped me on the shoulder and said, "Hey, I think you'd actually like this doing this analytical work". And I think you'd be good at it. Like you should join me and do this. And he was totally right. It was a massive improvement for me in terms of the quality of life and the challenge of the problem and the work we were doing.
I happened to just fall into it and then really fell in love with it. Ironically, the thing that really got me interested in working with data and programming data was like was programming in SAS, the statistical analysis software tool.
I have like a bad relationship with SAS. There are parts of it that are really abusive and horrible, like the macro language, but there are parts of it that are really incredible and amazingly powerful. Like the data, stuff is like really intuitive and really powerful.
It's probably the best data transformation language. It has this data step which is super useful. And I started to get a feel for the things you could do once you got your hands on SAS and you got your hands on some like raw data. At Axiom, we started writing recommendation engines in SAS and like doing cluster analysis in SAS.
And we were getting good results for our clients. So it was like, it was a really awesome transition for me. I started to see the light in analytics there where it's oh man, if the conditions are right, if the tools are there, if you have data coming through, and you have clients or partners that can use the data, you can generate a ton of impact. And sometimes, it works almost like magic.
You've been at some pretty operationally intensive companies. Flexport, obviously and then you're now the head of data at Vanta. What draws you to those problems, and what type of challenges does that present when it comes to data?
Yeah, I don't know if I'm drawn to them but I think there's something deep inside me that looks at situations that are just like gnarly and then says, oh, I should do that.
It'll be painful and I probably have a bad decision function inside but I think part of the fun about working with like operational data is that you actually make your partner operations and team's jobs easier. It's tremendously rewarding when you build something for them that they really use and that makes their life a lot better.
And now I'm at Vanta which is a security monitoring and compliance platform. Traditionally, the way you prove your security and the way that you manage consumer data safely and with the proper respect was you would go to an accounting firm and get a compliance audit done.
So this is where SOC 2 types, 1 and 2, and PCI all come from. The process of doing that is extremely painful. It's the standard designed by accounting firms. So not necessarily queued up to our modern cloud infrastructure cloud software, and there's tons of room for automation, and Vanta delivers the software to do just that.
You connect your cloud systems to Vanta, and then it provides a single pane of glass to see your security stance and see how you're doing as a startup. It can turn the VP of engineering into a virtual CSO without a ton of work involved. And that was really attractive to me.
Security is one of the areas I'm trying to enhance in my career. I worked with a bunch of different security teams in the past and I felt like their tools are just not up to snuff for executing their jobs. So I've learned a lot of sympathy for them.
I love the current gig here for that. And as far as just generating impact, being on the data analytics team means you can really influence the product. Why wouldn't we want bake analytics best practices into a security monitoring product? Why wouldn't we want to take the learnings we have internally from running our own Vanta and running our own data practice and pushing it out into the product? So to me, that's a great gig.
Business teams often lack understanding of what data teams do on a day-to-day basis. What are some of the best practices that you try to bring to a company in how to work with the business team?
I think it comes back to your whole strategy of how you're going to build your function within the company. So I think the most vanilla framing for this that's out on the internet, and you can Google all over the place, is called data maturity.
And I actually have data maturity goals at Vanta. I build a framework as it relates to technology process and culture and makes sure that all of those loops land in the company and a data lens in it. Because once you run the function, you realize, oh man, I actually need all these things to work.
I can build the most amazing system, but if nobody logs in that, nothing is going to happen. And if I don't train them how to use it, nothing is going to happen. And then if the leaders don't reward people for coming up with cool analytical solutions to problems, then, the system is not going to get better. It's not gonna improve. And then nothing is going to land.
So when you're thinking about working with the business, you have to start with that foundation, right? Step one is to assess where everything is and build your foundational assessment. And I think using a data maturity framework works really well for this.
You have to ask who and where the business and the teams within it are at. What are all their core constituencies? And then figure out where you're at and then figure out where you want to be and staff that appropriately along the way. Because I think there are different levels of where you want to be.
To achieve that you need a partnership between the data team and the business team. Just as an example, I think the staffing levels you wanna think about are in terms of like how much investment you're gonna make in analytical data professions. And I think the range of that is somewhere between 2% and like 13% of total staff.
When your data team is 2% of the staff of a company, once you hit a couple of hundred people, that's a bare minimum for running a data team. If you're at 2% of staff with your data warehouse and your analysts and your data scientists in the company, you're barely keeping the lights on, right?
At that stage, you can expect that your data warehouse and your data stack are not running super reliably and you can expect errors and outages regularly. You can expect opaque measurement that nobody knows how it's working, all that stuff. So that's like the bare basement staffing level, right? At 13%, that's what you want when you want to lead the industry. That's the highest staffing level I've ever heard of. And I think that was StitchFix's ratio. Their strategy was effectively burning the boats.
Stitchfix said we don't make a sale unless it goes through the recommendation engine, which is a serious commitment to the staffing of your data analytics. So it's basically, Hey, we're building the entire company up on this premise. Okay, so now, so you've got you, so you got your range of staffing levels of how you interact with the business and what you can do.
Now, you gotta decide within that. What are the click stops of what I can do? Do I wanna barely keep the lights on? Or actually, I'd like to do more than keep the lights on. I'd actually like to generate company-level projects. Do I want my partner teams to have embedded analytics, or do I want to drive my partner teams through analytics?
And then you've gotta partner up with your staffing levels on those. And then, and you have different models for engaging depending on what staffing level you want to do. And I think a lot of companies go; they start at barely keeping the lights on. And then they go, okay, actually, there's, we're leaving a lot on the table.
They say, "Let's go-to project-level embedding". They embed at projects like they'll develop a project intake to partner with business teams and figure that out. And those teams will act as a product or project-focused data team. And then from there, they go to, okay look, we're actually still leaving value on the table.
Then they say, "Let's go like direct embeds, like a direct analyst or data scientist". One person per team, every team we've got, and then staff it up that way, then you're starting to get into the like 6% to 10% of staff-level for, but your process is really easy, right? It's everybody got a guy, and that guy's on your team.
The last stage is "Okay, no, we're not actually young". We're gonna run the whole business through this whole group. I can only think of Stitchfix as a company at this stage. I can't think of any else. Even Facebook, which has got tremendous data science and analytical staffing, doesn't staff to that level.
But what does stage zero look like? And what are some signals that that founders and leaders should be looking at to say, okay, we should probably think about hiring a data team?
Yeah, it's a good question. In general, it's you don't feel confident in your numbers, right? You're not sure that people are using your tools correctly. Your goal setting, goal achievement, and project definitions are almost all narrative-based. And you feel like the narratives are not hitting the mark necessarily. Where it's I feel like this, this group is telling me this story, this group's telling me this story, but now, one's providing a number, and I can't tell and what's going to happen.
And I think that's the case usually where like, where you say, all right, we need someone or we need a team to make this easy. Because I like personally, I really like the story you're trying to tell as a business. The business and your products and the projects that you're developing are super important. It's probably the most important thing.
But when they start conflicting with each other is usually when building a data analytics team seems to make sense. Because the quantitative side of those decisions is something that, once you have your assumptions, everyone can definitely agree on. And it becomes a really powerful coordination tool.
So I think that's roughly like the state things are. People generally have a sense of it. And to be fair, being an analyst is a big company thing, right? Like, it's not really a five-person startup thing. You shouldn't be hiring a data engineer when you've got five people.
That's just too, that's too early. That's when you've gotten to the point where you're all the systems across the company because even little startups have tons of SaaS. We've got 10 SaaS tools per person, and then once all those start generating data and pulling them and you wanna start generating cross-departmental insights, it's okay. Now it starts to make sense that we need a team that's focused on this infrastructure.
It's really critical for goal setting. Because when you're an early company, you still have a lot of creativity left in your company that you have to execute. There's no success function unless you're a hedge fund that's sitting around that you can just throw a ranking engineer at and be like, oh, sweet.
There's a lot of storytelling in the market. You have to go do storytelling internally. You go make things work. And then, having that fixed point of the like real facts on the ground is super unifying and coordinating everyone towards the same goal and like heading in the same direction where you can ideally you can distribute it.
That's the dream. In your data system, you actually do have this success ranking function, and I can distribute it, and people can go hack on it. And then, someone will just independently come up with some amazing result. And then, will be, we'll be off to the races.
From a tooling perspective, what's the order of operations that you think about when you come into a company as you did with Vanta? What is the order of operations of problems that you look to solve first?
Top-down. So first and foremost, I think you wanna set the company metrics unless there are other burning projects that are sitting around and that need to codify. So if you, build the company metrics first. In general, that kind of help pulls the infrastructure out of it.
Because you've got a concrete thing you're delivering. You've got something people can use, and then the infrastructure and the technology decisions will get made after that cause you'll have a sense of what kind of a beast you're rendering. Especially too, as just as it relates to the size and diversity of the data and the data sets you've got, right?
If you're an internet of things, company, you've got five sensors generating gajillions amounts of data, but you've only got four or five tables in the data warehouse, you're gonna do a very different set of things.
Then there's Flexport which has a real diversity of data sets and sources and a very real diversity of transformation needs and team needs. And it's you pick very different stacks and those two companies. And so you'd start with the company metrics top-down of what you want to do. Then start looking at the underlying things that are feeding them based on data size, data, diversity, and number of systems that you're going to optimize.
For Vanta, security is pretty deterministic, right? So we don't have the demand. Traditional data science is limited in a few places, right? There are definitely some places where we've got opportunities there and that we're working on. But it's not like we're gonna rank order your policies. No, you're going to need that set in stone.
So that changes the tools we use there too. Because we have a deterministic business. So it's more important than these things to be deterministic versus fuzzy, probabilistic, and ranking. It is very different from Facebook, where we've got huge data flowing through it. Most of the decisions would be fine if we're off by a few percent, right? And even just lost a few percentages of logging roads. We'd be like, eh, whatever, no big deal. We got another million. That's just not the case in other businesses.
Now that you've seen a few of these stacks built out in systems, what problems do you think that the modern data stack has solved really well? And what problems do you think it has created, or what problems do you think remain for it to solve?
It's a game-changer. If I had that in 2009 it would've been incredible. We would've been able to do so much more than you can now. It's just taken a lot of evolution to get to that state. We have a structured model like all of the other cloud tools.
Now we can provide that through data sharing, and we'll provide that through a standardized API. The EL side works, I think tremendously well. Transform is still not all that great yet. And I think there's a lot of traditional data warehousing and engineering that has not been modernized.
I think Maxim B from Preset talked about this on a podcast a few years ago for a little while ago. That the data engineering practice for dimensional modeling and transforming data to transform it into metric shape is still not, I think, up to snuff to where the modern data stack where ought to be.
I think that the big gap is the things that are working really well is that EL works really well. Cloud warehouses, Snowflake, Starburst, and Big Query work really well. Like theirs. If you have a small data warehouse, they're great. If you have a small data warehouse and they just scale up to whatever size you need, which is great.
So you just pick a commodity partner, and then you don't have to make another decision for a really long time, which is great. I think some of the visualization tools are mixed. Some are great, some not so great. And then and then transform, I think, like native data engineering, still has a long way to go.
I think a great vision of the world would be why can't I define my transforms in a language and then not have to worry about all these connections, even though there, there's not that much work in them currently. I should be able to model out from my production objects, like how I define my source application and how the reports in data get generated. And I still think that there's like a lot of opportunities there. I'm surprised that no one's gone and built it. I'm surprised there's no modern Ruby on rails for SAS.
Most of the techniques are still pretty old. We haven't really updated Kimball's dimensional or fact modeling to present data for analytics. All the new things in data science are seen to be largely in huge data and deep learning versus the mass market problems.
I think the Tidyverse stuff, like what Hadley has been doing is really impressive. But even that is still, it's still like the last piece of a seven-point data value chain, right? Yeah. Why hasn't anyone yet figured out how to capture this in two links of a chain? Maybe it can't be. It might just be structural. But to me instinctually, this seems like there should be some vertical integration of the data value chain.