How to build better data platforms with Jessica Larson, Author, and Data at Pinterest
Welcome to The Canvas Podcast, where we bring business and data leaders together to discuss making data easier for everyone. And I am super excited to have a friend and former coworker at Flexport, Jessica Larson, who has been a data engineer analytics engineer at top companies like Ease, Flexport, and now Pinterest, and who has just literally written the book on Snowflake access control.
Jessica, tell us about yourself?
Thanks for having me on. Yeah, so I started my career on the analytics side as an analytics engineer. I liked some of the analyses I was doing and got into. When I started at Flexport, I was a data analyst. I realized that that wasn't necessarily my favorite thing to do. And That was when I switched to the data engineering side.
And that's kind of what I've been doing since then. So I was really focused on some system stuff at Flexport. After that, when I was at Ease, I was doing a lot of pipe lining, really sexy, fun, real-time stuff. We had a million use cases for analytics data there.
And then at Pinterest, I'm actually a platform engineer, so I'm not doing any pipelining. I'm actually just building tools for the data engineers doing the pipelining and doing a lot of working with all the stakeholders all over. To ensure that everybody has the things they need to interact with Snowflake.
And then, yeah, I just wrote the book. So, that came out in March this year.
And how'd you get your start in data?
Yeah. So, I studied cognitive science and did a minor in computer science at Cal. I was interested in the [00:02:00] intersection of the human brain and computers in college; it was more on neuroscience. And I worked in a lab where we did some really cool stuff. But after I graduated, I found myself in this weird situation because I was too technical for many roles that I was interested in. Then I wasn't getting those roles because I was being told, Hey, you're not gonna be happy in this role.
It's, you're, it's not technical enough. This is more of a writing thing, whatever. While at the same time, I was applying for software. And I am being told the exact opposite. You're not technical enough for this, which is the most frustrating thing in the whole entire world. Especially cuz I was, I'm literally, not even one semester away from having a double major.
But, here we are. So, data kind of ended up being that thing, right. Because analytics it's as long as you have a pretty good understanding of[00:03:00], experimental design some stats and, the ability to learn sequel or, or,, already having learned SQL.
It was, and it was pretty easy for me to get into with my skillset. And so, and then I just really d it. It was real; it was really exciting. It's really fast-moving. It's really hard to get bored with data, which is important for me, cuz I get really bored easily. So, yeah, I found my home in, in data.
What inspired the change from the analyst role and thinking about analytics to then going more towards the platform side, and what made you dive deeper down the stack?
So, I think the immediate answer to that is novelty and boredom. But, when I was, when I was on the analytics side when I was an analyst, I really struggled with this because I felt [00:04:00], my day-to-day, I wasn't actually getting any closer to where I wanted to be. And I felt I was kind of diverging from my career path.
I was really bored with SQL. It's not a functionally complete language. You can only spend so much time in SQL before you start going stir crazy. So I moved over to the data engineering side. And then, yeah, I've kind of just been all over the place, cuz I do my best work when I'm working on things I've never done before.
And. , I need, I need that novelty. I need something new all of that learning is what keeps me engaged and working on getting these projects across the finish line. When I'm working on something that I've done a bunch of times before, it's just really emotionally difficult for me to do so,
Tell us about this book you wrote. What inspired it? What's the response been so far?
Yeah, so, so basically, the genesis for this is that I spoke at the Snowflake conference, not this past one, but the year before. On this new feature, secondary roles, that kind of just fixed a lot of things for us from an access control standpoint.
And so, my publisher, Jonathan, my now publisher, reached out to me, and he asked me if I had any interest in writing a book on Snowflake.[00:06:00] And, I kind of joked with friends. The only way that you can answer a question like that is yes. I'm gonna do that. Right?
You have to take these types of opportunities when they come across. And he had a few suggestions cuz there's, there was kind of already a, a basic Snowflake book. There was an advanced Snowflake book. And so he was trying to,, trying to find something that was a little bit more niche and in-depth than a particular subtopic of Snowflake.
And since At Pinterest, we use Snowflake for our most sensitive data. We're,, for our HR data and, some sales stuff, some financial reporting, et cetera. And so, I was just constantly, day in and day out, working on security-related things. And so I thought, okay, yeah, let's do access control.
Let's dive deep into that. Especially because I think there's a huge need right now with GDPR and CCPA and hopefully a whole lot more that come that protect people in the United States, but not in California. [00:07:00] Right.
When you think about role-based access control and you think about compliance and security at the companies that you've been at, what are the most common ways in which not doing this well can manifest as business or customer pain?
Well, it can fundamentally be a data quality issue, right. Because if you're not locking it down, people can be mutating pro data. Right. That's, well, you can end up with some, sorry, my cat you, you can end up with some dev data showing up in your production instance that can mess up any of the models that you have.
I primarily thought I feel when I see this in practice the actual, [00:08:00] cause of this is not having access control in some tools, but not in all tools. And so, maybe you have very strict access control in Snowflake, but then when you connect it to Tableau, you use a service account.
And so then everybody in Tableau has exactly the same access. And then you end up with your business users going in, and they're actually putting in all of these additional controls in Tableau, but how are you making sure that those all match up right? That's a pretty big one. And then also when you look at engineering, right, again, it's all controlled here, but then you're using some service accounts where people have access to all these roles that they shouldn't have access to.
And then they're able to, maybe in their dev app or in some bastion or something able to actually change production data and usually not intentionally. Right. Yeah.
What are some overlooked parts of the stack in which poor RBAC can create problems?
A big one is also that downstream transformation piece. So when you're creating those, you're creating the data model, whatever, Does this person, this person who's creating this table downstream in this tool, are they only able to do that with the data that they're allowed to transform? Right. That is something I see, not controlled all the time.
And it's hard to find the right tool that actually allows you to enforce that access control at that step.
So you've been at some pretty crazy high-growth companies. So you've seen what makes great data teams, and curious to get from your perspective on what separates them from the pack?
So big thing I look for is how engaged is the data team with the entire rest of the company. Right. Or rather, how much is the entire company engaged with the analytics products? Right. I think one thing that I saw at Flexport that you probably also saw, which was that data really powered everything. We had a huge thirst for all of the assets that we were creating. One thing that was super cool at [00:11:00] Flexport was that everybody was just,, they were always reaching out to us about,, we want more, we want more, we want more.
And I think we had a really good reputation internally. There was a lot of trust. And again, it really felt the rest of the company really cared about us and were, you're valuable. We want all of the things that you can offer us. Right. Now I would also say that There were also some negatives to that, which was, in a lot of ways, we were creating products that I think we're actually outside of the scope of analytics.
Right. We had a lot of workflow tooling and things that were kind of bridging the product right. Bridging the actual tech behind the company. And I think that that's also pretty common to see, too, especially in some of these fast-growing companies where you basically have your analytics team working as prototypers.
And so we kind of have a similar thing at ease. So, One of the [00:12:00] things I love to talk about because I just think it was a really interesting project to work on. In 2019, I believe it was called vape gate. Right. All of a sudden, we realized that vapes are actually not very good for your lungs.
I think, not terribly surprising, but it was very surprising to a lot of people and And so then we saw overnight we saw these crazy changes. So, for example, in the city of Santa Cruz, basically, it was with 24 hours' notice they banned the sale of all vapes in the city.
And so we needed. That is way too quick of a turnaround to actually do something on the production engineering side. Right? How could you possibly modify your product for that? Right. And the way that we had it is we had one Depot serviced, a pretty large area, and that's where our menu would be based out.
And so we could either turn off vapes for that entire area. but unfortunately, that's our that was our most profitable item. [00:13:00], our most popular item or we could find some way to make it so that people within that city couldn't get vapes. And so, we ended up solving it on the analytics side.
We were already doing we were catching these backend events of somebody checking out their. And we are doing stuff with addresses already to make sure that they weren't in schools or government zone, areas we're not legally allowed to deliver to. We basically added a polygon, forked our process, and said, okay, if it lands in this polygon of the city of Santa Cruz and there's a vape in the cart, then we need to fire it off to customer service to figure out what the customer wants to do with the order.
Right. So there, I mean, there's a lot of, just trying to be scrappy, being able to do things very, very, very quickly. So you see that a lot which is, I guess, maybe not, not good or bad, right. Because the whole point of an analytics team is to really scale your org quickly.
So [00:14:00], maybe it's not so bad. I didn't mind it because I thought these were really exciting, fun projects to work on. And I frankly love to do things like this. Again, it's the novelty. So I kind of look for that. But, if you're, if you're really trying to do just very standard analytics, then you probably don't want that, right?
What are some of the ways that you measure engagement?
I think you can kind of start to approximate it by looking at how many people are, are spending time on your BI tool. Right. one metric that actually works pretty nicely that we would use if we needed to decommission something is, if something breaks, how long until somebody complains, right.
If it breaks and nobody complains, that's a problem, right? that means people aren't using it. And so, if you find a dashboard that's been broken for a month or something, you can probably just get rid of the dashboard if people aren't using it. The big one you see is that when people have OKRs that [00:16:00] are based on metrics that come from your analytics warehouse or whatever, that's a big one.
That's, that's less of that bottom-up that you would see with using dashboards and more of a top-down. Right. And so I think if you're in a situation where you don't have a. Engagement with analytics. I do think that this is something where leadership can kind of come in and say, Hey, I need all of you guys to be more data-driven.
So your OKRs need to have a dashboard. Every team has a dashboard for their OKRs or something that kind of forces everybody back into the data. Yeah, I'm trying to think. I haven't worked with product analytics too much, but I imagine that.
You can probably see some of this in the product as well. When you're, if you come out with some new feature and everybody hates it, and it just stays up there for a long time, are you really listening to the data there, right?
Confusing dashboards is, is a problem, right? Sometimes we're given so many tools, and then we start to make things so complicated, and then it gets to a point where [00:18:00] you look at a chart on a dashboard, you should immediately be able to understand what it is.
It can be frustrating to look at some of these overcomplicated graphs where it's too many things going on. There are bars, and there are lines, and there are dots, and there are different colored bars and all of this. And it can be really useful if you are using it to dive down and really look at certain particular things.
But overall, it's usually just sensory overload. And it's really hard for people to be engaged with something that's stressful. Right. I think another thing that you can kind of do to really work on that engagement is just increasing the data literacy of your company.
And I think that in general, we do not do [00:19:00] a great job of training people on how to look at data, how to understand data, how to, how to read it.
Especially when we start to look at things that are probabilistic outcomes, right? This is this particular thing when you look at it and cross it by with this. How do we reason about that versus something that isn't relativized?
Right. So I think it's increasing people's understanding of data, people's understanding of how you control for things. I think that a big one I see is people not having enough controls in their experiments or the way that they look at numbers where it's you're assuming that this is the factor, but, it could be any other number of factors because we're just looking at a, we're looking at correlations so much of the time, we're just looking at correlations.
And so making sure that we're drawing the right conclusions from what we're seeing.
From a process perspective, how do you think data and business teams should work together? What's the right level of coordination that you've seen work?
Yeah, I would say I don't think I've ever seen a business team and a data team work together early enough. It seems it's always that the engagement with business and analytics is always just a little bit too late. Before the business starts to make a decision or even before they outline the possible things that they're going to decide, the choices that they have analytics should be involved to help inform even narrowing it down to those choices.
It really should just be early and often. I [00:21:00] think that you just, you can't separate analytics from business, and one thing we've seen a lot in the past few years is this push for analytics teams to be more business-driven and more business-focused. But I also think that we need to also help our business users kind of understand the nuts and bolts of what we're doing.
Right. It's great for them to be able to understand that this is how this chart works, and this is how we can reason with this. But I'm a huge proponent of having those conversations. I've found it just immediately pays dividends to make sure that as much.[00:22:00]
If possible, you are on the same page, and you're doing this knowledge sharing. I've had a few stakeholders who would not call themselves technical people who have seen some data issue and been able to go; hey, by the way, I just noticed this; I think this might be the problem. Right. And that's extremely valuable because they know the data.
They're the ones who are looking at it all the time. I look at it every once in a while, but I have so much other data that I'm thinking about. And so, for them to be able to be, I think I identified the problem. What a huge win for everybody, right?
We're still working hard as an industry to get data teams to have a seat at the table at the beginning of the project rather than later. Any thoughts here?
When I did do some product analytics, I would be brought in late, and they would ask me, Hey, I have this question that we're trying to figure out. How are people using this new feature, or are people using this new feature? Are they clicking on this?
But they're in a lot of these situations. There wasn't any tracking for it. And I had to have that conversation of, I maybe can try to approximate it in this way shape maybe, maybe, maybe it's gonna be a terrible [00:24:00] approximation. Or even in situations where I'm, there's nothing I can do for you.
I don't have the data. It's just that it's not there. It wasn't. Nothing was logged. There's, I can't synthesize data. I can't just make it appear out of nowhere. If you guys didn't start collecting this. Way before you made the change, then it's totally meaningless, right? Yeah. Or, especially when you wanna see if it changes, changes people's behaviors, and you don't track it before you change it.
What problems do you think the modern data stack has solved? And maybe what problems remain with the modern data stack, and how do we think about fixing those things?
It's, it's definitely been interesting because I feel it was maybe only six months ago or something when somebody, all of a sudden, is the modern data stack. And then now, every single conversation is the modern data stack, the modern data stack. And I had a moment a while back where I was, what is the modern data stack? This is what I do.
This is my job, but what are they talking about? Somebody just came up with this, this buzzword. Yeah. Oh exactly. I'm, wow, this is embarrassing. But no, I think, I think there are some huge problems that the modern, modern data stack is solving. And I don't think it's dead, but also maybe don't quote me on that cuz, maybe it is, I don't know.
But, one of the things that we talk a lot about in data is that the personnel is just the hardest part. Right. All of these technical challenges are difficult. The organizational challenges are difficult, but none of those are insurmountable. What is, is not being able to find enough people, right.
And not being able to find the right te people from your organization. And, probably anybody listening to this is aware of [00:27:00] how crazy hard it is to hire the right people for your data team. And so I think that that's one of the big problems that the modern data stack is solving is that.
There are just, frankly, not enough data engineers. There are not enough data scientists. There are not enough data analysts. There are not enough people who wanna do data modeling, right? There's this huge need to make the people that we have way more efficient and also to offload the less interesting work, right?
The repetitive stuff, the stuff that's just not a huge value add. And that's what we see in the modern data stack. Right. For example, Fivetran basically replaces a few data engineers, which is great because we can't hire them. We want to, but we can't hire them. And it takes so much time to interview candidates and, and then they start, and you start the search, they start in eight months.
Great. You have one that's gonna start in eight months, and then you need to, but you need to hire [00:28:00] six, right? So I think that's where you see so much value in Fivetran. You plug it in, and it takes five or 10 minutes to add any new data source. It just works.
So, yeah, I think a lot of it is, is really just being able to do as much as humanly possible with knowing that you just don't have enough people. And that's just what it is.
What are the biggest problems you think the modern data stack hasn't solved or maybe has created for companies?
So this is one thing I'm highly critical of is, every time we're doing [00:29:00] things in a user interface, instead of doing things in code, we're making it very hard to migrate away. And so, We have that. We, we save that time that it would we would need in order to hire an entire team in order to do all of these things.
But if we decide tomorrow that we wanna move to a different tool, we basically have to start from scratch. Right. You can't, there, there may be some ways that you can kind of, go look at your logs and extract some things using APIs and kind of parse it and kind of be really smart about it.
But yeah, that's, I mean, that's one reason I'm not a huge proponent of things Tableau prep. For example, I think we should have as many things as code as possible. Everything is code. Configuration is code. Infrastructure is code, security is code, and everything is code. And now we're kind of getting away from that, with those no code, low code solutions.