Writing on Vinoo Ganesh

The Definitive Guide to Forward Deployed Engineering

Thu, 05 Feb 2026 00:00:00 +0000

Originally published on Next Play

If you work in tech, you should learn about Forward Deployed Engineering (FDE). It’s quickly become one of the most popular roles that nearly all fast-growing AI companies are looking for…at the same time, it’s the type of role very few people and companies seem to actually understand.

I designed Project Frontline at Palantir, a program that sent over 250 engineers into live customer deployments. Those engineers are now at OpenAI, xAI, Anduril, and dozens of leading companies.

The Bottom Line

The Forward Deployed Engineer (FDE) has become the most valuable credential in tech, and most people don’t understand why.

If you look at job postings at OpenAI, xAI, Anthropic, Helsing, Anduril, Scale, or Palantir, you’ll notice they’re not just looking for software engineers. They’re looking for engineers who can deploy, who can sit with customers, who can make software work in the real world.

Here’s what you need to know:

An FDE is a software engineer who owns customer outcomes. Not customer relationships, not customer satisfaction scores, but outcomes: the actual results the customer is trying to achieve with the software. This distinction sounds subtle, but it changes everything. A sales engineer’s job is to help close the deal. A solutions architect designs how the product fits the customer’s environment. An FDE’s job is to make sure the customer actually wins.

The critical insight is that you can’t easily hire your way to a true FDE organization. The skill set is too rare, too specific, too dependent on experiences most engineers never have. Traditional engineering jobs don’t build this skill set because they’re structured to insulate engineers from customer reality. You write code, you ship it, you move to the next ticket. You never feel the weight of a customer depending on your work, never see what happens when your elegant system meets messy real-world data, never have to look someone in the eye and explain why the thing they needed isn’t working.

So how do you hire forward deployed engineers? You don’t. You grow them. This was Palantir’s critical insight, and it took us years to fully understand it. The only way to build an FDE organization is to take talented people and put them through a crucible that transforms how they think about building software.

Why this matters now: Three forces are making the FDE model the default rather than the exception.

First, AI is making deployment harder, not easier. AI systems require fine-tuning, prompt engineering, evaluation, and ongoing adjustment. You can’t throw an LLM over the wall and expect customers to figure it out. OpenAI, Anthropic, xAI, and every serious AI company is building FDE-style functions, even if they don’t call them that. There’s a related point here about what it means to be an engineer at all: if AI can do much of what junior engineers used to do, the engineers who remain valuable are the ones who can do what AI can’t, which is sit with a customer, understand their real problem, and own the outcome.

Second, defense tech is exploding. Anduril, Palantir, Shield AI, and dozens of startups are building software for military and government customers who can’t tolerate deployment failure. These customers operate in austere environments with unique constraints, and you can’t support them with a helpdesk. You need engineers in the field.

Third, enterprise buyers are getting smarter. After decades of shelfware, enterprises are starting to demand implementation guarantees. They want to see engineers on site and know that someone will own the deployment, not just the sale. The vendors who can deliver this will win.

Five years from now, I expect every serious enterprise software company to have an FDE function. If you’re thinking about where to invest your career, this is the trend to bet on.

Why This Matters Now

Three forces are making the FDE model increasingly standard:

First, AI is making deployment harder. AI systems require fine-tuning, prompt engineering, evaluation, and ongoing adjustment. You cannot simply deploy an LLM and expect customers to figure it out. Companies like OpenAI, Anthropic, and xAI are building FDE-style functions, even if they use different terminology.

Second, defense tech is exploding. Companies like Anduril and Palantir build software for military and government customers who cannot tolerate deployment failure. These customers operate in austere environments with unique constraints that require engineers in the field, not helpdesk support.

Third, enterprise buyers are getting smarter. After decades of expensive software sitting unused, enterprises are demanding implementation guarantees and want to see engineers on site who own deployments, not just sales.

The Grey Track Jacket

At Palantir, everyone got a black track jacket. Standard issue. But there was another jacket, a grey one, that you couldn’t buy or request.

The mythology was that your black jacket had faded through years of experience, that you’d worn it through enough deployments and enough hard field work that the black had burned away and become grey. Leadership awarded them when you had proven yourself in the field.

Brian Schimpf, now CEO of Anduril, awarded these jackets when he was one of Palantir’s senior engineering leaders. When I got mine, I didn’t feel pride or accomplishment. I felt the weight of what it represented: the late nights, the systems that failed, the customers who’d depended on me when things went wrong. The jacket wasn’t a trophy. It was an acknowledgment that you’d been through something, and that something had changed you.

When engineers ask me how to become an FDE, they want skills to develop, certifications to get, keywords for their resume. But that’s not what this is about. The jacket was just the symbol. The transformation is what matters.

When I received my grey jacket, I did not feel pride or accomplishment. I felt the weight of what it represented: the late nights, the systems that failed, the customers who depended on me when things went wrong. The jacket was not a trophy. It was acknowledgment that you had been through something, and that something had changed you.

My Transformation

If you want concrete recommendations on how to become an FDE, skip to “The Playbook” at the end. But if you want to understand why this transformation matters and what it actually feels like, read on.

I joined Palantir on the product development side in Palo Alto. I fixed Jira tickets, wrote code, built tests, did code reviews, architected systems. I attended sprint planning and hit my deadlines. By every internal metric, I was doing good work.

But once I rotated into the field, what I saw broke how I understood my own job.

We had a product called Phoenix. It worked beautifully for the use case it was designed for. Clean architecture, solid code, good test coverage. We were proud of it.

Then I watched it fail.

The system relied on a data integration that seemed straightforward: when new data comes in, we generate a keyspace in Apache Cassandra. We’d designed it to create keyspaces in 10-minute increments for easy automated deletion. We thought it felt like a clever trick.

Then bad data came in. Well, specifically, no data came in. The date column was empty. Our system interpreted this as the epoch: January 1, 1970. So it did exactly what it was designed to do, and generated every keyspace between January 1, 1970 and October 7, 2013.

That’s 2.3 million keyspaces. Cassandra requires roughly 5MB per column family. You’d need 14 terabytes of RAM just to start up.

The system OOMed on startup. As was commonly said at Palantir back then, “bad times on boats.”

I sat there watching this happen, and I realized something that should have been obvious: we had never seen the data this system would actually process. We had never been in the room when garbage data showed up at 2am. We had never had to look a customer in the eye and explain why the system they depended on was suddenly unresponsive.

This wasn’t a one-time thing. I saw it again on other projects. We would build something elegant and technically sophisticated, ship it to the field, and watch it collide with reality in ways we never anticipated. The pattern was always the same: we built without understanding, optimized for the wrong things, and solved problems that didn’t exist while ignoring problems that did. The gap between what we were building and what users actually needed wasn’t a small gap. It was a chasm.

After that first experience, I kept going back to the field: NYC, Singapore, Afghanistan, Israel, UAE, Toronto, SF, Mumbai, and government deployments I can’t name. I spent hours in SCIFs working through data integrations across multiple networks. My laptop once overheated and nearly melted sitting in a car in the Middle East.

Each field experience taught me something new about the distance between building software and deploying it, about what happens when your code meets reality, about the difference between solving a problem in theory and solving it for a real human being who needs it to work right now.

I alternated between software engineering and forward deployed work for years, going back and forth between building and deploying. Each time I returned to product development, I built differently. I asked different questions and anticipated different failures. I understood, in a way I couldn’t have before, what the software was actually for. That oscillation changed me. And I became convinced it could change others.

Why This Model Works

The Business Case

The business case is simple: software that actually gets adopted is worth infinitely more than software that sits on a shelf.

Most enterprise software fails to deliver value not because it doesn’t work technically, but because it never gets properly deployed, configured, adopted, and integrated into customer workflows. The industry’s dirty secret is that a huge percentage of enterprise software purchases end up as shelfware. Customers buy it, try to implement it, hit friction, and give up.

Companies with an FDE model don’t have this problem. Their software gets used because there’s an engineer on site making sure it gets used. Problems get solved in real time, configurations get tuned to actual workflows, and edge cases get handled before they become blockers.

This creates a flywheel: deployed software generates feedback, feedback improves the product, better products deploy more easily, easier deployment means more customers, and more customers means more feedback.

Palantir generates over $1.1 billion in annual revenue from just their top 20 customers. That’s not because the software is marginally better than competitors. It’s because Palantir engineers own customer outcomes in a way that traditional enterprise software vendors simply don’t.

The other reason companies want FDEs is that they’re the best source of product insight. Engineers who work directly with customers see things that product managers reading survey data never see. They understand the actual workflows, the real constraints, the true pain points. The best product ideas at Palantir came from FDEs who’d spent enough time in the field to understand what needed to be built.

One deployment I worked on had over fifty data pipelines running daily, and monitoring them was a nightmare. A few engineers who’d experienced this pain firsthand built a pipeline monitoring tool that got deployed across multiple sites. They didn’t build it because a PM told them to. They built it because they’d felt the problem themselves.

That’s the FDE advantage: engineers who build from experience, not from specs.

A Correction, Not a Trend

For a long time, the software industry optimized for building at scale while pushing deployment complexity onto customers and divorcing engineering from the business problems it was supposed to solve. The prevailing model was simple: throw it over the wall, let the customer figure it out, and hire systems integrators to clean up the mess.

This worked when software was simple and customers were patient, but neither of those things is true anymore.

The companies winning right now are the ones who own the full loop of building and deploying. They don’t hand off to SI firms or blame customers for implementation failures. They send engineers into the field who own outcomes, not just outputs.

Palantir figured this out twenty years ago and got criticized for it. The criticism was relentless: “It’s not scalable.” “It’s just consultants.” “It’s not a real software company.” Now everyone is trying to copy the model.

The problem is, you can’t hire your way to an FDE organization.

Building Frontline: The Program That Scaled It

To really understand FDE, you need to understand Frontline.

At the time, Palantir had two organizations that didn’t understand each other. Product Development (PD) built the software while Business Development (BD), which included the forward deployed engineers) put it in front of customers. The two groups were culturally and operationally separate, and that separation was causing real damage.

PD thought BD lacked technical rigor. BD thought PD lacked urgency. Deploying the product was painful, sometimes taking days or weeks. Engineers in the field drowned in problems that engineers in headquarters had never seen, while engineers in headquarters built features that solved the wrong problems because they’d never felt the pain of the people using their software.

The best way to understand someone is to walk a mile in their shoes. We decided to make that literal.

Project Frontline was born from this insight. Software engineers would spend time in the field doing a forward deployed rotation before formally beginning their product development work. They would see how the software actually got used, feel the pain of deployment, and understand viscerally the gap between what we built and what customers needed. Then they would come back and build differently.

I had the opportunity to lead Frontline alongside Palantir’s senior leadership: Matt Steckman, John Garrod, Bill Ward, Lynne Lu, and Randy Schults. Shyam Sankar, Palantir’s now CTO, was directly involved, sitting in on our update meetings as we figured out what was working and what wasn’t. This wasn’t a side project. It was a strategic priority.

We learned quickly that throwing engineers into the field wasn’t enough. They needed support and structure. Every Frontliner got a mentor who had been through the experience, who could guide them through the disorientation of going from building software to deploying it. They got leads who checked in frequently, not just to track progress but to surface problems early. We created feedback loops throughout the rotation so that when someone was struggling, we caught it early, and when something wasn’t working, we heard about it fast.

We made mistakes along the way. The six month rotation timeline was supposed to be a guideline, but resourcing pressures sometimes stretched rotations to a year, and engineers felt stuck. There was friction between existing FDEs and the rotators, along with cultural clashes and different working styles.

We also discovered something crucial: often, the rotations weren’t connected to the products engineers would eventually build. Someone would spend six months deploying software they would never touch again. They learned valuable things, but the connection was abstract. So we started aligning rotations with future product work. If someone was going to build pipeline monitoring tools, we sent them to deployments where pipeline monitoring was a problem.

They became users before they became builders.

This was transformative. When engineers experienced the pain of problems they would later fix, something clicked. They built with emotion and passion, with a deep understanding of what users actually needed. You could feel it in the products. The software wasn’t just technically good; it was designed by people who cared, who had been there, who understood. Over time, the program scaled. Over 250 engineers rotated through Frontline, and it stopped being an experiment and became the way things worked.

The proof is in what happened to the people who went through it. Frontline alumni are now at OpenAI, xAI, Helsing, Anduril, Hex, and dozens of other companies building important things. Bill Ward, who co-led Frontline, started Northslope Technologies, which offers FDE as a service. Jesse Rickard started Fourth Age, which does something similar. The model we built is now being replicated across the industry.

What Makes Someone Great at This

I’ve watched hundreds of engineers go through the FDE crucible. Some became exceptional while others washed out. Here’s what separates the great ones.

Great FDEs are relentlessly curious about the customer’s world. Good FDEs solve the problems they’re given, but great FDEs understand the customer’s business well enough to surface problems the customer had given up on or hadn’t yet articulated. In other words, as I said above, they become users before they become builders. I once spent a week at a customer site and noticed that every morning, an analyst spent 45 minutes manually downloading data from three different systems and combining them in Excel before she could start her actual work. She’d been doing this for two years and had never thought to ask if it could be automated. She’d just accepted it as part of her job. A good FDE would have built what she asked for. I built a pipeline that did her morning routine automatically and had it waiting in her inbox when she arrived. That’s the difference.

Great FDEs calibrate their engineering to the situation. They know when to build a robust system and when to write a script that just works. Here’s a failure story: early in my FDE career, a customer asked for a way to deduplicate records. I spent two weeks building an elegant, configurable deduplication engine with fuzzy matching, confidence scores, and a review interface. What they actually needed was a SQL query that ran once to clean up a specific data import. They never used the engine again. I’d overengineered by 10x. The opposite failure is equally painful: I once hacked together a “temporary” data deletion script that ended up running in production for eighteen months (vinoo.groovy, a name many of my old Palantir teammates still call me).

Great FDEs communicate across audiences. They can explain complex systems to executives who don’t care about the details and dive deep with technical counterparts who need to understand exactly how something works. In one meeting, I had to explain why a migration would take three months instead of three weeks. To the CTO, I said: “The data has 15 years of accumulated edge cases that will break downstream reports if we don’t handle them carefully. Rushing it risks the quarterly board reporting.” To the engineering lead, I said: “There are 847 columns across 12 tables with undocumented interdependencies, and the existing ETL has implicit type coercions that we need to preserve or we’ll get silent data corruption.” Same problem, different language.

Great FDEs stay calm when things break. And things will break. I remember a deployment where the customer’s CEO was demoing our product to their board when it crashed. My phone rang. Instead of panicking, I talked the customer’s IT person through restarting the service (which I knew would work because I’d diagnosed the root cause earlier that week), and the demo was back up in four minutes. The CEO never knew there was an engineer on the phone. How you handle a crisis matters as much as whether you fix it.

Great FDEs own outcomes without having authority. This is the hardest skill. Once, I needed a product team to prioritize a bug fix that was blocking a major customer. I had no authority over that team. What didn’t work: “Can you please prioritize this bug?” What did work: “This customer represents $4M in annual revenue and their renewal is in six weeks. They’ve told their exec sponsor that this bug is their top complaint. If we don’t fix it, we risk losing the renewal, and I can get you a call with the customer to hear it directly if that would help.” I got the fix in two days.

Great FDEs know when to push back. They don’t just do whatever customers ask. Once, a customer demanded we build a custom feature that would have taken three months of engineering time. Instead of saying yes or no, I spent two days understanding why they wanted it. It turned out they were trying to solve a problem that our existing product could handle with a configuration change. I showed them how to do it, and they were happier with that solution than they would have been with the custom feature. Pushing back isn’t about saying no. It’s about understanding the real need and finding the right solution.

Why You Might Not Want This

I’ve spent this essay explaining why FDE is valuable. Now let me tell you why it might not be right for you.

If you need deep focus time, this isn’t your path. FDE work is interrupt-driven, with customers having emergencies and deployments hitting unexpected problems. Your calendar will look like that week I described above: chaotic, fragmented, constantly shifting. If ambiguity paralyzes you, stay away. FDEs operate with incomplete information constantly. Requirements are unclear, priorities shift, and you have to make decisions without knowing if they’re right.

If you want to go deep on a technical specialty, look elsewhere. FDEs are generalists by necessity. You’ll learn a lot about a lot of things, but you won’t become a world expert in any of them. If your goal is to be the best in the world at distributed systems or machine learning, pure product engineering is a better path.

If you take customer frustration personally, you’ll burn out. Customers get frustrated and blame you for things that aren’t your fault.

If you can’t set boundaries, you’ll work yourself to death. FDE work expands to fill all available time. I’ve seen talented people burn out because they couldn’t say no.

These aren’t weaknesses; they’re just characteristics that make FDE a bad fit. There are plenty of great engineering roles for people who have them.

The Playbook

Manufacturing Your Own Field Exposure

Most of you reading this don’t work at a company with a Frontline program. Nobody is going to hand you this experience, so you have to create it yourself. Here’s exactly how to do it.

Week 1-2: Get on customer calls. Email your PM or customer success lead today with this exact message: “I want to understand our customers better so I can build more robust features. Can I sit in on 2-3 customer calls over the next few weeks? I’ll be on mute, just observing.” Almost no one will say no to this. On the calls, don’t take notes on feature requests. Take notes on these things: What does this person’s day look like? What did they do before this call? What will they do after? What do they seem frustrated by that they’re not explicitly complaining about? What workarounds are they using?

Week 3-4: Shadow support or implementation. Find whoever handles customer escalations or implementations at your company. Ask to shadow them for three days. If you work remotely, ask to be added to their Slack channels and get on video calls with them when issues come in. Watch how problems get diagnosed. Notice the gap between what customers say is wrong and what’s actually wrong. Pay attention to which parts of your product cause the most confusion.

Month 2: Take an escalation. When a customer issue comes in, volunteer to own it end-to-end. Not just fix the bug, but talk to the customer, understand the impact, communicate the timeline, deploy the fix, and confirm it’s resolved. This is the full loop. It will be uncomfortable. You will learn more from this one experience than from months of normal development work.

Month 3: Visit a customer site (or do a deep-dive remote session). If your company has customers you can visit, ask to join a site visit. If not, ask if you can do a 2-hour screen-share session with a customer where you watch them use the product for their actual work, not a demo, not a training session, their actual job. While you’re watching, identify one papercut you can fix in under a day. Then fix it and ship it before the week is over. Tell them you did it. This builds trust and goodwill like nothing else.

Month 4-6: Build your feedback loop. By now you should have enough context to start seeing patterns. Create a simple document titled “What I’ve Learned From Customers” and update it weekly. Share it with your product team. Include specific quotes, specific workflows, specific pain points. This document will make you invaluable. It will also prepare you for FDE interviews, where you’ll need to demonstrate customer empathy.

Developing the Six Traits

Curiosity is really about empathy. It develops when you force yourself out of the codebase and into the customer’s environment. The next time you’re on a customer call or a site visit, don’t just listen for feature requests. Try to understand the customer’s day: what are they doing before they open your software, what are they doing after, and what are the three things keeping them up at night that have nothing to do with your product?

Calibration comes from getting it wrong in both directions. You need both scars. Deliberately vary your approach: on your next project, try solving it with the simplest possible thing, and on the one after, try building it properly. Pay attention to which approach the situation actually called for.

Communication is a muscle you have to exercise. Think of it like the game of telephone: every time a message passes through another person, it degrades. FDEs short-circuit that chain. Practice writing one-page summaries of technical projects for non-technical stakeholders. If you can’t explain what you built and why it matters in one page without jargon, you don’t understand it well enough.

Staying calm is about preparation and repetition, not personality. Put yourself in high-pressure situations voluntarily: volunteer for the on-call rotation, take the escalation, join the war room when production is down. By the tenth time, you’ll have a playbook in your head.

Owning outcomes without authority develops when you put yourself in situations where you need something from someone who doesn’t report to you. Learn to make requests that explain the why, not just the what.

Pushing back effectively runs on credibility. Think of credibility as points you can spend or lose. Early on, pick your battles carefully. The key is to always push back with an alternative, not just a no.

What You Should Actually Know Technically

FDEs are generalists, but that doesn’t mean you can be shallow on everything. Here’s the specific technical knowledge you need.

The XY Problem. Customers will ask you for Y when what they actually need is a solution to X. Practice by asking “what are you trying to accomplish?” before jumping to solutions.

SQL. JOINs, GROUP BY with HAVING, window functions, CTEs, subqueries. Target: write a query that finds “the second-highest value per category” without Googling. Data is objective and data analysis is your lifeblood.

Command line and Linux. Navigating the filesystem, reading logs (grep, awk, tail -f), understanding processes (ps, top, kill), basic networking (curl, netstat, ping, traceroute). Know what “out of inodes” means. Resource: OverTheWire Bandit.

Containers. What Docker does, difference between container and VM, where logs go. What Kubernetes does at a high level. Resource: Docker’s getting started guide.

Networking. DNS, HTTP status codes, TLS/SSL basics, debugging when you can ping but can’t connect. Resource: Julia Evans’ zines.

Python scripting. Read a CSV, transform data, write it somewhere. Call an API, handle pagination. Connect to a database. Be fast. Resource: Automate the Boring Stuff with Python.

Data quality. What happens with timezone mismatches, encoding issues, unexpected nulls. The Cassandra story I told? That’s a data quality issue.

Getting Hired: What Interviewers Actually Ask

FDE interviews are heavy on behavioral questions because the job is about judgment, not just technical skill.

“Tell me about a time you had to work with a difficult customer.” Bad: “The customer was being unreasonable, but I stayed calm and eventually they came around.” Good: “The customer was frustrated because our product had lost their data. They were angry, and honestly, they had a right to be. I started by acknowledging that we’d screwed up, not making excuses. Then I explained exactly what had happened, what we were doing to recover the data, and what we were doing to make sure it never happened again. I also gave them my cell phone number and told them to call me directly if anything else went wrong. It took two weeks, but by the end they renewed their contract.”

“Tell me about a time you disagreed with a customer’s request.” Bad: “The customer wanted something that didn’t make sense, so I explained why they were wrong.” Good: “The customer wanted us to build a custom dashboard that would have taken a month of engineering time. Instead of saying no, I asked them to walk me through exactly how they’d use it. It turned out they needed to answer a specific question for a weekly meeting. I showed them how to get that answer with our existing product plus a 10-minute SQL query. They were actually happier with that solution because they got it that day instead of waiting a month.”

“Tell me about a time something broke in production.” Bad: “Our service went down, and I fixed it.” Good: “I was on-call when our main API started returning 500 errors. First, I checked monitoring to understand scope: 30% of requests, not all. That told me it probably wasn’t new code. Error logs showed database connection timeouts. Database was at 100% CPU. Slow query log showed a full table scan on 50 million rows from a feature that launched that morning. I disabled the feature, fixed the API, then worked with the team to add an index. Whole thing took 45 minutes. Next day we implemented query review for large tables.”

Technical questions focus on debugging and practical coding:

“How would you debug a slow API endpoint?” (Walk through: monitoring, logs, profiling, database queries, external dependencies)
“Write a script that processes this data.” (Messy data problem. Handle edge cases. Python.)
“Explain X to me like I’m not technical.” (DNS, database indexing, rate limiting. No jargon.)

Your First 90 Days

Days 1-30: Learn. Absorb context. Meet everyone. Read all documentation. Shadow every call. Build a glossary. Write down three things you learned each day. Don’t try to add value yet. The most common failure mode is engineers who come in hot with opinions before they understand the environment.

Days 31-60: Find your first win. One small thing you can own completely. A script that automates manual work. A bug fix that’s been annoying customers. A dashboard replacing a spreadsheet. Something with a clear before and after. Ship it. Tell people.

Days 61-90: Start forming opinions. Now you have context and credibility. Engage harder problems. Propose solutions. Be humble—three months in, you’re still a beginner. But you’re a beginner with context.

Traps to avoid: Overengineering. I once saw a new FDE spend three weeks building a “robust, scalable” solution for a problem that needed a bash script. The customer cared about getting the answer by Friday.

Retreating into code. When a meeting gets tense, your instinct is to hide behind your laptop. Don’t. Stay present.

Overpromising. Underpromise and overdeliver. “Done by Wednesday” when you finish Tuesday beats “Done by Monday” when you finish Wednesday. Solving the wrong problem. “What are you trying to accomplish?” is the most important question in an FDE’s toolkit.

The Career Path

Year 1: Learn. Absorbing context, building relationships, developing judgment. Handling individual customer problems. Success: customers trust you, team relies on you for insights, you’ve shipped things that made a real difference.

Year 2-3: Lead. Owning larger customer relationships or leading small teams. Making prioritization decisions. Mentoring newer FDEs. Success: turned around a struggling customer relationship, shaped the product roadmap, people come to you when they don’t know what to do.

Year 4+: Multiply. Going deep (domain expert) or going broad (leading an FDE org, product leadership, starting a company). Many FDEs become founders because they have both technical skills and customer understanding. Others become product leaders. Some stay as senior FDEs handling the most complex deployments.

Frontline alumni have become founders of AI companies, heads of product at public companies, and engineering leaders at defense contractors. The FDE skill set is rare and valuable in almost any technical leadership role.

Resources

Companies hiring FDEs: Palantir, Anduril, OpenAI, Anthropic, Scale AI, Helsing, Hex, Databricks, Weights & Biases

FDE-adjacent: FDE as a service companies, defense tech startups, vertical SaaS with complex deployments, AI companies with enterprise customers

What to read:

Palantir’s S-1 filing (deployment methodology section)
Shyam Sankar’s blog - https://www.shyamsankar.com/
“The Mom Test” by Rob Fitzpatrick - https://www.momtestbook.com/
Julia Evans’ blog and zines - https://jvns.ca/

The Path Forward

The grey track jacket was just a symbol. What it represented was something deeper: the transformation that happens when you stop building software in a vacuum and start building it for real people, with real problems, in the real world.

That transformation is available to you whether or not you work at Palantir. It requires putting yourself in the field, feeling the pain of deployment, and understanding what customers actually need. It requires going through something that changes you.

The field is where the real education happens. Everything else is just preparation.

Context Is The Easy Part

Fri, 30 Jan 2026 00:00:00 +0000

Originally published on Kepler

Everyone’s talking about context engineering right now, but most of the conversation is focused on the wrong thing. Read the blog posts, the guides, the thought leadership. They’re all asking the same questions: What should I include in the context window? How do I manage tokens efficiently? How do I curate what the model sees?

These are valid questions. They’re also the easy part.

The hard part isn’t deciding what context to include. It’s building systems that deliver that context reliably, with provenance, at scale, every single time. That’s not a context problem. That’s an engineering problem. And engineering means something specific.

I spent fifteen years building data infrastructure at Palantir and Citadel, working on defense and intelligence systems where wrong answers weren’t acceptable. Here’s what I learned: knowing what data you need is trivial. Everyone knows what data they need. The hard part is building systems that deliver it validated, traceable, reproducible, and on demand.

The data world spent two decades learning this lesson. You don’t solve data problems by choosing the right data. You solve them by building the infrastructure that makes data trustworthy. Context is the same problem wearing different clothes.

Watch how people approach context engineering today. They’re selecting documents. Tuning retrieval parameters. Adjusting chunk sizes. Experimenting with what to include and what to leave out. Managing memory hierarchies. Optimizing token usage. This is prompt engineering with more inputs. It’s not engineering.

Engineering means provenance: where did this context come from, and can I trace every piece back to its source? If the output is wrong, can I figure out which input caused it? Engineering means versioning: the context I retrieved yesterday and the context I retrieve today, are they the same, and if not, why? Can I reproduce last week’s answer? Engineering means validation: how do I know the context is accurate before it reaches the model, what checks exist, and what happens when something fails? Engineering means determinism: same query, same context, every time, not mostly the same, but exactly the same. Engineering means observability: when context retrieval breaks, how do I know, how do I debug it, and how do I fix it without guessing?

Most context engineering today has none of this. It’s artisanal, manual, and fragile. It works in demos and falls apart in production.

The failure mode is predictable. Someone builds a system that retrieves context and generates answers. It works well enough in testing, so they ship it. Then the context changes. A document gets updated. A source goes stale. A retrieval pipeline silently returns different results. The model keeps generating answers with the same confidence, but now the answers are wrong. Nobody notices until a customer does.

This is the same failure mode the data world lived through. Dashboards that showed numbers nobody could trace. Reports that changed when you refreshed the page. Metrics that meant different things to different teams. We solved it by building infrastructure: pipelines with lineage, transformations with tests, data contracts, observability. The unsexy plumbing that makes data trustworthy. Context needs the same infrastructure, and almost nobody is building it.

The current generation of AI tools is fragile in ways that aren’t obvious. They work until they don’t. They’re right until they’re wrong. And when they break, there’s no way to diagnose why. The teams treating context engineering as an optimization problem will keep hitting this wall. They’ll tune retrieval, adjust chunk sizes, swap embedding models, and still get inconsistent results they can’t explain. The teams treating it as an engineering problem will build systems that are auditable, reproducible, and debuggable. They’ll know where their context comes from. They’ll know when it changes. They’ll be able to defend their outputs. One approach scales. The other doesn’t.

The term context engineering is new. The problem isn’t. It’s the same problem the data world faced: how do you build systems that deliver the right information, reliably, at scale? The answer was never “choose the right data.” The answer was always “build the infrastructure that makes data trustworthy.”

Context is data by another name. The lessons are the same. The question is whether we’re going to spend another decade relearning them.

Databricks Delta Live Tables 101

Fri, 08 Mar 2024 00:00:00 +0000

Originally published on Sync Computing

Databricks’ DLT offering showcases a substantial improvement in the data engineer lifecycle and workflow. By offering a pre-baked, and opinionated pipeline construction ecosystem, Databricks has finally started offering a holistic end-to-end data engineering experience from inside of its own product, which provides superior solutions for raw data workflow, live batching and a host of other benefits detailed below.

Since its release in 2022, Databricks’ Delta Live Tables have quickly become a go-to end-to-end resource for data engineers looking to build opinionated ETL pipelines for streaming data and big data. The pipeline management framework is considered one of the most valuable offerings on the databricks platform, and is used by over 1,000 companies including Shell and H&R block.

What Are Delta Live Tables?

Delta Live Tables, or DLT, is a declarative ETL framework that dramatically simplifies the development of both batch and streaming pipelines. Concretely though, DLT is just another way of authoring and managing pipelines in databricks. Tables are created using the @dlt.table() annotation on top of functions (which return queries defining the table) in notebooks.

Delta Live Tables are built using Databricks foundational technology such as the Delta Lake and Delta File format. As such, they operate in conjunction with these two. However, whereas these two focus on the more “stagnant” portions of the data process, DLT focuses on the transformation piece. Specifically, the DLT framework allows data engineers to describe how data should be transformed between tables in the DAG.

The magic of DLT is most apparent when it comes to datasets that both involve streaming data and batch processing data. Whereas, in the past, users had to be keenly aware of and design pipelines for the type of the “velocity” (batch vs. streaming) of data transformed, DLT allows users to push this problem to the system itself. Meaning, users can write declarative transformations and let the system figure out how to handle the streaming or batch components.

The word “Delta” appears a lot in the Databricks ecosystem, and to understand why, it’s important to look back at history. In 2019, Databricks publicly announced the Delta Lake, a foundational element for storing data (tables) into the Databricks Lakehouse. Delta Lake popularized the idea of a Table Format on top of files, with the goal of bringing reliability to data lakes.

Tables that live inside of this Delta Lake are written using the Delta Table format and, as such, are called Delta Tables. Delta Live Tables focus on the “live” part of data flow between Delta tables – usually called the “transformation” step in the ETL paradigm. Delta Live Tables (DLTs) offer declarative pipeline development and visualization.

Breaking Down The Components of Delta Live Tables

There are two main ways to create tables within a Delta Live Tables pipeline:

Tables

Tables in DLT are materialized views that are stored in the lakehouse. They represent the physical datasets that will be persisted and can be queried directly. These tables are created using the @dlt.table() decorator and contain the actual transformed data.

Views

Views in DLT are temporary datasets that exist only during the pipeline execution. They’re useful for intermediate transformations and don’t consume storage since they’re computed on-demand. Views are created using the @dlt.view() decorator.

You can declare your datasets in DLT using either SQL or Python. These declarations can then trigger an update to calculate results for each dataset in the pipeline.

When to Use Views or Materialized Views in Delta Live Tables

The choice of View or Materialized View primarily depends on your use case. The biggest difference between the two is that Views are computed at query time, whereas Materialized Views are precomputed. Views also have the added benefit that they don’t actually require any additional storage, as they are computed on the fly.

The general rule of thumb when choosing between the two has to do with the performance requirements and downstream access patterns of the table in question. When performance is critical, having to compute a view on the fly may be an unnecessary slowdown, in which case, Materialized Views may be preferred. The same is true when there are multiple downstream consumers of a particular View.

However, there are multiple situations where users just need a quick view, computed in memory, to reference a particular state of a transferred table. Rather than materializing this table, creating a View is more straightforward and efficient.

What Are the Advantages of Delta Live Tables?

There are many benefits to using Delta Live Tables:

Unified Streaming/Batch Experience

By removing the need for data engineers to build distinct streaming/batch data pipelines, DLT simplifies one of the most difficult pain points of working with data, thereby offering a truly unified experience.

Opinionated Pipeline Management

The modern data stack is filled with orchestration players, observability players, data quality players, and many others. DLT offers an opinionated way to orchestrate and assert data quality.

Performance Optimization

DLTs offer the full advantages of Delta Tables, which are designed to handle large volumes of data and support fast querying. Their vectorized query execution allows them to process data in batches rather than one row at a time.

Built-in Quality Assertions

Delta Live Tables provide data quality features, such as data cleansing and data deduplication, out of the box. Users can specify rules to remove duplicates or cleanse data as data is ingested, ensuring data accuracy.

ACID Transactions

Because DLTs use Delta format they support ACID transactions (Atomicity, Consistency, Isolation and Durability) which has become the standard for data quality and exactness.

Pipeline Visibility

DLT provides a Directed Acyclic Graph of your data pipeline workloads, giving you a clear, visually compelling way to both see and introspect your pipeline at various points.

Change Data Capture (CDC) in Delta Live Tables

One of the large benefits of Delta Live Tables is the ability to use Change Data Capture while streaming data. Change Data Capture refers to the tracking of all changes in a data source so they can be captured across all destination systems.

With Delta Live Tables, data engineers can easily implement CDC with the Apply Changes API (either with Python or SQL). The capability lets ETL pipelines easily detect source data changes and apply them to data sets throughout the lakehouse.

Delta Live Tables support Slowly Changing Dimensions (SCD) both type 1 and type 2. This is important because SCD type 2 retains a full history of values, which means you can retain a history of records in your data lakehouse.

What is the Cost of Delta Live Tables?

The cost of Delta Live Tables depends on the compute function itself. On AWS, DLT compute can range from $0.20/DBU for DLT Core Compute Photon up to $0.36/DBU for DLT Advanced Compute. However, these prices can be up to twice as high when applying expectations and CDC.

From an efficiency perspective, DLT results in a reduction in total cost of ownership. Automatic orchestration tests by Databricks have shown total compute time to be reduced by as much as half with Delta Live Tables – ingesting up to 1 billion records for under $1.

Conclusion

Delta Live Tables represent a significant advancement in data engineering workflows, offering a unified approach to batch and streaming data processing. By providing built-in data quality checks, automatic orchestration, and comprehensive pipeline visibility, DLT simplifies many of the traditional pain points in data pipeline development.

While there are cost considerations to keep in mind, the efficiency gains and reduced operational overhead often justify the investment, especially for organizations dealing with complex data transformation workflows.

This post was originally published on Sync Computing’s blog on March 8, 2024.

Rethinking Serverless: The Price of Convenience

Fri, 09 Feb 2024 00:00:00 +0000

Originally published on Sync Computing

Serverless functions have had their 15 minutes of fame (and runtime).

As is the case with many concepts in technology, the term Serverless is abusively vague. As such, discussing the idea of “serverless” usually invokes one of two feelings in developers. Either, it’s thought of as the catalyst for this potential incredible future, finally freeing developers from having to worry about resources or scaling concerns, or it’s thought of as the harbinger of yet another “we don’t need DevOps anymore” trend.

The root cause of this confusion has to do with the fact that the catch-all term “Serverless” actually comprises two large operating models: functions and jobs. At Sync – we’re intimately familiar with optimizing jobs, so when our customers gave us feedback to focus a portion of our attention on serverless functions, we were more than intrigued.

The hypothesis was simple. Could we extend our expertise and background in optimizing Databricks large scale/batch compute workloads to optimizing many smaller batch compute workloads.

Serverless Functions: How We Got Here

One of the most painful parts of the developer workflow is “real world deployment.” In the real world, deploying code that was written locally to the right environment and to work in the same way was extraordinarily painful. Library issues, scaling issues, infrastructure management issues, provisioning issues, resource selection issues, and a number of other issues plagued developers. The cloud just didn’t mimic the ease and simplicity of local developer environments.

Then Serverless functions emerged. All of a sudden, developers could write and deploy code in a function with the same level of simplicity as writing it locally. They never had to worry about spinning up an EC2 instance or figuring out what the material differences between AMI and Ubuntu are. They didn’t have to play with docker files or even have to do scale testing. They wrote the exact same Python or NodeJS code that they wrote locally in a Cloud IDE and it just worked. It seemed perfect.

Soon, mission critical pieces of infrastructure were supported by double digit line python functions deployed in the cloud. Enter: Serverless frameworks. All of a sudden, it became even easier to adopt and deploy serverless functions. Enterprises adopted these functions like hotcakes. Many deployed in the hundreds or even thousands of these functions.

Industry Adoption and Trends

In 2022, an IBM blog post titled “The Future Is Serverless” was published, which cited the “energy-efficient and cost-efficient” nature of serverless applications as a primary reason that the future will be serverless. They make the – valid – case that reserving cloud capacity is challenging and consumers of cloud serverless functions are better served by allowing technologies such as KNative to streamline the “serverless-ification” processes.

In 2023, Datadog released their annual “State of Serverless” post, showing the continued adoption of Serverless technologies across all 3 major cloud vendors. The leader of the pack is AWS Lambda, which has traditionally been the entry point for developers to deploy their Serverless workloads.

Interestingly, 40%+ of Lambda invocations happen in NodeJS – which is not traditionally thought of as a distributed computing framework, nor is it generally used for large scale orchestration of compute tasks. But it seems to be dominating the Lambda serverless world.

What Serverless Actually Solved

Before diving into the problems, let’s acknowledge where serverless functions truly excel:

Uptime Guarantees

One of the critical, but most frustrating pieces of the developer lifecycle is uptime requirements. Many developers hear the term “five-nines” and shudder. Building applications that have specific uptime guarantees is not only challenging, it’s also time-intensive. When large scale systems are made up of small, discrete pieces of computation, the problem can become all the more complex.

Lambda SLAs guarantee a fairly reasonable amount of uptime, right out of the box. This can save otherwise substantial developer efforts of scoping, building, and testing highly available systems.

Concurrency + Auto Scaling

Introspecting a large scale system isn’t easy. In an environment where requests can burst unexpectedly, creating and designing systems that scale based on spot user demand is also difficult.

One of the most powerful aspects of a serverless or hosted model is the demand-based auto-scaling capabilities offered by the infrastructure. These effects are compounded, especially when the functions themselves are stateless. This effectively eliminates developers having to care about the operational concerns of autoscaling.

The Problems with Serverless Functions

Despite these benefits, our analysis revealed four critical problems that are undermining the serverless promise:

Problem 1: Developer Bandwidth

In a typical Serverless Function deployment, the initial choice of configuration tends to be the perpetual choice of configuration.

Wait, “initial choice of configuration”? It turns out, yes, users still need to manually pick a particular configuration for each serverless function they deploy. It’s actually a bit ironic – with the promise of true 0-management jobs, users are still required to intelligently select resource configuration.

If an engineer deploys and accidentally overspecs a serverless function initially, it’s fairly unlikely that they will ever revisit the function to optimize it. This happens for several reasons:

Time – Most engineers don’t have the time to go back and ensure that functions they have written weeks, months, or even years ago are operating under the ideal resources.
Incentives – Engineers are not incentivized by picking the optimal resource configuration for their jobs. They’d rather have the job be guaranteed to work, while spending a bit more of their company’s compute budget.
Employee Churn – Enterprises have inherent entropy and employees are oftentimes transient. When other engineers inherit previous work, they are significantly more incentivized to just ensure it works, rather than ensure that it works optimally.

Problem 2: Serverless Still Requires Tuning

Lambda is predicated on a simple principle – the resource requirements for workloads that take less than 15 minutes to run can be pretty easily approximated. Lambda makes it easy for developers to set-and-forget, offering only one knob for them to worry about.

That knob is memory. Using Lambda, you can configure the memory allocated to a lambda function as a value between 128 MB and 10,240 MB. Lambda will automatically decide how much vCPU to allocate based on the memory setting.

This sounds great in theory. “I only have to pick one lever and all of a sudden, I get everything else figured out for me?” If that were the end of the story, this would be a much shorter post.

Instead, life is all about tradeoffs – generally correlated tradeoffs. In this case, it’s cost and performance. As an engineer, it’s easy to pick the largest memory setting available just to ensure the Lambda function works, regardless of what its actual resource requirements are. Once it works, why would anyone ever touch it again?

Well, it turns out that picking large, frequently uncorrelated-to-necessary-resources values isn’t the most cost effective approach. So much so that an AWS Solutions Engineer built and open sourced a tool to help users actually find the correct memory levels for their Lambda functions. The tool uses AWS Step Functions to walk users down to the minimum necessary level. It’s been so popular that it has 5K stars on GitHub and 18.8K deployments.

Clearly, the one-knob-rules-all solution isn’t working.

Problem 3: Serverless Is Hard to Introspect

The scale and growth testing that plagued engineers for decades before the rise of Serverless was unfortunately not in vain. Understanding how users will interact with an application, in terms of number of requests or compute load, gives engineers a powerful understanding of what to expect when things go live.

In the Serverless Function architecture, engineers don’t think about these considerations and push the burden onto the infrastructure itself. As long as the infrastructure works – it’s unlikely that an already oversubscribed engineer would spend time digging into the performance or cost characteristics of the Serverless function.

Absent home-rolled solutions, there are few tools that allow for the detailed observability of a single serverless function. Furthermore, there are usually hundreds if not thousands of serverless functions deployed. Observability across a fleet of functions is nearly impossible.

The primary mechanism for per-function observability is AWS CloudWatch. CloudWatch logs events for each lambda invocation and stores a few metrics. The major problem is that just collecting this information in CloudWatch has been observed to be more expensive than Lambda itself. There are full articles, posts, and best practices around just managing the costs associated with Lambda CloudWatch logs.

Problem 4: No Auto-Optimization

The year 2023 brought on a material shift in the mentality of “compute” consumers. Enterprises that were previously focused on growth at all costs shifted their focus to efficiency. Vendors in the generic Cloud, Snowflake, and Databricks ecosystem popped up at increasing rates. Most had a simple goal – provide high level visibility into workloads.

They provided interactive charts and diagrams to show ongoing cost changes… But they didn’t provide the fundamental “healing” mechanisms. It would be like going to the doctor, having them diagnose a problem, but provide no recourse.

Consistent with their focus on efficiency, enterprises had a few options. Larger ones deployed full teams to focus on this effort. Smaller ones that didn’t have the budget or manpower turned to observability tools… nearly all of which fell short, as they missed the fundamental optimization component.

Providing detailed visibility across a few, large scale jobs is considered table stakes for many observability providers, but for some reason providing that same level of visibility across many, small scale jobs, in an efficient and easy to optimize way hasn’t become standard.

Conclusion

We’re in a fairly unique period as an industry. Job visibility, tuning, introspection, and optimization have reemerged as key pieces of the modern tech stack. But most focus on the whales, when they should be focusing on the barracudas.

Serverless functions promised to eliminate the complexity of infrastructure management, but they’ve simply shifted that complexity elsewhere. While they excel in certain areas like uptime guarantees and auto-scaling, the hidden costs of poor optimization, lack of visibility, and ongoing manual tuning requirements suggest that the serverless revolution still has room for improvement.

The future of serverless likely lies not in the elimination of optimization concerns, but in the automation of those concerns – turning the promise of “set-and-forget” into reality through intelligent, automated resource management and optimization.

This post was originally published on Sync Computing’s blog on February 9, 2024.

The Efficiently Guide to Snowflake (Top Down)

Thu, 02 Feb 2023 00:00:00 +0000

Originally published on Efficiently (Substack)

The majority of my career has been focused on making data systems more efficient — whether that means performance, scalability, or cost. This series aims to democratize knowledge about how to Efficiently operationalize data.

TLDR

4 changes you can make right now to run Snowflake more Efficiently:

File a Snowflake support ticket and request access to the GET_QUERY_STATS function
ALTER WAREHOUSE SET AUTO_SUSPEND = 60;
For multi-cluster warehouses:
- ALTER WAREHOUSE SET MIN_CLUSTER_COUNT = 1;
- ALTER WAREHOUSE SET SCALING_POLICY = ECONOMY;
ALTER WAREHOUSE SET STATEMENT_TIMEOUT_IN_SECONDS=36000

Snowflake + Driving

Snowflake optimization resembles efficient driving. There are four parallel constraints:

Car efficiency = warehouse size selection
Driver skill = query authorship ability
Road congestion = warehouse saturation
Route optimization = query construction, schema design, and partitioning

Top Down vs. Bottom Up

This framework separates optimization into two approaches: Top Down (optimizing the environment/infrastructure without affecting users) and Bottom Up (optimizing operations — the “driver”).

This post focuses on the Top Down approach.

Insight #1: Get Data

Solving problems requires data. Two critical functions provide the visibility needed for optimization:

GET_QUERY_STATS
GET_QUERY_OPERATOR_STATS (preview feature, available on all accounts)

These functions are not available by default — you’ll need to file a support ticket with Snowflake to request access.

Insight #2: Turn Your Car Off

Your car (warehouse) should only be on when you need it. Minimize the time that it is on and doing nothing.

Warehouses consume credits continuously while active, even during idle periods. The AUTO_SUSPEND parameter automatically stops idle warehouses after a specified duration.

There are four warehouse states to understand:

Suspended — Off, no charges
Running/Idle (Will Suspend) — On but no queries executing, within the auto-suspend window. Charging unnecessarily.
Running/Idle (Won’t Suspend) — On but no queries, yet a query is incoming before suspension triggers. This requires investigation.
Running/Active — Queries are executing. This is the desired state.

The goal: minimize the Red “No Queries Running” while looking for ways to also minimize the Pink “No Queries Running.”

Snowflake’s minimum auto-suspend is 30 seconds, but charges are incurred for full-minute increments regardless. Setting to 60 seconds is the optimal minimum:

`1`	`ALTER WAREHOUSE <warehouseName> SET AUTO_SUSPEND = 60;`

Insight #3: Car Engines + Cylinders

Clusters function like engine cylinders — more clusters mean more processing power but higher consumption.

Snowflake bills at # Warehouse * # Clusters. More clusters enable parallel processing but increase costs proportionally.

Two configuration changes to make right now:

Start with a single cluster minimum:

`1`	`ALTER WAREHOUSE <warehouseName> SET MIN_CLUSTER_COUNT = 1;`

Use economy scaling policy (scales only when strictly necessary):

`1`	`ALTER WAREHOUSE <warehouseName> SET SCALING_POLICY = ECONOMY;`

The economy policy prioritizes cost efficiency over performance, scaling up only when queries are being queued.

Insight #4: Restrict Trip Distance

Like Federal Motor Carrier Safety Administration regulations limiting how long drivers can be on the road, queries should have maximum execution times.

Snowflake’s STATEMENT_TIMEOUT_IN_SECONDS defaults to 172,800 seconds — that’s 2 days. This permits excessively long, potentially runaway queries to consume credits indefinitely.

Set a reasonable timeout of 36,000 seconds (10 hours):

`1`	`ALTER WAREHOUSE <warehouseName> SET STATEMENT_TIMEOUT_IN_SECONDS=36000`

Conclusion

There is a prevalent idea floating around that Snowflake is expensive. That can be true, but as is the case in most of these systems, it really comes down to how effectively and Efficiently you use Snowflake.

Hands-On: Predicate Pushdown

Sat, 28 Jan 2023 00:00:00 +0000

Originally published on Efficiently (Substack)

We’ve spoken a lot about on-disk and distributed storage, as well as blocks. All of this theory is great, let’s talk about this in practice.

In this post, I’m going to:

Read a CSV dataset into Spark
Write the dataset into 5 Parquet files (treating each file as a block)
Introspect metadata existing on the files
Run queries demonstrating predicate pushdown power

Hands-On: Setup

The tutorial uses an airports dataset. Download it via:

`1`	`wget https://raw.githubusercontent.com/curran/data/gh-pages/vegaExamples/airports.csv -O dataset.csv`

The CSV contains columns: iata, name, city, state, country, latitude, longitude.

Loading Data into Spark

`1`	`val dataset = spark.read.option("header","true").option("inferSchema","true").csv("dataset.csv")`

The inferred schema shows latitude and longitude as double types.

Writing Parquet Files

`1`	`dataset.repartition(5).write.parquet("/root/parquet_dataset")`

This creates 5 Parquet files plus a _SUCCESS flag file.

Inspecting Parquet Metadata

Install the inspection tools:

1
2

pip3 install parquet-tools
pip3 install parquet-metadata

View file contents:

`1`	`parquet-tools show part-00000-53b27d15-b049-41db-a8aa-fa3033763836-c000.snappy.parquet`

Hands-On: Query Plans

Simple Filter Query

1
2

val simpleFilter = dataset.filter($"latitude" > 30)
simpleFilter.show()

The result shows all rows where latitude exceeds 30.

The query plan analysis reveals three optimization stages: parsed logical plan, analyzed logical plan, and optimized logical plan. The optimized logical plan has added some null checking — which also matches our predicate.

Complex Filter Query

`1`	`val complexFilter = dataset.filter($"latitude" > 30).filter($"latitude" < 40)`

As you can see, the plan has combined both of our predicates into one step as part of the query process (meaning that what would previously take two passes over the data now only requires one).

The optimized plan consolidates the filters: Filter ((isnotnull(latitude#21) AND (latitude#21 > 30.0)) AND (latitude#21 < 40.0))

Hands-On: Querying with Parquet

Row Group Metadata Analysis

`1`	`parquet-metadata /root/parquet_dataset/part-00000-...`

Critical metadata fields include:

stats:min — smallest value in the column
stats:max — largest value in the column

Example statistics from one file:

1
2

row_group 0 latitude stats:min 14.1743075
row_group 0 latitude stats:max 70.46727611

Predicate Pushdown Mechanism

For a second file with stats:min=44.4430157 and stats:max=74.46727611, a query filtering for latitude between 30 and 40 would exclude this entire file — because we know from the metadata that no values in this file fall within our filter range.

In practice, this is called predicate pushdown. The requirements of the predicate (the query) have been pushed down, allowing the optimizers to look at the metadata on the row groups themselves to decide which row groups to read, and when they can be ignored.

Conclusion

There is a lot of magic that goes into our ability to query data quickly and Efficiently. Query optimizers do a lot for us — and understanding how they work under the hood helps us write better queries and design better data layouts.

Distributed Data and Blocks

Tue, 24 Jan 2023 00:00:00 +0000

Originally published on Efficiently (Substack)

This is a continuation of a previous blog post about efficient data partitioning.

In the previous post, I discussed how data layout on disk impacts analytics performance. This post focuses on tactical implementation using open source technologies.

Topics I’ll cover:

HDFS
Blocks + Block Size
Block sizes + tradeoffs

Background

Data organization on disk dramatically affects analytics performance. I previously explored row-oriented, columnar, and hybrid storage models — now let’s connect these concepts to modern data infrastructure.

Hadoop

Hadoop provides an ecosystem enabling:

Distributed data storage (HDFS — Hadoop Distributed File System)
Data querying (MapReduce)
Compute resource management (YARN)
Additional commons library and object store (Ozone)

This discussion focuses on HDFS as the most relevant storage-side component for understanding data partitioning.

HDFS

Let’s say you have an invaluable file that contains the names, addresses, and phone numbers of employees at your company.

That file lives on a single server. What happens if the server goes down?

Nobody can access the file. That’s a problem. So, you decide to store a copy of the file on two servers.

Great — now if one server goes down, the file is still available. But what happens when someone updates the file? You now have two copies that need to stay in sync.

This orchestration complexity is the central problem that HDFS solves.

Blocks

HDFS stores data in blocks. These are the same Blocks from the last post. They are indivisible segments of data.

A block represents the minimum amount of data readable in a single operation — any read requires reading at least one complete block.

Files are divided into blocks during storage.

These blocks are then distributed across multiple cluster nodes.

And then replicated for fault tolerance.

Architecture Note: These diagrams intentionally simplify HDFS architecture, omitting NameNode/DataNode separation, clients, and other components. The focus remains on partitioning concepts.

Block Size

The default Hadoop block size is 128MB (previously 64MB). While initially seeming massive when working with small local files, this size becomes relevant when storing files of hundreds of gigabytes.

The block size is configurable on a per-client basis.

Tradeoffs

Block size selection determines the number of chunks and corresponding I/O operations for reading and writing.

Larger blocks:

Fewer blocks created
Fewer I/O operations needed
Increased memory requirements during processing

Smaller blocks:

More blocks created
More I/O operations required
Lower memory consumption per block
Benefits for small files and random access patterns
Increases NameNode metadata overhead
Creates scalability risks

The small files problem is a well-documented concern in distributed systems — too many small files create operational challenges.

Block size also impacts HDFS fault tolerance, though that falls outside this article’s scope.

Tuning for Efficiency

Reducing I/O through partition pruning provides substantial performance gains. Optimal block size selection yields strong results.

However, no universal solution exists. Selection depends on dataset characteristics, usage patterns, data type, use case, and additional factors.

High-level recommendations:

Files under a few hundred megabytes: Use smaller block sizes (64MB or 128MB) to minimize wasted block space.

Files several gigabytes or larger: Use larger block sizes (256MB or 512MB) to minimize generated blocks and unnecessary I/O.

These are just guidelines and selecting your optimal block size will likely require some testing on your side.

On-Disk Storage Methods (w/ visualizations)

Sat, 14 Jan 2023 00:00:00 +0000

Originally published on Efficiently (Substack)

A few years ago, I gave a talk at Spark Summit 2020 about File Formats covering Avro, ORC, and Parquet. I received numerous questions about that topic, responding point-to-point, leaving the knowledge confined to those forums alone.

That isn’t helpful for most people. This post aims to fix that.

In this series, I’ll outline the primitives of this topic and then explore the hands-on details.

Problem

In the efficiency space, minimizing “work” is key. Whether work requires compute, network, or storage, “the goal of efficient data usage is to get the most accurate answer in the fastest and cheapest way possible.”

File Formats help data practitioners store their data in ways that minimize work. When you think of a file format, you may think of extensions like .xlsx, .pdf, .pptx. Similarly, technologies like Parquet, Avro, and ORC serve this purpose.

Background / Example Data

A partition is a logical segment of data. In the big data world, this usually means a piece of a larger dataset. For our purposes, I’m going to use an example dataset below.

This dataset has 3 columns (Column A, Column B, and Column C) and 4 rows (Row 0, Row 1, Row 2, and Row 3).

This table should look familiar — something you’ve seen in Excel, Pandas, etc. Let’s take this example further and split the individual elements into their own logical “pieces.”

We can refer to each “cell” by its “.” For example, the second row in Column B is called B1.

Storage

Background

Data is stored on hard disks in what is called a block. A block is the minimum amount of data read during any read operation.

Blocks function like a suitcase. When checking a bag on a trip, you pay the same price regardless of how full or empty your suitcase is. It’s optimal to fill your suitcase with as many relevant objects as possible, in as easy a way to find as possible.

Extending this analogy: packing unnecessary stuff isn’t great. Bringing too many suitcases (unless strictly necessary) also isn’t great. Inside the suitcase, you want to “group” similar things together — each pair of socks should be next to each other in the same suitcase, rather than split across different ones.

In hard drives, these insights apply. Reading unnecessary data is expensive. Reading fragmented data is expensive. Random seeks are expensive as well.

Our goal is to lay data out in a manner optimized for our workflows.

Row-wise Storage

In database land, the common way to store data used to be row-wise. It’s pretty easy to understand why. Most people think about datasets as a list of rows.

Taking our dataset above, let’s store this in a row-wise method.

I have taken each row in order and packed as much of the rows as I can into a block, before moving to the next block.

This method works great when the goal is to read the data sequentially. All that’s required is a simple linear scan of the block in order. It doesn’t work as well if, for example, you want to only look at Column C. In that case, you’re required to read all of the block (i.e., read all of the data) and filter down to Column C.

This is row-wise storage methodology.

Columnar (Column-wise) Storage

Column-wise storage takes the opposite approach and orients around columns.

As you can see, we first take the entire column, pack it into a block, and then move onto the next column.

This method works great when the data is read in a columnar way (i.e., one column at a time). It doesn’t work well if, for example, you want to reconstruct Row 0. In that situation, you’d need to read all of the data and filter down to the elements that make up Row 0.

Now, we’re in a dilemma — one approach seems to favor a row-oriented workflow, one approach seems to favor a column-oriented workflow. Luckily for us (and Goldilocks), there’s a middle ground.

Hybrid Storage

A hybrid storage model gives us the best of both worlds. First, we group a fixed number of Rows together and then further group that by columns. We segment these and call these “Row Groups” (at least in the Parquet terminology).

In this example, we first selected two rows — Row 0 and Row 1. We then grouped those rows by column, and inserted them into our first Row Group.

I called these logical Row Groups because this is more of how we should be thinking about them, rather than how they may necessarily end up on disk.

This representation of data is actually immensely powerful. It allows us to optimize our workflows for both row-oriented and column-oriented operations.

Let’s talk about how this works.

In the case of a row-oriented workflow, let’s say you want to recreate Row 2. To do this, you would simply need to look at Block 1 and Block 2. If you were operating in a Columnar storage model, you would need to look at Block 1, Block 2, and Block 3. You’ve saved a whole Block!

In the case of a column-oriented workflow, let’s say you want to recreate Column B. In this case, you would simply need to look at Block 1 and Block 2. If you were operating in a Row-wise storage model, you would need to look at Block 1, Block 2, and Block 3. You’ve once again saved a whole Block!

Our examples used very small data, you can imagine how this extrapolates further with larger datasets.

Data Workflows

Throughout this post, I’ve referred to my data workflows as “row oriented” or “column oriented.” Luckily for us, the big data community has come up with some terminology that should help bring these two workflows to life.

OLTP

Online Transaction Processing (OLTP) workloads generally involve larger amounts of short queries/transactions. These tend to be more focused on processing than analytics and as such have more data updates and deletions. Roughly — we can consider OLTP workflows as “row oriented” workflows.

OLAP

Online Analytical Processing (OLAP) workloads are more analysis than processing focused. As such, there tends to be more analytical complexity per query and fewer CRUD transactions. Roughly — we can consider OLAP workflows as “column oriented” workflows.

Conclusion

Using data efficiently relies on using all levels of the “data stack” (storage, network, compute) efficiently. Reducing the amount of unnecessary data read during a query process can have compounding effects on the speed and efficiency of your analytics process.

In subsequent parts of this series, I’ll be digging more into the details of how everything we have covered thus far can be applied in analytics workloads.

Writing on Vinoo Ganesh

The Definitive Guide to Forward Deployed Engineering

The Bottom Line

Why This Matters Now

The Grey Track Jacket

My Transformation

Why This Model Works

The Business Case

A Correction, Not a Trend

Building Frontline: The Program That Scaled It

What Makes Someone Great at This

Why You Might Not Want This

The Playbook

Manufacturing Your Own Field Exposure

Developing the Six Traits

What You Should Actually Know Technically

Getting Hired: What Interviewers Actually Ask

Your First 90 Days

The Career Path

Resources

The Path Forward

Context Is The Easy Part

Databricks Delta Live Tables 101

What Are Delta Live Tables?

How are Delta Live Tables, Delta Tables, and Delta Lake related?

Breaking Down The Components of Delta Live Tables

Tables

Views

When to Use Views or Materialized Views in Delta Live Tables

What Are the Advantages of Delta Live Tables?

Unified Streaming/Batch Experience

Opinionated Pipeline Management

Performance Optimization

Built-in Quality Assertions

ACID Transactions

Pipeline Visibility

Change Data Capture (CDC) in Delta Live Tables

What is the Cost of Delta Live Tables?

Conclusion

Rethinking Serverless: The Price of Convenience

Serverless Functions: How We Got Here

Industry Adoption and Trends

What Serverless Actually Solved

Uptime Guarantees

Concurrency + Auto Scaling

The Problems with Serverless Functions

Problem 1: Developer Bandwidth

Problem 2: Serverless Still Requires Tuning

Problem 3: Serverless Is Hard to Introspect

Problem 4: No Auto-Optimization

Conclusion

The Efficiently Guide to Snowflake (Top Down)

TLDR

Snowflake + Driving

Top Down vs. Bottom Up

Insight #1: Get Data

Insight #2: Turn Your Car Off

Insight #3: Car Engines + Cylinders

Insight #4: Restrict Trip Distance

Conclusion

Hands-On: Predicate Pushdown

Hands-On: Setup

Loading Data into Spark

Writing Parquet Files

Inspecting Parquet Metadata

Hands-On: Query Plans

Simple Filter Query

Complex Filter Query

Hands-On: Querying with Parquet

Row Group Metadata Analysis

Predicate Pushdown Mechanism

Conclusion

Distributed Data and Blocks

Background

Hadoop

HDFS

Blocks

Block Size

Tradeoffs

Tuning for Efficiency