Latest Posts (15 found)
Jefferson Heard 1 weeks ago

The best worst hack that saved our bacon

No-one really likes engineering war stories, but this one's relevant because there's a moral to it. I've talked before about defining technical debt as technical decisions that provide immediate value, but with long-term negative impact if they aren't cleaned up. Sometimes introducing technical debt is necessary and you do it consciously to avoid a disaster. As long as you provide yourself enough room to clean it up, it's just part of the regular course of business when millions of people count on your software to get through their days. Twelve years of calendar appointments on our platform, and the data model was starting to show some wear and tear. Specifically, though, our occurrence table was created with a plain integer primary key, and we were approaching two billion occurrences on the calendar. Well, specifically, the primary key was rapidly approaching 2,147,483,647 – the magic number that is the maximum value for a signed 32-bit integer. We had actually known about this for some time, and we had done most of the work to fix it already. Our backend code was upgraded to bigints and the actual column itself had a migration set to upgrade it to a big integer. The plan had been in the works for a month and a half, and we almost ran with it. But then, roughly a week before we were going to deploy it (and maybe only a month before the keys ran out), someone, maybe me, I don't recall, noticed that these integer keys were exposed in one of our public APIs. You can count on one thing in SaaS software. If you provide an integration API to your customers or vendors and it exposes an attribute, that attribute is crucial to someone, somewhere. And in our case the people using the integrations often had to rely on their university's IT department to do the integration itself. Those backlogs are counted in months, and so we couldn't deploy something that would potentially break customer integrations. What to do? Well, Postgres integer primary keys are signed. So there's this WHOLE other half of the 32-bit word that you're not using if you're just auto-incrementing keys. My simple (read stupid) solution, which absolutely worked was to set the sequence on that primary key to -2,147,483,648 and let it continue to auto-increment , taking up the other half of that integer space. It was so dumb that I think we met like three times together with SRE to say things like, "Is it really this simple? Is this really likely to work? Are we really doing something this dumb?" and the conclusion was yes, and that it would buy us up to 3 years of time to migrate, but we would do it within 6-8 months so all IT departments can make alternative arrangements for their API integrations. The long term solution was the BigInt, yes, but it was also to expose all keys as opaque handles rather than integers to avoid dictionary attacks and so that we could use any type we needed to on the backend without API users having to account for it. It was also to work through the Customer Success team and make sure no-one counted on the integer-ness (integrality?) of the keys or better that no-one was using the occurrence IDs at all. In the end we had a smooth transition because of quick thinking and willingness to apply a baldfaced hack to our production (and staging) database. We had a fixed timeline we all acknowledged where the tech debt had to be addressed, and we'd firmly scoped out the negative consequences of not addressing it. It wasn't hard, but it meant that no matter who was in charge or what team changes were made, the cleanup would get done in time and correctly. It was the right thing to do. A few customers had been counting on those IDs and we were able to advise their IT departments about how to change their code and to show them what the new API response would look like long before they actually were forced to use it. In the meantime, everything just worked. Do I advise that you use negative primary keys to save room on your database? No. Was it the right choice of technical debt for the time? Absolutely.

0 views
Jefferson Heard 2 months ago

Tinkering with hobby projects

My dad taught me to read by teaching me to code. I was 4 years old, and we'd do Dr. Seuss and TI-99/4A BASIC. I will always code, no matter how much of an "executive" I am at work. I learn new things by coding them, even if the thing I'm learning has nothing to do with code. It's a tool I use for understanding something I'm interested in. These days I'm diving into woodworking, specifically furniture making. I'll post some pictures in this article, but I want to talk about my newest hobby project. I'm not sure it'll ever see the light of day outside of my own personal use. And that's okay. I think a lot of folks think they have to put it up on GitHub, promote it, try to make a gig out of it or at least use it as an example in their job interviews. I think that mindset is always ends-oriented instead of journey oriented. A hobby has to be about the journey, not the destination. This is because the point of a hobby is to enjoy doing it. When I was working on the coffee table I made a month ago or the bookshelf I just completed, every step of the journey was interesting, and everything was an opportunity to learn something new. If I was focused on the result, I wouldn't have enjoyed it so much and it's far easier to get frustrated if you're not in the moment, especially with something like woodworking. Johnathan Katz-Moses says, "Woodworking is about fixing mistakes, not not making them." So when I write a hobby project, I write for myself. I write to understand the thing that I'm doing, and often I don't "finish" the project. It's not because I get distracted, but because the point of the code was to understand something else better. In this case it's woodworking. First, a couple of table pictures: I will probably end up using Blender and Sketchup for my woodworking, because I'd rather spend more time in the shop than on my computer (although there's plenty of time waiting for finishes and glue to dry for me to tinker on code and write blog posts for you all). But the reasons I wanted to write some new code for modeling my woodworking are: I loved POV-Ray as a kid. With my Packard Bell 386, and the patience to start a render before bed and check it when I got back from school the next day, I could make it do some really impressive things. When we got our first Pentium, I really went nuts with it. The great thing about POV-Ray was CSG or constructive solid geometry and the scene-description-language. You modeled in 3-D by writing a program, which suits me well. But also, CSG. I think CSG is going to be perfect for modeling woodworking. The basic idea is that you use set-theory functions like intersection, difference, and union to build up geometries (meshes in our case). So if I want a compound miter cut through a board, that's a rotation and translation of a plane and a difference between a piece of stock and that plane with everything opposite its normal vector considered "inside" the plane. If I want to make a dado, that's a square extruded along the length of the dado cut. If I want to make a complicated router pattern like I would with a CNC, I can load an SVG into my program, extrude it, and then apply the difference to the surface of a board. And so on and so on. Basically the reason this works so well for woodworking is that I have to express a piece as a series of steps, and these steps are physically-based. I can use CSG operations to model actual tools like a table saw, router, compound miter saw, and drill press. With a program like Blender or SketchUp, I can model something unbuildable, or so impractical that it won't actually hold up once it's put together. With CSG I can "play" the piece being made, step by step and make sure that I can make the cuts and the joins, and that they'll be strong, effectively "debugging" the piece like using a step-by-step debugger. I can also take the same set of steps and write them out as a set of plans, complete with diagrams of what each piece would look like after each step. I'm going to to back to Logo and make this a bit like "turtle math" My turtle will be where I'm making my cut or adding my stock, and I will move it each time before adding the next piece. This is basically just a way to store translation and rotation on the project so I don't have to pass those parameters into every single geometry operation, and also a way to put a control for that on the screen to be manipulated with the mouse or keyboard controls. This is only my current thinking and I may abandon it if I think it's making it more complicated for me. I won't belabor point #1 above. I think we know I love to code. But what I will do quickly is talk about the tools I'm using. I usually use Python, but this is one case where I'm going to use Typescript. Why? Because the graphics libraries for JS/TS are so much better and more mature, and because it's far easier to build a passable UI when you have a browser to back you. The core libraries that I'll be using in my project are: Three.js is pretty well known, so I won't go into that except to say that it has the most robust toolset for the work I'm intending to do. BVH stands for "bounding volume heirarchy," which is a spatial index of objects that you can query with raycasting and object intersection. It's used by three-bvh-csg for performance. I'm planning to use it as well to help me establish reference faces on work-pieces. When you measure for woodworking, rulers are not to be trusted. Two different rulers from two manufacturers will provide subtly different measurements. So when you do woodworking, you typically use the workpiece as a component of your measurements. A reference face, from the standpoint of the program I'm writing is the face of an object that I want to measure from, with its surface normal negated. Translations and rotations will all be relative to this negated surface normal (it's negated so the vector is pointing into the piece instead of away from it). My reference faces will be sourced from the piece. They'll be a face on the object, a face on the bounding box, or a face comprised of the average surface normal and chord through a collection of faces (like when measuring from a curved piece). I've only just started. I've spent maybe 4 or 5 hours on it relearning 3d programming and getting familiar with three.js and the CSG library. I don't think it's impressive at all, but I do think it's important in a post like this to show that everything starts small. It's okay to be bad at something on your way to becoming good, and even the most seasoned programmer is a novice in some ways. Sure, I can write a SaaS ERP system, a calendar system, a chat system or a CMS, but the last time I wrote any graphics code was 2012 or so and that was 2D stuff so I'm dusting off forgotten skills. Right now there's not even a github repository. I'm not sure there ever will be. It's really just a project for me that's useful and fun as long as it's teaching me stuff about woodworking, and maybe eventually if it's truly useful in putting together projects. And that's okay. Not everything is meant to be a showcase of one's amazing skills or a way to win the Geek Lottery (phrase TM my wife). As a kid, I got a shareware catalog, and I'd use my allowance to buy games and tools. My most-used shareware program was POV-Ray and I kind of want something like that for reasons I'll get into. I wanted to write something where I could come out with a "cut list" and an algorithm for making a piece. I like to code. three-bvh-csg three-mesh-bvh

0 views
Jefferson Heard 2 months ago

How I design backend services

First off, this is gonna be a long article. Strap in... I'm old, I get it. For the majority of my career, MVC was the way to design web applications. But I read a book about Evolutionary System Design and went to an O'Reilly conference on system architecture in 2018 and saw Martin Fowler talk on the " Microservice Design Canvas ," and the two things together completely changed my thinking on how to design systems. I've really fused these ideas to form my own theory about backend systems design. My goal is to create systems that are: The microservice design canvas has you design commands, queries, publications, and subscriptions, then data. I go a little further, because I try to take deployment, scaling aspects, and security into account at design time. Also I don't take a "microservices first" approach. In my experience the amount of work and money it takes to instrument and monitor microservices really just doesn't pay off for young companies with few or zero customers (where practically every system starts). I make it so it's easy to refactor pieces into microservices, but there's no requirement for the system to start there. There's also no requirement that you start with streaming or event sourcing solutions. This model works for traditional web APIs and for all the more advanced stuff that you end up needing when you finally do scale to hundreds of thousands to tens of millions of active daily users. All code examples in this post will be in Python, but truthfully this could work in anything from Java to Go to Typescript to LISP. My personal preferred core stack is: This is the order you want to do the design in. MVC runs deep in development culture, and most developers will try to start any backend project in the ORM. This is a mistake, as I've pointed out before. If you do that, then you will end up adding elements that serve no purpose in the final version of the service, and you'll end up creating data structures that aren't conducive to the efficient working of your interface. Start with the interface layer. Then hit the data layer. The data layer gives you your options for auth implementation, so then do that. All those tell you enough about your environment and infrastructure requirements that you can design those. And it's not that you won't or can't skip around a bit, but in general these things will resolve themselves fully in this order. Typical examples of service design start with a TODO list, but I'm going to start with a simple group calendar, because there are more moving parts that illustrate more than the trivial TODO list, and you will be able to see how much this model simplifies a service that's naturally a little more complex. This will be short, but I want to set the stage for the decisions I'm making and how I'm breaking it down. I'm treating this like a small service. User management is expected to happen elsewhere. This service will be called (indirectly) by a UI and by directly by other services that serve up the UI and broker scheduling information with other parts of the company's SaaS offering. It is an MVP, so it's not expected to handle calendar slots or anything "Calendly" like. It's just a barebones store of appointment data. Since user management is separate, the appointments can be owned by users, groups, or rooms or anything, but we don't care because that's all handled (presumably) by the user, group, and resource service. What do my customers care about for V1? What does my customer success team care about for V1? What do my product people care about for V1? What are the (non-universal) engineering requirements to make this system flow well? So basically my requirements are: There are also a few affordances that make my job easier: First a link to the post where I go into detail about Commands: Just to sum it up, your command and queries should be governed by base classes that provide the "guts" of their mechanics, and serve as suitable points to enhance those gut mechanics. The main difference between Commands and Queries is that the Query classes should only have access to a read-only connection into the data. But if for example, you want an audit trail of every command or query issued against the system, or if you want a total ordering or the ability to switch to event sourcing, the base classes are where you do it. I typically write bulk mutations first, because there are just so many cases where people want to import data and bulk mutations typically have more efficient implementation options than fanning out atomic ones. For our calendar, here is a list of commands: We want a small set of queries we can optimize our data structures for, but that will suffice to construct any kind of calendar for the UI, and to aid users in creating conflict-free appointments. Our publications: Our subscriptions: Our schedules: Now this gets interesting. Because we've already designed our commands there's a lot more to our data model than we first thought, as well as a lot fewer data attributes in the core model. We want as few attributes as possible. Online database migrations are pretty rock-solid these days, so accidentally getting into production without something that you want for a future release won't be a problem. Also, it's clear from above that we don't just want Postgres for our data model. We're searching appointments, so we want OpenSearch, and for simplicity's sake I'm assuming we're using Redis for pub-sub, and FastStream , which is a FastAPI like framework for streaming applications. You could use Kafka, SQS/SNS, or RabbitMQ, depending on your scale or your dev and SRE teams' proficiency. Now that we know how people want from our service, we can define our data and infrastructure, and points-of-scale in our code around it. We're going to use Postgres to store our appointment information. I'm not going to go into deep detail here about the fields, since those can be derived from the Command classes and the UI requirements but as someone who has designed more than one calendar in my lifetime, I have some notes: Now, let's talk data models. Recurrences will be stored separately from appointments, and each appointment within the recurrence will be an actual appointment record within the database. We'll add a new scheduled task, UpdateOccurrences, which will run once a month and calculate occurrences out for 2 years (an implementation note to tell Product and CS about). The same code should be used when saving an appointment so that occurrences are saved that far out on initial appointment creation. We'll want to set a field on our Recurrence model to say how far occurrences have been calculated. That way if someone modifies or deletes an occurrence from the set, we won't accidentally re-create it when we call UpdateOccurrences. Along with the Postgres record, we're going to want to index title, description, attendees, is external, and owning-application within OpenSearch. I won't bore you with the details of indexing these correctly because the requirements change a lot depending on the locales you operate in and tokenizers you choose. Also, you'll end up needing to query the service that expands attendee IDs into names most likely or code the search function to call it to reify full text searches to possible attendee IDs. This latter idea may be a little better for MVP, since it won't require you to setup an async task and broker to save appointments. How about that audit trail? Well you'll notice conveniently if you're using FastAPI and a pattern like my Command class that all commands are Events, and publications are also events. We can dump those events into Redshift or BigQuery and boom, our audit trail is now real. We can use the event history to cover the CS case of recreating changes in the event a bug or a person someone screws up someone's calendar. We can use the same audit trail to figure out how many appointments were created by whom for engagement metrics. And we can use the audit trail along with any logging we do in DataDog to measure our service against our performance and reliability metrics. The other great thing about everything being an Event is that we can adapt our commands to a Streaming modality easily once we get to the point where we have to scale out. Dump the command event into Kafka and consume it with FastStream. We're using an external service for all our user info, and we're proxying this service through our app-facing and user APIs, so presumably authentication is handled for us. Same with the security environment. All that is probably handled upstream. The only extra requirements we really have here is authorization. We want to allow our customers to make private appointments. That's easy, we can add that as a simple flag. But we probably also need to keep track of who can see who's calendar. I personally love AuthZed's (I swear they don't sponsor me, I just think it's a great tool) open source SpiceDB permission service. It's an implementation of Google's Zanzibar standard, which handles all the permissions storage and checks throughout Google, so you know it can handle your use case. So I'm going to suggest these new permissions in authzed without going into further implementation details: In each command I will make sure that the calendar can be modified, and in each query I will make sure that the response will be viewable to the requesting resource or person, and whether it should just be empty appointments with busy labels. Maybe this belongs above in Data and Infrastructure, but I want to treat this section separately. I prefer to use Datadog personally, but it can be quite expensive. However you create your dashboard and whatever you log data into, you want to measure, at the very least: Cost effective to develop. Easy for new developers to drop into. Performant enough to be cost effective to scale. Easy to break up into microservices or other components to support horizontal scaling as required. Easily "composable," i.e. big things don't require new code so much as aggregates of small things. Easy to monitor, react to problems, and to debug. Easy to add nontrivial things to on a time budget without creating new technical debt, or at least keeping it local to the thing added. CDK or Terraform/Helm AuthZed for permissions Creating, editing, and deleting appointments. Recurrence. Being able to see whether a new appointment will cause a conflict in their own or other attendees calendars, and what those conflicts are. When scheduling multiple people, they want to see the conflict-free openings in the calendar before having to click through weeks of people's calendars to find a time. Being able to view their calendar in various ways, up to a year at a time. Having all their calendar items together. All apps in the company should dump appointment data into this if they are creating it, and I should be able to load outlook and google calendars as well. Average calendar load time for a single person's calendar should be 2sec or less on good internet. Average calendar load time for a month of appointments up to 50 people should be 30 seconds or less. To be able to set 0 or more configurable reminders by email, SMS, or push. If an appointment disappears or moves in the calendar, they want to know how that happened - what app moved it and who, so that if a customer thinks it's a bug they can prove it one way or another and take the appropriate action, and we should be able to restore the old appointment. Engagement: how many calendar views across which people per day, and any appointment metadata that would be appropriate as classifiers distinguishing between the type of appointments people are adding or how many recurring vs. individual, etc. And how many appointments added / removed per day by people (as opposed to external services like Google and Outlook) There are other apps in my company with calendars. I need to make sure that appointments managed by those apps stay managed by those apps if there's metadata that my service cannot account for directly. timezone-aware timespans, including recurrence rules links and attachments appointment metadata such as originator, attendees, title and description some way to refer to an external owner of the appointment, for appts coming from services like Outlook and Google and for internal owners understanding which app made it and whether it has to be managed by that app (such as if the app is storing other metadata on the record that would be corrupted if a different app edited it. An audit trail of some sort to serve both the CS and Product use-cases. Restoring appointments can be manual by CS or engineering, but only if the audit trail exists with sufficient data on it. On average, a person will not have more than 4-5 appointments a day, 5 days a week. That totals out to 1,300 appts per person per year that we have to save, load, or destroy. It's likely I can do this much bulk writing in a single transaction and stay within the time limits we've imposed on ourselves. Across all our services, there are really no more than 150,000 daily active users, and of those, we can estimate that they're only changing their calendars a couple of times a day at most. That means that the write traffic, at least at first, will be fairly low. I can likely get to our MVP without pushing the writes to an async worker, although it's likely something we're going to want as our app grows. Traffic will be bursty, but within limits. When new users are added or when they sign up to sync their calendar from Outlook or Google, we're going to see a spike in traffic from baseline. This can likely be handled by combining FastAPI BackgroundTasks (less clunky than a queueing system and async workers for infrequent bursty traffic) with the kubernetes autoscaler, at least at first. SetOpenHours (rrule) - Allows someone to set the time of day where someone else is allowed to schedule meetings including them. Like work hours. SaveAppointments - Bulk creation and update of appointment data. DeleteAppointments - Bulk deletion of appointment data. Cannot delete from external services or app-management-required appointments. CheckForConflicts (timespan, attendees, limit, cursor) - Return a list of people that have conflicts for a given timespan. FindOpenings (timespan, attendees, limit, cursor) - Return a list of timespans within the input timespan where the attendees have no conflicts. GetAppointments (timespan, attendees, limit, cursor) - Return a list of appointments, sorted by attendee, and then by timespan. SearchAppointments (search_criteria, sort_criteria, timespan, limit, cursor) - Return a list of appointments that meet a certain set of search criteria, which can be one or more of: Title Description Is External Is Owned By (app) AppointmentCreated( id, owner, attendees, timespan) - lets subscribed services know when someone's calendar appointment was created. AppointmentModified (id, owner, attendees, timespan) - lets subscribed services know when someone's calendar appointment was moved. AppointmentDeleted (id, owner, attendees, timespan) - lets subscribed services know when someone's calendar appointment was deleted. AppointmentReminderSentReceived (id, owner, attendees, scheduled_time, actual_time) - lets subscribed services know when a reminder was sent out and received by the given attendees. AppointmentReminderSentFailed (id, owner, attendees, scheduled_time, code, reason) - lets subscribed services know when a reminder was sent but failed to be received. ExternalCalendarsBroker - Trust me, use a 3rd party service for this. I'm going to suggest either Cronofy or Nylas's APIs for this. But be aware of and put in extra time to design for external system failures and hangups. Your users will judge discrepancies between their Google and Outlook calendars vs. yours harshly and you want to be able to explain those differences when they happen and have something to push support on at the service you do choose when there are issues. SendReminders - Sends reminders out to attendees. Will run once a minute to check if there are reminders to send and then send them. This may actually be broken up into a couple of scheduled tasks, depending on how many reminders you're sending out per minute, and how long they take to process. There is a fair amount of subtlety in sending reminders when people are allowed to change appointments up to the last minute. You're going to want to define a "fitness function" for how often reminders can fail, and how often reminders for deleted appointments can be sent out, and use that to determine how fiddly you want to be here. Indexing Timespans - Postgres has an interval type, and that interval can be indexed with a GIST index. This gives us an optimal way to search for appointments with overlapping intervals and covering single points in time. Indexing Attendees - Likewise an array type with a GIN index will give us the ability to search for all appointments that include a given set of attendees. We may need a CTE to deal with set math, but it will still be highly optimized and relatively straightforward. Timezones - Store the timezone as an ISO code (not an offset) on a separate field and store the actual interval in UTC. If you don't store all your time info in the same timezone then you can't effectively index the timespan of an appointment. Okay, you can but your indexing trigger starts to look complicated and you're saving nothing in terms of complexity in the timespan itself. Why use a code and not an offset? Because if someone moves the appointment and it crosses a DST boundary when they do so, you won't know to change the offset without the UI's help, making the interaction between your service and others need more documentation and making it more error prone. Recurrence Info - God save you and your users, read the RRULE standard closely, and in particular read the implementation notes for the most popular Javascript library and whatever your backend implementation language is. Better yet, this is one of those rare places where I'd advise you roll your own if you have time , rather than use the widely accepted standard, because the standard is SO loose, and because the different implementations out there often skirt different sets of details in it. But if you use RRULE, one big non-obvious detail you need to trust me on: store the RRULE itself in the local time and timezone that the user used to create the recurring appointment. If you don't, day-of-week calculations will be thrown off depending on how far away from the UTC timezone someone is, and how close to midnight their appointment starts. It's not that you can't correct for it, but one way lies 2400 lines of madness and bugs and the other way lies a different but far simpler type of madness. CanViewCalendarDetails - Whether or not the caller can view a given attendee's calendar (group, person, whatever – the perm services can handle this) CanModifyCalendar - Whether or not the caller can modify a given attendee's calendar. CanViewCalendarFreeBusy - Whether or not the caller can view free/busy info for a given attendee, even if they can't view the full calendar. p50 and P95 time to execute a save or delete appointment command. Alert if above 75-90% of the threshold determined by the users and schedule time to get it down if alerts are becoming more consistent. P50 and P95 times for each query. There are lots of ways to increase the performance. Sharding OpenSearch indexes, Postgres partitioning, tuning queries, storing the appointment data in multiple redundant tables with different indexing schemes tuned to each specific query, and caching the results of calls to other services or whole calendar responses. Failure rates - you want to alert on spikes in these. 404s, 401s, and 403s are generally problems downstream from you. They could indicate a broken UI or a misconfigured service. 400s could be a failure of coordination between you and your consumers, where you've released a breaking change without telling someone. 500 is definitely on you. 502 and 503 mean your services aren't scaling to meet depend. Track spikes and hourly failure rates over time. If the failure rates are increasing, then you should schedule maintenance time on it before the rates spill over your threshold. The key to good stewardship of engineering is proactivity. If you catch a failure before your customers do, you're doing well.

0 views
Jefferson Heard 3 months ago

So you bought a tech company, now what?

So you found an acquisition target, put them through tech diligence, and closed the deal yesterday. 🎉CONGRATULATIONS 🎉 !!! Now what? Are you closing the old business down, integrating it, or letting it run as-is? What should you do with the diligence report now that it's done? There are a lot of new questions if this is your first rodeo, and even if it isn't, every company acquisition you preside over as a technology leader is in a different state and has simple and not-so-simple answers to these questions. The Ecology of Great Tech No spam. Unsubscribe anytime. The first questions are really for the C-Suite, and you've probably already answered them by the time the deal closes: The first question tells you what playbook you're going to be running: run it as-is, integrate it, or close it down. The second one tells you what collapses out of your product roadmap. I'll cover the "shut it down" option in the next post on this subject. First let's talk about the case where the target continues under your company's banner. As soon as your acquisition announcement comes out, its customers are going to start asking: Your customers are going to ask: Depending on the user and their overall change-tolerance, there's a lot of anxiety hiding behind these questions, and you want to quell that immediately and drum up excitement in both user bases for the merger. Marketing will help with this immediately, but Product Development's follow-through will ultimately be the determining factor on both products' success and whether the two together are greater than the sum of their parts. The due diligence report is going to expose the glaring technical debt, and you will need to tackle the urgent work from that in time to avoid disaster, but except in extreme cases of tech debt or the relatively rare case of "nothing's changing about the product but the owner," the most important thing to do is start showing momentum on integration to those customers. If you don't, then users will start to think about an exit strategy for the product. They'll assume that the marketing hype is just that, and that the new parent company isn't committed to it. In particular, if the product roadmap is thrown into disarray by fronloading everything the diligence brought up, most of the improvements you're making will be invisible to users. Their perception of your commitment to product won't match the work you're putting in. Cynicism is hard to recover from. If it's set in by the time you make a major new release it won't have the impact you want it to have and you'll do twice as much work releasing features that half the number of people will use. Don't try to be creative. I know, hard right? You're reading this as a Product Manager or CTO or Engineer, and you want to lunge for the big vision you see for integration. Do that, but make it splash a year to 18 months from the acquisition announcement. The first things you ought to do are the things that will touch everyone and that will be the most obvious to them: Integration is as much a psychological exercise as a technical one. Some of these things may seem unimpressive and like they're really low-effort, but you're shaving complexity out of your user's day. For the people who have always used both products, they're going from double data entry to single. They effortlessly switch between applications. They don't have to remember two logins or go through four factors of authentication before they're fully able to do the job they opened your apps for to begin with. In a very real sense, this is the technical debt that the diligence report will never tell you about. It's created immediately upon merger. You had one system and a simple workflow for getting users in, keeping them engaged, and getting their work done and now you've gone to two. Everything that's duplicated between the two systems that's intended to work the same way for the same people is tech debt, and it's highly impactful tech debt because it's either putting a burden on your support team or your users to manage it. A little momentum goes a long way. Once you've established your commitment to the product you can start to tackle the more invisible tech debt. Keep the momentum up, but don't let the stuff in the diligence report fester. How do you take two engineering cultures and make them into one? Honestly, it's more art than science. I can tell you what I did, but you might do it better or do something entirely different that works just as well. No two human relationships in the world are alike, and the same goes for relationships between groups of people. You're acquiring some number of people who are going blind into your company and engineering processes. This is your chance to review habits and practices on both sides and disrupt the unproductive ones. If you disrupt them in the right way, and make sure that everyone gets credit for the improvements you'll naturally start to build bonds between the teams. You'll have new experts. Maybe you grab someone who knows Python better than anyone else on your team. Maybe you'll end up with a crack data scientist or AI guru that wasn't fully utilized on the other product. Maybe you'll be able to bolster your strongest engineer with a backup when they get a question or run into a bug that they're normally the last resort for. All I'm really saying here is "identify opportunities and capitalize on them quickly ." The longer that you go without actively trying to federate the teams, the more they'll have ingrained the habit of not talking to each other. You just want the teams to each each other as value-adds. The rest of the cross-collaboration will happen as a result of that naturally. What I would say don't do is to force changes rather than foster them. A lot of people are tempted to do things like "put a seasoned engineer from the parent company's team onto the new team," which is likely to garner suspicion rather than foster collaboration. The same thing goes with "we use Spotify Squads here so you're all now squads." Without some experience with the parent company's engineering culture you're telling them to make a change they don't have context for, and it's going to feel forced and take twice as long to achieve. The other thing not to do in my experience is hold off on staffing changes or cuts if you know you intend to do them. You've just shocked the whole target team by acquiring their company. If you run things for six months and then cut people it's going to feel like a judgement against the team and their product as a whole. And it's going to be two hits to morale, cohesion, and productivity instead of one. If your diligence report or revenue plan identify cuts that need to be made, make them. Work with the target company's leadership to make sure they're the right cuts, but don't "wait and see," or you'll simply hurt the people you want to keep. All in all, the job of post-acquisition is a balance of using the shock to your users and your product development teams to your best advantage, and mitigating the harmful effects of the shock. Rather than breaking ground on the grand vision for what the companies will achieve together, the first few months post-acquisition are all about building trust by showing momentum and commitment. How is the acquired business going to be run (or closed down) post-acquisition? What was the market opportunity or risk that made acquiring the company attractive? What is this going to do to the product? If I'm using both the target product and the company's product, is integration going to make my experience better or worse or the same? Is this acquisition going to distract them from improving their core products? Am I going to see any benefit from this acquisition? Eliminate duplicate data entry. Eliminate redundant logins. Propagate profile information. Unify permission settings. Eliminate bounces to a home-screen or multiple browser tabs. Eliminate jarring switches between branding. If your target was struggling, make some much-requested improvements that got backburnered by their lack of resources. Make your technical SMEs available to consult across teams. If you have a site-reliability or security team, get them intro-ed to the target's team early and have them help the new team clean up any tech debt that's affecting reliability or creating security risk. And of course there's always team-building exercises and on-sites (if you're a fully remote org). These may feel wishy-washy and hard to justify, but it's typically important to at least get a few key members together in person whom you think are likely to be bridge-builders as the two companies build a shared future.

0 views
Jefferson Heard 4 months ago

Interruptions and Garden Paths

When we talk about accessibility in software we usually talk about screen readers, tab-ordering, alt texts, colors, contrast, and the WCAG standard. But consider this: according to  Forbes Health  in 2023, more than 8.7 million adults in the U.S. have been diagnosed with ADHD. The rate is now over 10%, which means that except in specific niches you likely have far more users with this or another executive function disability than any other. I myself have ADHD (diagnosed by a professional). I use software for 8-10 hours in my average day. I can tell you that the software I use has a great impact on my productivity, and that's the case for others with my condition. If you're designing an application that is meant to be used productively, you should be designing for ADHD. The good news is that designing for ADHD improves your software for everyone. It will improve your NPS and CSAT directly, because what it's about is driving task-engagement and maintaining clarity of purpose. The Ecology of Great Tech No spam. Unsubscribe anytime. See what I did there? Honestly, if I were to click on that Subscribe button above and it navigated me off the page to congratulate me for signing up, there's a good chance I wouldn't make it to this part of the article. I'm not always that distractible, but the longer I've been working, and the more things that have gone onto or come off my mental pile of "things to write down or take care of" the more likely that I look up at the back button of my browser and my eye tracks across a tab I needed to do something on. Sometimes your software has to interrupt someone. Sometimes they type invalid data into a field. Sometimes the calendar needs to remind them that they have a call in 5 minutes. Sometimes the system experiences service interruptions. The point is that if you're going to fully interrupt the user the reason has to be critical. And I'm going to define what I mean by critical: If your user were to continue operating your software without dealing with the interruption it will cause them to lose data or greatly degrade their experience. That's not to say you can't have visual cues on the screen for things like "oh, hey, try this new feature!" There are degrees of interruption. They will still degrade my experience, and I'd love a checkbox in your app's Settings that let me turn non-essential notifications off or simply dial down the noise a bit. But I understand the need to engage in in-product Product Marketing. Just don't make it a modal. Degrees of interruptfulness in notifications: The longer you can keep the application in degrees 3 or higher, the more productive someone is, and ironically the more likely they are to pay attention closely to the content of items in degree 1. The more modals there are, the more likely I'm going to click "OK" without reading it when I am trying not to lose focus. Honestly, the biggest cause in my experience of these kinds of problems lie in design or workflow afterthoughts. And I actually mean "afterthought" literally. The problem is introduced when a feature is basically in release testing or UAT and someone catches an edge case, or realizes something was missing from the design, and the easiest thing to do is to pop up a banner or a dialog to cover it. This is the point in feature development when the team is feeling the crunch and reworking design to incorporate the item correctly is going to cost a lot. It's tech debt, really, but once it's in production that way it's unlikely to get taken care of. Overall, my advice here is to think about your user's attention span and what you're asking of it. In workflows and tasks that require concentration (even filling out a form) the less you pull them away from that task, the better. The more subtle problems than interruptions in your app are garden paths. Poorly designed components or screens. Take a list of people. If I'm looking for a specific person in a list, and I can't put it in alphabetical order then I have to scan it line by line. Now if I'm familiar with these people then the chance that I remember something about one of them as I read their name is high. That reminds me of an email I meant to send or something I promised them I'd do, and it might feel compelling enough that I switch tabs and leave your app, then come back and have no idea what I was doing. Workflows that drag you all over the place. Ahhhh, MyChart, how I loathe thee. Let me count the ways. I click an email to check lab results on the web client. That takes me to the login page. I have a password manager, so first I unlock my password manager, then it enters the username and password for me, then MyChart sends me a code, which I have to unlock my phone to find (dangerous, so much Reddit to catch up on!), then I enter the code, it asks for my birthdate and to confirm contact details that haven't changed in 5 years. Reminds me forcibly in a modal I have to close that I haven't uploaded my insurance card to the portal (everyone who needs it already has it), and then it forgets the email click and I'm belched out onto the homepage where there are a dozen notifications I have been ignoring, but I suddenly read hoping that one of them is the link for the lab results so I don't have to go find the email I clicked again. It's not there, of course (but the 7 year old notification for my old doctor who changed practices before I moved is), so I have to open my email again and try to find it. But one of my dear readers has written me and made a great point, and I realize I want to go get it into a draft post before I forget it. At which point it will be most of a day before I remember to do it. And instead of stressing myself out all over again, I will literally make a 15 minute appointment for myself on Friday to remind me to do it again because I just can't deal with another garden path today. The most galling part of this "workflow" (can "flow" really be used in any word to describe this?) is that no-one wrote it. Instead it's an unholy confluence of lazy design and coding choices combined with real regulatory requirements that results in someone like me having to set aside specific time where I force myself to navigate all the way through it without distraction. Each individual feature that I run into on my way is in service of some other purpose – plus one bug where one of the many notifications causes the forward destination to be lost to the browser app. That's an extreme (if real) example, but there are other garden paths, such as ones where the app knows what you need to do but forces you to stop what you're doing and go through the motions to satisfy it : I really think AI agents may be a great way to generalize solving these garden path problems . If an agent can understand your intent, it can make a non-intrusive suggestion making the change or pulling the content in from the other part of the app that's obviously relevant. Providing an avenue to train an agent on common clickstreams and on places where data is entered redundantly in an app may make it possible to write a generic agent that solves such cruft as a service. Even without them, though these problems are avoidable. Engineering and Design should work together on design and workflow, and practice should be put in place to prevent people from glossing over these problems until they're too late in the development process to design well. I have actually seen inattention to this design sink NPS, feature acceptance, and renewal numbers. It's not that people are clamoring for ADHD accessible design, but that lack of attention to the details that matter to people like me slows everyone down and makes your application more full of friction. When I saw it happen, we were trying to get a major feature out before a hard deadline that there was no way to avoid. Somehow, someone had missed ever specifying that lists needed a sort order to them, and so all the list components for the entire feature, whether they included names, vendors, dates, whatever, were all in database insert order and couldn't be changed. Because it wasn't part of a standard and the team was new and not practiced, no-one caught the mistakes until they compounded in front of angry customers who had already been told they had to wait to do a big yearly thing with our product until after we'd made the change-over. The result was that the new interface took around 30% longer to use and more clicks than the old one did. And although it would crush people like me, even for people who trend more neurotypically in their attention span were frustrated as hell because it seemed like we just didn't care about their time or their usability. I got a bit longwinded today, so let me recap: To design for ADHD accessibility: Modal dialog. Pops up on the screen and I can't do anything else until I clear it. Use these only for critical notifications. Ever-growing stack of notifications in a default-on feed. This just feels like a to-do list I have no control over, and the longer it gets the more anxious I get. Ask me how I have my Macbook's notification center configured... Brief notifications that clear themselves after a few seconds without forcing me to look away from my task if my attention is at a critical moment. Blissful quiet. "to enable editing this field, please navigate to Settings > Labs > Layout, and Check 'inline editing of subtitle.'" Or where I have to copy structured data from one part of the app into the one I'm working on, even though the link between the two should be obvious. Or where to avoid a conflict, I have to check another screen than the one I'm on, even though again it's obvious to an algorithm that the screens would be in conflict. For any given productivity application, more than 10% of your userbase has ADHD or another executive dysfunction that can make everyday usage more difficult. Designing for ADHD accessibility increases CSAT and NPS across the board because it takes everyone fewer clicks to do what they came to do. Failing to include ADHD accessibility in your designs risks CSAT and NPS if standards around workflow consistency and clarity are otherwise too lax. Minimize artificial stopping points Design (or train AI agents to help your users through) workflows that carry them from one task to the next towards their goals. Train Engineers and Designers to work together on workflow and design and not gloss over the kinds of details that end in unwanted dialogs and notifications or arbitrary garden paths in your products.

0 views
Jefferson Heard 4 months ago

Build or Buy. Grind or Automate.

One thing I like to ask in diligence is how people in engineering and product judge the value of third-party software purchases. Sometimes these are buy vs. build decisions, but in others an engineer sees an opportunity to save X hours of time per engineer per month for $Y thousand dollars per year for the department. It will pay for itself many times over, and the question is, "Is your team is too scared to ask to buy the tool?"  Although this exists in every company to some extent, it can be especially dangerous for companies that are facing a cash crunch or have just come out of one. It takes conscious reflection to unlearn the panic-avoidance of saying "no" to purchases based on price-tag alone, and to move towards also considering the benefits that might come from such purchases. The Ecology of Great Tech No spam. Unsubscribe anytime. Engineers, and engineering managers who come from engineering backgrounds by and large have no training nor innate understanding of how to calculate the return on investment for a major tooling purchase. Engineers are just people. They will default to the patterns they're familiar with. Absent of conscientious coaching by the business side of the house the framework they will fall back is often the same one they'd use at home: In any case, without another frame of reference to work off of, anything over about $1,000 is a "big purchase" and anything over $35,000 is a "major purchase," and the bigger that price tag is, the more people discount the benefits to match their anxiety about asking for that much money. Because of this, there's a good chance you have engineers right now spending time and energy on problems that have already been more than adequately solved by someone else . I'm not suggesting engineers and engineering managers become spendthrifts, but let's look at our hypothetical $10,000/yr tool. First, a principle: if a tool saves engineering time, the saved time should be spent towards useful work. If it saves time but creates more work, then it has to balance out somehow or all you're doing is transferring cost between centers. But let's assume to continue our example that our $10k tool saves 8hrs net per engineer per month, and that we're using a standard American work-week of 40 hours: It won't really work out quite that neatly. Dividing hours up over that many people leaves a lot of room for messiness, but the principle is still sound. If you can find the $10,000 anywhere, spend it. More importantly, teach your engineers how to evaluate the ROI on a purchase and make it embarrassment- and penalty-free to ask about a purchase. If you have 22 engineers on your team, that's 22 people who spend a few minutes out of every hour of the day browsing the web being exposed to new ideas, tools, frameworks. Even passive research on improving engineering will yield benefits as long as people don't feel like they're painting a target on their back for making a proposal. The other side of course is that engineering managers and directors need to be diligent about measuring the benefit of purchases or agreements. If they buy the $10,000 tool to save 8hrs of time per month per year and they're not measuring that, even indirectly, then your engineering subscription bill is going to look like a deluxe cable TV package from the early 2000's. The cut-and-dry calculus above is great if you have it. I remember when we went from using a homegrown and hand-maintained Jenkins setup to Gitlab. Yes, Gitlab costs money, but the engineering time that saved went directly into scaling and hardening our product, which was worth more to the company than the money we "saved" by rolling our own CI/CD system and keeping it from burning down every time our system went up a tick in complexity. If a purchase – even a yearly subscription – is a one-time improvement instead of an ongoing one, then ask yourself whether it opens up long term gains. Take something like a charting/dashboarding library with a yearly subscription. If you saved the time of implementing a more complex open-source one, or (god forbid) rolling your own, and then you used that either to get to market earlier than a competitor or to make your MVP feel more premium on launch date, then it was probably worth it. Even if it costs as much you expect to make in the first 18 months with the new product, you can get out there and start making revenue and eating free market-share instead of paying more to get market share from the competition. Then after launch you can pay the cost of moving to the cheaper solution. The only good reasons in a software company to roll your own anything vs purchasing it off the shelf are: I might be missing one, but you can see the shape of the reasoning here. The point of all this is that if you're reading this post from the business side of the house you may not realize that none of this is obvious to the tech side. If you're on the tech side of the house and it is obvious, consider that it might not be to other individual contributors or managers. Understanding how to account for ROI is important in all places in the business, and it's an art that people don't just "pick up" from being in the position to make decisions. Cheap enough to buy without asking my partner/spouse. As expensive as my last big hobby purchase or my laptop. As expensive as my car - I'm not even going to bring this up unless we've been talking about how broken down my car is for months, and it's getting more expensive to have it repaired every month than it would be to pick up a new car payment. A fully burdened mid-level software developer is, let's say, $165,000/year or about $80/hr. To pay for itself therefore the $10,000 tool needs to save 125 hours per year total for all users. At 8hrs a month that's 96 hrs/year per person. That means that if the tool costs $10,000 for 2 engineers it's is a net positive. If it costs $10,000 for 22 engineers you've saved an entire FTE worth of time. In other words, the productivity gain should be like having a whole extra person on your engineering team. It's core to your product experience. An adequate off the shelf solution doesn't exist. You can do it better, and doing it better matters to the success of the business. Factoring in all the costs of implementation (that means the cost of the headcount that will be dedicated to it and the cost of not dedicating that headcount towards something that's potentially more profitable) you can do it cheaper or finish it faster. The company selling the product is shaky or new, and you'd introduce operational or product risk if that company fails. You literally cannot find the cash on hand.

0 views
Jefferson Heard 4 months ago

An unbroken chain of little waterfalls

The Buffalo River, nestled in the Boston Mountains of Arkansas, is one of the best floats in the country. It's so iconic that it's been protected as a National River, the first of its kind. Clear water, tinting to green-blue in the depths. White limestone rocks and pebbles studded with the fossil remnants of a shallow Paleozoic sea. Gentle rapids that cascade your boat downriver, each one a little waterfall that is so smooth your canoe never feels out of your control. A float from the put-in at Tyler Bend to the Gilbert General Store takes about 4 hours if you're looking to enjoy yourself along the way. But this isn't a post about floating down a river. It's a post about Agile, Waterfall, and the challenge of estimating time and complexity as a Product and Engineering leader. But it's also a little bit about rivers, journeys, and Flow . There are countless ways software and product teams use to estimate how long it will take them to ship a feature or complete a project, precisely because all of them are so bad. I suppose Point Poker sells those silly card decks, so it makes someone money. But Fibonacci points, T-shirt sizes, point poker, time estimates, and all the other idiosyncratic things people resort to under pressure to perform better at estimation than the last late-shipment are well... pointless. The Ecology of Great Tech No spam. Unsubscribe anytime. If you ship consistently on time, when has your point-poker (or whatever) exercise ever told you something you didn't already guess intuitively? If you ship consistently late or early, and you go over your estimations every time, when do you get anything but "reasonable excuses that we couldn't have known better" for why the estimate was so far off? There's a book, my old colleague and mentor James Coffos at Teamworks gave to me when I was trying to ship our re-engineered product stack on time. Actionable Agile Metrics for Predictability: An Introduction , by Daniel S. Vicanti. Despite its longwinded and dull title, this is probably the best, shortest book I've ever read on figuring out how to recognize speedups and slowdowns, how to estimate correctly, and how to intervene when the float down the river from concept to feature snags on the shoreline. First off, humans are tremendously bad at estimating time and complexity, and there's no evidence they can get much better when the approach is "narrative guessing," i.e. reading a description of a feature or ticket and giving an estimation based on your understanding of what's written. It's far better to estimate completion of a task based on past performance. Start with a well-written epic of fixed scope. Break it out into tickets with engineers and allow those tickets to be broken into subtasks. (I'll tell you in a minute how to do all that.) Then, at the end of each sprint measure: By "epic of fixed scope" I don't mean that the tickets are static. They can be added to and designs can be reworked, but the outcome should remain steady. Over time you're going to build a picture of what a good project looks like and what a troubled one looks like. From these measurements above you want to understand how fast on average your team moves through tickets vs. the amount of "scope creep" and "unexpected complexity" they discover per sprint. You won't believe me until you measure it for a while, but regardless of how they estimate it, your teams are going to move through tickets at roughly the same rate every month. The canoe-crushing boulders on your project are not velocity, but scope change, creep, and undiscovered complexity. There's some wiggle room on these rules, but: If this doesn't happen then the feature definition is incomplete or the devs lack clarity on how to build what's being asked. Further attempts to develop against it without refinement will result in wasted work. Schedule a retro and re-evaluate the scope and tasks before going forward again. If this sounds like Waterfall to you, understand that you cannot deliver reliably if you don't know what you're building . I'm not saying that you build a whole new product with waterfall process. I am saying that the most agile way to develop is to navigate a chain of tiny waterfalls that become clearer as they're approached. Rapids, if you will. An epic at a time. This of course puts a hard outline around how an epic can be constructed. It has to describe a precisely known modification. It covers maybe six weeks, not six months of work. It also can't be a title and a bunch of stories which themselves are title-only. It has to have written, fixed(-ish) requirements and end-goals when it's accepted as dev-ready. If you're building something bigger, it's more of an "initiative" and covers multiple well-understood modifications to achieve a larger goal in an agile way. A well written epic distinguishes between the Product function and the Engineering and Design functions. Typically, I think of product concerns as user stories, acceptance requirements, stretch goals, milestones, and releases. Organize what's being built in the most declarative (as opposed to procedural) terms possible. You want to give your engineers and designers freedom to come up with a solution that fits the experience of using the product, and you can't do that if you're telling them how to do it. I'm going to go with an example. I don't want to rehash how to do basic product research but your decently written epic is going to focus on a class of users and a problem they want to solve. This isn't the only way to write an epic, but it mirrors how entrepreneurs think. There are need-to-haves and nice-to-haves, and follow-ons that they deem natural, but aren't included in the ACs. Each bulleted "user story" or AC is written as an outcome that a class of user needs to achieve using your software. Designers and Engineers should talk with the person guiding the product definition (could be a PM, could be the CEO, could be the Founding Engineer) and clarify the requirements and fill in gaps before going off and creating designs and engineering artifacts. For example, missing in the above but probably necessary is " PF Coaches need to be able to cancel or reschedule an appointment individually or for a block of time / days." and "PF Coaches need to be notified if a conflict arises with a client from an incoming appointment on their calendar." A good designer or engineer will catch that requirement as missing and add it. Designers will take these ACs and turn them into full fledged user stories with screen or dialog designs. Engineers will say "oh yeah, we already have a recurrence dialog, so don't design a new one" and debate with the designers on how to get into the scheduler in a natural way. Then they come back with the person guiding the product definition and go over their approach. That approach should be complete in the sense that it covers the project from breaking ground through delivery. It's not just the first sprint's worth of tasks but an outline of how the project should go. Sure, more will get added along the way, but Design and Engineering should know how they're getting there. Also, If the product person takes the design to a customer call or two, the lead designer and engineer on the team should be present on that call, because they're the ones that need to get better over time at intuiting how a customer wants to use their software. Once everyone agrees that the solution is workable, it's tasked out: And so on. If you're thinking "This looks like a lot of work before engineers start building my product," then one of two things is true: I cannot tell you the number of times I've seen "seat-of-our-pants Agile" result in dead features or weeks or months of being stuck at "80% done" on a project. If the person doing the product definition is incentivized to commit to the work they're pushing towards engineering and product, and they're held accountable for late-changes, that person is going to get better at their job quickly. If the engineering and design functions are given creative control over a solution, then they're ultimately held accountable for coming up with a good solution for customers, making them more market and customer focused. When the above is practiced well, it encourages Flow in your team. Product is incentivized correctly to keep the process moving. Design and Engineering are given the ability to work to their strengths and become better at understanding your market and customers. Here are some "anti-patterns" I've seen that cause projects to drag out. The One Liner: This makes the design and engineering functions guess at what the product is. The worst part of a one-liner is that it's usually deceptively straightforward. It implies that the product person thinks that the result is obvious enough that engineering and design should just know what to build. Realistically they're going to think of about 40% of the actual ACs. The "stories" that get added will be a mishmash of engineering tasks, design artifacts, and guesses about what the user wants. The result will be that the PM goes back and forth with customers with the half-finished feature and saying "we actually need this change," to the building team. If it was a 6-week project done right, it's now a 12-week project that creates tension and bad blood between engineers, designers, product, and customers. Product Negotiates Against Itself: If the product manager is not also the engineer and the designer, then they do not know what constitutes complexity for those functions. If they don't understand this limitation, the temptation is to "offer" engineering and design "concessions" that actually fail to reduce complexity at all and in many cases increase it and make for a worse customer experience at the same time. For our above example, these kinds of ACs get added by product management before the feature ever reaches an engineer or designer: From the product manager's perspective, they've reduced the number of days and the time window that need to be considered, made it so you don't have to handle anonymous scheduling, and you've reduced the number of channels a message has to go through. From Design and Engineering's point of view, these are requirements, not concessions and well, now: Instead of cutting scope, the product manager just doubled it and created a series of follow-on tickets that will get "mopped up" after the initial release when the coaches complain that they can't do things like schedule evening appointments, and when clients demand a "lite" experience that doesn't require downloading the app. Prescriptive instead of Descriptive: Here we have no user-stories. We have a prescription by Product of what they want built. It might come in the form of a slide deck or Figma they sketched designs on. It might come in the form aberrant "user stories" that are actually written as ersatz engineering and design tasks. But however the work is presented, the product manager has done Engineering and Design's jobs for them albeit badly, setting up a chain of painful and slow negotiations where Design and Engineering suggest changes without understanding the underlying desires of the user they're building for. Now your builders are building for the product manager rather than the customer . The end-product will inevitably be incomplete and buggy, because the builders are playing Telephone with product management and customers. Focus your teams on laying out the course for an epic in the first sprint of dev work. A particular story may be unworkable or a design might need to be completely re-thought. Ideally that would've been found before the canoe was in the water, but the next best time is the present. You want to reduce the number of items that are reworked or added each day as quickly as possible, because hitting that fixed-scope is going to be what accelerates you to the finish line. For the product manager, an MVP contains a minimum complete set of user-stories that improve a user's experience over their baseline. "Cutting scope" at the product level is an illusion based on misunderstanding the product definition vs its implementation . Engineers and designers will outline their approach and may ask for concessions that reduce the implementation time, but the actual scope of work remains the same. Not every team is going to go through tickets at the same speed. Not every team is going to dial in the work to be done at the same speed or add engineering tasks at the same rate. Each major feature in your products is a different level of maturity and complexity, approaching a different problem for a different set of users. They're going to have different profiles and that's fine . The goal with your measurements is to establish a baseline per team or per product , not to measure the delta between your team and an ideal or compare them to each other. If a team starts doing suddenly worse than it has before on fixing the scope or it starts adding engineering subtasks at a late stage in the game, you have a way to say "This isn't normal, what's wrong?" and improve. For my money, the best unit of work to measure is the epic, but there are things that don't fit in epics. Drop-everything bugs come in. Vendors you integrate with change their products. Tooling and library dependencies age. Marketing updates branding. Sales is pushing into a new market and needs a new language pack. These tasks make up a sort of base-cost of operations and maintenance. You can categorize them differently and measure bug rates and so forth, but in the end what you want to know is how much of an average month or sprint or quarter is taken up O&M work vs. Developmental work. I've divided this up various ways over the years, but I've found it really doesn't matter how you divide it up. Over-measuring doesn't clear up the signal. If there is an uptick in how much time is being spent on this by a team or on a particular product, dig in and figure out why and whether there's an opportunity to bring it down to or below baseline. It could be tech debt. It could be that a vendor has become unreliable. It could be a communications or employee engagement breakdown. Once you have baselines you can make a prediction for new work. For our example above: We just scoped out giving personal fitness coaches a way to schedule with their clients, and we have 32 tasks. That means that at the end of the first sprint we should have 51 total tasks and at the we should expect about 65-70 tasks in the epic by the time it's completed and shipped. That's about 6 weeks worth of work to get it to UAT using the lower numbers, and accounting for the O&M overhead. You can use that in your estimates to others, or you can build in some wiggle room but keep in mind that projects usually take as long as you give them. You can even use statistical methods to build that wiggle room, and the more data you have, the better those estimates will be. I know that's a lot, so I want to summarize the main points and give you a Start-Stop-Continue. Predicting completion based on past performance. Organize work into epics. Epics are populated with user stories that describe a good user outcome. Then engineering and design work with the people defining the product (including customers where appropriate) to determine an approach and an initial set of development tasks that goes from breaking ground to shipped product. Once it's agreed upon, it goes into development and while the development tasks may change, the outcome should not. Once you have a baseline it's easy to provide completion estimates to people outside the team. It's also easy to figure out if a project is on or off track, to proactively communicate that, including how far off track it is. And you have the information to dig in with the right people on the team if that happens. Stop doing point-poker and all forms of "narrative guessing" in creating development estimates. Stop letting development start on an idea that's not ready. Stop writing user-stories that are actually development tasks. Stop product management from "people pleasing" and pre-emptively negotiating against itself with the development team. Continue to provide completion estimates, but better ones. The number of tickets and subtasks completed. The number of changes to the scope of a story or epic e.g. the end-goal, requirements, and user-stories. The number of new tickets and subtasks added. Changes to the end-goal or core requirements of a story or epic should be done by the time devs accept it and begin work. From there, tickets and subtasks should stop being added within a sprint or two. You're a tiny shop where everyone knows what they need to build and this is too heavy-handed a practice. This article can wait until you're bigger. Consider this up-front cost vs. the cost of wasting engineering and design days and weeks needing to redo work based on unforeseen changes. We can't use our existing calendar components because of the restricted schedule. The sign-up flow has to account for a new user sign up happening during the scheduling workflow. Redirects and navigation need to be updated in the mobile and web apps. The user's experience is made significantly worse because they have to complete an irrelevant task in the course of doing the thing they came to do. Our message provider Twilio already provides SMS, push, and email as a single notification package so now we write code into the system that allows us to ignore the user's notification preferences and only send SMS. Now the user is irritated because every other message from your app comes according to their preference but this one, new, "buggy" feature they used. The calendar team clears 15-25 tasks per week The number of implementation tasks will grow from the initial dev-ready set by an average of 60% in the first sprint, 20% in the second sprint, and then fall off. The calendar team spends 20% of its task clearance on O&M. Changes to the outcome (user stories or overall goal) made after development starts. The rate of development and design tasks added over time. The rate development tasks are cleared overall. The average number of "overhead" tasks within that.

0 views
Jefferson Heard 5 months ago

Product is the what. Engineering and Design are the how.

There's a problem I see more and more in companies, where Product is thrust into the role of "supervising" Engineering. This is not actually their demesne and will cause friction in the end. Engineering is accountable for development, testing, deployment, and reliability. Product is accountable for product research, direction and definition. That supervisory relationship leads to a common practice that is a big issue: Leashing all engineering artifacts to the product definition means that technical concerns are subsumed by the roadmap as Product sees it. Tech debt will go unaddressed or will be created without limit, and purely technical problems like infrastructure will have to be laboriously explained in non-technical terms to proceed so they can be prioritized by Product, leading to a slowdown. The Ecology of Great Tech No spam. Unsubscribe anytime. What’s my evidence for this approach? Look at Microsoft Office as a model for the pure Product-Led ecosystem. For years under Ballmer customer sentiment around Microsoft’s products suffered from low customer engagement and a feeling that they were “stagnating.” Yes, you’re right, Microsoft is one of the largest and most successful companies out there and was even with flagging sentiment.  But during that time Google Docs, Dropbox, and Box carved out chunks of their dominance that should have been impossible. Microsoft’s product teams without a doubt released more features into each of their products and shipped more bugfixes, probably by an order of magnitude than at any time in the company’s history and far more than the upstarts.  This culminated in Sharepoint. Sharepoint was the ultimate in file sharing systems, with configurability and control unmatched by its competitors even today. But it was beat so handily by less-capable upstarts that Microsoft had to pare it back and release OneDrive. But which one do people use? And why? GSuite, because GSuite feels like a system and at that time, Office felt like a collection of unrelated and confusing features. By the numbers , Microsoft was far more innovative, and they could prove it to you in an argument if you bought into the idea that more features = more innovation. But popular sentiment was that Google was more innovative by far, and Google ended up creating a beachhead on Microsoft's shore that should have been impossible. When facing the problem of "Product running the show" at his workplace, an old colleague said this: The problem here is in the word "force"  That's one way to do it, but every time you have folks address technical concerns, your making of room for them will use social capital .  The other way to approach it is to say that "These are the engineering standards we hold ourselves to," and bring that to the table and have Product agree to them. Then you can label or tag or whatever tickets that are aimed at keeping the product we're building in-line with our standards of excellence. If you want to put them in the product definition as well because that's where initiatives live, then great and do that. If you do it that way, the negotiation has been done up front and it's one expenditure of social capital aimed at a productive collaboration everyone can agree. The other choice is negotiating each time for time and resources against what Product sees as furthering the product in the eyes of consumers. Look at it this way. Every product you build has infrastructure requirements, and they almost never have direct input from Product because the average product manager knows zero about infrastructure and security. If Product ran your infrastructure project as its initiative, you'd debate the merits of terraform and kubernetes vs. just running on something "quick and dirty" because the quick and dirty will get product in front of customers and you can always work on reliability as a "fast follow."  Forget spending the time and energy selecting a queueing technology. You'll just do the queue in postgres because everyone knows it and we can get it out there faster than SQS. You have to have standards and goals in engineering that are just as "top level" as the product roadmap, because if you don't, everything is an MVP discussion and you throw necessary things out that will bite everyone in the end because you didn't feel like you had the capital to negotiate and "this one thing" didn't seem that bad to compromise on. So my thesis here is that if people in Engineering and Design come up with standards for their projects and hold these at the same level as Product initiatives, and then product, design, and engineering collaborate on a shared vision of development, we will drive innovation far further than Product alone will and we will do it with lower cost and fewer R&D resources. The ideal here is not just a "definition of Done" but a "definition of well done. " The right feature, inserted in the right place, and designed for the people who are meant to use it. By agreeing to these standards up front, you put Engineering and Designers on an even playing field with Product. Giving engineering and design the mandate to make everything you build a part of your users' natural workflow, you can develop fewer features and still gain ground faster, and you can do it with less friction.  Engineering and Design functions should systematize the Product definition. If you build feature after feature and rely solely on customer education or product marketing for uptake of those features, the majority of those features will go unused by most people. Users don't get exposed to product marketing every day. They are exposed to your product every day. Users don't click on marketing emails every time they arrive. They do move through their primary workflows in your product every time they open it. Careful engineers and designers, given reign to do so, can take feature requirements in a product definition and fit them into those workflows so that they improve them as a matter of course . Every feature that has a thesis about its use built into the natural workflow of the system will benefit your product in terms of NPS, stickiness, and customer sentiment. Cross-sells will be easier because asking people to learn a new product means they only have to tell their department to learn the bits that are specific to that product’s function, not re-learn how to send a message, add a user, create a calendar appointment, create a form, or design a report or a chart.  It takes longer to launch a product that complies with SRE’s standards for reliability and security. It takes longer to build a product that fits your design system. And it will take longer to launch products that comply with an engineering system. But if every feature lands with customers, you can build fewer features and still give your customers joy. Cohesiveness strengthens your brand. Word can do more than Docs. Excel can do more than Sheets. Sharepoint can do more than Dropbox, Box, and Google Drive combined. Outlook can do more than GMail. Active Directory can do more than GSuite admin.

0 views
Jefferson Heard 5 months ago

Modeling data well

I talk often about how modeling data should be done later in the process. If you start a new project with a or or then you're putting the cart before the horse, and you're going to model data badly . But okay, you've accepted that or at least that it's my opinion. In fact, my steps are: But when it finally does become time to model data, how do you model it? The Ecology of Great Tech No spam. Unsubscribe anytime. You've gotten a start modeling everything else, so you've probably already spent some time thinking about this. That in fact is the point of putting data modeling this far back in the process. You need to decide on databases, whether you're adopting a streaming or other async strategy for processing mutations, whether the same tables (or databases) can be used for mutation and query, what indexes to use, and so forth. You can't do that without the context in which your data lives. When sitting down to model data, you want to know things like: You have finite time to spend modeling data. Spend it where there's going to be traffic and tough-to-meet requirements. When you have to steal time, steal it from infrequently used cases in your app, service, or microservice. Which queries are the ones that are most visible to the user? Which mutations are the ones that touch the most and are the most sensitive to consistency or timing? Focus on those and branch out. And yes, you need to model schema, attributes, and document fragments, but all of that is at your fingertips if you've already modeled commands, queries, subs, pubs, and schedules. There's a temptation to over-optimize for scale at the data modeling stage. Sure, you want 25M daily-active-users and 100 million queries an hour, but is that where you're really starting? (I'll grant that in some situations it is, but I implore you to ask yourself whether you're the exception). And yes, you want all your queries to return as quickly as possible, but there's a point of diminishing returns and there are queries that are only executed 100 times a day. Put in a data migration system first thing, always. You can hand-control it or automate it, but you have to have one. Except in extreme cases, it needs to effect migrations while the system is online. And while not every migration needs to be reversible, the migration system ought to have the concept. I like because of its flexibility but having one is more important than having a specific one. This is also a great place to use Architecture Decision Records (or ADRs) to talk about what kinds of scale you hope for / expect, what strategies you'd employ when you're approaching that scale, and what the process ought to migrate up a level. You don't have to get too detailed yet; "concepts of a plan" suffice at this stage. The point is that you've thought about the possibility that at some point for example you might have to go from a Postgres-based queue to RabbitMQ to Kafka, and you don't do anything now that prevents you from making that transition smoothly when the time has come. I'll note that these data modeling strategies are valid for SaaS companies and service architectures. There are other modes of data modeling out there with different concerns (in particular there are times where my position about foreign keys is not valid). But in more than a decade of SaaS work, these have served me well. I don't use sequential integer keys. I use uuid4 or uuid7 keys for the most part. The latter is sortable, with the sort key working out to a timestamp equivalent. There's a hard-won security lesson that predicated this: if your keys are obviously sequential, then technically-minded users interested in automating something involving your API will assume they can index through the list. The amount of CPU cycles or storage you save by fitting your keys in a 32-bit word is basically never worth the tradeoff. Go ahead and fight that holy war with me, and I'll point out the dollar and change per day you're saving on a really absurdly high-traffic table vs. the cost of mitigating just one easily avoidable data breach. I rarely use foreign key constraints. Github famously does not use foreign keys ever, anywhere. I don't entirely avoid them, but they've made me bring down a service and introduce downtime we didn't have to run a migration. They're fine most of the time, but there are relatively common cases where you have to alter the schema or update records that cause the constraint to lock things that you don't want locked. Instead I typically opt to put resilience in my business logic around cascading data and handling broken relationships. I use ARRAY and JSONB columns a lot, as well as GIN and GIST indexes. Your code to access and maintain data will never be less complex than the underlying data model. Because of this I avoid over-normalizing database models, period. If something's not going to be accessed as a top level object, and it doesn't push the constraints of either of those types, I will pick column storage knowing that if scale changes the calculus I can run a database migration to get to where I need to be. I never use a relational database table as a queue. This is one of those "never do anything that prevents you from scaling later," things. Having your queue be part of your database lets you do things you can't do in any other queuing system. You can access potentially all old queue records. You can join to other tables. And in the rush of crunch time and the brain fog of friday peer reviews, letting a junior coder get away with abusing the queue because it's "just a table" is easy to do and hard to fix. I just use redis or SQS to start and expand to something more robust if I get to the point that it's necessary. I add checkpointing and auditability in relatively early. They aren't typically the first things I add, but checkpointing and event-level (e.g. mutation or API-call level) auditing let me replay mutations on top of a checkpoint for forensic purposes and let me fix inconsistencies. I'm not concerned with the redundancy of writing the same data to two or more places. That is to say that if I'm accessing OpenSearch to provide a query I don't have a problem writing the whole record as a document in OpenSearch and returning that without first checking Postgres to make sure that it's still there. Obviously there are exceptions where consistency matters, but it's worth thinking about whether you have one of those situations. It's often worth creating tables or data stores that serve up data that's more compatible with the query at hand than the "base" data structure the truth is stored in. The benefits that you get with maintaining redundant data are: I do my best to abide by the principle " Choose Boring Technology ". For plain old data, I use Postgres with SQLAlchemy. They both give me tons of scaling options. Once you move to RDS or Aurora you can scale up pretty much as far as you'd ever need to. And while SQLAlchemy contains an ORM it is not an ORM. If I want composability or absolutely crazy queries, I get that without having to drop to SQL strings or templates. For document-structured data , I stay in SQL, but I use a JSONB column and various indexing strategies instead of over-normalized relational hierarchies. I typically validate document structured data with a JSONSchema or something like Pydantic that produces one. For providing full-text indexed search, I use OpenSearch. I should say that I have most of a PhD in text search and I still choose OpenSearch 99.9% of the time for this. It simply follows so many best practices and you can do so much to change how indexes are built that the number of cases where OpenSearch or ElasticSearch aren't the right tool are vanishingly small. I can safely say that unless you already have a ton of tooling around Postgres full-text search that you'll create a better user-experience faster and with fewer resources with OS/ES. For online migrations, I use Alembic. It will autogenerate changes off your SQLAlchemy ORM models, but you can also just tell it to create you a bare migration and you can do anything in that you can do in Python, like moving the contents of a Redis pubsub to a new RabbitMQ instance or loading all your profiles into AWS Cognito. And then you can keep using the autogenerated migrations after that. For regular gridded data-science, ML, or AI training data, I use Parquet. Yes there are other formats, but parquet is the most compatible by far. I can use DuckDB, Polars, Pandas, or R. I can move it and all the resultant processes into Snowflake or Databricks when I get to that point of scale. I can hand it to people who work in other languages or systems and they can pretty universally do something with it. It partitions well. It's ridiculously fast. For queueing, caching, cross-process data storage, and other set membership or hash-lookup cases, I use Redis. Again, it's the most flexible and broadly compatible thing out there. I almost suggested RabbitMQ for queuing but Redis does an appreciable job until you get to the scale where you want a streaming system like Kafka. For permissions I use AuthZed's open source SpiceDB. This is the most "non-boring technology" choice on this list, but after using it in a couple of projects I can't imagine going back. It's an open implementation of Google Zanzibar, which controls access to files and folders in Google Drive among other things. While I recognize that I don't provide much in the way of concrete examples in this post, I still think there's a lot of general advice that's worth giving around data modeling, and I'd love to answer specific questions from people about some of my advice or what technology to use for a certain use case ( use the contact form on my website ). TL;DR, though I leave you with three things: Model commands. Model queries. Model pubs, subs, and schedules. Model the environment / outside connections. ... Then model data. How many users, agents, and other autonomous processes are there going to be hitting your data through the commands and queries you've laid out? How long can a query take before a user notices the lag? How important is transactional consistency? Are there special requirements for queries like text search that suggest alternative databases to support them? Are there special reporting or auditing requirements for various models? Should changes on certain data be reversible? Do any of your commands and queries revolve around collaborative editing? Are there row-level (or document-level or whatever) permissions involved? Avoiding stovepipes made up of constraints and single-use indexes that slow down your transactions in the ground truth tables. The ability to populate read-only tables in an async or lazy fashion. The ability to use strategy-specific data stores like OpenSearch for accelerating queries while using battle tested transactional consistency from a Postgres or similar. It creates clear and complete permission models; It can handle ridiculous query frequencies before you need to scale up Migrating away from it or up to a cloud-based AuthZed solution in the event that you need to is pretty straightforward. Zanzibar is as battle-tested as it gets in the commercial space (classified govt systems are a separate class). Your code to access and maintain data will never be less complex than the underlying data model. Behavior (both machine and human) begets schema. Optimize for the ability to evolve to scale, rather than to scale you don't have.

0 views
Jefferson Heard 5 months ago

Examples metrics and fitness functions for Evolutionary Architecture

In my previous article, Your SaaS's most important trait is Evolvability I talk about the need to define fitness functions that ladder up to core company metrics like NPS, CSAT, GRR, and COGS. Just today I had a great followup where a connection on LinkedIn ask me for specifics for an early stage SaaS. I think it'd be valuable to follow up that post with some examples from that conversation. The Ecology of Great Tech No spam. Unsubscribe anytime. Pick a metric first that's important to the company at large. For early stage SaaS, I'd say that's NPS. It's easy to collect, low touch, and Promoters are the people who will help you clinch down renewals and propagate your SaaS to their colleagues at other organizations. The more promotable your software is, the less work your sales and renewals folks will have to do to move their pipeline. Promoters are people who think your software is a joy to use, and that everyone should be using it over whatever they're using today. At an early stage, whatever your software is, you have one or two killer features that really drive engagement and dominate a user's experience of your product. You're asking yourself, "What metrics do I have control over that make the experience Promotion Worthy?" The point is to make it concrete and measurable. Once you can measure it, you want to know two things: Build a now. Measure continuously. Find the trend. Build that into your Site Reliability practice. Push your engineering team to understand what levers they have to control that function and know how quickly they can adapt if it starts trending negative. As your software and company grows, you'll accumulate functions like this for measuring the fitness of your software for common use-cases. It won't be "one key metric" but one or two metrics for each persona. Pivots happen. M&As happen. Product requirements shift as the horizon gets closer. For the kinds of changes you learn to expect as an executive, how well does your tech team adapt to change? As a top software architect or VP of Engineering, these are the kinds of things you measure to see if the team is healthy and if the software is healthy under it. Change is life. Change is necessary for growth. In a healthy, growing company, change is constant. But change introduces stress. Your software architecture's ability to absorb this stress and adapt to new circumstances faster than your competition without creating longer term problems is the ultimate measure of its quality. If your killer feature is messaging, how long does it take for messages and read-receipts to arrive? How long until someone notices lag? How fast is fast enough that improvements aren't noticed? If your killer feature is delivering support through AI, how many times does a user redirect the AI agent for a single question? How complex an inquiry can your AI handle before that's too great? How long does it take for a response to come back? If your killer feature is a calendar, how long does it take for someone to build an appointment, how long does it take to sync to their other calendars, and how close to "on-time" are reminders being delivered? If your killer feature is your financial charting, how up to date are the charts, and how long does it take for a dashboard to load and update? What's the minimum acceptable bound? What's the point of diminishing returns? Do they get thrown into crunch-time in the last 30 days of every project? Does software ship with loose ends and fast-follows that impinge on the next project's start time? Does technical debt accumulate and affect customer experience, support burden, or COGS?

0 views
Jefferson Heard 5 months ago

Avoiding Technology Tarpits: Ontology and Taxonomy

Avoid, avoid, avoid starting a project by modeling your data in an ORM. Why? Because the temptation in data modeling is to model the "thing" perfectly instead of prioritizing your model by utility within your domain. I have seen countless projects die a slow unsatisfying death because their attempt to capture everything devolved into never-ending Doneness Creep – a subspecies of feature creep where you convince yourself that your product is never ready enough to release. Usually this is accompanied by a Design By Committee flaw. Stated simply, the Taxonomy and Ontology tarpits are: It's difficult to provide examples of these, because the products that devolve into them are almost always DOA. You've run into them in your career though if you've been in the industry long enough. The Ecology of Great Tech No spam. Unsubscribe anytime. Take a person. A "profile" might be the standard attributes offered up in Cognito . But what is a person, really? What attributes provide a complete enough picture that you'd never need more? Without the discriminator of the domain model you can get lost in any manner of rabbit hole. You could boldly assume that standard attributes are enough when they aren't. You could have useless attributes that you add "just in case" they're ever needed. You can spend days figuring out how to store every variant of phone numbers and email addresses and how to validate them when validation isn't actually required. Or you could not do that. Instead you could say: Now you have questions that tell you what data is necessary, what you can leave out and what can wait till later. "Domain Modeling" has taken on a whole meaning of its own and there are plenty of people who will sell you expensive tools and frameworks to do it. These are only needed as they're helpful though. What you really need to do is understand what's being asked of your software, and here some simple exercises and market research are pretty sufficient. This asks, "when people or other systems get data out of this system, what do they need and how do they expect to access it?" For people, this is often some form of decision support or compliance activity. They typically need to be able to get a high level listing for the purpose of browsing what data is available. For processes, you often have more "fixed" than exploratory queries, and you're typically optimizing for system load, throughput, and latency. Think about the people or UIs accessing these queries. Depending on the number of items, they will need pagination and different sort methods. Also depending on the number of items, they'll need search. That is, if you're searching people by name, you need to be able to account for misspellings and therefore want a trigram index or a phonetic attribute for names. Are you fully utilizing unicode or do you need directional writing support out of the gate (lucky you)? How will the search be parameterized? Finally, performance. A query that powers a once-a-day process running in the background can take an hour. Who cares, as long as you can time it and you can constrain the resources it takes up on the system? A query that powers a UI needs to return in 90ms or so, and if it fails for some reason that reason needs to be clear since a simple retry won't likely do and you need to give the user some useful action. Understanding what queries are going to be executed, by whom, and with what expectations of responsiveness tells you what you need to store and how you need to store it. Continuing our example: we go figure out who our likely first users are who need significant amounts of profile data. Let's say this is a system for managing the people for our regional coffee roaster and shop chain. We have: Those requirements tell me a lot more about how profiles need to be structured in the system and which attributes are essential than my initial stab at "Amazon has standard attributes and is a pretty big company, so they're probably right." Is the data coming from another system? Is it being uploaded in bulk via CSV, JSON, Excel? Who can modify it? You'll figure out validation and such in the next question, but for this one you really want again to model utility patterns and performance. Maybe there is a bulk upload interface, but the CSV files aren't going to have people's coffee preferences or buying habits in them. So you know that you have to support bulk import, but some attributes will be blank. Now you know that something that you might have required can't be. And again, there's the question of performance. If a person is updating their profile, the update can take a little longer than the load, but not much. If the guts of your system make it so that certain updates take a bit, those are going to have to be asynchronous. Reiterating from question 1 the processes our system drives and UI it affords in our fictional coffee chain, we have: Given the above, we can say pretty confidently that our system needs, as attributes: This is where you get to the M of the MVP. An MVP isn't simply a half-finished product that serves a need just enough better than the solution people are already using to make them convert. It often is that, but it doesn't have to be. If you're really clear about applying the razor of necessity to all parts of your product, you can use the saved time to put the effort in to make your Minimum Viable Product a joy to use. Given the scale we expect at the beginning – 25,000 profiles – and the use cases, we don't expect there to be a lot of traffic up front. Maybe a few hundred daily active users. At that scale the only index that is actually necessary is the trigram index because calculating string distance or phonetic matches is expensive work to waste. It could even be in-memory if you don't want to use OpenSearch for some reason and you really just enjoy the pain of writing search from scratch for a few dozen searches a day, but it does need to be there. We know we're starting with 25,000 customers, but we'd like to eventually scale to a million plus. It might be a few years though. Also, within a few months of release we want to be able to keep people's favorite drinks so they can place an order from their Apple Watch when they walk into one of our shops. It's good to know what scale you're going to get to, and it's good to have some other use-cases that you know are coming down the pipeline. But you can't support all use-cases or scale out of the gate. If you try to capture all of them, you're right back at the tarpits we're talking about. Instead, you have to build your system to evolve . You'll need to be able to add attributes down the road, add more indexes, validation, or specialized data stores, and recognize that attributes are sometimes complex in the sense that they comprise two or more linked fields. Taxonomy and Ontology traps are some of the most common ways for projects to devolve and fail. You will eventually run into one trying to form if you work long enough. Asking the right questions up front, rather than starting by modeling the objects and their attributes is the right way to approach domain modeling to keep this from happening. Model external interactions (commands and queries), behavior, initial scale and roadmap, and design your system to be evolvable along those lines, and you will avoid the temptation to figure out everything before you can get started. The taxonomy tarpit is when you have a classifier or category system, and you keep finding things that don't fit neatly into the category, or where the hierarchy is never quite sufficient to navigate your collection. The ontology tarpit is when you continuously add to the attributes and validation of your model regardless of those things necessity. Often there are use-cases for these, but there's no effective discriminator between "necessary" and "nice to have" and "can wait till later" What queries use profile data, and what attributes do they need to function performantly and reliably? (queries) How is data expected to get into the system and be modified? What are the mutations and commands that drive your system's behavior? (commands) Which attributes drive system or human behavior and therefore need validation on entry? (behavior) How many people and different use-cases do I expect will pile onto this product as soon as I release it? (roadmap and scale) On the order of 25,000 profiles that one person might have access to. We hope that number will grow to the millions. All our profiles and users are Canadian, so we have to support French characters and phonetic search in addition to English. A process that runs weekly to send out newsletters to email addresses. A process that runs monthly to send out mailers to people. A process that runs at will to send a pound of a person's preferred bean, grind, and roast to the address they gave. A UI where I can search profiles by territory, interest group (coffee, tea, accessories), and by name, and as an admin view and modify data. A login interface for anyone in that list of profiles where they can see and correct their data as well as sign up for a coffee or tea subscription. A process that runs weekly to send out newsletters to email addresses. A process that runs monthly to send out mailers to people. A process that runs at will to send a pound of a person's preferred bean, grind, and roast to the address they gave. A UI where I can search profiles by territory, interest group (coffee, tea, accessories), and by name, and as an admin view and modify data. A login interface for anyone in that list of profiles where they can see and correct their data as well as sign up for a coffee or tea subscription. Preferred and given name Phonetic searchable indexed attributes for the above (likely a trigram index). Market region (e.g. Toronto, Windsor, Hamilton, Ottowa, Montreal) Opt-in or opt-out attributes to let me know whether to check subscription tables. Interest tags (coffee, tea, etc) Physical address (validated) Billing address (validated, but maybe by my payment processor) Email address (validated)

0 views
Jefferson Heard 5 months ago

Your SaaS's most important trait is Evolvability

In the world of commercial SaaS, your technology is always on a trajectory to become generic . Competition catches up. Broader trends change the way software is meant to look, feel, and be used. The longer your product stays static the less it stands out. What this means for your technology organization is that good software architecture is not some fixed set of principals from a textbook, but the practice of building adaptive strategies into your software, your infrastructure, and your team. There was an O'Reilly book, Building Evolutionary Architectures . Its basic idea was hugely influential on me, but the text of the book missed the mark because it ended up deep-diving on side topics rather than being an in-depth discussion of the core idea. The essence of building a system that can adapt to market changes, company growth, and competition, is to identify fitness functions that each describe an attribute that can be optimized to keep your product healthy. If you measure how well your software is doing, and how it and its development organization are responding to changes, you will always have a picture of whether your architecture is healthy or ailing. You'll also always have a picture of whether you're software is, well not done because it never is, but done for now. The Ecology of Great Tech No spam. Unsubscribe anytime. According to Wikipedia Net Promoter Score, or NPS is "a market research metric that is based on a single survey question asking respondents to rate the likelihood that they would recommend a company, product, or a service to a friend or colleague." Most companies use it in some form to determine how well their products are doing in the market. Sometimes you even get NPS for individual features or facets of your product offering. There are other functions as well that your company might care about: ARR, net retention, gross retention, churn, out-of-cycle churn, etc. These can all be characterized as a derivatives of your core software architecture fitness functions – not in the mathematical sense, but in the stock market sense. They're affected by what you want to optimize. They're also a trailing metric, and to optimize your engineering and software architecture you want leading metrics. If all this seems abstract and you don't know how your software architecture affects net retention, don't worry I'll get there. What you want to be able to do to communicate your department's effectiveness to others in and out of engineering and to improve is infer a reasonable connection from functions you define to these iceberg metrics. The most important corrollary to "measure everything," is "measure it right." Calendars describe a scarce, conflict-prone resource: a person's or facility's time. Because it's a conflict-prone resource and because most people aren't going to use your calendar as their only calendar, you decide you want a way to do a two-way sync with outside calendars. That way they can schedule without checking multiple calendars by eye for conflicts. This is a great feature to use as an example because in the ideal case a user won't even notice it. It's not just not flashy. It's ideal state is invisible. So the only thing you can really tune is a customer's perception of its reliability. As a user, when I create a calendar event on either calendar, I expect to see it show up on both. That's the core purpose of a sync. I can tell you easily with a yes-no whether I'm achieving that, but how do I know if I'm achieving it well, and how do I know what potential improvements are impactful? For that, let's brainstorm some attributes that I can build fitness functions around. If these also look like feature requirements, there's a good reason for that. You're coming up with the measurements for core attributes or features of the system that determine its fitness for purpose. And by derivative you affect the end user's likely satisfaction and likelihood to tell others, "Yeah this calendar's the one you need." Finding the hard numbers and friction points to hit above requires product research. This is why market-engaged product engineers are so powerful in your organization relative to "pure techie" engineers or contractors. This is engineering-focused product research. It requires knowing enough of the guts of the system you're building and maintaining and enough about who's using the product and why to ask the right questions. You can't just assume you know the number. You'll either frustrate customers or you'll over-optimize and waste time and money. So you should have your engineers work directly with customers or market research data to come up with the right answers. For the purposes of continuing our illustration, here are some made-up answers to these questions. Yours would vary by market: Now we have numbers. They're affected by the number of users who opt in to calendar sync, and the rate of change, outage, and bugs of the external systems we connect to, and the rate of uncaught bugs we introduce. And they all ultimately feed into NPS, CSAT, and our ability to renew and retain users. Over time you learn things like "every 10,000 users we add to the system means another cycle of optimizing queries and queues." and "our logs and alerts are becoming hard to monitor and we need to change strategies in about a quarter" and "every time we change this aspect of the system it takes us a month to get it right." You learn which are the things about your systems you have to modify the most, and if your engineers are doing their job well, your software architecture bends to make those things more modifiable and to give you longer lead times to anticipate change . Taking engineering spend and lead time into account on a new product, your queuing system for your calendar sync probably started in Postgres or MongoDB. It works well to start with, but as you hit 100,000 calendars or so you're starting to see that the queues aren't keeping up. You need to change queueing systems or find some way to scale the data store. Now if you've defined your fitness functions and used that to inform your monitoring, then you have an idea of how long you have before you'll drop out of spec on that 60 second appointment sync time. The longer the lead time, the better, more long term a decision you can make architecturally. If you don't define your functions and wait until you have days left, then you're probably doubling the size of your database instance and hoping for the best. Some engineer hacks in a way to use the read-replica you weren't using and buys another few months of time before you drop out of spec again and you're never sure when that will be. Whereas if you had two months of lead time you could have switched to SQS or Kafka and been good for a year or two before you needed to strategize about scale again. The same goes for changes to Outlook or Google Calendar. They introduce some change and an API parameter you're using is going away. If you haven't worked to make that interface with the outside world easy to modify, then you have a pair of engineers working back to back 60 hour weeks to change the implementation and QA it, and they still release that bug that adds duplicate appointments to the calendar, fixing a crisis and causing another, compounding the hacks that now exist in the system and make it harder to modify. Operating under crisis leads to tech debt . Your system will become a stovepipe of individual hacks the longer you operate without a clear idea of how the system is scaling to environmental changes and user growth. And eventually those hacks will compound to the point that you can't modify the system, support tickets take forever, and customers crush your retention numbers in a fit of rage. Lastly, having your fitness functions defined means you know when you're done for now. And you also know what you don't have to do. Your calendar sync service only ever does one thing. Sure there's a remote chance that a new calendar product will come up and you'll need to add sync for that, but really the likelihood is that your customers are using one or another major calendar system. Therefore you don't have to spend time planning for undue growth of the codebase. You can derive tests and QA for Google and Outlook and you can ignore Apple and other lesser calendar products. You can move engineers onto other projects until it's clear you're a month or so out from one of your core metrics going out of spec. And you can listen to a junior engineer's excited ramblings about making a deep change to the system that will "make everything better!" And when you do you'll be able to tell yourself whether that's really likely or even necessary and then redirect that engineer to something more useful with the standard that all of engineering has already agreed to. The point is this: understanding the ecosystem your software operates in, its state and how it typically changes leads to understanding your software's fitness for purpose. Putting concrete numbers, classifiers, and functions around that understanding allows you to set standards for engineering to aspire to, and shapes the evolution of the software around the changes to that ecosystem. By aligning architecture with environmental change rather than general software trends your engineers want to adopt or assumptions they make about a market and users, you have clear start and stop points for modifying a system and points of likely change you can make more flexible as you build them. Evolutionary architecture is a powerful way of thinking about building software writ-large, and gives you a sound set of principles to lead a department with whether it's 5 people or 150. The Ecology of Great Tech No spam. Unsubscribe anytime. How long the sync can take for a single appointment before the user becomes frustrated? How long can the sync take before the probability of accidentally scheduling a conflict rises? How long can it take for a whole calendar to sync in bulk for the first time? How far ahead and behind does the user expect to be able to see? How long can a sync problem last before you need to proactively report it to a customer? And how quickly and exactly can you detect and classify a problem? What are the common pitfalls (like, say, accidentally syncing duplicate copies of appointments) that you need to be able to detect and mitigate? These may be caused by bugs, but they're kinds of bugs that are on vendors' or partners' systems, or they're easy to introduce and intermittent enough to be hard to catch in QA. Users will wait 15 minutes on their appointment to sync before they get frustrated enough to call support. But they'd ideally like to see it in < 60 seconds, and they stop noticing improvements at 15 seconds. Duplicate appointments are very bad, especially with recurrences. No user should see more than 5 duplicated appointments in a given month before we alert support of a problem. If Outlook or Google goes down for a customer or globally for more than 2 minutes, we should reach out to support and let them know. Additionally we need to be able to estimate recovery time once it comes back up. Outlook releases feature or security updates we have to account for every 6 months on average Google does the same every 3-4 months. The cost of operating and maintaining the service should not exceed 5% of our overall COGS budget.

0 views
Jefferson Heard 5 months ago

A little bit about permaculture.

So by now you know I'm a software guy. But what makes me an ecosystem guy? Back in 2020 the pandemic closed down our physical office and made it possible to make my partner's and my long-time dream of living in the mountains a reality. Whether this particular place will be my forever home, or whether we eventually move to a different mountain, there's one thing that I will always practice, and that is permaculture. I don't know if you garden. I try. I am terrible at it. My tomatoes are ravaged by bugs. My peppers are ripped out of the ground whole by groundhogs. My corn is choked with jimsonweed. Gardening requires reliable constant input, and as a software executive who travels, who sits in weeklong meetings, who gets sick from sitting in airports next to someone with a hacking cough, I simply can't do that. But I love growing things. So I discovered permaculture. At first I struggled to understand how you did "agriculture" that way. My family were farmers for generations and that's part of what I was moving to the mountains for – to connect to my roots – but "farming" to me meant livestock or vegetables or both. It turns out that's not the only way. I started with mushrooms . When I was growing up, my dad had a colleague that grew shiitakes, so I knew at least what to look for online. I got a ton of help from my local mushroom club as well, and attended a few talks by a local mushroom farmer. Then (like I do) I dove in. My friend Trevor and I got out on one cold day in March and felled three stringy tulip poplars that were being crowded out by stronger trees. We chopped them into four foot segments, and we waited. There's a lot of waiting in mushrooms. Against the frenetic pace of helming a venture-funded, growth-oriented tech organization, the waiting is a welcome thing, let me tell you. The first thing you have to wait for is for the tree to die and for the weather to warm up a bit. You want to fell the tree while it's dormant, but you want to inoculate it with mushrooms when the weather is just starting to turn warm. I bought some oyster and wine-cap spawn from Field & Forest and a couple of tools and got to it. I guess the other thing that is needed in abundance with growing mushrooms is trust. There's a process. You follow it. If you follow it, the majority of the time you'll get mushrooms. Eventually. But there's a long waiting period where you have no idea if you'll get anything. Months for oysters and wine-caps. A year or more for shiitakes. Longer for some of the harder to grow exotic varieties. They will fruit when they fruit, or they will fail silently. The zen of mushroom farming is that you make peace with it. You spend a day or two inoculating. You stack the logs. You've put your work in the hands of the workers, the hyphae that will colonize the log and produce gourmet mushrooms. And you trust them to do their job for six months while you move on to the next thing. That teaches you a lot. That productivity happens with or without status updates. That life happens at its own pace. It also teaches you how to use your sudden, unexpected flush of 10 pounds of gourmet mushrooms before they turn into a stinky brown mess. Okay, but why is that permaculture ? By itself, it's really not. You can destroy a lot of wood with mushrooms, but to be permaculture you have to give back. I started a wine-cap bed under the big cherry tree. They're huge brick-red mushrooms that quickly digest wood chips and turn them to mulch. They're not as amazingly tasty as the oysters, but they're still excellent. And what I was amazed by was how quickly the cherry tree responded to the change. The year we moved in it was buggy and the cherries were tiny and few. The year after the wine-cap bed fruited, most of the leaves were whole and the cherries were much larger. These were fresh chips. Ordinary mulch wouldn't have done that. But the mix did. And so I started to read up on permaculture and how to get things to work together. Now I grow fruit tree guilds, with berries, and I've started hazelnuts, I grow more mushrooms, and I tap trees in the winter. I work with the ecosystem that's determined to be there with or without me, and I've found that working with it is so much more satisfying than the toil. I'm still not great at it. I'm only five years in really. And I've probably screwed up as much as I've gotten right or perhaps considerably more. I can't do as much as someone who does it full time, but if I head to the office for a week-long planning session, the mushrooms abide. The berries grow and ripen. The hazelnuts and walnuts swell. I do take these lessons from permaculture into software. I delve into a market segment and I see what people are doing. What the ecosystem looks like now . I assess how to create software that makes that ecosystem work better. I think closely about the connections between people and between systems. The parameters and functions that define the relationships. How do you strengthen them? How do you become essential to the niche? What are the survival and thriving qualities that software and products need to operate in that "biome?" And then I try to work with people to foster that. I really do think that living this close in tune with the ecology of where I am makes me better at my job. Yes, because it is a relaxing and interesting outlet for all my non-software energy, but also just because really it's not so different.

0 views
Jefferson Heard 5 months ago

What I talk about when I talk about Technical Debt.

Communicating technical debt to people other than engineers is essential to getting work on that debt prioritized and valued alongside bugs and product roadmap work, and it’s not easy at all. One key quality distinguishing a good engineering leader from a great one is the ability to bring engineers and non-engineers to agreement about technical debt and its priority. In this article, I'll talk about how we've done that at Teamworks. The Ecology of Great Tech No spam. Unsubscribe anytime. For a long time we struggled with differing definitions of technical debt by different parts of the company and a lack of ability to communicate the urgency of tackling it. We arrived at this definition: This doesn’t attempt to provide  a taxonomy of technical debt . It doesn’t establish a framework to determine priority (I’ll talk about how we do it a bit later). But it does establish a hard line between what  is  and what  is not  technical debt and gets everyone on the same page. Ward Cunningham first used the debt metaphor to talk about code problems that weren’t exactly bugs but rather things that made the code harder to understand and modify. It’s a good metaphor. It describes these problems as having a cost associated with them. It provides for the idea that there’s a principal and an interest rate, even if it doesn’t define how you arrive at those things. Venture-backed startup companies generally share the characteristic that they spend money faster than they can make it to expand into new markets. This deficit spending is a conscious choice and makes long-term sense. Most importantly it tends to pervade every decision made about how to allocate capital in a company. That includes how that company allocates technical capital. A venture-backed software company  has  to build software faster than it can refine it. Getting into new markets and getting to market fit faster than competitors require lean experimentation alongside a codebase that’s  also  serving a well-developed base of paying customers who count on an agreed-upon service. This leads to conscious adoption of technical debt in the service of growing the company. In an investment-backed company, you need to strategically embrace technical debt . Just remember to understand it, document it, and budget for paying it down before it buries you. To get their renewal, someone promised an important customer who was already on the fence about renewing that tracking wearable data would be available by April 21. To make that happen, you had to cut some corners. A a new section of config has to be done for each user for the feature to work. Without a detailed update to the profile screen that walks someone through connecting their watch to the app, one of your engineers has to do it in SQL and by making API calls with the provider. The cost of the customer doing it themselves is $0 and throughput is basically instantaneous, but it'd require those screens to be built – including validation and failsafes and OAuth handshakes. The cost of someone in customer support doing it with a quick-and-dirty screen is, oh let's say $15/user for their time. But also the time that it takes for CS to do a rollout for a customer can't be allowed to be a drag on CSAT, so there may be additional opportunity costs if they're updating profiles en masse and letting other work pile up. You also have to consider the throughput time. Depending on their queuing strategy and time guarantees, the time from when a user realizes they need the new feature until they can use it goes from minutes to a day or so. The ongoing cost of an engineer doing this in SQL and API calls is: So now the user opens a support ticket to get on the new feature. Support sees it and forwards it along. Engineering uses their SDLC process to accomplish it or their 2nd-tier technical support process if you have a 2nd-tier support team. Counting all the handoffs, the cost is now in the hundreds of dollars per ticket if not pushing four figures. The throughput is now a day or more depending on how disciplined your engineering team is about customer-facing problems and it has impacts to your team's ability to push new features. The tickets interrupt people. Engineers can't be as adept about making context switches as other functions in your organization, so you'll lose more than the few minutes of active work it takes people to service the ticket. This is technical debt, and these are its associated costs. Again though, this is not a matter of bad vs. good. The above impact scenario has its place. I've done it, but when I did I knew full well what I was doing. The point is that when you take on tech debt, you're aware of its scope, you document the impacts, and you communicate the need to clean it up. Sometimes the most debtful scenario is fine long term because in actuality it amounts to a few tickets per quarter and it's not worth diverting the team to write a fully hardened, well designed screen that puts the setup in the user's hands. In his article on Technical Debt , Martin Fowler characterizes technical debt this way: This definition is a good one for software development in a vacuum, but it’s not all that useful in a company setting. The challenge with tech debt in a company is getting it on the docket when there are features to develop and territory to capture. In the enterprise, technical debt has impact well beyond engineering concerns. It includes: The impact of technical debt is the sum of these costs that are themselves the result of solvable technical issues, shortcuts to market (like mechanical turking), and costly adaptations (hacks). Too much debt can drag on a company’s KPIs across the board. A pragmatic approach to planning and accounting for technical debt on the other hand allows you to achieve things in a timeframe you couldn’t otherwise. All of these costs, importantly, are quantifiable. You can calculate the increased cost of customer support. You can calculate the cost of churn due to low customer satisfaction. You can calculate the cost of your R&D department having to re-engineer and bootstrap a project to work around the problems from poorly maintained production software. And since you can quantify that cost, you can communicate it to the CEO, COO, and CFO. The important part of communicating to the non-technical parts of the organization is quantification. If you can make a spreadsheet or a graph of it and relate it to ARR, you can relay debt in terms that are meaningful to everyone who isn't an engineer. I say cost and not other metrics because cost is always meaningful. There's no way to make it a vanity metric . No-one cares that you can improved garbage collection times by 50%. Everyone cares that you can eliminate $125,000 a quarter from cloud costs with one month of work. In the above scenario, the cost of missing your engineering deadline is losing the renewal to churn. To the company that's, say $850,000 in ARR. It's also the cost of missing quarterly earnings numbers, dropping Gross Retention, and so on. So as long as the cost to engineering and CS, etc is less than that number, taking on the technical debt and maintaining it without fixing it is worthwhile. When you tell engineers why they're not just delaying the feature by another sprint or two, this is what you tell them. When you tell folks handling renewals why this feature is being prioritized over others, making their negotiations harder, this is the calculus you give them. There's a number. The math works out. Yes, there are impacts to this, like engineers slowing down on feature development to handle support tickets, and like CS being forced to be really precise and check their work on a new screen that makes them handle the rollout of hundreds of users person by person. But in a controlled timeline, it works. And when you tell everyone that you're delaying something else by a couple of sprints to get the screen in, even after 95% of existing users have been migrated onto it, you point out the ongoing cost to everyone of running the above processes to add all the users every time sales signs a new customer. And thus your technical debt gets cleaned up. In my experience, the  principal  of the tech debt is the cost of what it will take to provide a solution that eliminates the negative impacts. The  interest  is the toil and drag across the organization involved in using and maintaining the debt-financed solution and the growth of that toil over time as other code and company process has to work around or incorporate the tech-debt in order to get its job done. Take for example a function that lets you create and update a form, but there’s nothing self-service to delete it; imagine that deleting and cascading was more complicated, so you needed longer to consider how to do it right. Not having the deletion is the principal. The interest on that principal consists of the time and cost of every ticket your DBA had to take care of manually in SQL in order to delete a customer’s form. It's the number of times you had to restore a backup or otherwise fix the database when the DBA made a mistake. It’s also the cost to the company’s reputation of all the times that it took longer than its customers felt was reasonable to delete that form, or they were impacted by mistakes. When prioritizing tech-debt in the backlog, it helps to describe consequences to be mitigated instead of the solutions you plan to use to mitigate. Imagine a system where you’re going from a single-instance cache to a highly-available, scalable cache. Your description of the work could be “Switch web caching from our managed redis to a managed ElastiCache,” That communicates nothing about the why, and in a backlog of 1000 tickets the title tells the product owner nothing about how to prioritize that vs. everything else. A better ticket would be titled, “Cache misses are causing users to complain about slow performance at peak times.” There is an important distinction between doing work that  anticipates  change vs. work in  reaction  to it. Work that is done as a reaction to change is often paying down principal or interest on technical debt. Work that’s done in anticipation of change expands our overall technical capital. To illustrate the difference, consider a mobile/web shared code project. At some point in the past we began using React on web as well as React Native to build our application. In the beginning, there was very little shared code, but we  anticipated  that much of the code between web and mobile would be shared in the future. If we had taken that foreknowledge and applied it then to solving the problem of “how to share code between web and mobile”, prioritized, and scheduled that work, that work would not have been “technical debt.” Why? We realized our code repo was inadequate to future needs. Why is that not Technical Debt? The answer to that is also the answer to the question “Was there  realized  impact or did we get ahead of it?” It’s the difference between being forced to react vs. having the advantage of the situation. If we had done it then, we could have done it without also being impacted by the negative consequences that came with waiting too long to address it. Instead we didn’t plan the work ahead and we had to build a shared-code solution while also experiencing development drag from engineers manually keeping shared code in sync. Prioritize proactive work by thinking about the technical debt you take on if it’s ignored. This bucket of work is for experimentation and work that has the potential for positive disruption. It’s a bucket for work where the engineering department can be the force for innovation. Think “labs.” If when you crafted your story, you thought “Things are pretty good, but I think they could be way better.” then you have an engineering priority. Prerequisite and requisite work is the work that should be done before building or revising a feature, or should be done in order to make the feature complete. This is often the work needed to make new development conform to engineering standards of quality, testability, and performance. Examples of this include: Sometimes prerequisite work can be skipped and a feature can still be shipped, but it will be more costly to maintain and modify than a fully complete project. This causes technical debt to incur and thus the priority of the work can be based on that impact. The difference between a bug and an item of technical debt is obvious most of the time. The distinction is blurry when the bug doesn’t affect the correctness of output, only some aspect of importance to engineering or operations. In some cases, the distinction may be down to urgency or whether treating it as technical debt can bring a single item into the context of a wider cleanup push. The key concepts are Urgency and Impact. Another key activity for grooming technical debt, however, is contextualization. This is the planning activity of organizing technical debt into well-scoped refactoring plans, epics, and the collapsing of closely related stories. This makes it so that we can tackle more technical debt than we could grabbing a few stories off the stack. Teams should groom technical debt carefully and where possible create proactive solutions like refactors vs. playing “whack-a-mole” with issues that haven’t changed since they were initially reported. Too often, "technical debt" is a meaningless phrase in a company setting. Making it meaningful is about showing the wider impact technical debt has to the organization. Everyone is impacted by technical debt, and so everyone has to collaborate on fixing it, whether that's writing code, adjusting timelines, smoothing over bumps with customers, or yielding budget dollars to help with the paydown. To achieve that kind of collaboration, though, you have to become a great communicator of technical debt to technical people and non-technical people alike. By characterizing the debt in terms of cost, the pay-down in terms of impact, and making conscientous choices about tech debt to take on, you will manage your company's technical debt balance effectively and not let it compound until it stalls growth. The Ecology of Great Tech No spam. Unsubscribe anytime. CS writes the support ticket and puts it on the engineering queue to be prioritized An engineer stops work on features (possibly even the screen that cures this tech debt) and modifies the SQL template for a specific ticket, costing them a few minutes to an hour (cheap) and a context switch (expensive) Another engineer spends their peer review time making sure the SQL is correct Someone with permissions to execute SQL against the production database and API calls against the production vendor account runs it. The cost of providing adequate customer support. Cost of providing performant and reliable software. Cost of continued scaling of the customer base. Cost of ensuring regulatory compliance and security. Throughput of individual support requests and their impact on customer relationships and retention. Cost of hiring engineers who can make system modifications reliably in bounded time. Ability to execute on high- to medium- priority items in a product roadmap in a sane amount of time. The impact of customer frustration from user “toil” and confusion necessitated by engineering around existing behavior i.e. “You have to have  this  permission and go to  that  strangely named screen, and then do your task in 8 click-and-waits because that’s the only way we could build it.” Providing self-service admin functionality. Refactoring code somewhere else in the stack that is common to the new / revised feature, so that it can be shared between the old and new. Bulk uploads. Settings screens.

0 views
Jefferson Heard 5 months ago

So You Wanna Buy a Tech Company

I've run the tech side of the M&A playbook now I think 10 times. I want to talk to fellow tech executives who are looking at acquiring a company about tech diligence and what it's for. In 2021 we bought a 30 acre hobby farm. Our house was built in 1947, and it was maintained largely by DIYers for the last quarter century. When we had our home inspection done it wasn't to decide whether or not to buy the house, but to lay out for us clearly what work would likely need to be done, and to help us prioritize it. The well needed replaced. The basement needed shored up in one place. The heating system memorably involved a heat pump, an electric emergency coil, a heating-oil switchover, and a wood burning furnace. But ultimately, we bought the house. The inspection was important, but it didn't carry much weight with us in terms of whether to buy. It just helped us understand what the purchase implied about where our money was going to go in improvements. As the technical executive running diligence – whether you do it yourself or you contract it out to a company that specializes in it – what you're ultimately doing is the same thing. You need to know how to get the new acquisition incorporated cleanly and quickly and at what cost. The Ecology of Great Tech No spam. Unsubscribe anytime. So then, if not "whether to buy," what do you want out of tech diligence? That's it. It's not about whether you're going to buy that house that comes with the 30 acres, mature walnut grove, and spring-fed pond. Diligence is about strategy. And honestly? I don't think most people really understand that. It's about being able to hit the ground running as soon as the deal closes. In the series of blog posts starting with So You Wanna Buy a Tech Company , I will develop deep-dives on the different aspects of technical diligence and what the CTO and parent, and target tech team's roles are in them. But first I want to spend the rest of this post talking about how to get your money's worth out of tech diligence. I'm going to talk about what kinds of questions to ask, and how to build those three bullet points above. If you're interested in a deeper conversation about tech diligence, reach out to me . I love this stuff, and I'd be more than happy to talk about running your play with my network of experts or one-on-one helping you build any of the plans I talk about here afterwards. Okay, first thing's first. There is an occasion where you tell your fellow executives to run screaming . It's possible for a company to be so underwater technically that they're teetering on the precipice of disaster and inviting that company into your organization will drag you down with them. But it's unlikely. Most of what can go wrong after acquisition boils down to poor planning or poorly set expectations. Regarding expectations, the second goal of technical diligence is to introduce the target team to the culture, and to the way engineering leadership thinks about building, and how they lead. Before I continue, it's of utmost importance that you're open and nonjudgemental throughout your interactions with the tech and product teams. And it's important that you do interact, even if you've hired an outside firm for the diligence work. They need to get a feel for your company, and you need to see how their team is reacting to the prospect of working with you. With that out of the way, let's talk about how you make a plan. Something that almost never gets covered in tech diligence is unwanted overlap or redundancy. When you buy a company and don't kill their product, there will be some announcement from marketing that "Company X is excited to join the Y family of products!" This implies integration to customers, and they will expect it. What will that integration take to achieve? There's a big emotional component to this. It's less important that the guts of your product are seamlessly integrated than that your customers get the experience of integration in a timely fashion. Determine what constitutes the experience of integration, and develop a plan for it. Use your tech diligence people to ask questions like: In general, scarce or conflict-prone resources are things about the customer's experience that can come into conflict between your product and your acquisition target's product. The important things to make customers feel like the acquisition was of value to them are The quicker you reduce redundancy and confusion, the faster you get to added value. You should also be careful when you take features away from the target product and direct people into your own. It should feel natural and not be something someone has to remember. Also, they won't want to lose history from the product with the lost feature. You should talk to their product team about their vision for integration. How they're thinking about this will tell you a lot about how they're going to work with the rest of your team. Then talk to the tech diligence team about that vision and brainstorm questions to ask to figure out feasibility, horizon, and cost. Every piece of software has tech debt. If the tech team tells you theirs doesn't that tells you a lot about them. Ask them to define what tech debt means to them, as well. I have another post which talks about tech debt in-depth, and their own definition ought to align with it on some level. The more their view of debt aligns with aesthetics and trends and the less impactful it is on business the more direct leadership they will need. Mainly your job here is to get the tech team to talk about their debt, give you their gut impressions of impact and priority, and to get them comfortable with the idea that you're not judging them for doing what was necessary to get their business to where it is. The important thing here isn't so much that there might be "gotchas" like AGPL licenses or production data on Dropbox, but that you find them and have a plan to do something about them. At a high level you need to understand: There are clearly other details you need, and I'll go over this top in depth in another post, but at a high level this is what I consider the most important take-away. Most of the time you'll be acquiring a company that's significantly smaller than you. They're at an earlier stage in their journey, and as such they will have security risks you've already mitigated. You want to ask yourself and your tech diligence folks, "If we incorporate their technology, will it materially impact our security posture or certifications?" You don't need them to be perfect, but you do need to know how far off they are from your own standards. A very non-exhaustive list of good things to understand are: A lot of acquisitions will have a "bring things up to standards" period immediately post-merge where they cannot release new features or provide any evidence to customers of integration with your products. You want to know how long this period will be so you can inform customer and your colleagues' expectations. To get at that, what you're really after is: When you acquire the company, you acquire the team. Its trust issues in its own leadership, in your leadership, its overall sentiment and culture. The size, strength, independence, and makeup of the team gives you your strategy for incorporating them. Diligence here is about determining that and about first impressions. You're not (likely) up front deciding who you're going to keep and who you're going to lose. That puts the cart before the horse. You want an incorporation strategy first and foremost. And here alone, more than answering questions you have, it's more important to establish trust. They're going through a big change, and you and your fellow executives are the architects of that change. They need a sense that they're participating in that plan and that it's not being made for them. They need to get a feel for what's likely to happen on day 1, 30, 60, 90, and forward. Does their product have a trajectory or is it likely to be sunset? You don't have to tell them, but they're going to walk away with a theory anyway, so it's best if you establish trust and rapport early. This is the most important part, but it's also the part I have the least advice for. It's not because I don't have an opinion but because the questions you have to ask are so culture and team dependent. You should work with your diligence folks to get the facts about the team and use your impressions to create questions that the get answered. But it's all going to be highly dependent on whether you're acquiring a couple of founders or a 50-person tech team, and what the state of their culture is. Your overall goals, though don't change. This doesn't go half into the detail I'd like it to and it's still probably too long. In future posts on this subject, I'm going go into detail about each of the subjects and talk about what I look for, what questions I've asked in the past, and what my experiences have been good and bad in incorporating companies into the fold. I'm not going to give horror stories. There's plenty of clickbait on the internet for that if you want it. But I can talk about what mistakes I've made and what I think I'd do again. I love the process of technical diligence. I love meeting new teams, talking about cool technology, and figuring out how to make the best marriage out of two companies on a course for greatness. If you'd like to talk to me in depth about this, send me a note and we'll set something up! The Ecology of Great Tech No spam. Unsubscribe anytime. An assessment of the technical strengths, debt, and risk. A orderly plan for incorporating the team and the technology post-acquisition. A clear understanding of the likely costs to budgets and timelines of the above. How is auth handled? Do they support SSO? How are profiles stored and updated? Are there "conflict-prone" resources of the user that the product keeps track of that our product also does, like calendar appointments? Avoiding double entry Avoiding conflicts between systems Giving them a clear choice of which product to open. How and where their software is deployed. How resources with an ongoing cost are divided between customers. The basic ongoing cost profile per customer, commonly called Cost of Goods Sold or COGS. How customers receive updates, and whether there are special customers with leverage over the update process. Whether there are looming deadlines that pose a significant risk to the software. How the tech team develops software. That is, their SDLC. Do their code repositories contain private keys or passwords? How is tenancy handled? Do they keep dependencies up to date? Do they review CVEs regularly on the tech they use? Where all does PII end up? Ask the hard questions here, because people do do things like put PII in application logs, even though they shouldn't. Are they externally or self-certified for SOC2? If they deal with specially classified data (e.g. health or education or data about EU citizens) what level of compliance do they maintain? They need to have a non hand-wavy answer about FERPA, HIPAA, GDPR, etc. What are the remedies needed to get them up to our standards? Is their tech team capable of implementing them, or do we need to pull resources? What is the priority order to mitigate critical vulnerabilities to our preexisting business if we incorporate them? What is the likely cost of that to timelines and budgets? Are there people with ridiculously high "bus factor?" Who are the people on your team who are most natural partners for the leadership post-merger? E.g. are they reporting to you or is there a director that you think is a great fit to work with them? Is the team excited about this opportunity or worried. No really , what do you think the balance is there? Are there detractors who hate your company or the whole idea that their product might not "win?" Does the target tech team trust their own leaders? You need a plan to pair your leadership with theirs. You need a way to gauge whether the merge is progressing well down the road. You need a set of strategies to employ if it needs to be brought on track.

0 views