As you can read from my profile, aside from my regular job as Enterprise Architect, during my free time I’m a software developer, and I love to create new things, to solve problems: as much as possible, I try to follow the Lean Startup approach, to validate my ideas, and to create a minimum viable product (MVP), eventually trying to scale it to start a business.
A couple of years ago, using the knowledge acquired during my time as architect in a telecommunication company, I decided to create a messaging platform to offer new way to send and receive A2P (Application-to-Person) messages, unifying different types of channels (email, SMS, push notifications, Telegram, WhatsApp, etc.) through a single API: a platform named Deveel OCM (Omni-Channel Messaging).
I will not go in details here about the idea, the overall architecture, or its whole implementation, also because I believe that the majority of the project’s idea, architecture and code was positive: I will rather share with you the story of how it failed, in a way that some might use as a cautionary tale, or as a way to learn from my mistakes (a sort of public post-mortem analysis).
Some of you will recognize some of your own stories in this one, but some others (maybe the majority) will read it and think “I would never do such rookie mistakes”: that is something I thought myself when reading similar stories posted by others, and given my training as developer, project manager and architect, but I eventually did those mistakes anyway, once mixing all roles together, and losing my objectivity.
The Easy Part: A Working Proof-of-Concept
In fact, to rapidly validate the business idea, I diligently followed the good lean practices that I observed in previous cases, and that I preach myself to developers and business owners: starting an innovation process of my own, I developed a monolithic application, aggregating together several sub-domains and entities (eg. channels, terminals, webhooks, messages, etc.) into one single service, cutting off architectural and code complexity, and speeding up the development time: in roughly two months I had developed the first version of a cloud-native messaging system, a notification service to deliver webhook callbacks to subscribers (for incoming messages or event notifications), and a set of background services and infrastructure to run the majority of the features of the system (eg. creating channels, creating terminals, sending messages to providers, receiving messages, retries, notifications, etc.).
Although I didn’t spend much time in the full design of the architecture, an agile project management process was in place (using Kanban boards linked to code), and I did some domain analysis, defining entities, events, services, sensing the future complexity of the system and the opportunity to scale the platform, while being confident that the architecture was good enough to start with a demo.
The Follow-Up: A Working MVP
Having achieved a working set of minimal features, I proceeded with demoing the system to some friends and colleagues, gaining some attention and a young business partner: this made me decide to go further, and to create a Minimum Viable Product (MVP), to test the idea with real users, and possibly to start a business behind it.
To enable such scenario, I had to fill the gap of some missing services and components, such as an administration portal (for users to manage their resources), a registration service, authentications and authorizations, and a way to track the usage of the platform, and to provide insights to the users about the usage of the messaging (analytics).
This also meant scaling up the technical capabilities of some of the existing components, and to start thinking about the infrastructure and the deployment of the platform.
Fortunately, after reaching out to Microsoft, we were granted access to the Startup Program, receiving a Sponsored Subscription to the Azure cloud platform, and other valuable perks: at the time I felt that was a great help and support, because I could focus on the development, and I didn’t have to worry about the infrastructure, and the costs (after a couple of very expensive invoices for the usage of the platform in a development phase).
Anyway, for how much I appreciated and felt relieved by Microsoft support at that time, I realize now that it also might have been an issue for me, since I stopped feeling the urgency of raising up funds to pay for the infrastructure, leading me to relax and to start focusing on the architecture redesign towards an ideal target state of the technology, rather than focusing on the business development.
The (Premature) Re-Architecture
Being an architect after all, I could not resist the temptation of applying proper design principles and patterns, and thus I started breaking the monolith in several microservices, creating a complex and distributed architecture: that also had the consequence of re-designing the CI/CD pipelines to support a distributed deployment model: not a small detail, given that each domain ended up having its own services, infrastructure, codebase.
Such an architecture was indeed implementing best practices, suited for large organizations, with multiple teams collaborating on the same system, and that need to scale up their agile development.
Anyway, I am not a large organization, and this approach failed me.
In fact, every fix, every change, every new feature, required me to work on several repositories, to restore dependencies and to update several configuration files: the time spent to implement, test and validate the changes multiplied by a factor of nine, and I was not able to keep up with the pace of the business idea I wanted to implement.
Design and architecture are the most important activities in the system development process: considering aspects such as security, compliance (to regulations), scalability, business continuity, DevSecOps is crucial for an established organization that produces software service solutions, but it has to come at the right time of the lifecycle, with the right scale of efforts in the right phase: I can easily say I failed myself evaluating at which stage I was myself (as a one-man-startup), starting a process of re-engineering that was premature and not needed.
Filling the Gaps
As mentioned above, the initial Proof-of-Concept of the system was intentionally and knowingly missing some key components, that were anyway required to make it to be usable by real users:
- Multi-Tenancy: given the Software-as-a-Service (SaaS) nature of the idea, the system lacked the capability to isolate resources, users and applications for specific tenants (organizations or individuals who would use the platform)
- Security: the system lacked the security features to authenticate and authorize users and client applications
- Analytics: the platform was missing a way to track the usage of the platform entities (eg. channels, terminals, message statuses, campaigns, etc.), and to provide insights to the users about their usage
- Registration and Provisioning: the platform lacked a service or process to onboard tenants and users, and the provisioning of an initial set of credentials and resources to use the platform
- Administration Portal: potential users of the platform lacked a graphical interface to manage their resources, to configure the platform, create new applications and users for their tenants, reviewing the usage of the platform, etc.
As understandable, the above enablers are not in fact part of the core messaging domain, but rather sit aside as supporting capabilities and services, and as such I wanted to avoid reinventing the wheel, but rather trying to find vendors (that I could afford!) delivering such functions as Software-as-a-Service (SaaS) themselves (without the complexity of installing and managing the software parts), so that I could focus on the core domain.
The Multi-Tenancy Issue
Although multi-tenancy is a pretty known model and architecture, it’s not easy to implement, and it actually posed the greatest challenge to the initiative:
- Vendors providing a multi-tenant authentication services are not affordable for a startup, this also includes the usage of Azure Active Directory (that had limitations in the Sponsored Subscription provided to us by Microsoft)
- Resources must be all segregated for a specific tenant, isolating databases, logs, analytics
- Access control must be enforced to a new level of granularity, since users and applications of a tenant must be able to access only their resources, and not the ones of other users or tenants
Not wishing to handle databases of secrets and credentials, to avoid any compliance issues, I initially discarded the usage of Duende’s IdentityServer, that anyway would have been a great fit for the system in later stage of maturity, but that also meant an additional service to handle my own, and I started looking for a SaaS solution for the authentication and authorization of users and applications, delaying the decision to eventually switch to IdentityServer in a later moment.
Playing safe, I resolved the issue of missing authentication by using an instance of Auth0, integrating the service through their Management API, that allowed me to provide some more metadata to the user profiles, to logically group them in the context of tenants: this stole some time, since I could not use a plain out-of-the-box registration process provided by Auth0 itself, given I needed users to carry the tenant identity since the start of their lifecycle within the platform.
Furthermore, Auth0 provides a very limited number of applications in its free tier offering (although even in its Enterprise tier the number of applications would have not been enough): because the SaaS nature of the the system, the main consumers of the messaging functions were intended to be applications accessing the APIs, and users should have had the possibility to create one or many client applications per tenant, that meant having an unpredictable number of applications to authenticate.
This limitation forced me to implement a custom solution to handle the creation of applications, to manage their credentials in a secure way, forking application authentication and user authentication processes.
For addressing the segregation of resources like channels, terminals, messages, I ended up relying on some off-the-shelf .NET libraries provided from the community (for example, Finbuckle.MultiTenant Library), especially for the discovery of the tenants and resolving their resources, although most of the complexity of the multi-tenant architecture I had to cover myself.
Note to the Readers - Implementing a multi-tenant strategy requires more than just a library: it requires a design of the architecture, and a design of the data model, that is not easy to do, and that requires a lot of time, and a lot of testing (to ensure the resource segregation and availability).
The Analytics Service
Providing a messaging platform to users without the ability to review the performances of the resources they are using is not highly valuable, and thus I decided to implement an Analytics Service to address this concern.
The implementation of such a service was another big challenge, since it required a large amount of data to be collected, stored and processed: I had to study the best way to execute on the processes, and provide the insights to the users on several dimensions, with the possibility to customize such reports. Although the number of performance indicators for analysis might scale to a large amount, I wanted to start small, and only considering the statuses of messages (eg. queued, sent, delivered, failed, etc.), also to avoid too many integration pipelines.
For the first version of this service I heavily relied on an instance of ElasticSearch-as-a-Service by Elastic, but the costs rapidly scaled up to a point unbearable for me, and I had to find a cheaper solution, leveraging the resources available from within the Azure Subscription I received as part of the Microsoft Startup Program: although services such as Azure Synapse were more suited than ElasticSearch to provide analytics, going from a model based on a search engine (that evaluates aggregated data on-the-fly) and a simple ingestion, to a model based on staging, pipelines, data lakes and data warehouse to produce specific type of analysis, required a lot of time and steep learning curve to migrate the service.
Although ultimately I succeeded, these tasks distracted me from the development of the MVP, and from the development of the platform itself, before any customer was even acquired, sacrificing the implementation of new channels to focus on this accessory service.
Registration and Provisioning Orchestration
Authentication, Authorization and Resources are not synonyms and consist of different processes, implement different models, and require dedicated components.
The onboarding process of a new user into the platform requires a set of coordinated movements between few services, especially in a distributed architecture, like the one I ended up implementing:
- A unique (root) tenant has to be created to handle users, applications and resources (and potentially child tenants in the directory)
- An administration (root) user must be created on the service that handles the authentication, within the context of the newly created tenant
- A default set of resources (eg. databases, tables, file storages, etc.) must be created on each of the service that master a certain type of information, for the newly created tenant
- The administration user must be granted access (through roles and direct grants) to the default set of resources and information
- The identity of the user must be validated through a confirmation process (e-mails, sms, etc.) and activated
You might logically argue that large parts of the above process might have been provided out-of-the-box by the Identity Management service, but as described above the requirement of a multi-tenant model and its lack of support by Auth0, I had to implement some patches since the registration phase, forcing me to proxy the process through a custom service.
Furthermore, the coordination of creation of resources, including failures and retries, required a central service to handle the process, and to provide a single point of failure for the process itself.
The Administration Portal
The last missing piece of the puzzle was the Administration Portal, to enable users to the management of their tenants, applications, resources, reviewing the analytics, etc.
Despite being an easy-to-use technology, it still took me some time to learn it, and several cycles to implement even the little changes required to the UI, and to the UX, and to make it usable by the users.
Remember: No matter how easy a technology is, the development of a graphical interface is one of the most time-consuming tasks, and requires a lot of iterations, and a lot of testing, to ensure that the user experience is good enough.
The Experimentation Trap
As I mentioned above, being granted by Microsoft with a Startup Program subscription relieved me from the urgency of finding revenues to support the infrastructure, and this brought me to a situation where I took a lot of liberty in the experimentation of new (at least for me) technologies and methodologies: it definitively benefitted my personal knowledge, but it also distracted me from the development of the platform itself, and from the acquisition of customers.
When I decided to break the monolithic architecture of the first service, I had to decide which technology to use to implement the new set of modules emerging from such choice, going through a long phase of Proof-of-Concepts and Proof-of-Technologies.
Given the trends of the industry, I decided to try with Kubernetes and Docker, and to implement the services as containers, but I soon realized that the learning curve was too steep, and the complexity of the technology too high, to be able to implement it in a reasonable amount of time, and to be able to maintain it in the long term.
Wanting to keep the idea of using containers through an orchestrator, I then tried to benefit from the Microsoft’s experimental Project Tye, that offered a simple design to configure the cluster infrastructure, and to deploy the services: that succeeded for a while.
Since the overall set of components (databases, monitors, networks, secrets, API gateway, etc.) were resident in Azure, and the scope of the cluster was to host only the services, I was immediately posed the challenge of accessing external services from within the cluster (eg. Azure Key Vault, Azure SQL Server, CosmosDB, etc.) and to make the services reachable from the outside (eg. Azure API Management, Azure Application Gateway, etc.): for this I tried to use functions provided by the _Daprk, but without much success.
After few attempts and iterations, leveraging the power of Infrastructure-as-Code (IaC), I ended up with a set of Terraform modules that were able to provision the services as Azure Apps and Azure Functions, and to deploy the services, delegating the complexity of resource balancing to Azure, to eventually resume the possibility of using an orchestrator in the future: this choice brought benefits to the speed, costs and simplicity of the deployment, although it came after few iterations, and after a lot of time spent in the experimentation of new technologies.
Experimenting With Code
Overall the various changes and new implementations described above, the major mistake that I can identify was the choice of experimenting new methodologies for writing the source code of the services, leading to a lot of time spent in the implementation of more than one code framework to reduce the amount of code to write, and to increase the speed of development.
As much as such practice is beneficial in the long term, it can also be a trap in the short time, since the .NET runtimes release new features quite quickly, that means I had spent a considerable amount of time filling the gaps of missing features in the framework, to use the new features of the runtimes.
One of the most disappointing examples that describes this situation, was the implementation of a set of features to reduce the amount of time to log operations and events, optimizing the performances of such calls, to then receiving these same provisions from the version 7.0 of the .NET framework.
The End of the Journey: The Shutdown
After a year of development, and after a year trying to find customers or partners (without success), I decided to shutdown the platform, and to stop the development, to avoid entering in the situation of having to pay for the infrastructure from my own pocket, without having any revenue in sight.
Some results were achieved, but not enough to justify the continuation of the project, and the costs of the infrastructure by myself, that without the support of the Startup Program would have been not even closely affordable.
Along the way I learnt a lot about several aspects of the messaging domain, the underlying models and technologies, the challenges of implementing a Software-as-a-Service platform (as developer), and the challenges of running a startup, although the failure of the project was disappointing nonetheless.
I don’t exclude the possibility of reboot it later on, reusing the good parts (software libraries, applications, services) and lessons learnt along the way, but for the moment I am just focusing on absorbing the various technical, architectural and business lessons learnt, and on applying them to my current job.
The Lessons Learnt
They say that if you don’t learn from your mistakes, you are doomed to repeat them, and I don’t want to repeat the same mistakes again, leaving room for a bunch of new mistakes to do.
As much as possible I would like that those of you who have read this post until now will benefit from the lessons I have learnt along this year:
- Avoid Scaling the Architecture Prematurely - The decision to make the system too complex (through micro-servicing patterns) ahead of time was a mistake, that led to a lot of time spent on the architecture design, the infrastructure management, the dependency resolution, and other secondary concerns, rather than on the business logic, and on the business value
- Best Practices can be Misleading (Sometimes) - The refactoring of the existing components to implement the missing best practices and code patterns was a mistake that stolen a lot of valuable time: the code was good enough, secure enough, and performant enough to be productive (even after the re-architecture of the platform), and the changes were not required; furthermore such best practices are opinionated and debated by the development community, making them not an absolute truth.
- Not Focusing on the Business Value - The decision to focus on the technical aspects of the platform, rather than on the business aspects of the platform was a mistake, distracting me from the delivery of features and components that would have enabled the acquisition of customers
- Don’t Work Alone - One of the worst mistakes anyway was the decision to work alone on the implementation, without any technical partner in the development, someone with whom I could have shared the workload, receive an provide a different perspective on the development: although one might be expert enough to cover several aspects of a system (e.g. infrastructure, analytics, messaging, etc.), the time to allocate for each of the items to implement and manage will sum up and eat up the available capacity of the person.