A Real-Life Experience of Excellent Teamwork in Incident Handling

How Project A helped kfzteile24 launch its new e-commerce platform and solve a critical issue

By Stephan Schulze

Currently, one of our ventures, kfzteile24, a Berlin-based multichannel retailer in the field of car parts and accessories, which provides car parts for hobbyists and professionals alike, gets some support from our Software Engineering team. Side by side with kfzteile24’s IT team, our experts work on the companies’ product catalog. In our experience, a collaborative culture is crucial not only in day-to-day dealings but especially in situations where issues must be handled efficiently and effectively.

Accessibility and the allocation of specific roles are essential for a smoothly functioning cross-departmental team

For months, our developers, Ops, and product managers from Project A and the kfzteile24 IT team worked on a new e-commerce platform. Finally, we went live with the last puzzle piece of the new platform: The catalog. Thanks to Microservices, the rest of the platform was already running in production. But we had no experience with the product catalog. Nevertheless, for two weeks, the platform operated without any interference. On one weekend in August, on a sunny Saturday, my phone rang. A colleague, the Tech Lead from kfzteile24, called me.

“We have issues with the website!”

“We have issues with the site,” he said. The worst-case scenario had occurred; the site went down when someone tried to access the product catalog. But even on the weekend, we could rely on our team — on both sides: Within 10 minutes, a group of up to 7 people, colleagues from the Tech and the DevOps teams from kfzteile24 and Project A, were online. They quickly organized themselves and started to think about ways to solve the issue:

After a few minutes, we discovered the site was down because a Redis server ran out of disk space. We discussed possible options in the teams and agreed on one: Increasing the disc size and then re-deploying the service seemed a trivial solution; we just needed to adjust some Helm Chart. We had the assumption that the Redis would come back, but it wouldn’t work as expected. Some circle dependency resulted in the application trying to access the server when the disk size hadn’t been increased yet.

We had a running system back within 40 minutes.

Finally, we quickly set up a managed Redis on AWS, followed by a manual configuration hack in the Kubernetes cluster.

Communicate, participate, take over responsibility, distribute roles, get everyone on the same page, and find solutions as a team

Of course, there was some clean-up work after solving the issue. For example, we had to persist the manual changes as code in the infrastructure and to do a post-mortem incident meeting to share the knowledge about the reasons causing the incident and to discuss how we will prevent similar situations in the future. But we took some great learnings out of it:

When the conditions shift, an aligned and flexible team that smoothly adapts to the situation is vital for the success of a project.

It was a great experience to see how team members from different departments and companies communicated openly. Everyone participated in finding a solution, people took over responsibility, and, finally, solved the issue as a team. Besides a collaborative workplace culture, knowledge sharing is key to creating an environment of success: Thus, once we had addressed the problems, we shared our knowledge.