Approaching an Unknown Code Base

I recently got an email, asking me how I approach an unknown code base. Here’s an answer.

It’s a common situation we face when we join a new project. We rarely do green field development and start from scratch. Instead, we are presented with an existing code base and now need to find our way through it. For large code bases, this may be quite a big task. Where should we begin to look, what parts of the code base are the most important?

Find someone to answer your questions

No matter how clear the code and how good the documentation, it is always beneficial to have someone who has worked on the same code base. They may or may not be still in the team or organization, but it’s always good to know someone to ask if everything else fails.

That does not mean that asking them should always be your last resort. If they’re still on the team, it’s usually a good idea to work closely with them for some time. Do some pair programming or discuss with them how you approach typical tasks.

In some cases, there is only one person who knows about the code you are going to work on, and that person may even have left the company. It may be hard to get a hold on them, and they may not have time to help you much. Still, if what they left behind is an unintelligible mess of code with no documentation and other clues, it’s their fault and they have the obligation (moral, professional, sometimes even legal), to make good for it.

Get to know the problem domain

This one seems to be overlooked fairly often. In order to understand unknown code, you have to understand what the software does – or is supposed to do. In order to do that, you have to understand the problem the software solves.

You obviously don’t need to become a complete domain expert up front, but some basic knowledge will go a long way to understand what is happening in your code base. Playing around a bit with the software can give us a feeling what it’s about.

Study the documentation

True, the cliche says developers don’t like to write documentation. Still, usually, someone did the work and has written down something.

User documentation may give you additional information about the problem domain, but there also should be some architecture docs giving you the big picture about how the software is structured. Ideally, someone has written a specific onboarding document for new team members.

Besides the obvious documents, there are other artifacts that can give us more insight into the software. For example, issue trackers can be a great source of information, e.g. if we can filter for features that have been implemented in the past.

Tests

A good suite of automated tests will show how parts of the code and the whole program are supposed to be used. A good suite of system tests should cover typical use cases, integration tests can show which components there are and how to use them. Unit tests should document how classes and functions work.

Writing tests can also be a good way to get to know the code base. If you think you know how some piece of code behaves but don’t see it documented somewhere, consider to write an automated test that verifies your assumption. That way you not only get the insight about the behavior, you also gain knowledge how to call the functionality. As a bonus, you just have contributed to the stability of the software.

The unknown code base itself

With what we know from the documentation, our domain knowledge, and larger scale tests, we can have a look at the directories and source files. The former should reflect the software architecture, while some the files should have class names that are related to domain concepts. That way, we should be able to spot the most important source files and packages.

Debugging can also help: Debug a manual run of the application or one of the system tests. Function names can give you a sense of their importance, so you know which calls to skip and where to dig deeper. Debugging a typical use case will usually bring you to the source files that are most important. Take your time to look around to get to know what happens where.

Other tools

We can use statistical profilers to get a feeling which parts of the code base are most important, e.g. via flame graphs. Profile a few typical use cases from the set of system tests to get a good sample.

In addition, looking at the source code history can show us where the most work has been done. Those usually are the most interesting parts of the software for new developers, since we likely to have to change them frequently as well. We can find those hot spots by treating our code like a crime scene.

Conclusion

Dealing with unknown code can be difficult and hard to start with, but there are lots of different things we can do to get to know it better.

Previous Post
Next Post

9 Comments


  1. One thing I find useful is to try and identify the core components of the software, sketch these up on a big whiteboard and then identify the relationships between components. Iterate on the whiteboard as your learn more. Drill down as necessary. Doing this in pairs if possible can be an advantage.

    Reply

    1. Hi Jukka, thanks for your input.
      I’d say that depends on the size of the code base. The main function may only start a bootloader or an application framework, trigger tons of uninteresting configuration routines etc. before somewhere deep down the actual application gets started. I have seen projects where nobody has touched or even seen the main function for years because all the interesting stuff happens several function call layers away from it, maybe even in another thread. For smaller projects, starting at main can be a good point though.

      Reply

  2. Another technique I find useful is to refactor with reckless abandon (i.e. scratch refactoring with no intention of keeping the changes.) The very process of remodeling the code helps build an understanding for it. After some time you may feel confident enough to make real refactoring work.

    Reply

    1. I like that. Refactor the code so it represents your understanding. Ask colleagues (if available) if that’s right or what I missed.

      Reply

  3. Good tool is Visual Architect . You can import the code base and it generates UML graph of that think . It’s easy than to spot the spagety at first sight .

    Reply

    1. That’s a possibility if the code base is not too big, yes. I once tried that with a code base of 2 million lines of code and about 4000 classes/files. not a good idea 😉

      Reply

  4. These are all good tips. Additionally, I find that running static analysis tools over the code and investigating the warnings and errors that they report can give you a good idea about the sort of problems you’re in for. You want to find out how the previous authors thought, and what kinds of bad patterns they were prone to.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *