Introduction
Defect Analysis is a classic analysis technique to get insights into how buggy your system might be. One variant of Defect Analysis is to use an issue tracker (i.e., the system where you manage your issues) to compare the found and fixed bugs over time. With this analysis, you can find out if you
- have a spike in reported bugs (where you should investigate deeper to identify the primary sources of the bugs)
- are doing well fixing bugs (and therefore can be confident to release a stable version of your software)
- are keeping bugs at bay (to allow the implementation of new features in a more risk-free environment)
The main goal of this article is to implement this kind of analysis.
In this blog post / notebook, I’ll demonstrate a Defect Analysis technique using the issues tracker system YouTrack and the defacto standard Data Science tool pandas – the Python Data Analysis Library. We want to analyze the relatively new software product Code With Me from Jetbrains – a plugin that enables you to share your locally running project from your IDE with anyone. The issues from Code With Me are managed in YouTrack, hosted by Jetbrains, and open to everyone. This allows us to use this issues tracker as our playground.
The second goal of this article is to write the analysis code in an as less confusing manner as possible. I want to achieve this by using pure pandas functionality to get a readable analysis of the defects in a software system. In the analysis part, I’ll show the first five entries of the output in each step so that you can follow more easily. In fact, this article is an export of a Jupyter notebook. You can start, execute and even experiment with the notebook by yourself. To do this, please click here!
This said, be aware that your issue tracker might bring Defect Analysis to you out of the box (YouTrack does this AFAIK). Click through the reporting features of your system before you reinvent the wheel! Nevertheless, I’ll show you how you can do it on your own. You could reuse many steps from this analysis for other purposes like tracking the progress of work, too.
Please note: I’m not affiliated with Jetbrains. Jetbrains simply has on open issue management system where I can get some data from in an easy way. You can do this kind of analysis with any modern issues tracker system as well.
How to get the data
First, we have to retrieve the data we need from the issue tracker system. We can do this in various ways (direct database access, scraping the website, or using an API provided by the system). How this works depends on your tool.
In our case, we’re using YouTrack’s REST API to download some data. To do this, we create an URL to get the data we need. I’ve dug deep into the YouTrack REST API documentation and found the information we needed to export bugs only with their creation and resolved timestamps. We also set an upper limit to the latest 1000 entries to not overload the available public issue tracker system.
'https://youtrack.jetbrains.com/api/issues? |
In this analysis, we’re using the built-in read_json
method of pandas. The method executes a request to YouTrack. Youtrack responds with data in the JSON format. The read_json
method parses this data and creates a DataFrame where each returned field is put into separate columns.
resolved | created | $type | |
---|---|---|---|
0 | NaN | 1612469976486 | Issue |
1 | NaN | 1612517192097 | Issue |
2 | NaN | 1612458988395 | Issue |
3 | NaN | 1612469718798 | Issue |
4 | 1.611839e+12 | 1611611056898 | Issue |
With this DataFrame (stored in the variable df
), we can dig into a first analysis of the development of opened (the column created
: the time a bug was created) and closed (the column resolved
: the time where a bugfix was confirmed) bugs. First, we throw away the $type
column because we don’t need the information for our analysis.
resolved | created | |
---|---|---|
0 | NaN | 1612469976486 |
1 | NaN | 1612517192097 |
2 | NaN | 1612458988395 |
3 | NaN | 1612469718798 |
4 | 1.611839e+12 | 1611611056898 |
Analysis
OK, let’s take a look at the development of the reported and fixed bugs. One important question we would like to answer is whether we are getting behind with bug fixing. For this, we are calculating the creation and resolution of bugs over time. That means for us that we have first to convert the corresponding data in the columns created
and resolved
into a data format we can calculate time with. For this, we’re using the to_datetime
method for converting the text in these columns into the Timestamp
data type. Because the retrieved data has a timestamp information in milliseconds since epoch (01/01/1970 00:00:00) we have to provide this information as additional parameter unit
.
resolved | created | |
---|---|---|
0 | NaT | 2021-02-04 20:19:36.486 |
1 | NaT | 2021-02-05 09:26:32.097 |
2 | NaT | 2021-02-04 17:16:28.395 |
3 | NaT | 2021-02-04 20:15:18.798 |
4 | 2021-01-28 12:55:27.912 | 2021-01-25 21:44:16.898 |
Next, we need a model that shows us the progress of the created and fixed bugs over time. For this, we have to list all events in the timeline in one column, but to keep the information of the created and resolved bugs.
We can use the stack()
method on the DataFrame to reorganize the multidimensional data into one continuous list of data. In other words, we convert from wide to long format and achieve a flat list of creation and resolve events of a bug.
0 created 2021-02-04 20:19:36.486 1 created 2021-02-05 09:26:32.097 2 created 2021-02-04 17:16:28.395 3 created 2021-02-04 20:15:18.798 4 resolved 2021-01-28 12:55:27.912 dtype: datetime64[ns] |
This gives us a Series with a multiindex: (i.e., the first two “columns”: index number and the information about the bug' status) as well as the timestamp information (as datetime64
data type) in the column named 0
. So the bug status is part of the index, not a normal value. Because we want to work with the status, we have to change the index to get this as a normal value field. We melt the first level of the multiindex with reset_index()
to also list the status of the bug entries as normal values because we want to work with these values later on.
level_1 | 0 | |
---|---|---|
0 | created | 2021-02-04 20:19:36.486 |
1 | created | 2021-02-05 09:26:32.097 |
2 | created | 2021-02-04 17:16:28.395 |
3 | created | 2021-02-04 20:15:18.798 |
4 | resolved | 2021-01-28 12:55:27.912 |
Next, we need a new column that tells us for each entry if a bug was resolved or created. We can do this by creating a new column that we fill just with the number 1 for each entry. This allows us to calculate some metrics based on this number later (even if it might make no sense to you for now).
level_1 | 0 | count | |
---|---|---|---|
0 | created | 2021-02-04 20:19:36.486 | 1 |
1 | created | 2021-02-05 09:26:32.097 | 1 |
2 | created | 2021-02-04 17:16:28.395 | 1 |
3 | created | 2021-02-04 20:15:18.798 | 1 |
4 | resolved | 2021-01-28 12:55:27.912 | 1 |
OK, now comes the tricky part. We have to reorder our table to get a side-by-side comparison of opened and closed bugs over time. We use the great pivot_table
method on the DataFrame to achieve this. We set the column with the time information from the (oddly named) column 0
as our index. This gives us a time-based index and allows time series calculations. pivot_table
also sorts the time information in the right, ascending order. As columns, we us the values of the level_1
(= status of a bug entry) to get a side-by-side view of created and resolved bugs. Last, we use our new number column count
to mark the activities for each bug entry at each given point of time. As a small cosmetic correction we rename the column heading from level_1
to defects
. We do the same for the index.
defects | created | resolved |
---|---|---|
timestamp | ||
2020-06-01 16:46:48.516 | 1 | 0 |
2020-08-04 15:45:22.220 | 1 | 0 |
2020-08-27 10:35:40.693 | 1 | 0 |
2020-08-27 12:17:03.483 | 0 | 1 |
2020-08-27 14:29:18.302 | 1 | 0 |
Next, we want to create a trend of the created and fixed bugs. In other words, we need to track the accumulated number of open and fixed bugs for each of our timed entries. We can do this by using the cumsum()
method on our DataFrame.
defects | created | resolved |
---|---|---|
timestamp | ||
2020-06-01 16:46:48.516 | 1 | 0 |
2020-08-04 15:45:22.220 | 2 | 0 |
2020-08-27 10:35:40.693 | 3 | 0 |
2020-08-27 12:17:03.483 | 3 | 1 |
2020-08-27 14:29:18.302 | 4 | 1 |
Last, we visualize this DataFrame to get a feeling about how we good (or bad) we are in managing our bugs. We’re using the standard plot
method of the DataFrame which created a line chart of your two columns: the amount of existing created
and resolved
bugs, plotted over time.
From this visualization, we can see that the team might be behind regarding fixing bugs. We also see that at the beginning of December 2020, there was a steep increase for bugfixing. It looks like someone cleaned up the issue tracker right before the official release, doesn’t it? :-)
Conclusion
This blog post guided you through a variant of Defect Analysis using pandas and data from the issue tracker YouTrack. The analysis gives you some hints on how you could approach analyzing data from your own issue tracker system (or even similiar kinds of data).
I hope you could also see that this kind of analysis isn’t magic. Using appropriate standard tools like pandas helps you getting answers to your specific questions in a straightforward way.
Many thanks to my colleagues Michael Krämer and Stefan Lauer for their comments on an earlier version of the article.
Header image from Ulrike Leone / Pixabay