ARTS (Azure Reliability Tagging System) is a hierarchical taxonomy of root cause tags used to label production incidents of Azure services. Labelling of postmortems enables aggregating root cause categories to identify problem areas, trends, patterns, and risks that may lead to future incidents.
The ARTS taxonomy is comprehensive---it includes many different factors contributing towards thousands of production incidents in Azure, and yet compact enough---it contains only the root causes observed in real incidents. The taxonomy has been developed over last few years and is still growing (slowly) as new root causes are discovered.
Description of the ARTS Taxonomy:
One can manually label a postmortem with relevant ARTS tags. However, such manual labelling does not scale well and can be inconsistence/error-prone (see paper).
To address this, we developed automated ML-based techniques to analyze the content of a postmortem called AutoARTS in order to (1) recommend ARTS tags relevant to the postmortem, and (2) a small snippet of texts (from the long postmortem) that capture the rootcauses explaining the recommended tags. The work is published at USENIX ATC'23, email us at autoarts-rca-taxonomy at outlook dot com if you are interested in the details.