Autonomous or self-driving networks are expected to provide a solution to the myriad of extremely demanding new applications in the Future Internet. The key to handle complexity is to perform tasks like network optimization and failure recovery with minimal human supervision. For this purpose, the community relies on the development of new Machine Learning (ML) models and Artificial Intelligence (AI) techniques. However, ML/AI can only be as good as the data it is trained with. Datasets provided to the community as benchmarks for research purposes, which have a relevant impact in research findings and directions, are assumed to be of good quality by default, but often they are not.

In the context of autonomous networks, assuring the quality of the data gathered by the network telemetry framework (RFC 9232) is a major need. Autonomous networks require an autonomous data generation mechanism, and this should ensure that high quality data is produced. As a result, data quality assessment must be embedded in the automatic generation of data, which calls for new approaches to assess data quality and to produce data that maximizes quality (data-quality-by-design). As for today, data is produced in a fashion that does not need to be optimal for the application the data is collected for. To give some examples, the data to estimate a traffic matrix to, e.g., optimize routing can be gathered with different technologies, but not all of them have the same tradeoff between accuracy and resource consumption, and the choice can affect the result of the optimization. On the other hand, a dataset for anomaly detection and diagnosis/troubleshooting often needs to be labelled, and inaccuracies in this labelling can dramatically impact the performance of AI for anomaly detection.

