Automated curation of in-vivo recorded time series

Project leader: ,
Project members:
Start date: 1. January 2024
End date: 31. December 2026

Abstract

High-quality data is a fundamental requirement for training AI models effectively. While vast amounts of text and image data are readily available online, the growing complexity of AI systems demands well-curated datasets. Massive models integrating diverse data sources, such as voice and image for robotics, require extensive effort and resources for data collection and processing. Recent findings indicate that optimal AI performance depends on increasing model size significantly, data volume moderately, and training cycles slightly. However, data suitability—ensuring balanced distribution and reliable annotation—remains a major bottleneck.

Smaller models can achieve comparable results to large-scale systems if trained on high-quality data. As a result, AI developers prioritize data quality over sheer quantity. Yet, manual curation is costly and often relies on labor-intensive processes. Additionally, emerging AI regulations, such as the European Commission’s AI Act, emphasize the importance of data quality to ensure compliance and mitigate liability risks.

Our project addresses these challenges by integrating domain-specific and contextual knowledge into the data curation process, improving training efficiency while reducing costs. We focus on automated methods to enhance low-quality data, making it suitable for AI training. The use case involves a handwriting recognition system that converts inertial data from an electronic pen into computer-readable characters using pattern recognition. With prior expertise in this field and a robust dataset, our project aims to optimize AI model training through more efficient and scalable data curation methods.

Partners

https://stabilodigital.com/digipen/