Why is Data Evaluation Essential in the Age of Artificial Intelligence?

Yan Garmin
2024年6月19日
讀畢需時 4 分鐘

In the Context of AI Model Training:

Why is data quality critical in model training?

The capabilities of AI models heavily depend on the quality of the data used for training. High-quality data can significantly enhance training efficiency, model performance, and model portability. In contrast, low-quality data not only wastes computational resources but also potentially harms the overall performance of the model. Therefore, focusing on data quality is crucial for achieving optimal training results.

How does high-quality data improve training efficiency and model performance?

High-quality data improves training efficiency by reducing redundancy and misleading data, lowering computational costs, and enhancing model performance in both common and uncommon queries. Additionally, using high-quality data can result in smaller, more portable models that are equally performant and cost-effective to serve.

How to effectively evaluate and select high-quality data in the context of large-scale data?

Identifying and selecting high-quality data when dealing with petabytes of unlabeled data is a complex and costly process. Our data evaluation service addresses this challenge through automated optimization processes, including identifying the most informative data, determining redundancy levels, rebalancing datasets, targeted data augmentation, and efficient data ordering and batching. This approach not only ensures data security but also significantly enhances the efficiency and effectiveness of model training.

In the Context of Data Transactions:

How to match trading parties?

Data is personalized; a particular dataset may be highly valuable to one organization but not to another. Therefore, it is crucial to match the most suitable buyers and sellers to maximize the overall value of data transactions.

How to evaluate data value in a privacy-protected scenario?

Given the nature of data, its value can significantly diminish if leaked. Thus, it is essential to perform data value evaluation in a "data does not leave the premises" scenario to ensure privacy and maintain its value.

In the Context of Federated Learning:

Federated learning aims to train models collaboratively while protecting data privacy, with multiple participants contributing their data. The heterogeneity of participant data distribution impacts model convergence. In some trading scenarios, buyers seek a model trained by multiple participants, necessitating fair distribution of benefits based on each participant's contribution.

How to evaluate and select data without sharing any data samples?

While current methods can reasonably evaluate customer contributions, federated learning lacks a mature method to comprehensively understand individual data contributions. This limitation hampers transparency and evaluation persuasiveness. Developing such methods will aid servers in selecting the most valuable data for model training.

How to predict and evaluate customer contributions without involving model training?

Previous methods involved training federated models to assess validation performance or using generative models to learn data distribution. To reduce computational costs or dependency on validation data, the key challenge is developing evaluation methods that do not require direct federated model training. This capability is crucial for federated learning environments with complex model training.

How to evaluate data contributors in large-scale environments?

As the number of customers increases, the complexity of the Shapley value (SV) method grows exponentially. Therefore, providing a lower-complexity evaluation method without needing data access is vital for large-scale federated learning environments, especially those involving over 100 customers.

Technical Methods

Using Wasserstein Distance to Measure Data Diversity

The difference in data distribution determines whether data is "similar." Similar data tends to yield better models in federated learning. Thus, we employ robust Optimal Transport methods to calculate the Wasserstein distance.

Using Triangle Inequality and Interpolation to Continuously Optimize Distance Estimation

Calculating Wasserstein distance directly involves data from both parties, making direct computation infeasible. To address this, we leverage the triangle inequality, where the distance between two points can be estimated using a "third point" on the line segment between them. This third point is an interpolation between the two datasets.

We generate one "global" interpolation and two "local" interpolations.
Only the global interpolation can be shared, while the local interpolations remain private.
When the data distributions of both sides and the three interpolations lie on the same line, we complete the distance calculation.

By combining these techniques, we can effectively and securely measure data similarity and optimize model training in federated learning environments.

Using Sensitivity Analysis in Linear Programming to Further Filter Noisy Data Points

Explanation: This value is interpreted as the contribution of specific data to the distance, as it determines the direction of movement based on whether it is positive or negative. If the value is positive/negative, transferring more probability mass to this data will increase/decrease the distance between the local distribution and the interpolated measure, thereby increasing/decreasing the distance between the local distribution and the target distribution.

Technical Effectiveness

1 ) Our evaluation method is currently the fastest in the industry and academia (complexity linearly increases with the number of evaluators), suitable for large-scale scenarios.

2 ) Our evaluation method can accurately assess the contribution of participants in the federated learning data fitting scenario.

3 ) Our evaluation method can achieve a 100% noise data recognition rate.

4 ) Data privacy is protected, and the original data is not visible; the only visible information is entirely noise data.

5 ) Our data evaluation tool can be applied to all types of data. Use cases include: image classification models, traditional tabular data classification models, language model pre-training and inference, multimodal (image + text) inference, graph-structured data classification, and recommendation systems.