Innovative Approach Enhances Protection for Confidential AI Training Data

Data confidentiality comes with a price. Various security methodologies exist to safeguard sensitive user information, such as client addresses, from attackers who may try to extract them from AI models — however, they frequently compromise the models’ accuracy.

Researchers from MIT have recently introduced a framework grounded in a novel privacy metric dubbed PAC Privacy. This framework could uphold the functionality of an AI model while ensuring sensitive information, like medical images or financial documents, stays secure from intruders. They have now advanced this research by enhancing the computational efficiency of their technique, improving the balance between accuracy and privacy, and developing a formal blueprint applicable to privatize nearly any algorithm without needing access to its intricate details.

The group deployed their new iteration of PAC Privacy to privatize various traditional algorithms for data evaluation and machine-learning applications.

They also showcased that algorithms characterized as “stable” are simpler to privatize using their approach. A stable algorithm’s predictions remain steady even when slight changes occur in its training data. Enhanced stability enables algorithms to make increasingly accurate predictions on previously unencountered data.

The researchers assert that the enhanced efficiency of the revised PAC Privacy framework, along with the four-step guide for its implementation, would facilitate the technique’s application in real-world scenarios.

“Typically, we perceive robustness and privacy as either unrelated or possibly even conflicting with the development of a high-performance algorithm. Initially, we create a functioning algorithm, then enhance its robustness, and finally ensure its privacy. We’ve demonstrated that this is not invariably the correct perspective. By improving your algorithm’s performance across various contexts, you can essentially attain privacy without additional effort,” states Mayuri Sridhar, an MIT graduate student and principal author of a study concerning this privacy framework.

She is accompanied in the publication by Hanshen Xiao PhD ’24, who will commence as an assistant professor at Purdue University in the upcoming fall; and senior author Srini Devadas, the Edwin Sibley Webster Professor of Electrical Engineering at MIT. This research will be presented at the IEEE Symposium on Security and Privacy.

Estimating noise

To safeguard sensitive data utilized in training an AI model, engineers frequently introduce noise, or generic randomness, into the model, making it more challenging for an adversary to deduce the original training data. This noise diminishes the model’s accuracy, thus, minimizing the noise added is preferable.

PAC Privacy automatically determines the least amount of noise necessary to ensure an algorithm achieves a intended level of privacy.

The foundational PAC Privacy algorithm executes a user’s AI model multiple times on various samples from a dataset. It assesses the variance as well as the correlations among these numerous outputs and leverages this data to estimate how much noise is required to protect the information.

This new variant of PAC Privacy operates similarly but does not necessitate representing the complete matrix of data correlations across the outputs; it merely needs the output variances.

“Since the component you are estimating is significantly smaller than the entire covariance matrix, the process can be executed at a much faster pace,” Sridhar clarifies. This advancement allows scaling up to more extensive datasets.

Introducing noise can adversely impact the utility of the outcomes, making it essential to limit utility loss. Due to computational limitations, the original PAC Privacy algorithm was restricted to employing isotropic noise, which is uniformly distributed in all directions. The revised version estimates anisotropic noise, which is specifically tailored to the unique characteristics of the training data, allowing users to add less overall noise while maintaining the same level of privacy, thereby enhancing the privatized algorithm’s accuracy.

Privacy and stability

During her exploration of PAC Privacy, Sridhar hypothesized that algorithms with greater stability would lend themselves more readily to privatization using this technique. She utilized the more efficient PAC Privacy variant to validate her theory with several classical algorithms.

Algorithms demonstrating increased stability show less fluctuation in their outputs when minor modifications are made to their training data. PAC Privacy divides a dataset into segments, applies the algorithm to each data segment, and evaluates the variance among the outputs. The larger the variance, the more noise that must be incorporated to ensure the algorithm’s privacy.

Utilizing stability methods to diminish the variance in an algorithm’s outputs would concurrently reduce the amount of noise needed for privatization, she elaborates.

“In optimal scenarios, we can achieve these mutually beneficial outcomes,” she adds.

The team illustrated that these privacy assurances remained robust regardless of the tested algorithm, and that the new variant of PAC Privacy necessitated an order of magnitude fewer trials to gauge the noise. They also evaluated the method through simulated attacks, proving that its privacy assurances could endure cutting-edge assaults.

“We are interested in investigating how algorithms could be co-designed with PAC Privacy to ensure that they are more stable, secure, and resilient from the outset,” Devadas states. The researchers are also keen to examine their technique with more complex algorithms and further investigate the privacy-utility balance.

“The current question is: When do these mutually beneficial outcomes occur, and how can we facilitate them more frequently?” Sridhar inquires.

“I believe the significant advantage PAC Privacy possesses in this context over alternative privacy definitions is that it operates as a black box — there is no necessity to manually analyze each individual query to privatize the results. This can be accomplished entirely automatically. We are presently developing a PAC-enabled database by adapting existing SQL engines to support practical, automated, and efficient private data analytics,” notes Xiangyao Yu, an assistant professor in the computer sciences department at the University of Wisconsin at Madison, who was not associated with this research.

This study is partially funded by Cisco Systems, Capital One, the U.S. Department of Defense, and a MathWorks Fellowship.

Leave a Reply Cancel reply