Privacy is one of those nebulous ideas that everyone loves. Delivering it, though, is a job that’s full of nuance and tradeoffs. Turn the dial too far to one side and the databases are useless. Turn it too far in the other direction and everyone is upset about your plan to install camera arrays in their shower to automatically reorder soap.
The good news is that there is a dial to turn. In the early days, everyone assumed that there was just a switch. One position delivered all of the wonderful magic of email, online ordering, and smartphones. The other position was the cash-only world of living off the grid in a cabin wearing an aluminum foil hat.
Privacy enhancing technologies let you control how much privacy to support but limit that control to preserve functionality. They mix in encryption functions with clever algorithms to build databases that can answer some questions correctly — but only for the right people.
In my book, Translucent Databases, I explored building a babysitter scheduling service that could let parents book babysitters without storing personal information in the central database. The parents and babysitters could get the correct answer from the database, but any attacker or insider with root privileges would get only scrambled noise.
The field has grown dramatically over the years and there are now a number of approaches and strategies that do a good job of protecting many facets of our personal lives. They store just enough information for businesses to deliver products while avoiding some of the obvious dangers that can appear if hackers or insiders gain access.
The approaches all have their limits. They will defend against the most general attacks but some start to crumble if the attackers are better equipped or the attacks are more targeted. Often the amount of protection is proportional to the amount of computation power required for the encryption calculations. Basic protections may not add noticeable extra load to the system, but providing perfect security may be out of reach for even the cloud companies.
But these limits shouldn’t stop us from adding the basic protections. The perfectly secure approach may not be out there, but adding some of these simpler solutions can protect everyone against some of the worst attacks that can be enabled by the new cloud services.
Here are nine strategies for balancing privacy with functionality.
Use the features
The cloud providers understand that customers are nervous about security and they’ve slowly added features that make it easier to lock up your data. Amazon, for instance, offers more than two dozen products that help add security. The AWS Firewall Manager helps make sure the firewalls let in only the right packets. AWS Macie will scan your data looking for sensitive data that’s too open. Google Cloud and Microsoft Azure have their own collections of security tools. Understanding all of these products may take a team but it’s the best place to start securing your cloud work.
Watch the secrets
Securing the passwords, encryption keys, and authentication parameters is hard enough when we’re just locking down our desktops. It’s much trickier with cloud machines, especially when they’re managed by a team. A variety of different tools are designed to help. You’ve still got to be careful with source code management, but the tools will help juggle the secrets so they can be added to the cloud machines safely. Tools like Hashicorp’s Vault, Doppler’s Enclave, AWS’s Key Management System, and Okta’s API management tools are just some of the options that simplify the process. All still require some care but they are better than writing down passwords in a little notebook and locking it in someone’s office.
Consider dedicated hardware
It’s hard to know how paranoid to be about sharing computer hardware with others. It’s hard to believe that an attacker may finagle a way to share the right machine and then exploit some of the different extreme approaches like rowhammer, but some data might be worth the hard work. The cloud companies offer dedicated hardware just for occasions like this. If your computing load is fairly constant, it may even make economic sense to use local servers in your own building. Some embrace the cloud company’s hybrid tools and others want to set up their own machines. In any case, taking complete control of a computer is more expensive than sharing, but it rules out many attacks.
One of the simplest solutions is to use a one-way function to hide personal information. These mathematical functions are designed to be easy to compute but practically impossible to reverse. If you replace someone’s name with
f(name), someone browsing the database will only see the random encrypted noise that comes out of the one-way function.
This data may be inscrutable to casual browsers, but it can still be useful. If you want to search for Bob’s records, you can compute
f(Bob) and use this scrambled value in your query.
This approach is secure against casual browsers who may find an interesting row in a database and try to unscramble the value of
f(name). It won’t stop targeted browsing by attackers who know they are looking for Bob. More sophisticated approaches can add more layers of protection.
The most common one-way functions may be the Secure Hash Algorithm or SHA, a collection of functions approved by the US National Institute of Standards and Technology. There are several different versions, and some weaknesses have been found in the earlier versions, so make sure you use a new one.
Good encryption functions are built into many layers of the operating system and file system. Activating them is a good way to add some basic security against low-level attackers and people who might gain physical access to your device. If you’re storing data on your laptop, keeping it encrypted saves some of the worry if you lose the machine.
Regular encryption functions, though, are not one-way. There’s a way to unscramble the data. Choosing regular encryption is often unavoidable because you’re planning on using the data, but it leaves another pathway for the attackers. If you can apply the right key to unscramble the data, they can find a copy of that key and deploy it too. Make sure you read the section above about guarding secrets.
While some complain about “fake news” corrupting the world, fake data has the potential to protect us. Instead of opening up the real data set to partners or insiders who need to use it for projects like AI training or planning, some developers are creating fake versions of the data that have many of the same statistical properties.
RTI, for instance, created a fake version of the US Census complete with more than 110 million households holding more than 300 million people. There’s no personal information of real Americans but the 300 million fake people are more or less in the same parts of the country and their personal details are pretty close to the real information. Researchers predicting the path of infectious diseases were able to study the US without access to real personal data.
An AI company, Hazy, is delivering a Python-based tool that will run inside secure data centers and produce synthetic versions of your data that you can share more freely.
The term describes a general approach to adding just enough noise to the data to protect the private information in the data set while still leaving enough information to be useful. Adding or subtracting a few years to everyone’s age at random, for instance, will hide the exact birth years of the people but the average won’t be affected.
The approach is most useful for larger statistical work that studies groups in aggregate. The individual entries may be corrupted by noise, but the overall results are still accurate.
Microsoft has started sharing White Noise, an open source tool built with Rust and Python, for adding a finely tuned amount of noise to your SQL queries.
Most encryption algorithms scramble the data so completely that no one can make any sense of the results without the proper key. Homomorphic approaches use a more sophisticated framework so that many basic arithmetic operations can be done on the encrypted data without the key. You can add or multiply without knowing the underlying information itself.
The simplest schemes are practical but limited. Chapter 14 of Translucent Databases describes simple accounting tools that can, for instance, support addition but not multiplication. More complete solutions can compute more arbitrary functions, but only after much more expensive encryption.
IBM is now sharing an open source toolkit for embedding homomorphic encryption in iOS and MacOS applications with the promise that versions for Linux and Android will be coming soon. The tools are preliminary, but they offer the ability to explore calculations as complicated as training a machine learning model without access to the unencrypted data.
Programmers may be packrats who keep data around in case it can be useful for debugging later. One of the simplest solutions is to design your algorithms to be as stateless and log-free as possible. Once the debugging is done, quit filling up the disk drives with lots of information. Just return the results and stop.
Keeping as little information as possible has dangers. It’s harder to detect abuse or fix errors. But on the flip side, you don’t need to worry about attackers gaining access to this digital flotsam and jetsam. They can’t attack anyone’s personal data if it doesn’t exist.
Copyright © 2020 IDG Communications, Inc.