If you handle data, chances are, you have run into this situation where, upon creating a masterpiece visualization, or an innovative report, you need to share it outside your organization for, say:
- presenting at a conference
- obtaining vendor technical assistance
- or just to brag!
And then you realize: oh no, my underlying data set is confidential, can’t share customers PII, or even can’t risk letting this performance data fall into the wrong hands…
Handling that situation is actually more common than I believed until recently, when I found out there is a whole industry addressing that need, with its own Gartner Magic Quadrant! That industry is called Data Masking with data obfuscation solutions aplenty, which most likely costs absurd amounts to license, and months to deploy.
What can you do without tons of resource? What if you have only 1h and need something quick & dirty, but clean enough to not get into trouble?
You could go the old Superstore sample dataset route, which means rebuilding your model using the sample data set that comes with Tableau, if the retail data fits your approach well enough and if you can stretch that 1h.
Another approach favored by Tableau users is the good old Excel fudging. It consists in writing over each cell manually. I would not recommend that approach though, because it is unreliable, tedious, not refreshable, and breaks the underlying congruence of your data set. Namely, if you don’t maintain a consistent value for each customer’s transactions, you won’t be able to illustrate trends per customer, that your visualization was meant to surface.
If you have Alteryx, there is a better way, which is easy, safe, quick and… Clean!
- Bring your data set into Alteryx, using the wide set of available connectors , such as the one for SFDC, or a flat file input
- Use a formula tool to mask the original data fields:
- For Key Figures / Indicators, use RAND() found under Math functions and multiply by the current field
- For Master Data / Attributes, use MD5_UNICODE() found under String functions and refer to the source field. The MD5 cryptographic algorithm should be safe enough to prevent the decryption of the string values, while preserving the congruence of your attributes
- For Geo data, You can input a random location using ST_RandomPoint(object) found under Spatial functions on your spatial object
- Note that if you need to mask several fields of the same type, you can leverage the Multi-Field Formula Tool found under Preparation:
It will save you manual efforts and make the workflow much easier to maintain. THe syntax will be RAND()*[_CurrentField_] or MD5_UNICODE(_CurrentField_) - Add a Select Tool to your workflow to remove the original unmasked fields and plug an output tool to write into a DB, to a QVX for Qlik, or if you work on Tableau, to generate a TDE file, free of sensitive data, or load directly to a Tableau Server with the Information Lab upload Macro.
That’s all there is to it! You can now mask data continuously or just on request with a simple click to kick off the workflow.
Note also that if you don’t have any data to start with, an awesome free tool can let you generate random data:
May i add a suggestion for masking through encryption which safe, cheap, reversible, and with strong encryption : http://www.shesekdatadynamics.com/ExcelAnonymizer.html
Disclaimer: I work for ShesekDataDynamics and am personally involved in developing this add-in.
You can certainly pitch your solution. As I wrote in that post, there is a whole industry catering to those needs for masking. Nevertheless, my post offers a FREE and quick solution for Alteryx users…
MD5 is cracked so easly, https://hashkiller.co.uk/md5-decrypter.aspx
may not be wise to use for hashing Personal info…
Great blog, your string and number functions server my use case perfectly. Many Thanks,
Tim
Data masking effectively provides a solution to five key issues surrounding data – data breach, data loss, data hijacking, insecure data interface, and unauthorized data use by insiders in an organization.
It reduces risks involved in data usage in a cloud environment.