Why, For Now, I’ve Stopped Worrying About My Data In The Cloud

If You’re A Regular Joe Tech User We’re All Hopelessly Beholden To The Random Decisions Of Big Tech. On Consumer SaaS And Its (Underdiscussed) Backup Problem

Those who follow me online may have gleaned, at some point, that I’m something of a backup anorak (translation for Americans: nerd).

A typically vaguely bleak looking data center. An integral part of the world of big tech. Photo by Manuel Geissinger from Pexels

I was kindly introduced as such by the guy who’s basically the world’s foremost expert on backups (no, I kid you not, there really is such a guy).

His name is W. Curtis Preston. He lives somewhere in California, I believe. He’s literally written the book on backups (one forthcoming). And he even had me on his excellent podcast, Restore It All.

If you’d like to hear Curtis, Prasanna Malaiyandi, and yours truly discussing why backing up consumer SaaS products is so difficult, then check out our conversation from last year.

My other backup claims-to-fame?

I’ve created my own backup documentation and watched in amazement as it’s been forked on Github by somebody who cared enough about backups to find the repository.

Other geeks have gotten a good laugh out of what I described (then) as my best Ubuntu backup strategy to date (the link is to an episode of the Linux Game Cast). I’ve been tweeted by Ubuntu who affirmed that backups are, indeed, rather vital. Vendors have shipped me backup appliances to test out. It takes time and effort to achieve this kind of notoriety in the backup world. Trust me.

And you know what? It all came to nothing. At least on a personal level (my interest in backups has led to dong some work with vendors). Or at least to an awkward pause in my thoughts on backup.

Because if you’re worried about protecting your data in the cloud — and I still say that you should be — then your current options for adequately protecting your data are pretty darn limited and there’s no point pretending otherwise. You have, to choose from, a few mostly lousy options.

Because having only bad options is arguably better than having none at all (arguably; this doesn’t apply to beer, or so some argue), let me walk you through those anyway.

But first: is this actually a problem? Yes. Perhaps not one that threatens the planet with immediate extinction. But one which could nevertheless threaten to destroy your data through various means.

And why should you care? Here’s why. But first, let me handle the most common objection.

But … Isn’t The Cloud Backup!?

Nope.

Committing data to the cloud does not mean that you have backed it up.

It just means that you’ve taken data from a source that you own (your computer, your phone) and moved it to somebody else’s computer.

To be technical, it increasingly means that you’ve connected to some piece of software using a web browser and have used something like a keyboard and mouse to create data directly in the cloud.

It’s not written to your computer at all. Its original source is in a filesystem and database that you don’t control. Increasingly, that’s how us consumers are creating data. Zero local touchpoints. And not a thought given to how we can retain a copy for ourselves. A small point that matters lots (thanks for pointing this out, during our episode, go to Curtis Preston).

Equally, that simplistic representation that backup enthusiasts love to trot out (about the cloud being somebody else’s computer) is kind of inaccurate, at least as it’s commonly understood. Because it makes the situation seem wary more relatable than it actually is.

The “computer” in this picture might actually be a huge bank of computers operating in a data center.

In fact, unless you’re dealing with a total minnow, it probably is. The era of on premises data storage is nearing its end. That new flashy scheduling software you signed up for probably builds on infrastructure that they rent from one of the major providers on today’s market: AWS, Azure (Microsoft), or Google Cloud Platform.

Even if you use a ton of different SaaS tools, there’s a high chance — nay a probability — that almost all the data you create in the cloud is being centralized in data storage operated by one of those companies.

Perhaps one or two of those startups is CTO’d by some cranky dude who insists on keeping a server in the basement. Bad news: he’ll probably be retired soon. A rep from one of the big cloud providers will eventually get to him and they’ll migrate. Welcome back to the cloud. Data from a bunch of other tools you use is probably on some other computer here.

Those data centers are controlled by some of the strictest security you can imagine. Increasingly, they’re even linked together by a monolithic worldwide network of underwater cables that actually constitute the majority of the internet (the technical term is the “backbone”).

Private concerns are (increasingly) laying trans-oceaning cable to move your data more efficiently between their main data centers and possibly also smaller data centers that are used to serve data more efficiently to your consumers (AWS Cloud Front). Remember when famous dudes like Marconi were messing about with getting the first messages across the ocean? Now it’s the tech giants doing it. Less glamorous. But the interests are becoming far more concentrated.

Therefore, whether you’ve thought about this or not, if you use the cloud at any kind of scale, the bits and bytes that constitute your data are actually probably spread out all over the world (literally). They’re duplicated, replicated, and virtualized many times over.

Therefore, if you want a copy of your data, then I hate to break it to you, but you probably can’t rock up to one of Google’s data centers with a plug-in SSD and a smile (although you can use something called Google Takeouts, although it’s kinda rubbish — says I — limits how often you can download your own data, and provides so means for automating the extraction process). “Where’s the mainframe,” you may rail in protest? There is none. It’s in the cloud, you see.

Oh, and that computer?

There may be (nay, there almost certainly is) some load balancing going on so that your data might be pulled from different sources depending on what the current network conditions are like.

So your data isn’t really on any specific computer but rather sharded between a bunch of them and where you or anybody else gets it from is actually determined by (yet another) server.

Which is why even in a netherland world you couldn’t fish it out onto your drive. But ultimately — even if temporarily — it will exist on some piece of hardware that provides storage. At the end of the day, data has to be stored somewhere for it to exist. Even if there’s virtualization and a whole other bunch of complicated stuff taking place with it.

An NAS: the Synology DS 920+. Think of it like a data center that lives in your home. Photo: Author.

And the “person” operating that data center?

Larry from Google probably isn’t syncing your data onto his laptop at Google HQ and the laptop doesn’t have a nice bit of labelling on it that says “customer 102”. Rather, it might be a gigantic tech corporation worth billions of dollars. Provisioning that infrastructure is the shared concern of one team and not any one person. Which oddly equally means, of course, that nobody’s really responsible for all your “stuff.”

As A Consumer On The Cloud, You’re A Tiny Minnow In An Ocean Full Of Bigger Fish

The good news is that these big cloud companies who are increasingly taking over the world of tech know that if they randomly loose your data, then consumers are going to be annoyed and start talking about it on places like Twitter which in turn will drive away other users. They’ll probably start leaving. In droves.

And so they don’t want Daniel from Medium to write an angry post on Medium about how Google totally screwed up and lost all his wedding photos without any means of recovery thereby leaving him high and dry and wedding photo-less. Even though it’s way easier for that to happen than you might think.

More churn. Less revenue.

So Google institutes systems that will be good enough to prevent 99% of users from doing something like that through their own actions. Safeguards, essentially.

So that’s well and good.

But to pretend that major tech companies think about these things in any more empathetic terms — I contend — would be to unrealistic.

Ultimately, they’re financial animals beholden to the interest of their shareholders and they don’t really give a crap about your wedding photos even if they mean the world to you. Even if the photographer got locked out of his cloud too and there’s literally nowhere else that they exist. You’re one of tens of millions. You’re not even a business much less an MSP. Or so I contend.

So what they do is get really good at ensuring that that probably won’t happen through a fault of their own (or a combination of faults).

They create elaborate systems for ensuring redundancy. And then make sure that you’re responsible for backup in a section of the fine print that they know you’ll probably never read. Alternatively you may get some basic backup functionalities. But they’re squarely in control of them.

Even the acronyms that represent the documentation you probably forget about and which may have spelled all this out manage to sound snooze-inducingly boring.

EULA. TOS. Who has time for any of this? Click next and get the stupid thing up and running already. Who cares where my data is going! It’s in the cloud! You don’t actually mean to suggest that a company as big as [major tech provider] isn’t backing up my stuff, do you?!

If you’d like to know more about the thing that folks commonly mistake for backup (redundancy), then look up the various types of RAID.

You can even build your own miniature data center by buying a device known as an NAS and buying a few TBs worth of storage from your local tech store. You’ll have redundancy in operation right beneath your dishwasher. How cool is that?

Your Google Drive probably actually looks something like this in real life. Although it may not actually be on a physical computer like this for very long. And even if it were, it would be on a virtualized layer of it. Confused yet? Photo by panumas nikhomkhai from Pexels

Oddly, by setting up your NAS, you may now actually have a better backup system in place than the one that you think you’re getting from your billion dollar cloud but actually aren’t. There’s one major difference, at least. You own it (at least the hardware). And with ownership comes control.

But what’s backup anyway?

Backup is typically a point-in-time copy of your data.

It’s created through one of several means (incremental, full, differential). But that’s already more detail than you need to know about backup. (If you want all the details, listen to the Restore It All Podcast. They have a few on how Hollywood does backup. Seriously, it’s interesting stuff).

Backup is designed (among many other uses) so that if you royally screw up by deleting your files and then emptying them from the bin that there’s still some way of retrieving those wedding photos.

Or if your whole filesystem gets locked down by ransomware.

Scoop up a clean copy and restore from there. At least that’s the idea. (True backup vs. backup-ish is another fine point for to be debated).

But if redundancy is just a live copy of your data so that it’s always there, then all those safeguards go right out the window. And that’s why redundancy isn’t considered backup even if the two are often mistaken.

If data source A is a useless chunk of data that’s been encrypted by some malicious ransomware and it’s replicated immediately to data source B then data source B isn’t any better. You’ve just got two useless chunks of data on two different storage media rather than one. You’ve duplicated bad data. Good news for whoever sold you the disks.

You know what else you need for any backup system that’s not under your control to be worth its salt?

You need a data restore plan.

Somebody prepared to actually operate the restore part of the backup operation (common backup refrain: all backups need to be periodically tested for restorability).

And you need some kind of a service level agreement (SLA) that sets out data restorability objectives. To define both the maximum period of time to get the first system back in operation and then longest amount of time that can elapse.

The next time you feel like prodding around some terms of service (TOS) agreements which you signed but have totally forgotten about then see what they say about backup.

Or just study for an AWS certification like I was doing last year.

When you do, you’ll learn quickly about the shared security model.

AWS will give you infrastructure to work with. But it doesn’t take responsibility if you prodigiously screw up thereby destroying all your data contained therein (rm -rf / anyone?). You take responsibility for things in the cloud. They keep the cloud itself running. Mishaps happens more often than you might think. And it’s more problematic when users aren’t aware of what they need to be taking care of.

Most cloud providers will ensure redundancy and uptime.

They’ll make sure that your data is accessible.

But take a closer peek under the hood and you’ll see that most also conveniently shift the responsibility for backup off their shoulders and back onto yours (wait … are you the backup!?).

They assume that you do your own backups.

You do, right? Didn’t think so.

But you COULD. Just follow my instructions.

Option 1: Take Your Own SaaS Backups Even If That Means Writing Data Request Emails Every Few Weeks

A typical backup approach as implemented by a major online tech provider, Quora. Screenshot: publication date. Photo: Author.

Software as a Service may sound like something you heard about in the news in reference to a massive IPO but it’s actually something you probably use every day of your life.

Gmail?

Google Drive?

That cloud hosting thingymajig where you upload your invoices to so that you don’t get an even more horrible bill from the taxman?

Yes, all the little tools we rely upon like that. And increasingly so.

But wait. Stop and think about it for a second.

Software you use. That’s running in the cloud. So … not on your computer.

You don’t need a program to use it like in the old days when you had to fiddle around with CDs.

That must mean that somebody else is provisioning the “stack” needed to make it accessible: the software, the operating system it runs on and finally the hardware (storage and computing) needed to make it run. Somebody else’s computer. Sound familiar?

Indeed they are. This is SaaS. And when you stop to think about how much SaaS you might already be relying on for essential things — functions that if a business we would surely be earmarking as “mission critical” — then you might begin to understand the extent of the vulnerability we all face when we entrust all our data to cloud providers.

But what can go wrong, you might ask?

So the redundancy is probably going to protect you in an everyday sense of the word. But if you really run into the odd periodic disaster than you’re not going to be saved.

But what if you needed an actual backup. Like when:

  • Your primary data system gets corrupted by ransomware and you don’t have money to pay the ransom.
  • Your website gets infected by malware and you discover that backups weren’t included in the shared hosting plan you signed up for.
  • You accidentally overrode all the mechanisms designed to make sure that you keep your own data like the trash bin on Google Drive etc and then wonder what’s the “second backup” built into the system (“THERE IS NONE!?!?!”). You may be really surprised to learn that if you did that then there may be absolutely zero way back to your data. Doesn’t Google have a backup system? Probably. Are they going to initiate a custom restore just for you? Probably not. Sorry. You’re one guy. Of tens of millions. Welcome to the bottom of the tech totem pole.

You may also return from vacation to find that you’ve been locked out of your own Google account for a trivial reason (you logged in from different IPs and had to change password because the cloud service didn’t like this pattern of activity which it deemed suspicious; but you forgot to update the password on your phone and thus your email client was repeatedly checking for email using an outdated password. And now your cloud provider thinks somebody’s trying to hack into your data but you don’t actually know who they are and they won’t tell you…). Think that doesn’t happen? It just did to me.

So what you could do is take your own cloud backups.

We need to cover every piece of data that we commit to the cloud and don’t keep in some primary storage system that we’re already backing up.

Think that sounds easy? It’s not because:

a) If you’re like most consumers, you’re probably backing up zero data on any system whatsoever and have been proceeding this way since you were born.

b) When you consider the amount of cloud hosted data systems you use on a daily basis and entrust your data to, your head is probably going to start swimming.

Medium.com? The writing you’re creating here lives in the cloud until you scoop it out.

Post photos on Twitter? Tweet?

Use Facebook?

Comment on connections’ post on LinkedIn?

Data, data, and more data, dear readers.

You’re probably creating some many hours of the day even if all you’re doing is leaving comments on cat videos on YouTube.

But if there’s no backup being provided — and you don’t want to arbitrarily lose any of it — then what are you going to do exactly? How will you be able to construct a backup archive of those comments on cat videos if you get locked out of your Google account (which is your sole means of accessing YouTube)?

The answer to the above is what led me to try to map out the backup approaches of the most common cloud providers.

The Mediums, LinkedIns, Githubs, and Reddits of this world.

You can peruse that documentation here. It may already be outdated.

But to save you some clicking, here’s what I found out:

  • There are some cloud services that don’t let you backup your data at all. Yes, really.
  • Others liberate only a portion of the data and leave others locked up in places like content delivery networks (CDNs) that they use to serve images more quickly.
  • Some provide automated backup processes. Others manual ones. And others have totally manual ones in which you need to write to the company to request a backup archive. And if it weren’t for things like GDPR, there’s a good chance this last group of companies wouldn’t be providing any backup functionality at all.

Despite these limitations, you could try your best. Here’s where the backup geeks and anoraks of the world do their thing. They use CLIs, VPS or dedicated hosting plans, and do other things that most people can’t. All in order to try extract their own data from the various places it winds up on the internet.

But to start out small you could:

  • Create a backup calendar. Just a regular digital calendar but give it a separate name.
  • Create recurrent tasks for your backup operations so that you don’t forget. Or you could set up a backup day.
  • Create a checklist to make sure that you don’t miss out on any services. Now copy and paste the checklist into your backup calendar. Today’s the day you get to write to Quora and Reddit to pull out your latest answers and posts.
  • Log in to each platform you use every 3 months and create or request a backup. Package everything up into a zip archive. Don’t forget to include your web hosting, if you have any. And all the bits and pieces it might contain such as filesystems, MySQL databases, etc.
  • To fulfill the 3–2–1 aspect of backup best practice, you’re also going to need to mirror this archive, every few months, to the cloud or some other offsite location. If you have an NAS running at home, then you can speed up this process a little bit by using something like Hyper Backup to just replicate it onto another storage medium.

Tired already? Your backup day fell out during your vacation and you’re now lying on a beach in the Caribbean sipping beer? You get where I’m coming from.

Option 2: Use More Complicated And Inferior Tech To Run Your Digital Life And Quietly Begrudge Those Blessed Souls Who Aren’t Woke To The Vulnerability Of Their Data

If you fully embrace the backup lifestyle, then you can start doing things like storing offsite backups in your car — or in friends’ houses. Photo: author.

Ignorance is bliss. But if you’re made it this far, then the bad news is that it’s already too late for you. You’ve become one of us. The data fiends. You know about the problem. It’s too late to turn back. (Want to find more? Check out /r/datahoarder for a start).

After my recent temporary lockout from Google, I became acquainted with the online community known as /r/degoogle.

This community has put together some amazing resources on how to work around the services provided by tech giants like Google. And there are other subreddits too dedicated to helping users achieve similar purposes (/r/deamazon). A whole network of minds busily trying to figure out ways to reduce their dependency upon these tech giants.

The Unexpected Problems When You Try To De-Google

The rationale behind most of these communities is pure-spirited and seems to be roughly the same.

These major tech companies have gotten too big. They’re too well-funded. Their services are too good and we’ll all become dependent upon them. Hopelessly so. Because they don’t really care about us. They’re under our skins. But we need to be our own dealers. Of technology, that is.

My initial euphoria wore off, however, after I realized that I had been there, gotten the t-shirt, and threw it away a few years ago.

I once ran my own web server, you see. I used an old laptop and repurposed it for this job by installing Ubuntu Server. Set up port forwarding so that it could be accessed from beyond my local network (LAN). The works.

I decided that I couldn’t continue having my data locked up in Google.

And so I signed up for one of the various open source platforms that provides server side scripts that aim to provide something like the familiar panoply of services Google provides and puts them … wherever you want them to go. Like that laptop I mentioned.

Around the same time, I eradicated Android from my phone and installed a custom operating system. Downloaded apps from a third party marketplace instead of the App Store.

And the more I got into this, the problems began to stack up.

“Dude, you just pulled out my email server and I’m pretty sure the guy I just emailed was about to download my resume!”

“What!? That plug next to the fridge with a beer label stuck to it!? The one running into that beat up looking netbook you bought on a flash sale”

“Yes. That’s got like my entire life on it. And my website!”

These are credible words that have probably been uttered among roommates hosting their own tech on-premises.

Home internet connections, you see, aren’t typically intended to be used to run amateur data centers from. You can get better lines, for sure — look for a symmetrical connection — but that’s probably going to come at a cost.

Data centers have elaborate systems for failover. Both of internet connectivity and of power. You may wish to provision both. So add a beefy UPS or a small generator to your shopping list.

The upload speeds on many consumer-grade connections are heavily throttled. Which means that downloading anything larger than a couple of megabytes from your self-administered web server is likely going to be a frustrating experience.

The vast majority of home internet users don’t set up port forwarding or run web servers either, you see, so doing so is a pretty good way to flag yourself as an aberrant subscriber. So your ISP is liable to assume that you’re doing something shady.

And the tech?

Let’s just say this. I’ve been using Linux day to day for more than 10 years. You don’t need to tell me about how buggy tech can become when there’s no money going into the ecosystem. It’s worse.

You end up using worse technology that’s ten times more complicated to set up while your roommates accidentally knocks off your web server just as your best job prospect is about to download your resume and your ISP starts wondering if you’re operating something illicit on the dark web. Not fun.

Option 3: Accept Data Insecurity As A Lamentable Fact Of Consumer Life In Today’s Cloud.

A representation of technology infrastructure previously operated by the author. Photo: author.

Why does consumer SaaS protection suck so badly in comparison to what’s available in the enterprise space, you may be wondering?

A lot has to do with the fact that most consumers don’t think that there is a problem to begin with. The cloud = backup lie has proven to be a difficult old trope to do away with, you see.

And trying to get people who believe everything is dandy to pay for a solution they don’t think they need is probably not the best marketing proposition you can shoot for. I should probably know that.

If nobody’s going to pay (or hardly anybody) then nobody has a particular incentive to engineer a solution.

The next obstacle on the road:

The fact that most cloud / SaaS tools don’t want consumers to be able to easily migrate their data between providers.

Data portability isn’t in their interests.They’d prefer that you stick around as a long term customer. So you’ll have them to work with also. Unenthusiastic technology partners who lack any incentive to help you do your job on the one side and a disinterested market on the other. Anybody down for the job?

I once believed that I had a great business idea at my fingertips. It involved essentially this: finding a way to integrate the hundreds of SaaS tools consumers use into one database that could be piped into Google Drive / AWS S3 / wherever else folks like to keep their storage. Then backed up for safekeeping.

I spoke to a couple of folks in the industry who gave me a runthrough of basically this. Sadly, I scratched it off my list — even though I still think it would be an epic tool and one I would happily pay for myself (there are some cloud to cloud backup tools that cover consumer SaaS, but none that covers the assortment of tools that I use.)

The strange synthesis of all my thinking outlined above:

I still think that data protection is vital, even for consumers. That’s why I spelled out the problem that many people are unaware that they face. Or (because it’s not a problem until it’s a problem) let’s call it a vulnerability instead.

In fact, with GoogleGate still etched fresh in my memory, I believe more than ever that anybody that commits crucial data to the cloud needs to find a way to be able to adequately back it up.

I think it’s a great pity that — relative to the way organizations are helped to protect their data — this market is so bereft of solutions.

As I learned, it’s apparently that way for a reason — and that’s a commercial incentive.

Unfortunately, at some point, life got busy.

As you may have noticed if you follow this Medium channel, I’ve been getting into video lately.

I now need to back up GBs and TBs of data rather than the humble PDFs and photos I backed up when I mostly “just” wrote.

So let me tell you a few secrets.

But only if you don’t tell my backup buddies.

I haven’t managed to reliably offsite all the originals of my videos yet. They’re kinda heavy and my internet is slow.

I should really run a Clonezilla backup on my desktop too. Because I’ve added a few programs to it this week and probably haven’t done a thorough backup on it in a while.

At some point my desire to be creative took over my desire to protect my data.

And I realized that as one guy with an internet connection there’s only so much that I can do to work around a technology industry that — in large part — simply doesn’t care much about this problem.

In consumer SaaS land, we’re mostly still flying blind.

Letting big tech hold our data hostage.

Leaving it to the unincentivized (open source) community to develop hard-to-backup alternatives that don’t really stand a chance of being replacements for Google considering the massive financial inequalities they face. It’s a David vs. Goliath battle that nobody’s going to win.

I still care about backup.

I’d love for all the data I create on the cloud every day to be at my digital fingertips — on my NAS — by the same evening.

But there’s no way for me to achieve that without dedicating hours per day to the process. And I just don’t have time for that.

For now, I’m choosing the digital path of least resistance. I wish there were better options. But unless I’m missing them, there aren’t.

What can you do?

Consider becoming a consumer advocate for backup. No, really.

Scrutinize your favorite SaaS providers’ backup options — or lack thereof.

If you find them lacking, or absent entirely, then let them know about it.

Look into other options and if you find them tell the provider that you’re leaving or have left because you don’t want them to be the sole custodian of your data.

The more people that do this, the quicker companies will feel compelled to offer proper solutions.

We may live in a world in which ownership of our own data stops being a right the moment we entrust it to somebody else and their computers.

But we also live in a world which provides few acceptable options for those of us who want to maximize our potential online. We need a LinkedIn account to do business (we just can’t back it up automatically). Etc, etc.

Take a deep breath but keep up the energy.

It’s our data.

The digital imprint of our lives.

And I believe it’s worth fighting for.

Daytime: tech-focused MarCom. Night-time: somewhat regular musings here. Or the other way round. Likes: Linux, tech, beer. https://www.danielrosehill.com