Beware: the web gives posted documents a life of their own

Beware: the web gives posted documents a life of their own

by Alan Zisman (c) 2009 First published in Business in Vancouver August 4-10, 2009 issue #1032

High Tech Office column

Don’t let this happen to you.

I got an e-mail recently from a friend who is working for a local environmental non-profit. (He’s asked me to keep his name and the organization he works for out of this column.) His organization was choosing a delegation for a conference in Europe – a committee was considering about 100 applications, submitted as Microsoft Word-formatted documents.

To facilitate access to all the applications, the organization created a web page with links to the applications. After selecting the delegation, they took the web page and the applications offline. Or so they thought.

My friend had gotten an e-mail from an unsuccessful applicant; she had used Google to try to find out who had been selected. Much to her surprise, her search hits included a link to her own application form – complete with name, address, phone number, e-mail and other identifying information.

The page, which had a numeric IP address rather than a standard domain name, had a note at the top stating: “This is the html version of the file [filename deleted]. Google automatically generates html versions of documents as we crawl the web.” The numeric web address appeared to be owned by Google.

Who knew Google was not only indexing the web but also converting any documents it stumbled across into standard web page format, and posting them to another location without the knowledge or consent of the document creators? Or that they remained online regardless of the fate of the original document?

(There are probably copyright issues here, but that’s for the lawyers.)

In her e-mail, the unsuccessful applicant said she was horrified that this information was made public – along with that of everyone else who had applied. I would agree.

The organization had clearly made a mistake. By posting a web page linked to the application forms that didn’t require any sort of log in, they had made all the applications forms public even though they didn’t publicly advertise that page. And once the information is “out there,” anything can happen.

But once something’s got into Google’s system, how do you get it out? There’s no obvious way to talk to a “real person” at Google.
I checked in with Chris Goward of Vancouver’s WiderFunnel Marketing, a company that works closely with Google in helping clients optimize their websites for more effective results. He pointed us to an online form at: www.google.com/webmasters/tools/removals, noting that it can be used to remove a webpage from Google’s list. According to Goward, the form is effective, though it can take a few days before the page is removed.

He pointed out that anything posted on the web may sooner or later show up in a Google search list, unless it’s behind a firewall, which is the usual practice for corporate networks, in a password-protected area or has a robots.txt restriction. (Robots.txt is a standard file used to request that search engine indexers ignore specified files or folders. Note that word “request”; compliance with robots.txt restrictions is on the honour system.)

Employees at the non-profit got to work, filing removal forms for each of the pages that Google had created for their 100-odd applicants. As far as I can tell, it worked – Googling an applicant’s name no longer brings up the application form in the search results.

Lots of people are pretty sloppy with personal information online. If you post your own information, that’s one thing. But we all need to be held accountable if we post other people’s information – employees, customers, friends – online. As we’ve seen, good intentions aren’t enough – anything posted online is liable to show up in a Google search – and may be given a life of its own beyond your intended use. •