gle.ovo.

Quasi-Private Resources :: Copyrighteous

Public Resource republishes many court documents. Although thesedocuments are all part of the public record and PR will not take themdown because someone finds their publication uncomfortable, PR willevaluate and honor some requests to remove documents from searchengine results. Public Resources does so using a robots.txt fileor “robot exclusion protocol” which websites use to, among otherthings, tell search engine’s web crawling “robots” which pages they donot want to be indexed and included in search results. Originally,the files were mostly used to keep robots from abusing serverresources by walking through infinite lists of automatically generatedpages or to block search engines from including user-contributedcontent that might include spam.

The result for Public Resource, however, is that PR is now publishing,in the form of its robots.txt, a list of all of the cases thatpeople have successfully requested to be made less visible!

In Public Resource’s case, this is is the result of a carefuldecision; PR makes the arrangement clear in on their website. Therobots.txt home page also explains the situation saying, “the/robots.txt file is a publicly available file. Anyone can see whatsections of your server you don’t want robots to use,”, and “don’t tryto use /robots.txt to hide information.”

That said, I’ve looked at a bunch of robots.txt files on websites Ihave visited recently and, sadly, I’ve found many sites that userobots.txt as a form of weak security. This is very dangerous.

Some poorly designed robots simply ignore the robots.txt file. But onecan also imagine an evil search engine that uses a web-crawler thatdoes the opposite of what it’s told and only indexes these “hidden”pages. This evil crawler might look for particular keywords or useexisting search engine data to check for incoming links in order toconstruct a list of pages whose existence is only made public througha file meant to keep people away.

Check your own robots.txt and ask yourself what it might reveal. Byadvertising the existence and locations of your secrets, the act of”hiding” might make your data even less private.

“Some poorly designed robots simply ignore the robots.txt file. But one can also imagine an evil search engine that uses a web-crawler that does the opposite of what it’s told and only indexes these “hidden” pages. This evil crawler might look for particular keywords or use existing search engine data to check for incoming links in order to construct a list of pages whose existence is only made public through a file meant to keep people away.”



Leave a Reply