More

    Google Wants to Establish an Official Standard for Using Robots.txt

    Google has proposed an official internet standard for the rules included in robots.txt files.

    Those rules, outlined in the Robots Exclusion Protocol (REP), have been an unofficial standard for the past 25 years.

    While the REP has been adopted by search engines it’s still not official, which means it’s open to interpretation by developers. Further, it has never been updated to cover today’s use cases.

    Google Webmasters

    @googlewmc

    It’s been 25 years, and the Robots Exclusion Protocol never became an official standard. While it was adopted by all major search engines, it didn’t cover everything: does a 500 HTTP status code mean that the crawler can crawl anything or nothing? 😕

    57 people are talking about this

    As Google says, this creates a challenge for website owners because the ambiguously written, de-facto standard made it difficult to write the rules correctly.

    To eliminate this challenge, Google has documented how the REP is used on the modern web and submitted it to the Internet Engineering Task Force (IETF) for review.

     

    Google explains what is included in the draft:

    “The proposed REP draft reflects over 20 years of real world experience of relying on robots.txt rules, used both by Googlebot and other major crawlers, as well as about half a billion websites that rely on REP. These fine grained controls give the publisher the power to decide what they’d like to be crawled on their site and potentially shown to interested users.”

    The draft does not change any of the rules established in 1994, it’s just updated for the modern web.

    Some of the updated rules include:

    • Any URI based transfer protocol can use robots.txt. It’s not limited to HTTP anymore. Can be used for FTP or CoAP as well.
    • Developers must parse at least the first 500 kibibytes of a robots.txt.
    • A new maximum caching time of 24 hours or cache directive value if available, which gives website owners the flexibility to update their robots.txt whenever they want.
    • When a robots.txt file becomes inaccessible due to server failures, known disallowed pages are not crawled for a reasonably long period of time.

    Google is fully open to feedback on the proposed draft and says it’s committed to getting it right.

    Recent Articles

    Analyst predicts how Apple will hit a $2 trillion market cap within 4 years

    In 2018, Apple became the first publicly-traded company to hit the $1 trillion market capitalization milestone. Some analysts are now turning their attention to...

    Do-nothing Zuckerberg privately expressed ‘disgust’ over Trump’s comments

    Mark Zuckerberg wants to have it both ways.  The Facebook CEO has gone out of his way to publicly, and privately appease Donald Trump as...

    New images claim that Apple is moving the SIM tray on the iPhone 12

    Move SIM, get out the way. What you need to know New images show Apple may be moving the SIM tray on the iPhone. A report claims...

    Apple Pays Hacker From India $100,000 For Discovering Serious ‘Sign In With Apple’ Vulnerability

    Apple Inc (NASDAQ: AAPL) has awarded $100,000 to an Indian hacker who found a serious vulnerability in the “Sign In With Apple” service. What Happened “Sign In With...

    Facebook employees plan to stage a virtual walkout over Zuckerberg’s inaction on Trump posts

    Some Facebook employees plan to stage a virtual walkout on Monday to protest CEO Mark Zuckerberg's decision not to take action on a series of...

    Latest Stories

    Stay on op - Ge the daily news in your inbox

    Do NOT follow this link or you will be banned from the site!
    Translate »