Topics: How Google prevents its index from potentially
spammer sites . The controversial "Information retrieval based on
historical data" - United States Patent Application 20050071741 -
March 31, 2005
Information retrieval based on historical data"
- United States Patent Application 20050071741
Preliminary statements:
1. We know, and you know, that when a
big company (whose copyrighted unique technology is studied, analyzed,
debated) files a Patent, it is mainly for misleading competitors rather
than revealing its secrets.
You don't need to place a copyright on something that
nobody can really understand, copy and steal. On something like Google
Algorythm, in example (with exception of Pagerank, which has been
patented - see Pagerank section).
Therefore: don't take this Patent (made
public on March 2005) as the greatest Google revelation ever.
2. This Patent seems to describe a
stereotyped, predictable, standard spamming activity, then the way
Google robots are set up to prevent it.
Many parts of this patent reveal its misleading intentions rather than
true technical issues. It all seems to be realistic and logical. Turn
on your brain, read it twice and you'll feel a strange smell of fried
air all around...
3. What is described in the patent
looks like a "method" rather than a new Google's algorythm add-on. As
the patent's title implies, it's about a method to retrieve
informations filtering them on an historical basis. "Google" name NEVER
appears in patent contents. Never.
Since this is a Google Ranking strategy guide we
thought writing about this Patent was our duty; whether this
patent really describes something that is actually implemented in
Google's algo or not (we believe it's not implemented), you
can use this patent to refine your your website features (i.e.
by adding fresh contents on a regular basis), and strategies (i.e.
by looking for qualified inbound links and forgetting link farms
forever)
4. On the Net you can find articles
about this Patent that are more accurate and detailed that the one
you're reading now. Especially on websites whose owner is convinced
that 20050071741 doc is actually a revealed part of the algorythm.
If you want to read other opinions about it (and we expect you do) try
to make a search with Google
What follows is a resume of the most interesting
contents we have extracted from it.
Quote from patent: "A system
identifies a document and obtains one or more types of history data
associated with the document. The system may generate a score for the
document based, at least in part, on the one or more types of history
data."
And here's how Google (oops! the authors of the
patented document) may score a document:
a. Inbound Links
Quality, number and text factors of links are taken into account,
that's not a news. The news is the history data used for calculation:
how often links to your page are updated? Which is the anchor text for
those links? A bunch of links with the same anchor text may indicate
spam activity. Many links with different anchor text may indicate fresh
and interesting contents. Historical data include also the time your
website takes to get inbound links: too many inbound links in a too
short period of time may indicate spam activity.
b. Contents
Relevant and fresh contents are preferred. Content scoring is also
related to users' behaviour (see f. point below), because even a stale
content may be still useful (i.e. a biography).
Significant changes in contents are taken into account. The Date a page
is updated is recorded.
Websites contents should be updated and increased on a regular basis,
just like an "organic growth". Too much pages added in a short period
may be a spam indication. (same for links, see point a.)
c. Domain Names
It seems spammers registers domains for just one year, often providing
false admin and contact details. Domains registered for more that one
year may get a higher score.
Patent does not refer to any historical data related to domain age: it
seems to focus on domain expiration rather than domain creation.
Therefore if your domain is 5 years old, and this would mean you're
serious about it, you could even be penalized because you renew it
yearly.
d. CTR (click-thru rate)
CTR monitoring and recording is to check out which kind of content is
preferred in a certain period of time (seasonal, fashion or trend
related). In example, a ski website receive more clicks in Winter, and
so on.
e. Traffic
Traffic factors such as how much time visitors spend on your page are
taken into account
f. Users' behaviour
Duration of the visits, bookmarks.
Quote: "Information relating to how often the document is selected
when the document is included in a set of search results". This
mean that there's a historic variable that calculates a ratio between
the number of times your page appears into the results and the number
of times users click on it.
To make it short, this Patent assigns a higher
score to websites that are developed gradually, in a natural way, with
a day-by-day marketing, with a regular growth of contents. C.t.r. and
User's behaviours monitoring are applied to make sure you really work
on your contents to make them interesting, fresh, useful.
--------------------------------------------------------------------------------
Google's actual Spam
indicators
While Information retrieval based on historical data is (or may
be) a great method to leave spammers out of the door, there are some
indicators Google actually use to catch spammer sites.
1. Multiple domains with the same contents.
Not uncommon to see many similar websites ranking high for certain
keywords then disappear from index after a certain period of time
(usually one month)
2. Google's ABUSE service. Based on
users' input. Users can report a website they suspect it's spamming the
search engine to Google abuse service. The website will be investigated
(by a human) and, if it's caught on some spamming activities is banned.
3. Keyword Stuffing. Pages that
present same keyword repeated over and over usually rank high for a
short period of time then disappear from index. Keyword Stuffing infact
is considered the most annoying spam practice.
4. Don't forget the so-called Sandbox effect (read
about it on Google Sandbox effect)
|