Ranking - Das IICM

Transcription

Ranking - Das IICM
Group Assignment
Information Search and Retrieval
Graz University of Technology
WS 2012
Ranking
Ranking Algorithms and Search Engine Optimization
Group 10
Paul Kapellari
Technische Universität Graz, paul.kapellari@student.tugraz.at
Daniel Krenn
Technische Universität Graz, d.krenn@student.tugraz.at
Georg Kothmeier
Technische Universität Graz, georg.kothmeier@student.tugraz.at
Supervisor
Univ.-Doz. Dr.techn. Christian GÜTL
Institute for Information Systems and Computer Media (IICM),
Graz University of Technology, Austria
cguetl@iicm.edu and cguetl@acm.org
Index
ABSTRACT/ZUSAMMENFASSUNG
3
1
INTRODUCTION
4
1.1
Problem description
4
1.2
Motivation
5
1.3
Structure
5
2
RANKING ALGORITHMS AND STRATEGIES
6
2.1
In Degree
6
2.2
Page Rank
7
2.3
HITS
10
2.4
SALSA
12
2.5
Summary about link based algorithms
12
2.6
Non link based approaches
12
3
WEB SEARCH ENGINES / WEB SEARCH SERVICES
3.1
Google
14
3.2
Alexa
15
3.3
DMOZ
16
4
SEO
17
4.1
Techniques and Methods
18
4.2
Problems
21
5
NEW TRENDS
22
6
CONCLUSION
26
7
APPENDIX
27
7.1
References
27
7.2
List of figures
28
14
Abstract
With a vast amount of information available in computer systems, the challenge for users to find and retrieve
information becomes more and more tricky. Information needs to be analyzed and catalogued in order to
make it accessible. When searching for specific documents, results need to be ranked and ordered properly
to offer users the highest possible quality of information.
This paper discusses the process of "Ranking", more specifically, "Ranking Algorithms and Search Engine
Optimization (SEO)". After a short overview, it shows different ranking algorithms and strategies for this
purpose, and gives an insight on web search engines and services. Furthermore some methods for search
engine optimization and their problems will be discussed. Finally new trends in this field of science will be
introduced.
Zusammenfassung
Durch die enorme Menge von Informationen welche in Computersystemen zur Verfügung gestellt werden,
wurden das Suchen sowie das Auffinden von Informationen immer schwieriger. Um Informationen
zugänglich zu machen ist es notwendig diese zu analysieren, zu katalogisieren und nach der Relevanz zu
sortieren um den Benutzer eine hohe Qualität zu gewährleisten.
Diese Arbeit beschäftigt sich mit dem Thema „Ranking“ genauer mit „Ranking Algorithmen und
Suchmaschinen-Optimierung“. Nach einer kurzen Einführung in die Materie werden verschieden
Algorithmen mit ihrer Funktionsweise erklärt und ein Überblick verschiedener Websuchmaschiene bzw.
Websuchservices gegeben. Weiteres werden Möglichkeiten zur Suchmaschinen-Optimierung und die damit
verbunden Probleme beschrieben. Abschließend wird noch auf die zukünftigen Trends zu den oben
genannten Themen behandelt.
1 Introduction
1.1 Problem description
Nowadays there is a vast of information available for everyone. Especially the accessible digital data grew
enormously during the last years. Many people are speaking from an information flood. On the other hand
also the information need is getting bigger and bigger. The reasons for information seeking are very
different. People perform researches for educational reasons, for their jobs but also for their personal
interests (e.g. news, their hobbies and so on). In this huge heap of information, the biggest challenge is to
find the right resource, which provides the information you need. This is where ranking tries to help the
user. If there are just a few search results, it’s no problem for every user to distinguish important resources
from unimportant ones. This may be true for some small local databases, but on the web there are billions of
resources and nobody can rank all of the documents manually. So a good ranking algorithm is essential for
every search engine. There are many different approaches (which we will discuss later see section 2), but all
try to accomplish the same goal. The most relevant resource should be display as first and the most
irrelevant at the end of a list. Some approaches are more successful than others. Because the web is very
diverse and HTML is far away from being semantic, there are many different attempts to find good
resources and rank them in the descending order of their relevance. Right now (2012) it seems, that Google
has the best ranking strategy. This is the reason why this paper discusses many ideas, algorithms and
approaches form the Google universe.
So there are the search engine providers on the one side, which want to rank as good as possible, and on the
other hand there are webmasters which want to get their website into the top places of every search. So
another discipline emerged which is called Search Engine Optimization (short SEO). Especially for people
who run their business over their websites it is indispensible to appear at the first positions in search results.
This is why they started to optimize their internet appearance also for search engines. SEO became an own
industry. Almost every marketing agency offers SEO, there are plenty of websites which explain you SEO
techniques and also conferences are held to this topic. Also in Austria see http://www.seokomm.at/, the
event took place in Salzburg at the 23.11.2012. The chapter 4, are dealing in detail with SEO and some
examples.
Because there are always people who try to benefit more than others, there are a lot of problems with SEO.
These people want to boost their rank by using unfair and dishonest methods. Doorway pages, link farms etc
just to name a few. These kinds of problems will be explained in detail in chapter 4.2. So there is always a
competition between the search engines and spammers. This is why ranking algorithms and strategies evolve
over the time. The new trends seem to be more personalized ranked results. Some applications do this very
well and the user doesn’t recognize that he gets ranked results. He doesn’t even know that he is searching,
see Google Now on Android Phones. There is a lot going on this field. Also crowd ranking, folksonomies,
and Social Media are getting more important for ranking. These new trends are presented in chapter 5.
1.2 Motivation
Since the internet began to grow and since many people have access to it, searching and ranking became
more and more important. If you see the big turnover of search services like Google, Yahoo and so on you
know that this is still a big topic in 2012. For small web developers it doesn’t seem that they can compete
with the big companies on the searching and ranking area, so why this paper? The big ones do their own
research. The target audience of this paper is not Google or Yahoo. It is you as interested web developer.
Only if you understand the concepts behind ranking you can optimize your product and get it placed where
you want it to. Ranking and SEO go very close together this is the reason why this paper discusses both.
1.3 Structure
As mentioned before this paper has two main parts, ranking and SEO. At first there will be a brief overview
about ranking. After that there is the more technical part. There are algorithms and approaches discussed.
The SEO part is more practical and needs therefore the theoretical background which is provided in the
previous chapters. Chapter 5 is focused on new trends, to give the reader an outlook on possible new
developments.
2 Ranking algorithms and strategies
To clarify what ranking is, a short explanation at the beginning. Imagine you are searching for a document
out of several others. You will find a set of documents which could be relevant. Now this set should be
ordered. The best thing would be that the first one is the most relevant. And this is what ranking tries to do.
Sorting search results in a meaningful way to help users finding what they are searching for. Out there, there
are plenty of different ranking algorithms and strategies. Which one is the best, is hard to tell. It depends
strongly on the context, the use case and the system to which the ranking algorithm should be applied. The
more specialized a system is the more advanced methods are possible. For specific information systems one
can use complex machine learning algorithms like Bayesian Networks, Neuronal Networks etc. The more
diverse the information in a system become, the more general approaches have to be used. But for all of
these systems it is essential to perform well in ranking. If the user doesn’t find what he is searching for, he
will consider the system as crap. For this reason the ranking strategy is one of the main factors for every
information system to be successful. To illustrate how important ranking is, think of Google or Yahoo. They
could close their whole business without an outstanding ranking. Because this paper deals with ranking in
the World Wide Web its main focus is on link based algorithms. These types of algorithms seem to perform
very well on the context of the web. They form the base of many search strategies.
2.1 In Degree
All link based algorithm look at the web as a graph. If a user browses through the net, he is doing a random
walk on this graph. Documents are seen as vertices and links are like edges in a graph. The idea for an
algorithm based on the in degree of a vertex is inspired by citations in the academic world. A document A
seems to be important if many other documents cite A. For scientific papers this seems to be true and
ranking on the in degree of a vertex would be enough. But the web is very different to scientific citations.
This was also mentioned in the paper “The PageRank Citation Ranking: Bringing Order to the Web”, (see
also (Brin, Page, Motwani, & Winograd, 1999)). They mentioned that the web is more diverse. Especially in
terms of quality and content. There is no quality assurance; everybody is able to publish content. Also the
type of content varies from a text about some ones hobbies over news to very scientific things you can find
all. This is why they started to think how to improve the idea of in degree. And they came up with the
PageRank. Nowadays some papers which were published show that you can approximate the PageRank (see
(Upstill, Craswell, & Hawking, 2003), or see (Fortunato, Boguná, Flammini, & Menezer, 2008)) with in
degree and save a lot of computation time, but Google still relies on the PageRank and is very successful. So
it seems that this strategy is still very good.
2.2 Page Rank
The first idea to improve the in degree was to find a measure to see how important an incoming link is. A
link from a web page with high reputation should be more worth than a link from a very poor web page. To
model this, two types of links were introduced: “back links” and “forward links”.
Figure 1: Forward links and back links
Source: (The PageRank Citation Ranking: Bringing Order to the Web, 1999)
The figure above shows a simple link structure with 3 web pages. A and B have one “forward link” and no
“back links” and C has 2 “back links” and no “forward link”. For the PageRank a page is considered as
important if it has many “back links”. So “back links” from these kinds of pages are very good for your
PageRank. To illustrate how this works in detail see the figure below:
Figure 2: Snapshot of page rank calculation,
Source: (The PageRank Citation Ranking: Bringing Order to the Web, 1999)
The first page achieves a PageRank of 100 with all it “back links”. This page has two “forward links”. To
calculate the PageRank this page propagates over its “back links”, the own PageRank is simply dived by the
number of “forward links”. The second page gets its PageRank by page one and page three (the one with
PageRank 9). The sum of the “backward links” is 53 so the PageRank is 53. To calculate the PageRank you
need the following recursive formula:
(Brin, Page, Motwani, & Winograd, 1999)
This formula has two big problems dangling links and infinite loops.
2.2.1 Random surfer
A tricky problem is the possibility of an infinite loop. This happens if some pages are only interconnected to
each other and this circle has only one “back link”. The pages in this circle would propagate their PageRank
all the time over and over again. This circle produces a totally wrong PageRank which doesn’t reflect the
real value. The next figure shows how such an infinite loop happens:
Figure 3: Infinite loop
Source: (The PageRank Citation Ranking: Bringing Order to the Web, 1999)
To solve this problem the model of a random surfer was introduced. This model also applies better to a real
world scenario and also solves the mathematical problem. The random surfer can visit every page randomly.
He can switch the page by simply typing a new URL into the browsers address bar. Also it is very unlikely
that a random surfer gets lost in an infinite cycle. After he recognizes that he is in a loop he will jump to
another page. So the summation formula from above has to be redefined:
(Brin, Page, Motwani, & Winograd, 1999)
The vector E describes the likelihood that a user jumps to another page. There can be made different
distributions which leads to different results. (Brin, Page, Motwani, & Winograd, 1999) suggest to us a
uniform distribution over all web pages and adding a term α to adjust the weight of E.
2.2.2 Dangling links
Dangling links are links which link to pages which don’t have outgoing links. The problem is that it is not
known where to distribute their PageRank. It could be that these pages really don’t link to others or that you
don’t see the links, because your sample of the web is too small. This is the more likely case because nobody
has a full representation of the World Wide Web. To solve this problem (Brin, Page, Motwani, & Winograd,
1999) suggest, removing them during computation and adding them after the process converged. It is
noticed, that this changes the results slightly but this has no big effect.
2.2.3 Convergence
After solving the two main problems you can apply the algorithm to real data. The algorithm is started and
can be stopped, when the difference between the last iteration and the actual iteration is smaller than a
predefined very small value. The results of (Brin, Page, Motwani, & Winograd, 1999) show that the
algorithm converges very well and is also useable for big data sets.
Figure 4: Convergence rate for half size and full size link database
Source: (The PageRank Citation Ranking: Bringing Order to the Web, 1999)
As you can see, after 52 iterations the algorithm converges already. Also the difference between the half size
link database and the full size link database isn’t that big. The scaling factor is circa linear in log(n).
2.2.4 Google Matrix
To compute the algorithm efficiently it can be implemented as matrix multiplication. Therefore the Google
Matrix was implemented. With the knowledge of the previous chapters it is really easy to understand this
matrix.
Figure 5: Description of the Google Matrix
Source: Slide from the Lecture Webscience and Webtechnologie at TU Graz presented by Markus Strohmaier
The matrix H is simply the transition matrix with the probablitities that you reach one page over a link from
another. After all the PageRank algorithm can be described with:
2.2.5 Acutal development
PageRank is still the heart of Google’s ranking tactic (Moskwa, 2011). But it seems that PageRank is getting
more and more competitors. It is hard to tell how Google really ranks, because this is one of their biggest
secrets, but it seems that additionally to the PageRank different factors play a role when ranking. Especially
the Social Web, Google+ and many more services changed the WWW dramatically, so it is obvious that
more factors are considered.
2.3 HITS
The HITS (Hypertext Induced Topic Search) algorithm emerged during the same time as the PageRank.
HITS was even a little bit earlier. It is still an important approach even though it wasn’t as successful as
PageRank. The idea behind HITS is similar to PageRank. There are two big differences. First there a search
is executed. The set of found documents is used to calculate the ranking. In contrast to PageRank you
consider not the whole web graph. You only look at a certain sub graph. The second major difference is that
there are two raking scores. The hub rank and the authority rank. Kleinberg defines a hub as a page which
links to many others and a authority is a page with many incoming links. Also some pages are called
universal popular. These pages have many incoming links and almost no outgoing links.
Figure 6: shows hubs, authorities and universal populars
Source: Information Search and Retrieval at TU Graz, presented by Christian Gütl
The hub and authority values are propagated by every page to the next one like in PageRank. This also leads
to a recursive formula, which can be calculated iteratively. One big challenge is to find the right sub graph.
The ideal sub graph would be small and consists out of very good hubs and authorities. But this is not
always the case so Kleinberg suggests looking at a bigger sub set as given by the search. Therefore pages are
considered as relevant which link into the initial set and pages which are linked from the initial set.
Figure 7: extending the root set to the base set
Source: (Authoritative Sources in a Hyperlinked Environment, 1999)
The figure from Kleinberg’s paper illustrates the idea behind the extension. The initial set is called root set
and the expanded set is the base set. Often the base set is also called as neighborhood graph. The HITS
algorithm has many pros and cons. (see (Langville, 2012), chapter 11.5) The HITS algorithm has the
advantage of two ranking scores. Because hub and authority values are distinguished a user has the option to
decide if he wants to do a broader search or a more specific search. For specific searches he will prefer
authority pages and for broad searches hub pages have advantages. Also the small subset for which the
ranking has to be computed can be an advantage but is also be a weakness. Especially against spam a
smaller subset is not very resistance. Also the search for the sub set is critical. How to find the right sub
graph? Finding the neighborhood graph also leads often to the case, that off topic pages are included. Many
scientists wrote papers how to solve these weaknesses and so HITS is also useable in real world scenarios.
Monika Henzinger and Krishna Bharat (see (Improved Algorithms for Topic Distillation in a Hyperlinked,
1999)) wrote a solution to deal with spamming. Also there are many versions of HITS which make the
algorithm query independent. So the algorithm simply uses the whole set of pages not only a sub set.
Longzhuang Li, Yi Shang, and Wei Zhang presented „Improvement of HITS-based Algorithms on Web
Documents” 11th International World Wide Web Conference. So there is a lot of research going on about
HITS. And it is still used (www.ask.com) even if it has a very strong competitor named PageRank.
2.4 SALSA
SALSA was introduced 2000 by Ronny Lempel and Shlomo Moran. So it was invented after PageRank and
HITS. One idea was, to combine the features of both, HITS and PageRank. SALSA also distinguishes
between hubs and authorities. But the scores are calculated by a stochastic process in form of a Markov
chain. This is what it has together with the PageRank. Like HITS SALSA is also query dependent and forms
a neighborhood graph. This graph is then transformed into a bipartite graph. On one side the hubs and on the
other side the authorities. On this graph SALSA performs the random walk. Because of its stochastic nature
SALSA doesn’t suffer from the same problem as HITS to derive many off topic pages from the root page
set. It is also more robust against spamming as HITS. But in this category PageRank is still better. Another
advantage is that the computation time is less because of the used sub graph. And like HITS the user can
choose between authority –and hub results. The biggest drawback is also the query dependence. This should
be always considered if someone wants to use a query dependent ranking algorithm. It is possible to fix this
problem in the same way as for HITS. Just calculate the scores for the whole graph. A very detailed
description of SALSA can be found in (Google's Pagerank and Beyond: The Science of Search Engine
Rankings, 2012) chapter 12.
2.5 Summary about link based algorithms
Link based algorithms suites very good to the structure of the web. Because it is very heterogeneous it is
very hard to apply specific approaches. Also link based algorithms are relatively easy to understand and fast
to compute. If intelligent designed this approaches also scale very good. One thing is really different
between PageRank and the original HITS. PageRank computes a ranking over all pages. So you have a
global ranking which can be used to rank the search results. PageRank is therefore not query dependent.
HITS in contrast, computes the ranking for every query new. This is the reason why HITS only has a local
ranking. The rest of the ideas are very similar. For all readers who want to step deeper into this topics the
book (Google's Pagerank and Beyond: The Science of Search Engine Rankings) is suggested. There are a lot
of examples and background information. Also some calculation examples are listed, with step by step
generation of the whole algorithms.
2.6 Non link based approaches
The most common techniques used by search engines are link based approaches with influences from other
factors. These other factors aren’t very clear because every search engine is keeping a big secret about their
real strategies. But beside the link based strategies there are some other approaches too. We will introduce
you in a very short way to them.
2.6.1 Rank aggregation
Because the search results of the different search engines are very different there came up a new idea which
is called Rank Aggregation. A Research Study by Dogpile.com in collaboration with Queensland University
of Technology and Pennsylvania State University (see (Different Engines, Different Results, 2007)) shows
the differences in numbers. The following table has the details:
Figure 8: unique results of a search engine in the top results
Source: (Different Engines, Different Results, 2007)
As you can see the major number of the top search results differs from search engine to search engine. The
rank aggregation approach tries to use these rankings and combine it to a new one. Meta search engines use
this concept. There is also a lot of research going on at this area, because it is not totally clear how to
combine the different results in the best way.
2.6.2 Traffic Rank
Another approach to rank web pages is, to rank them by their traffic. The simple and efficient idea is that the
page with the most traffic is the most important. For instance: Alexa computes a ranking based on the traffic
which
occurs
on
a
page.
This
ranking
can
be
viewed
under
the
following
URL:
http://www.alexa.com/topsites. There are different lists summarized for countries, categories etc. It is
unfeasible to calculate the exact traffic for a web page therefore Alexa invented its tool bar. This toolbar
sends information about the surfing behavior to Alexa. With this data Alexa computes a prediction how high
the traffic is. More about Alexa you can read in chapter 3.2
2.6.3 Summary non link based approaches
Most of these techniques are very young and there are several more out there than the two which are
presented here. It is hard to say what will be the next big hit. Will link based ranking always be the best
method or will there come up new strategies which succeed PageRank, HITS and co? These kind of
questions are tried to be answered in chapter 4.2
3 Web search engines / Web search services
The following chapter should give an overview over one of the biggest Web search engine Google and there
measures to stay on the top. Furthermore other Web search services like Alexa and DMOZ will be shortly
introduced.
3.1 Google
Today Google is the most powerful and used web search engine in the world. Google answers more than one
billion question from people around the globe in 181 countries and 146 languages. (www.google.com, 2012)
It recorded the most traffic of all search sites and even had is own dictionary entry (Grappone & Couzin,
2011). But why is Google so powerful?
Figure 9: Basic information about Google
Source: (Grappone & Couzin, 2011)
Dana Blankenhorn from www.smartplanet.com said that the Google story isn’t about media or marketing, or
young engineers. It’s all about reducing its cost of doing business online like the big online store Amazon
(Blankenhorn, 2009). But this could not be the only point.
Beside the web search, Google also offers different services like: email, maps, a calendar, online document
sharing, video, image and many more which helps that Google is in every mouth (Grappone & Couzin,
2011). But they don’t concentrate only on special services, they also worked permanently on new search
functions that guaranteed that the users find easily the requested information and stay on their site. In the
following enumeration you will see some new search functions from Google:

Instant Search

Flight Search

Handwrite Search for devices with touch screen

Search by image

Voice search

Knowledge graph

Related Search Previews
Another point why Google become so powerful is the Ranking. Nobody in the public knows exactly how it
works but it is based on the Google PageRank, which is explained in the chapter 2.2, and several other
algorithms (Singhal, 2009). Amit Singhal wrote in the Google blog that there stand three philosophies
behind the Google Rank:
1. Best locally relevant results served globally.
2. Keep it simple.
3. No manual intervention.
So it is not a lot we know about the ranking but Google said that there are 200 facts, which are definitely
important. In the following enumeration there are some of them:

Domain – age, top level domain, sub or root domain, domain history, keyword in domain

Server – geographical location, availability

Architecture – URL and HTML structure, external CSS / JS, valid HTML code, cookies

Content – language, amount of information, uniqueness, actuality, orthography

Website – age, number of pages, xml sitemap, on page trust, style

Keywords – in alt tags, in title, at the beginning of continuous text, in URL

Outgoing links - number per domain / site

Backlink profile – relevance and quality about linked websites

Users – location, quantity
3.2 Alexa
Brewster Kahle and Bruce Gilliat founded Alexa in April 1996 named after the Library of Alexandria. The
company crawls all publicly available websites to create a series of snapshots of the web. The amount of
data that is collected over the time are used to create features and services like:

Site Info – traffic ranks, search analytics, demographics

Related Links – similar or relevant sites for the one that the user currently views
The data, which Alexa gathered per day, are approximately 1.6 terabytes of web content. After each
snapshot of the web they collect 4.5 billion pages from over 16 million sites.
Figure 10: Traffic rank of www.google.at
Source: (www.Alexa.com)
But they don’t only crawl the Internet they also gathering web usage information. To get this information
they developed a toolbar for nearly every major browser. Every user who hast this toolbar installed sends
information about the web, how it is used and what is important and what isn’t to the community where it
processed for the services Alexa offers. (www.Alexa.com)
3.3 DMOZ
The DMOZ – Open directory project is the largest human edited directory of the Web, which is developed
and maintained by a global community of volunteer editors. DMOZ has nearly 5.2 million sites and 97.000
editors for over 1 million categories. (dmoz - open directory project)
Figure 11: Homepage www.dmoz.org
Source: (dmoz - open directory project)
4 SEO
Why SEO? This chapter will give some insights on Search Engine Optimization (SEO) and clear the
question what SEO exactly is. When speaking about SEO or better the goal one wants to achieve by
performing SEO, people would probably just think of improving a page's rank on a variety of search
engines, just to mention Google for now. But in fact, the term Search Engine Optimization describes an
entire set of activities which may be performed to increase the number of visitors, finding a website by a
particular search engine. These activities not only include techniques and methods that can be performed to
the HTML code of a website, but also to the text, speaking of the websites content itself.
Before thinking about how to optimize a website to communicate with several search engines, one should
clear the question "Which function does the website serve?" or better "Does the website serve a function at
all?" It is not uncommon that companies sometimes build a website just for the purpose to have a website.
But even in this case, a site serves several functions like an online store or product portfolio, a personal blog,
some kind of news service, company information, or any other function may think of. To become aware of a
website's functions is the first step to optimize it, so that one may be able to answer the second important
question "What do I want a visitor to do on my website?" Now the Search Engine Optimization may begin.
(Grappone & Couzin, 2011)
Before coming to the techniques and methods of SEO, this article will approach the basics of search engines
first. Basically their results from queries are divided into the following two types, the so called "organic
listings" and "paid search adwords". Due to the fact, that no SEO is necessary to list a website within paid
search adwords at all, this article will concentrate on the first type of results.
Figure 12: Comparison between the search engines Google and Yahoo showing results to the query "Skifahren". Results
origin from paid search adwords are shaded in color, respectively listed separately.
Source: (www.google.com, 2012) (www.yahoo.de)
4.1 Techniques and Methods
The one probably most important fact about SEO is that text in a website's content plays the biggest role at
all. Due to the fact that a web search is mostly text based, search engines will always consider how much
text there is on a webpage, even if users search for multimedia content. How it's formatted and what the
content says are crucial for the result. This simple fact hasn't changed since the beginning of search engines
on the web. Every webpage contains both visible and invisible text to the user. Invisible are for example alttags of images or title-tags of hyperlinks. Not forgetting the meta-tags of a webpage, search engines consider
all these parts for the ranking of results as will be explained soon.
Robots and Spiders Agents, more specifically the so called robots or spiders, continuously search the text
of a website. In order to help those agents with their search and by doing so optimizing the website for
search engines, there is the change to communicate with the robots by using the just mentioned tags to
include invisible text to the page. But also tags which mark up visible text are important to these robots. This
lets them know which parts of the text are more important than the others as you can see in Figure 13.
Figure 13: Showing the first organic search result for "skifahren in der steiermark" of Google on the left and the
regarding website www.steiermark.at on the right
Source: (www.google.com, 2012)
Google marks the text snippets which lead to the result in bold print. In Figure 13 one can easily see that all
parts of the search query show up in the web address, as well as the html-title-tag and in a piece of text
which is marked by a strong-tag. The lesson is clear: To gain the best possible result for specific key words
on a specific page, they should appear in both the title-tag and the text on the page. Those two elements need
to work together.
Myth Meta Tags
As a typical webpage contains so called meta tags, which belong to the invisible text
of a webpage, this paragraph will put some light on the importance of the meta-tag description and keywords
to the page rank in search results. The meta-description-tag contains information which describes the
website and can be displayed in the search results right beneath the link to a specific page.
<meta name="description" content="Aflenz
Outdoorpark - der steirische Bewegungspark
Aflenz Bürgeralm - das Naturschneeparadies">
<meta name="description" content="Eines der
größten Skigebiete Österreichs im Herzen der
Region Schladming-Dachstein und
Austragungsort der Alpinen Ski WM 2013.">
Figure 14: Search results on Google showing the meta tag description right below the link to the website on the left and the
regarding html meta description of these pages on the right. Words occurring in the search query and in the meta
description tag are print bold in the result.
Figure 14 shows how big the influence of the meta description tag is for the search result and the
information displayed to the user. On the other hand side, there is also a so called meta keyword tag part of
the html head. This provides an opportunity for the website owner to simply enter a list of keywords without
showing them on the visible text. But compared with the meta description tag, the keywords "carry little or
no weight in search engine rankings." (Grappone & Couzin, 2011) This can very easily be tested by entering
all the keywords of a particular website into a search query. If these words don't occur in the rest of website's
text, the particular page probably won't be displayed on the top of the search results at all.
4.1.1 Google specifics
This chapter deals with Google specific optimizations. There will be shown some new optimizations here,
but Google also suggests to use those that already have been explained before.
Keywords in URLs Beside using unique title and description tags on every site, Google suggests an
improvement of URLs in order to gain a better page ranking. Many content management systems use URLs
with the id number of the regarding article instead of words. This is not only hard for the user to work with,
but also not good for the page rank, as the URL of an document is an important part for the search result.
The best practice to optimize an URL is to use several words in it which are relevant both for the site's
content and structure. It can also help if the structure of the menu, respectively the directory is reflected by
the URL. An example could be: “http://www.domain.com/stories/2012/keyword1-keyword2-...-.html”
..........................
Figure 15: Websites with "search engine friendly" URLs gain better search results.
Source: (www.google.com, 2012)
A single URL for each Webpage
The fact that Google puts a big value on the text in URLs, could lead to
the assumption, that the same page could be linked with a bunch of different URLs in order to gain the best
search result. But in truth, this leads to a technique called "Duplicate Content", which is very depreciated to
be used as search engine optimization and explicitly unwanted by Google. But more on that on the next
chapter. But what in fact can be done in this matter, is setting a 301-redirection to another webpage if it's
absolutely necessary to have multiple URLs to the same target. This way, Google knows for fact that there is
only a single page containing this specific content.
Navigation on a website
The navigation, particularly the structure of the website is very important, not
only to visitors but also to search engines. Navigation should be planned in a way, so that every page should
be accessible by following hyperlinks from the home site on. In the end there should be as less different
ways as possible to gain access to a specific site in order to make navigation clearer to the user. Besides that,
Google also suggests offering a breadcrumb navigation.
Sitemaps
Sitemaps, one for the user, one for the search engine. Sitemaps offer a way to present a user
every accessible site in a simple overview. But analyzing the user-specific sitemap is not always optimal for
search engines. So Google suggests creating an xml-based sitemap which is meant to be at the search
engines proposal only.
4.2 Problems
SEO can become harmful if people try to push their rankings. Trying to achieve underserved ranks is called
spamming. If search engines discover sites to be spamming, even if not doing on purpose, the sites rank may
be downgraded or the site may even be banned.
4.2.1 Problems for search engines
Cloaking
The name cloaking describes a technique where a website relays robots to so called doorway-
pages instead of showing those the human visitors get to see. This way, pages can show significantly
different content to search engine robots. This would keep the search engine from indexing the site correctly
and giving users accurate results to their search queries.
Duplicate Content
If the designer of a website produces duplicate content, in other words the same
content twice or even a few times, the search engine will no longer be able to distinguish these pages, which
would render the search result useless.
Keyword Stuffing
Adding important keywords to a page over and over again, not in a rational way but
just to massively mention the same words is depreciated by search engines and could cause a bad ranking for
the particular page.
Invisible Text
If there is text in the same color then the background it's on, so it can't be read by an
human user but only by an search engine, the page ranking won't be accurate anymore and causes the same
bad effects keyword stuffing or cloaking.
5 New trends
Over the last years the amount of information in the World Wide Web strikingly increased. New innovations
like Social Networks or Smartphones are one of the main reasons of the process of growth. But as much
information is available as much difficulty is the search and so some projects arose to use the new
technology to improve the searching.
One project is the “PeerSpective”, a social network based web search. A group of scientist of the Max
Planck Institute for Software Systems from Germany tries to integrate information of a social network in a
web search because they think that social network links can be important to increase the quality of search
results (Mislove, Gummadi, & Druschel). So they build up an own network of ten people to share the
downloaded and viewed content with one another. When a search started the query was sent to every user in
the network and to Google. Every user had his own proxy, which executes the query on the local indexed
sites and sent the results to the sender. These results were displayed right next to the Google results as
shown in Figure 16.
Figure 16: Result of PeerSpective
Source: (Mislove, Gummadi, & Druschel)
To rank the results of the network they used a Lucene text search engine ranking, multiplied the Google
PageRank to that result and adding the scores from all users who viewed the result. With these technic the
search takes advantage of the hyperlinks of the Web and the social links of the network. (Mislove,
Gummadi, & Druschel)
Social media sites have by now reached an enormous influence on the page ranks. Postings in e.g. Facebook,
Twitter or Google+ perfectly correlate with search results regarding the specific websites these posts
mention. Due to this fact it becomes very obvious, that social media is pretty helpful for gaining perfect
search results. This has to do with establishing backlinks to which lead users from the social platform back
to a website. In case of Facebook, if users even "like" or "comment" such a posting or even share it with
other users, the backlinks take care of the rest. Speaking of links, it is nowadays considered useful to have
links which contain instead of just keywords also stopwords, in order to guarantee a natural language
hyperlink. Search engines recognize and appreciate this kind of user friendliness.
But one important part mustn't be ignored: Having the keyword in the domain still puts the website
regardless of all other facts on rank one. Beside speaking of links there is another important point regarding
a website's content. Having too much advertises on a website can decrease the page rank, as it is considered
spamming.
Figure 17: The speaman’s rank correlation describes how the named facts influence search results.
Source: http://www.searchmetrics.com/de/services/whitepaper/seo-ranking-faktoren-deutschland/
As we heard before, keywords in h1- and title-tags are important to reach good search result. But in case of
Google, it's interesting to mention, that keywords from the search query won't appear completely within the
titles of the first search results, there content matters most. Regardless these and facts about social media,
having a website subscribed on Google+ obviously leads to better search results, for the moment. As one can
see, a list of correlating results from Google+ will appear right below the first few search results.
Another Project called “Geooreka” tries to improve search results with geographical information. Geooreka
is a web search engine integrated with a Geographic Information System (GIS) database, which allows to
search web documents that refers to an area visually configured by a user by means of a map (Buscaldi &
Rosso, 2009). The architectur of Geooreka is quit easy as you can see in Figure 18. The user selects an area
and adds a search theme. Then all toponyms, which are relevant for the chosen zoom level of the map, were
extracted. Then the web counts and common information were used to find the optimal results. To speed up
the whole process web counts were calculated by the Google Web1T and the search for the theme and
toponym combination were processed by Yahoo!. (Buscaldi & Rosso, 2009).
Figure 18: Architecture of Geooreka
Source: (Buscaldi & Rosso, 2009)
Another trend in web searching is to show the answer directly and not the site where you can find it. Such
engines called answer engines and the most popular called “Wolfram Alpha” which was developed by
Wolfram Research in 2009. The users of Wolfram Alpha submit the question via a text field. The engine
then computes answers and relevant visualizations based on a big internal database, external data sources
like Facebook, CrunchBase, Best Buy and under the functional principle of cellular automatons. The only
disadvantage is that the user have to confirm special rules for the input that the engine can handle the query.
(Wolfram Alpha - Wikipedia, the free encyclopedia, 2012)
Figure 19: Homepage www.wolframalpha.com
Source: (Wolfram Alpha)
6 Conclusion
The diversity of the web and the vast of information demands good search services. Besides searching, the
ranking of results is becoming very important. The simple approaches of link based ranking strategies still
found the base of many search services. But it is totally clear, that these factors aren’t the only ones. What
more is considered for ranking is still one of the biggest secrets of Google and Co. Many studies show that it
seems that Social Signals become more and more important also “location based result ranking” is very
trendy. Because of the many smart phones this trend will be increasing in the next years for sure. The
ranking strategies permanently evolve and develop to fit to the actual requirements. And so the SEO guys
also have to do. They need always to adapt to new algorithms and approaches. The graphic of gShiftLabs
show an estimation what is important for a web presence to be found well.
Figure 20: Hierarchy of Web Presence Optimization
Source: http://searchenginewatch.com/article/2228256/10-SEO-Truths-of-2012-for-Agencies-In-House-Teams
As you can see gShift defines a pyramid for successful SEO. Therefore the old and well known topics still
form the base. On this base you can build up the more advanced techniques and react to new trends. This
seems reasonable, because ranking strategies doesn’t change totally. Best practice and good experience with
different ranking factors will be always a part of ranking strategies. So you don’t need to change your SEO
strategies total only because there is some new hype out there in the web.
It will be interesting what the future of search and ranking will be. Definitely there will be more crowed
based factors, more location aware factors. As applications like Google Now, Apple’s Siri and Wolfram
Alpha show, there will emerge more and more semantic search tools. The future is hard to predict, because
many people think about creative ideas how to rank search results. Because of the many possibilities you
have to define a rank no one can say what will be the next big thing.
7 Appendix
7.1 References
Blankenhorn, D. (2009, 10 9). Smartplanet. Retrieved 12 9, 2012, from
http://www.smartplanet.com/blog/thinking-tech/what-makes-google-powerful/1749
Brin, S., Page, L., Motwani, R., & Winograd, T. (1999). The PageRank Citation Ranking: Bringing Order to
the Web. Stanford InfoLab .
Buscaldi, & Rosso. (2009). Geooreka: Enhancing Web Searches with Geographical Information. Valencia,
Spain: Universidad Politécnica.
dmoz - open directory project. (n.d.). Retrieved 12 9, 2012, from http://www.dmoz.org/
Fortunato, S., Boguná, M., Flammini, A., & Menezer, F. (2008). Approximating PageRank from In-Degree.
Springer-Verlage Berlin Heidelberg .
Grappone, J., & Couzin, G. (2011). Search Engine Optimization (SEO): An Hour a Day. John Wiley &
Sons.
Henzinger, M., & Bharat, K. (1999). Improved Algorithms for Topic Distillation in a Hyperlinked.
Kleinberg, J. M. (1999). Authoritative Sources in a Hyperlinked Environment.
Langville, A. N. (2012). Google's Pagerank and Beyond: The Science of Search Engine Rankings. Princeton
University Press.
Mislove, A., Gummadi, K. P., & Druschel, P. (n.d.). Exploiting Social Networks for Internet Search. Max
Planck Institute for Software Systems & Rice University .
Moskwa, S. (2011). Beyond PageRank: Graduating to actionable metrics. From Goole Webmaster Central.
Singhal, A. (2009, 6 9). GoogleBlog. Retrieved 12 9, 2012, from
http://googleblog.blogspot.co.at/2008/07/introduction-to-google-ranking.html
University, D. i. (2007). Different Engines, Different Results.
Upstill, T., Craswell, N., & Hawking, D. (2003). Predicting Fame and Fortune: PageRank or Indegree?
Department of Computer Science, CSIT Building, ANU Canberra .
Wolfram Alpha. (n.d.). Retrieved 12 9, 2012, from htttp://www.wolframalpha.com
Wolfram Alpha - Wikipedia, the free encyclopedia. (2012, 12 05). Retrieved 12 08, 2012, from Wikipedia,
the free encyclopedia: http://en.wikipedia.org/wiki/Wolfram_Alpha
www.Alexa.com. (n.d.). Retrieved 12 9, 2012, from http://www.alexa.com/company/technology
www.google.com. (2012). Retrieved 12 9, 2012, from
http://www.google.com/competition/howgooglesearchworks.html
www.yahoo.de. (n.d.). Retrieved 12 9, 2012, from http://www.yahoo.de
7.2 List of figures
Figure 1: Forward links and back links Source: (The PageRank Citation Ranking: Bringing Order to the
Web, 1999)................................................................................................................................................. 7
Figure 2: Snapshot of page rank calculation, ..................................................................................................... 7
Figure 3: Infinite loop ........................................................................................................................................ 8
Figure 4: Convergence rate for half size and full size link database ................................................................. 9
Figure 5: Description of the Google Matrix .................................................................................................... 10
Figure 6: shows hubs, authorities and universal populars ............................................................................... 11
Figure 7: extending the root set to the base set ................................................................................................ 11
Figure 8: unique results of a search engine in the top results .......................................................................... 13
Figure 9: Basic information about Google....................................................................................................... 14
Figure 10: Traffic rank of www.google.at ....................................................................................................... 16
Figure 11: Homepage www.dmoz.org ............................................................................................................. 16
Figure 12: Comparison between the search engines Google and Yahoo showing results to the query
"Skifahren". Results origin from paid search adwords are shaded in color, respectively listed separately.
.................................................................................................................................................................. 17
Figure 13: Showing the first organic search result for "skifahren in der steiermark" of Google on the left and
the regarding website www.steiermark.at on the right ............................................................................ 18
Figure 14: Search results on Google showing the meta tag description right below the link to the website on
the left and the regarding html meta description of these pages on the right. Words occurring in the
search query and in the meta description tag are print bold in the result................................................. 19
Figure 15: Websites with "search engine friendly" URLs gain better search results. ..................................... 20
Figure 16: Result of PeerSpective.................................................................................................................... 22
Figure 17: The speaman’s rank correlation describes how the named facts influence search results. ............ 23
Figure 18: Architecture of Geooreka ............................................................................................................... 24
Figure 19: Homepage www.wolframalpha.com .............................................................................................. 25
Figure 20: Hierarchy of Web Presence Optimization ...................................................................................... 26