Cybermetrics

Transcription

Cybermetrics
Cybermetrics
Theory and practice
Isidro F. Aguillo
Version 2 (Nov’11)
isidro.aguillo@cchs.csic.es
Presentación: Isidro F. Aguillo

Current position


Background





Head, Cybermetrics Lab
Spanish National Research Council (CSIC)
MSc. Biology (Univ. Complutense, Madrid)
MID (Univ. Carlos III, Madrid)
DEA (Univ. Granada)
Doctor Honoris Causa (Univ. Indonesia)
Research topics & other working activities





Rankings Portal: webometrics.info
Research projects: QEAVIS (e-humanities), MAVIR
(multilingual Web), CARTO (R&D cartography),
ICYTnet (Virtual Libraries)
EU funded projects: ACUMEN (indicators portfolio
for individuals), OpenAIRE (EU central repository),
WISER (cybermetrics), EICSTES (R&D web
indicators), PEKING (knowledge management),
IMPACT-INFO2000 (information society)
Founder and editor of the e-journal “Cybermetrics”
300 seminars and conferences in over 100
universities from all over the World
2
Agenda

I. Descriptive Cybermetrics



II. Applied Webometrics



Methods and tools
Web indicators
Positioning in search engines
Optimising web contents
III. Usagemetrics


Log files and visits analysis
Popularity
3
MODULE 1
Descriptive Cybermetrics
Web Analysis
See also: Usability
Accesibility
Web Metrics
Definition

Cybermetrics is the discipline dedicated to the
quantitative description of the contents and
processes of the communication that take place
in the cyberspace


Cyberspace is the set of contents accessible in electronic
format. The condition of universal accessibility of Internet
suggests the use of this term as synonymous of the
Internet of the contents, basically but not exclusively, the
webspace
Since the Cyber-scientometric is the sub-field more
developed, for practical reasons it is named with the more
general term of Cybermetrics or the more specific of
Webometrics
5
Quantitative disciplines
informetrics
bibliometrics
scientometrics
Cyberscientometrics
webometrics
cibermetrics
Adapted from Björneborn
6
Relationships
Scientific policy
Investigation managementn
Scientific documentation
Libraries
Services for
Investigation in
Economy
Science‘s sociology
applied
Librarianship and
Documentation
History of science
Scienctometrics
basic
Informetrics
Life sciences
Webometrics
Mathematics/Physics
Other sciences/Humanities
www.ulb.ac.be/unica/docs/Sch-com-2004-pres-Glanzel.ppt
7
Advantages of the quantitative approach

The presence on the Web reflects more and better
the activities of the institution or individual than the
traditional publications on paper


The Web reaches a greater audience than other
traditional scientific communication media


At the academic area, professors, researchers and students put
on the Web unpublished material, first draw works, preliminary
versions of papers, course materials, slides for presentations or
data bases
The scientific journals has a restricted distribution
The hypertext nature of the Web offers the
possibility to discover hidden patterns between the
different institutional sites

The academic sites link to other sites with a marked economic,
industrial, cultural, politic or social character
8
New application areas

Webometrics

Topology of hipertextual networks
Social networks
PageRank, HITS

Comparative analysis of search engines



Ciberscientometrics





Studies of electronic mails and forums
“Big Science” & Grid
Cybergeography and cyberdemography
New units: institutional Web sites
New indicators


Visibility
Popularity
9
Cibergeography, ciberdemography

Data and sources






Internet Geography Project
www.zooknic.com
Cybergeography
www.cybergeography.org
Clickz Surveys
www.clickz.com/stats
Blog
www.internetworldstats.com/blog.htm
Demography and Geography of the Internet
www.sociosite.org/demography.php
www.sociosite.net/topics/webgeography.php
Internet Demographics Directory
internet-demographics.netfirms.com
10
Ciberdemography (I)
www.internetworldstats.com/stats.htm
11
Ciberdemography (II)
12
Ciberdemography (III)
www.internetworldstats.com/stats7.htm
13
Size of Internet: Infrastructures

Hosts




www.isc.org/ds
www.ripe.net/info/stats/hostcount/
www.ciolek.com/Asia-Web-Watch/main-page.html
Netcraft
www.netcraft.com
Servers


Lottor (World)
RIPE (Europe)
Asia Web Watch
Domains
World
www.norid.no/domenenavnbaser/domreg.html

Domain worldwide www.domainworldwide.com
www.verisign.com/Resources/Naming_Services_Resources/Domain_Name
_Industry_Brief/

Germany (and others)
www.denic.de/en/domains/statistiken

Studies (outdated) www.zooknic.com

14
Internet evolution (Lottor)
15
Lottor
http://ftp.isc.org/www/survey/reports/2011/01/bynum.txt
16
Web servers
http://news.netcraft.com/archives/web_server_survey.html
17
Web contents


Webspace

Spireproject

Present day
Deposits



10.000 millions (10/02)
spireproject.com/art13.htm
40+40.000 millions
Archive
Google Cache
www.archive.org
www.google.com
Traffic

The 80% of the browser sessions in the Web imply the use of
a search engine or a directory. Yahoo and, specially Google,
are the more important intermediaries
18
Wayback Machine
19
The problem with the gTLD

gTLD





First ones: .com, .org, .net, .int (.eu.int)
New ones: .biz, .info, .name, .aero, .coop, .museum, .eu, .cat
De facto: .cx, .tv, .cc
Special cases: .edu
Experiments

Google/Bing/Exalead




Filter operator “site:” Problems with some cTLD
Domains and countries
International domains (gTLD)
IP translators




IP Locator 1.41
AW IP Locator 1.8
IP Address Locator
Ip2location
www.atelierweb.com/iploc
www.geobytes.com/IpLocator.htm?GetLocation
www.ip2location.com/free.asp
20
Google: Languages and countries
21
Mentions
22
Academic Webspace

Sites

Institutional domains




OCLC Web Characterization (1998-2002)
http://www.oclc.org/research/projects/archive/wcp/
Sites and institutional sites
Netcraft October 2011
 500 millions of web sites
 Active (50%) * (5-10 institutional site/site) ~ 2 000 mill.
institutional sites
Academic webspace

Academic subdomains

Not every country
23
Academic subdomains
ac.ae
ac.at
ac.bd
ac.be
ac.bw
ac.by
ac.ci
ac.cn
ac.cr
ac.cy
ac.fj
ac.gg
ac.gs
ac.id
ac.il
ac.im
ac.in
ac.ir
ac.je
ac.jp
ac.ke
ac.kr
ac.lk
ac.lv
ac.ma
ac.mu
ac.mz
ac.nz
ac.pa
ac.pg
ac.pl
ac.ru
ac.rw
ac.se
ac.sg
ac.sz
ac.th
ac.tz
ac.ug
ac.uk
ac.uz
ac.vn
ac.yu
ac.za
ac.zm
ac.zw
acad.bg
edu.al
edu.am
edu.ar
edu.au
edu.az
edu.ba
edu.bb
edu.bh
edu.bm
edu.bn
edu.bo
edu.br
edu.bs
edu.bt
edu.by
edu.bz
edu.ck
edu.cn
edu.co
edu.cu
edu.dm
edu.do
edu.dz
edu.ec
edu.ee
edu.eg
edu.gd
edu.ge
edu.gh
edu.gr
edu.gs
edu.gt
edu.gu
edu.hk
edu.hn
edu.hu
edu.jm
edu.jo
edu.kg
edu.kh
edu.kn
edu.kw
edu.ky
edu.kz
edu.lb
edu.lc
edu.li
edu.lv
edu.mk
edu.mm
edu.mn
edu.mo
edu.mp
edu.mt
edu.mx
edu.my
edu.na
edu.nf
edu.ng
edu.ni
edu.np
edu.om
edu.pa
edu.pe
edu.ph
edu.pk
edu.pl
edu.pr
edu.pt
edu.py
edu.qa
edu.ru
edu.sa
edu.sg
edu.sh
edu.st
edu.sv
edu.to
edu.tr
edu.tt
edu.tw
edu.ua
edu.uy
edu.ve
edu.vg
edu.vn
edu.ws
edu.ye
edu.yu
edu.za
edu.zm
24
Academic databases

Public Web
Google Scholar
Publish or Perish
Citations Gadget
scholar.google.com
www.harzing.com/pop.htm
code.google.com/p/citations-gadget/
MS Academic Search
academic.research.microsoft.com
Scirus
CiteSeerX
Citebase
Paracite
DBLP
ScienceDirect
(US) Science Gov
In-extenso
www.scirus.com
citeseerx.ist.psu.edu
www.citebase.org
paracite.eprints.org
dblp.uni-trier.de
www.sciencedirect.com
www.science.gov
www.in-extenso.org
25
Context
Public Web
Private Web
Invisible Internet
Databases
Visible Web
Repositories
Electronic
journals
26
Google Scholar
27
Scholar (II)
Trabajos en dominios
universitarios
(Enero ‘07)
28
Scholar: Publish or Perish
29
Google Scholar Citations (testing)
30
Microsoft Academic Search
31
MAS Author entry
32
MAS Institution entry
33
MAS Comparing institutions
34
CiteSeerX
35
Rich files and media files

Rich files

Definition and types



Size



Filter operators: filetype (Google, Live, Exalead)
Media files
Definition and types


Adobe Acrobat (pdf) y Postscript (ps)
MS Office: Word (doc, rtf), Excel (xls), Powerpoint (ppt)
FilExt
www.filext.com
Localization in search engines



Terms
Filter operators
Autonomous databases
36
Google (filetype)
37
Bing (filetype)
38
Images in search engines
39
Languages on the Net

Sources and studies

Users according to language



Global Reach
global-reach.biz/globstats/index.php3
Composition of the webspace
Experiments with search engines





Google
Yahoo!
Bing (ex-Live) Search
Ask (Teoma)
Copernic
40
Users according to language
http://www.glreach.com/globstats/index.php3
41
Languages on the Net
Languages used to access Google
www.google.com/press/zeitgeist.html
42
Languages (Google)
<lr> value
Idioma
Arabic
Chinese (S)
Chinese (T)
Czech
Danish
Dutch
English
Estonian
Finnish
French
German
Greek
Hebrew
Hungarian
Código
lang_ar
lang_zh-CN
lang_zh-TW
lang_cs
lang_da
lang_nl
lang_en
lang_et
lang_fi
lang_fr
lang_de
lang_el
lang_iw
lang_hu
Language
Language
Idioma
Icelandic
Italian
Japanese
Korean
Latvian
Lithuanian
Norwegian
Portuguese
Polish
Romanian
Russian
Spanish
Swedish
Turkish
Código
lang_is
lang_it
lang_ja
lang_ko
lang_lv
lang_lt
lang_no
lang_pt
lang_pl
lang_ro
lang_ru
lang_es
lang_sv
lang_tr
43
Countries (Google)
Andorra
United Arab Emirates
Afghanistan
Antigua and Barbuda
Anguilla
Albania
Armenia
Netherlands Antilles
Angola
Antarctica
Argentina
American Samoa
Austria
Australia
Aruba
Azerbaijan
Bosnia and Herzegowina
Barbados
Bangladesh
Belgium
Burkina Faso
Bulgaria
Bahrain
Burundi
Benin
Bermuda
Brunei Darussalam
Bolivia
Brazil
Bahamas
AD
AE
AF
AG
AI
AL
AM
AN
AO
AQ
AR
AS
AT
AU
AW
AZ
BA
BB
BD
BE
BF
BG
BH
BI
BJ
BM
BN
BO
BR
BS
Bhutan
Bouvet Island
Botswana
Belarus
Belize
Canada
Cocos (Keeling) Islands
Congo, DR
Central African Republic
Congo
Switzerland
Cote D'ivoire
Cook Islands
Chile
Cameroon
China
Colombia
Costa Rica
Cuba
Cape Verde
Christmas Island
Cyprus
Czech Republic
Germany
Djibouti
Denmark
Dominica
Dominican Republic
Algeria
Ecuador
BT
BV
BW
BY
BZ
CA
CC
CD
CF
CG
CH
CI
CK
CL
CM
CN
CO
CR
CU
CV
CX
CY
CZ
DE
DJ
DK
DM
DO
DZ
EC
Estonia
Egypt
Western Sahara
Eritrea
Spain
Ethiopia
European Union
Language
Finland
Fiji
Falkland Islands (Malvinas)
Micronesia, FS
Language
Faroe Islands
France
France, Metropolitan
Gabon
United Kingdom
Grenada
Georgia
French Quiana
Ghana
Gibraltar
Greenland
Gambia
Guinea
Guadeloupe
Equatorial Guinea
Greece
South Georgia/South Sandwich I.
Guatemala
Guam
EE
EG
EH
ER
ES
ET
EU
FI
FJ
FK
FM
FO
FR
FX
GA
UK
GD
GE
GF
GH
GI
GL
GM
GN
GP
GQ
GR
GS
GT
GU
Guinea-Bissau
Guyana
Hong Kong
Heard and Mc Donald Islands
Honduras
Croatia (Hrvatska)
Haiti
Hungary
Indonesia
Ireland
Israel
India
British Indian Ocean Terr.
Iraq
Iran
Iceland
Italy
Jamaica
Jordan
Japan
Kenya
Kyrgyzstan
Cambodia
Kiribati
Comoros
Saint Kitts and Nevis
Korea, DPR
Korea, Republic of
Kuwait
Cayman Islands
GW
GY
HK
HM
HN
HR
HT
HU
ID
IE
IL
IN
IO
IQ
IR
IS
IT
JM
JO
JP
KE
KG
KH
KI
KM
KN
KP
KR
KW
KY
Kazakhstan
Lao PDR
Lebanon
Saint Lucia
Liechtenstein
Sri Lanka
Liberia
Lesotho
Lithuania
Luxembourg
Latvia
Libya
Morocco
Monaco
Moldova
Madagascar
Marshall Islands
Macedonia, FYR
Mali
Myanmar
Mongolia
Macau
Northern Mariana Islands
Martinique
Mauritania
Montserrat
Malta
Mauritius
Maldives
Malawi
44
KZ
LA
LB
LC
LI
LK
LR
LS
LT
LU
LV
LY
MA
MC
MD
MG
MH
MK
ML
MM
MN
MO
MP
MQ
MR
MS
MT
MU
MV
MW
Countries II (Google)
Mexico
Malaysia
Mozambique
Namibia
New Caledonia
Niger
Norfolk Island
Nigeria
Nicaragua
Netherlands
Norway
Nepal
Nauru
Niue
New Zealand
Oman
Panama
Peru
French Polynesia
Papua New Guinea
Philippines
Pakistan
Poland
St. Pierre and Miquelon
Pitcairn
Puerto Rico
Palestine
Portugal
Palau
Paraguay
MX
MY
MZ
NA
NC
NE
NF
NG
NI
NL
NO
NP
NR
NU
NZ
OM
PA
PE
PF
PG
PH
PK
PL
PM
PN
PR
PS
PT
PW
PY
Qatar
Reunion
Romania
Russian Federation
Rwanda
Saudi Arabia
Solomon Islands
Seychelles
Sudan
Language
Sweden
Singapore
St. Helena
Language
Slovenia
Svalbard and Jan Mayen Is.
Slovakia (Slovak Republic)
Sierra Leone
San Marino
Senegal
Somalia
Suriname
Sao Tome and Principe
El Salvador
Syria
Swaziland
Turks and Caicos Islands
Chad
French Southern Territories
Togo
Thailand
Tajikistan
QA
RE
RO
RU
RW
SA
SB
SC
SD
SE
SG
SH
SI
SJ
SK
SL
SM
SN
SO
SR
ST
SV
SY
SZ
TC
TD
TF
TG
TH
TJ
Tokelau
Turkmenistan
Tunisia
Tonga
East Timor
Turkey
Trinidad and Tobago
Tuvalu
Taiwan
Tanzania
Ukraine
Uganda
United States Minor Outlying I.
United States
Uruguay
Uzbekistan
Holy See (Vatican City State)
Saint Vincent and the Grenadines
Venezuela
Virgin Islands (British)
Virgin Islands (U.S.)
Vietnam
Vanuatu
Wallis and Futuna Islands
Samoa
Yemen
Mayotte
Yugoslavia
South Africa
Zambia
TK
TM
TN
TO
TP
TR
TT
TV
TW
TZ
UA
UG
UM
US
UY
UZ
VA
VC
VE
VG
VI
VN
VU
WF
WS
YE
YT
YU
ZA
ZM
45
Lists of universities
Braintrack
www.braintrack.com
Universities Worldwide
univ.cc
Galilei
www.galilei.com.ar
Webometrics Catalogue
www.webometrics.info/university_by_country_select.asp
HEIR
siu.no/heir
General Education Online
www.findaschool.org
International Colleges and Universities
www.4icu.org
Portal Tecnociencia
www.tecnociencia.es
Universia
www.universia.es
Canadian Universities
www.uwaterloo.ca/canu
U.S. Universities by State
www.utexas.edu/world/univ/state
Top American Reseach Universities
thecenter.ufl.edu
UK Higher Education Map
www.scit.wlv.ac.uk/ukinfo/uk.map.html
Times World Universities Rankings
www.thes.co.uk/worldrankings
German University Ranking
www.university-ranking.org
Academic Ranking of World Universities
ed.sjtu.edu.cn/ranking.htm
All Universities around the World
www.bulter.nl/universities
Ranking of China Universities
rank2005.netbig.com
Alphabetical Index of Japanese Universities camp.ff.tku.ac.jp/TOOL-BOX/JapanUNIV
Language
Language
46
Personal agents (I)

Website extractors
AaronWebVacuum 2.9
JOC WebSpider 5.7
Teleport Pro 1.64
Leech 4.3
WebCopier 5.4
BlackWidow 6.28
MemoWeb 4.0
Offline Commander 2.1
WebReaper 10
Offline Explorer Pro 5.9
Website Extractor 10.0
WebWhacker 5.0
WebZip 7.1
Website2PDF 1.0
Medusa 1.2
www.surfwarelabs.com
www.jocsoft.com
www.tenmax.com
www.aeria.com
www.maximumsoft.com
www.softbytelabs.com
www.goto.fr
www.zylox.com
www.webreaper.net
www.metaproducts.com
www.asona.org
www.bluesquirrel.com
www.spidersoft.com
www.spidersoft.com
www.candego.com
47
Personal agents (II)

Link checkers
Alert LinkRunner 6.01
HTML Link Validator 4.47
HTML Validator Professional 11
Link Checker Pro 3.3
LinkScan Workstation 12.1
Web Link Validator 5.5
Xenu's Link Sleuth 1.3
www.alertbookmarks.com/lr
www.lithopssoft.com
www.htmlvalidator.com
www.link-checker-pro.com
www.elsop.com
www.relsoftware.com/wlv
home.snafu.de/tilman/xenulink.html
48
Personal agents (III)

HTML extractors


WebData Extractor 6.0 www.webextractor.com
Experiments


Site extraction with the offline browser Teleport Pro
Mapping of the extracted site with Xenu


Direct mapping of the site with Xenu


Link checking
Link checking
Size of the site according to the search engines

Google, Yahoo, Exalead, Ask, Gigablast
49
WebDataExtractor
50
Website extraction, checking and mapping
51
Cybermetrics of search engines

Search engines: Characteristics and
problems

8 “different” big search engines









Google
Yahoo Search (now Bing supplied)
Bing (ex-Live) Search
Ask (ex-Teoma)
Exalead
Wisenut
Gigablast
Alexa
Studies about search engines
Search Engine Showdown
searchengineshowdown.com
Search Engine Watch
searchenginewatch.com
52
¿Only seven (+one)?
2003
Base de datos
Sede
GOOGLE
NETSCAPE
YAHOO
ALTAVISTA
ALLTHEWEB
LYCOS
IWON
HOTBOT
MSN SEARCH
TEOMA
ASK JEEVES
ALEXA
GOOGLE
ALTAVISTA
FAST
GOOGLE
INKTOMI
2004-2005
Base de datos
Sede
GOOGLE
NETSCAPE GOOGLE
YAHOO
ALTAVISTA YAHOO
ALLTHEWEB
LYCOS
TEOMA
IWON
GOOGLE
WISENUT
WISENUT
MSN SEARCHMSN SEARCH
TEOMA
TEOMA
ASK JEEVES
ALEXA
GOOGLE/MSN SEARCH
A9
EXALEAD
EXALEAD
WISENUT
WISENUT
GIGABLAST
GIGABLAST
GIGABLAST GIGABLAST
TEOMA
GOOGLE
2006-2007
Base de datos
Sede
GOOGLE
NETSCAPE
YAHOO
ALTAVISTA
ALLTHEWEB
LYCOS
IWON
HOTBOT
LIVE
GOOGLE
YAHOO
ASK
LIVE
ASK
ASK
ALEXA
A9
EXALEAD
WISENUT
GIGABLAST
HEREUARE
ALEXA
LIVE
EXALEAD
WISENUT
GIGABLAST
53
Cybermetrics of search engines
GOOGLE
BING (LIVE)
EXALEAD
ASK
GIGABLAST
site:xx
site:xx
site:xx
site:xx
site:xx
site:aa.xx
site:aa.xx
site:aa.xx
site:aa.xx
site:aa.xx
site:aa.xx/bb
site:aa.xx/bb
site:aa.xx/bb
NO
inurl:xx
NO
NO
inurl:xx
url:xx
inurl:xx
inurl:xx
link:aa.xx/b.htm
NO
link:www.aa.xx
(NO)
(NO)
NO
NO
link:aaa.xx
NO
NO
File type
filetype:yy
filetype:yy
filetype:yy
filetype:yy
filetype:yy
Language
Advanced
Advanced
Advanced
Advanced
NO
Country
Advanced
(Advanced)
Advanced
Advanced
NO
TLD
Domain
Directory
Word in url
Link
Link domain
54
URL-mention
55
Outlinks
56
Quality, visibility and impact

Quantitative evaluation of institutional
websites

The Google model


ToolBar installation (toolbar.google.com)
Page Rank



Logarithmic scale
rankwhere.com/google-page-rank.php
www.rustybrick.com/pagerank-prediction.php
Components: visibility + weight
Visibility




Types of links: inlinks, outlinks, self-links, back-links
Calculation using search engines
Web impact (WebIF)
Link quality: Link inspectors
57
Google Toolbar
58
RankWhere
59
PageRank Prediction
60
urltrends
61
Nutch
62
Popularity

Number of visits


It's difficult to obtain for comparative studies
Relative position






Popularity according to
 Only domains
 World Wide coverage
 Some “absolute” values
 Temporal evolution
 Geographic biases (>> Asia)
Snapshot
 Only USA!!!
Ranking.com
Traffic Estimate
Popularity according to Netcraft
 Institutional sites and variants
 More restricted coverage
No comparables
www.alexa.com
snapshot.compete.com
www.ranking.com
www.trafficestimate.com
toolbar.netcraft.com/site_report
63
Alexa
64
Limits of Alexa
65
Inequalities in Alexa
Posición
% VISITAS
Top 3
23
Top 500
45
Número 10
5
Número 100
0,1
Número 1.000
0,06%
Número 10.000
0,02%
66
Snapshot
67
Ranking.com
68
Netcraft
69
Working with links

Visibility



Web impact


Inlinks (incoming links)
 Yahoo Site Explorer
 Exalead: link: -site:
Outlinks (outgoing links)=Luminosity
 Link inspectors
Definition of WebIF
 Calculation=Visibility/size
Quality

Link checkers
70
Basic terminology


A

B
E
G


C
D
F

B has an outlink to C : ~ reference
B has an inlink from A : ~ citation
B has a selflink : ~ self-citation
E and F are reciprocally linked
A is transitively linked with H via B-D
A has a transversal link to G : short cut
H

co-links

C and D are co-linked from B,
i.e. shared inlinks: co-citation
B and E are co-linking to D,
i.e. shared outlinks: bibliog.coupling
71
Cyberscientometrics

Development of R&D indicators in the Web

Units


Models
Indicators

Small World
www.db.dk/lb/2002smallworld.pps

CiteSeerX
CiteBase
Google Scholar
Arxiv
Scirus
DBLP
citeseerx.ist.psu.edu
citebase.eprints.org/cgi-bin/search
scholar.google.com
arxiv.org
www.scirus.com
dblp.uni-trier.de



Institutional site
Co-sitation, social networks and theory of the “small
world”
Bibliometrics of e-journals and deposits of
documents





72
Web indicators
Scientometrics
Input
Output
R&D
Indicators
Bibliometrics
Patentometrics
Web
Indicators
Webometrics
Cybermetrics
Information Society
Indicators
73
Building Indicators

Experiments

Codification




Institutional
Subject (UNESCO)
Geographic (NUTS)
Indicators calculation

Visibility (sitations)






Visibility of the rich files
Visibility of articles in repositories
Visibility of electronic journals
Impact (WebIF)
Diversity
Co-citation
74
Composite indicators

Web Impact factor (WebIF)


Visibility (sitations)/ Size (No. of pages)
Webometrics (Academic) Rank

Size


No. of Webpages
No. of files


Rich files:
pdf, ppt, doc, ps

No. of papers
Google Scholar
Other bibliographic
databases
Visibility


Incoming external links
Mentions
Popularity
75
Webometrics Ranking
www.webometrics.info
76
Size (number of pages)
77
Direct crawling
78
Other rankings
http://vcmike.blogspot.com/2006/01/ranking-colleges-using-google-and-oss.html
79
Other rankings: G-factor
http://www.universitymetrics.com/g-factor
80
Related (I)
81
Related (II)
82
MODULE 2
Applied Cybermetrics
Search Engine Optimization (SEO)
Web Positioning
Applied Cybermetrics

The aim is not only to publish in the Web, but to get
visibility




A search engine is used in 80% of the web sessions


Getting a great number of visits (real audience closed to the
potential one)
Receiving external links
Being present in directories and portals
The web positioning is the key to increment visibility
Quality influences the chances to get a good
positioning, but also...



The volume of information
The hypertext structure
The contents annotation
84
Positioning

Presence measurements



Visibility measurements



Directory indexing
Actual indexed pages by a search engine/Total pages
Page Rank
Prominence by terms
Measurements of access and usage

Popularity
•
•

Absolute: Number of visits
Relative: Alexa Ranking
Usage
•
•
•
Number of downloaded files
Average time per visit
More frequent reference terms
85
PageRank Google
86
Problems

Design is irrelevant, or even counterproductive



Invisible Internet


Databases and dynamic web pages can not be indexed by
search engines
Link quality


Few indexable contents on main page
Flash animations or Java applets that hinder the robots’
navigation
It's necessary a continuous maintenance and update of external
and internal links
Rich files

Documental files are handy for distributing information with a plus
value
•
Formats pdf, ppt, doc, ps
87
Tools
Webmasters World
tools.webmastersworld.org
SEO Encyclopedia
Webmasters Tools
SEO Online
PageStrength
Data Centers Tool
SEO Tools
SEO Web Directory
SEO Company
SEO ToolSet
www.seopedia.info
tools.devshed.com
www.seoonline.info
www.seomoz.org/tools/page-strength.php
www.seocritique.com/datacentertool
www.seochat.com/seo-tools
www.seowebdirectory.com/SEO_Tools
www.seocompany.ca/tool/seo-tools.html
www.webconfs.com
88
89
90
Criteria (Google)

Hypertext structure





Number of times that the search terms appear
Relative position of the search terms






Title and URL
Metadata
Headings
ALT tags and external anchors
Updating periodicity


Maturity: Depth of the institutional sites
Visibility: PageRank
Neighborhood: External and internal links
Freshness (new contents)
Popularity: Page visits
Local aspects (geographic, languages)
91
Criteria (Google)
92
Presence of terms in the URL


Very relevant
Preferably in the domain or subdomain


Recommended no longer than 30 characters
The order is important


Whole words, not truncated



http://better.good.xx/aceptable
http://lib.univ.edu
http://library.university.edu (YES)
Independent terms/phrases (dash/underscore)


Universidad-Complutense= +Universidad +Complutense
Universidad_Complutense= “Universidad Complutense”
93
Agapea
94
Presence of terms in Title


Very relevant
Tag contents <TITLE>!!!






Key words, no title
The position is important: first words carefully selected
Long phrase, without empty words (~60 characters)
Don't repeat terms, bilingual option
Institutional identification, geographic localization
The tag’s contents are also considered <Hn>


The heading gives the title obtained <H1>
Moving generic words: “Hello”, “Welcome”, “Page of” to inferior
levels <H2> ó <H3>
95
Terms in Title
96
Metatags


They are not so important
Description





Keywords






Up to 250 characters
Reusable tag for versions in other languages
The position is important: choose wisely the first words
Don’t repeat words
Up to 20 terms
Terms SHOULD also appear in the text
Reusable tag for versions in other languages
The position is important: choose wisely the first words
Don’t repeat words
Description pre-cataloging

Use another tags: Dublin Core model (15 repeatable)
97
Generating META tags
Meta Builder 2
vancouver-webpages.com/META/mk-metas.html
Meta Tags Generator www.meta-tags.us
MetaTags Generator
tools.webmastersworld.org/MetatagsGenerator.php
Meta Tag Generator
www.invision-graphics.com/meta-tag-generator.html
Meta Tag Generator
www.submitcorner.com/Tools/Meta
DC-Dot
www.ukoln.ac.uk/metadata/dcdot/
98
Key words in text

To select correctly



Density



To study synonymy, variants, similar terms in other languages
To analyze usage in search engines
Total: Up to 25%
Individual: Up to 5%
Position



Heading tags <Hn>
First paragraphs
Font modifying tags


Bold <B><strong>; Italic <I>; Font size
To promote the proximity of terms (where appropriate)
99
More about keywords

Alternative text ALT





Very important
Used to give meaning to images, graphs and banners
Specific treatment similar to title
Up to 250 characters
Anchor terms in the links



Use keywords
It’s very important the pages that link ours
It’s also relevant for the internal navigational links
100
Google-bombing
101
Google Trends
102
Google Timeline & Map
103
Links to external pages

Link’s density


Average of links/page (incl. internal) ~ 20
Structuring resource lists in hierarchical directories


Each category, one or more pages
Target pages

Linking to good pages







Main page (whenever appropriate)
Pages with high PR
Updated pages
Local>.edu>.org>.info>.com
Check frequently that links are still active
Avoid links to link farms
Select carefully the text on the link (avoid “here”, “page”)
104
Characteristics of the institutional sites

Domain

Own





Subdomain: Inherit PR from site root
Don’t change domain!!!
Medium-sized and big institutional sites


Preferably large
Updating

Frequently



Avoid acronyms, provide content
Local, .org, .info, .name versus .com
Increase number of pages (maintain new/old rate )
Promote inlinks
Promote visits

Keep statistics
105
Characteristics of the pages

Size

Small or medium-sized <100 k




Medium or big-sized
Updating


Frequent, but not that much
Change contents, no address


But 40-50 k can be a great volume of text
Structure correctly the groups of pages through consecutive links
(back-next)
Reduce to a minimum the restructuring
Versions

In different pages


In other languages
In other formats (pdf, doc, ps, ppt, ...)
106
Barriers for robots

Links hidden, incomplete or without meaning

Graphs and way-in banners without link in text mode



Javascripts in navigational menus





With hidden links
With relative, incomplete links (without URL Base declaration)
Frames (but NOT always!!)
Orphan pages
Avoid re-direction and alias



Specially Flash files
It’s also important the presence of ALT text
Refresh tags
Institutional farms (site.es; site.com; site.org)
Dynamic pages

Reduce length and complexity of the URLS: Give them a
meaning
107
Robot-friendly

File robots.txt



Map of the site (html and xml)
Navigational internal links


Just the ones and necessary
Sign-in in referrals




Don’t abuse of “no index”
At the search engines (not very important, only speed-up
indexing)
In directories (In Yahoo increase the visibility)
In supersites (trick: Wikipedia)
Fight against the invisibility


Static pages
Support submenus
108
“Visible” Internet
109
Hacking strategies (to avoid)



Invisible texts
Pixel links
Link farms




Duplicate texts
Cloaking


Link buying
Visits buying
Different pages for the search engine than for the user
Hacking mirrors
110
Tools: Words’ Density
Site Content Analyzer 2.2.15
www.sitecontentanalyzer.com
Good Keywords 2.0
www.goodkeywords.com
Keyword Density
www.keyworddensity.com
Keyw. Dens. & Prominence 1.2
www.ranks.nl/tools/spider.html
Keyword Density Analyzer
tool.motoricerca.info/keyword-density.phtml
KDAnalyzer Version 2.0
www.webjectives.com/keyword.htm
Google Adwords
adwords.google.com/select/KeywordSandbox
Keyword Density Analyzer 1.3
www.searchengineworld.com/cgi-bin/kwda.cgi
Keyword Investigator
www.keywordster.com/keyword-investigator.htm
GRKda
www.grsoftware.net/search_engines/software/grkda.html
111
Keyword Density & Prominence
112
Tools: Position
Accurate Monitor 2.5
Advanced Web Ranking 4.7
AgentWebRanking Pro 2.6
IBP 9
Dynamic Web Ranking 7.0
Link Popularity Analysis 2.0
Link Popularity Check 3.0
Link Survey 1.5
RankSpy 1.3
Trellian SEO Toolkit
Web CEO 6.0
www.cleverstat.com
www.advancedwebranking.com
www.agentwebranking.com
www.axandra.com
www.dynamicwebrank.com
www.link-popularity-analysis.com
www.checkyourlinkpopularity.com
www.antssoft.com
www.searchutilities.com/rankspy
www.trellian.com/seotoolkit
www.webceo.com
113
WebPosition
114
Advanced Web Ranking
115
Quality: Duplicates, broken links
116
Evolution and persistence


Volatility
Persistence




Changes in web pages used
to be minor or cosmetic
The frequency of change
varies according to the
domains
The magnitude of the change
depends largely on the size
Big pages change more and
more frequently
research.microsoft.com/research/sv/sv-pubs/p97-fetterly/p97-fetterly.pdf
117
Generating Contents

Personal pages (also research groups or departments)



Institutional Repositories

Papers, books and book chapters, dissertations, …

Multimedia repositories
Portal of journals


Access to full texts files (academic publications)
Local institutional journals
Super-sites

Added value directories of (web) resources
118
Added value
119
Personal pages

Current situation

Few scholars with their own personal webpage, most of them
with a limited amount of contents


Bad positioning practices, especially regarding the URL
Personal Branding

Increased Impact (global audiences)

Efficient Networking (peers and non-peers)

Complements your formal scholarly communication

Reflects the diversity of your activities (and of yourself)

Not only reactive but also proactive

It is easy, fast and cheap
120
A model
Institutional Logo & Banner
Name of the group, department or faculty
Index
 Papers
 Conferences
 Books
 Teaching
 Proyects
 Popular
Science
 Prizes
 Hobbies
 Press notes
 Blog / Web 2.0
 Statistics
 CV (pdf)
Photo
Contact info
http://johnclements.net/home
General comments and presentation
News, relevant new info
Next conferences
Links
Updated 5-July-2012
thebook.virtualknowledgestudio.nl/author/paul-wouters
121
MODULE 3
Usage metrics
Tracking and Analyzing Visits
Web Usage Mining

Definitions


Data mining: Knowledge extraction from databases
Web Mining: Gathering and analisys of the visit patterns of a Web
site


Objectives: Aspects to explore






It is not to search or recover information about that site
Joining
Classification and clustering
Transversal patterns
Sequential patterns
Similarities
Visits Web sites analysis


Log files: Definition and structure
Software for log analyzing

Practices with WebTrends Analysis Suite (www.netiq.com)
123
Taxonomy of the Web Mining
Web Mining
Mining of Web
contents
Mining based
on agents
 Search engines
 Metasearchers
 Personal agents
Mining of the Web use
Database mining
 Identification
 Description
 Analysis tools
 Invisible Internet
124
Log files(logbook)

File that automatically records all data about the visits that
a web site receives






IP address from the visitor
Visited URLs
Time of visit
Time dedicated to the visit
URL from which the visit came





Type of petition
Type of answer
Size of answer (bytes)
Browser used
etc…
Apache web log
205.188.209.10 - - [29/Mar/2002:03:58:06 -0800] "GET /~sophal/whole5.gif HTTP/1.0"
200 9609 "http://www.csua.berkeley.edu/~sophal/whole.html" "Mozilla/4.0 (compatible;
MSIE 5.0; AOL 6.0; Windows 98; DigExt)"
216.35.116.26 - - [29/Mar/2002:03:59:40 -0800] "GET /~alexlam/resume.html HTTP/1.0"
200 2674 "-" "Mozilla/5.0 (Slurp/cat; slurp@inktomi.com;
http://www.inktomi.com/slurp.html)“
202.155.20.142 - - [29/Mar/2002:03:00:14 -0800] "GET /~tahir/indextop.html HTTP/1.1"
200 3510 "http://www.csua.berkeley.edu/~tahir/" "Mozilla/4.0 (compatible; MSIE 6.0;
Windows NT 5.1)“
125
Utilities

Questions to answer









¿How the information has been used?
¿How frequently?
¿What is the most and the less popular (visited)?
¿Where from do the visitors come?. ¿Where from do they
exit?
¿Where do they spend more time?
¿How much time do they spend?
¿Which are the paths that visitors follow the most?
¿Who are the visitors? ¿Where do they come from?
¿How did they arrive?
126
Visits trackers
Google Analytics
Yahoo Web Analytics
StatCounter
ActiveMeter
123Statmore
Counter Central
Digits Web Counter
Free Hit Counter
GoStats
MyWebStats
OneStat Free
OneStat
Opentracker
ShinyStat
TDstats
TheCounter
WebSTAT
What Counter
www.google.com/analytics
web.analytics.yahoo.com
www.statcounter.com
www.activemeter.com
www.123stat.com
www.countercentral.com
www.digits.com
www.ritecounter.com
www.gostats.com
www.mywebstats.org
www.onestatfree.com
www.onestat.com
www.opentracker.net
www.shinystat.com
www.tdstats.com
www.thecounter.com
www.webstat.com
www.whatcounter.com
127
Google Analytics
128
Google Analytics (II)
129
Google Analytics (III)
130
StatCounter
131
Log file analysis software
10-Strike Log-Analyzer 1.53
123LogAnalyzer 3.3
Log2Stats 1.5
AdvancedLogAnalyzer 2.1
Alterwind Log Analyzer 4.0
Analog 6.0
Analyse Spider 3.01
Deep Log Analyzer 4.0
eWebLogAnalyzer 2.3
FastStats Analyzer 4.1
Nihuo Web Log Analyzer 4.07
SawMill 8.5
SmarterStats 6.5
Surfstats 2011
WebLogStorming 2.6
WebLogExpert 7.4
WebTrends Analytics 10
www.10-strike.com
www.123loganalyzer.com
www.bitstrike.com
www.abacre.com/ala/index.htm
www.alterwind.com
www.analog.cx
www.analysespider.com
www.deep-software.com
www.esoftys.com
www.mach5.com/products/analyzer
www.nihuo.com
www.sawmill.net
www.smartertools.com
www.surfstats.com
www.datalandsoftware.com/weblog
www.weblogexpert.com
www.webtrends.com
132
10-Strike Log Analyzer
133
123-Log Analyzer
134
SawMill
135
Exercises

Experiments


Funnel Web 5.0
Practices with log files






Total and disaggregated visits
More popular pages and directories
Downloaded files
Points of entry and exit
Visitors demography
Entry referrals (origin, browser and search engine words
used)
136
Configuring Funnel Web
137
Results
138
Referrals
139
Bibliography/Webliography













General Bibliography/Webliography www.cindoc.csic.es/cybermetrics/links03.html
Björneborn, L. & Ingwersen, P. (2001). Perspectives of webometrics. Scientometrics, 50(1): 65-82.
http://www.db.dk/lb/2001webometrics.pdf
van Raan, A. F. J. (2001). Bibliometrics and internet: Some observations and expectations. Scientometrics, 50(1):
59-63
Bar-Ilan, J. (2001). Data collection methods on the Web for infometric purposes. A review and analysis.
Scientometrics, 50(1):7-32
Björneborn, L. (2004). Small-world link structures across an academic web space : a library and information
science approach. PhD dissertation. Royal School of Library and Information Science. xxxvi, 399 p. ISBN 877415-276-9.<http://www.db.dk/lb/phd/phd-thesis.pdf >
Jepsen, E.T.; Seiden, P.; Ingwersen, P.; Björneborn, L. & Borlund, P. (2005). Characteristics of scientific web
publications: preliminary data gathering and analysis. Journal of the American Society for Information Science and
Technology. Special Issue on Webometrics.
Björneborn, L. & Ingwersen, P. (2005). Towards a basic framework for webometrics. Journal of the American
Society for Information Science and Technology. Special Issue on Webometrics.
Thelwall, M.; Vaughan, L. & Björneborn, L. (2005). Webometrics. Annual Review of Information Science and
Technology, 39.
Ingwersen, P. & Björneborn, L. (2004). Methodological issues of webometric studies. In: Glänzel, W. et al. (eds.).
Quantitative Science and Technology Research. Klüwer Academic Publishers.
The Statistical Cybermetrics Research Group. Wolverhampton University <http://cybermetrics.wlv.ac.uk>
Alonso Berrocal, J.L.; Figuerola, C.G. & Zazo, A.F. (2004). Cibermetría:nuevas técnicas de estudio aplicables al
Web. Ediciones Trea, Gijón. 207 pags.
Faba Perez, C., Guerrero Bote, V. P. & Moya Anegón, F. (2004). Fundamentos y técnicas cibermétricas: modelos
cuantitativos de análisis. Junta de Extremadura, Mérida. Serie Sociedad de la Información, no. 18. 216 pags.
Prime, C.; Bassecoulard, E.; Zitt, M. (2002). Co-citations and co-sitations: A cautionary view on an analogy.
Scientometrics 54 (2): 291-308:
140