Google Scholar: the pros and the cons
Transcription
Google Scholar: the pros and the cons
The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/1468-4527.htm SAVVY SEARCHING Google Scholar: the pros and the cons Peter Jacso University of Hawaii, Manoa, Hawaii, USA Abstract Purpose - To identify the pros and the cons of Google Scholar. Design/methodology/approach- Chronicles the recent history of the Google Scholar search engine from its inception in November 2004 and critiques it with regard to its merits and demerits. Findings - Feels that there are massive content omissions presently but that, with future changes in its structure, Google Scholar will become an excellent free tool for scholarly information discovery and retrieval. Originalitylvalue - Presents a useful analysis for potential users of the Google Scholar site. Keywords Reference services, Internet, Search engines Paper type General review The launch of Google Scholar - not surprisingly - drew much attention and praise, although not necessarily for the right reasons, both from the popular and the professional media. Google makes it very easy and free to find scholarly information about any topic - an important service for those who do not have access to the most appropriate fee-based indexinglabstracting databases which traditionally have helped in information discovery. Google Scholar goes beyond information discovery by leading qualifying users at subscribing libraries to the primary digital documents, and any users to the millions of open access (free) primary documents offered through mega-databases of preprint and reprint servers, as well as to the full text digital collections of several government agencies and research organizations. Google also deserves credit for introducing - albeit a bit belatedly - advanced options to refine the search process. On the negative side, the most important problem is that the crawlers of Google Scholar have not indexed millions of articles, even though they were let into the digital archives of most of the largest academic publishers and preprint servers and repositories. The stunning gaps give a false impression of the scholarly coverage of topics and lead to the omission of highly relevant articles by those who need more than just a few pertinent research documents. The rather enigmatic presentation of the results befuddles many users and the lack of any sort options frustrates the sawy searchers. Pros Google deserves credit for making the first step in providing a tool for discovering Online Information Review scholarly information, even though access may be limited to very short bibliographic Vol. 29 No. 2,2005 citations of articles and conference papers in a sizable segment of Google Scholar. Still, PP. 208214 ~ ~ ~ d G r O u p P u b a s h i n g L i m imany ted of the records have informative snippets from the full text for any users, and ~ o ~ ~ o . i i w many o s also offer the abstracts of the articles for anyone. This free service alone is roughly equivalent to what several of the traditional online indexinglabstracting databases have been providing for a hefty fee. The big plus is that Google searches the indexes created from the full text or part of the full text of the primary documents (even it is shows only a snippet of it), not merely the bibliographic records, abstracts and the subject terms (if assigned by the author or the publisher to the articles). The crawlers of Google Scholar were let in the huge databases of the largest and most well-known scholarly publishers and university presses (such as IEEE, ACM, Macmillan, Wiley, University of Chicago); their digital hosts/facilitators (such as Highwire Press, MetaPress, Ingenta); societies and other scholarly organizations and agencies (such as the American Physical Society, National Institute of ~ a l t hNOAA), , and preprintJreprint servers (such as arXiv.org, Astrophysics Data System, RePEc, and CiteBase). Patrons of libraries which have subscription to the digital archives of publishers are the greatest beneficiaries of Google Scholar, as with a single search they are lead to the digital full text versions of the articles and their supplements. This is particularly valuable for those libraries which have no federated search engines, and expect the patrons to repeat their searches by hopping from one publisher's archive to the other, finding the query form and resubmitting the same query - which is not the prevailing attitude. In Google Scholar multiple database search is the default approach, unless the user specifies the publisher by using its URL. in the site parameter, such as < tsunami site:ieee.org > . Once again, any user can see the bibliographic records, the abstract (if available with the paper), andlor the snippets of the context of the full text matching the query, and may order the document often at a much lower price than charged by some of the document delivery services. The initial launch of Google Scholar in mid-November 2004 did not offer an advanced search template. In the review of the initial beta version (lacso, 2004), I voiced my disappointment that Google treats the highly structured records of scholarly articles the same way as the billions of unstructured web pages, even though the former have unambiguously and consistently tagged metadata identifying the title, the author, the journal name, the publication year and many other fields. Laudably, a month later the advanced template was introduced with good tips for providing additional search criteria to refine the searches (see Figure 1).Although some of them (such as the publication year) are not totally reliable, it is a good move by Google. So is the calculation and display of the "cited by" score, whose credence, however, should be established by disclosing the sources covered, and by vastly improving their currently unsystematic, unpredictable and disturbingly fragmentary coverage. The coverage of Google is impressively broad and includes the most important scholarly publishers' archives with the notable exception of Elsevier's, the largest publisher. It is another question that the coverage of many archives is extremely shallow, which leads me to the cons. Cons The underlying problem with Google Scholar is that Google is as secretive about its coverage as the North Korean government about the famine in the country. There is no information about the publishers whose archive Google is allowed to search, let alone about the specific journals and the host sites covered by Google Scholar. Google Scholar: the pros and the cons 209 eFind ar Advanced Scholar search 111of the wo rds ie exad pl . I--& --.. 11 laaax ~ I I U I Lrta wurJ~ wnnou t the wordS where my words Ioccur Figure 1. Field-specificindexes for search criteria .. Advanced Search TIPS I &, - [warning pred~ctlonalert WE I Return articles published between Another "feature" undisclosed by Google, but reported and well-illustrated among others by Gary Price specifically for Google Scholar (Price, 2004) is the fact that it limits the indexing of the collected files to the first 100-120K-bytes of the text (depending on the file type). The size of the majority of scholarly feature articles are close to or even exceed 1M-byte.If the search term occurs first beyond Google's limit, the item would not be found. In fairness, AskJeeves has a similar limit, MSN is slightly more generous with a 150KB limit, and Yahoo! stops indexing at about 0.5MB (Sullivan, 2004). My comparative search results in mid-November 2004 and on 1 January 2005 suggest that the content of Google Scholar has not been updated since its launch. This is not a major problem yet, but over time the staleness will become more prominent. Google has not disclosed yet how often it will be updated. In this column I focus on the gaps still found through test searches in the first days of 2005 in the coverage of the most prominent archives covered by Google Scholar. The tests were done by limiting the subject search to specific site names which host the archives, such as < site:nature.com > for the Nature Publishing Group, < site:sciencemag.org > for Science magazine and < site:adsabs.harvard.edu > for the Astrophysics Data System, which is maintained by Harvard University among many other mirror sites. The site limiters are not always obvious, but from scanning search results the diligent user can figure them out. In some other cases a slightly different version of the domain name may yield somewhat different results, so one should proceed carefully (Jacso, 2004). My special polysearch engine which I used for jacso/scholarlyl the test is available for anyone at www2.hawaii.edu/ side-by-side2.htm The test results were deeply disappointing in Google Scholar in light of what the native search engines of the sites retrieve for the same query. Through Google the search for "tsunami" in the title field limited to the site of the Nature Publishing Group - Google Scholar: the pros and the cons lslte nature com lnt~tletsunam~ Results 1 - 1 of 1 from nature.com for intitle:tsunami (0.04 seconds) Scholar 1 jsiliferous Lana'l deposits formed bv multiple events rather than a single qiant tsunami Rub~n,CH Fletcher, C Sherman - C~tedbv 6 \W & Keating, BWThe Hulopoe Gravel, Lana'i, Hawaii: New sedimentological data and ,,,,,r bearing on the "Giant Wave" (mega-tsunami) emplacement hypothesis. Nature. 2000 - nature.com - ncbi.nlm.nih.qov ... 0 homo ocuments 1t o 8 of 8 matching the q~ earch query1:Tltle: tsuna a [ New aL 1. Tsunami-wave seismology Wayne Thatcher Description: TSUNAMIS (seismic sea waves) are among the most awesomr Nature 340,674 - 675 (31 Aug 1989) News and Vlews (from the Nature A December 1996) PDF - 1 1 2.Tsunami experts left high and dry David Swinbanks Description: Tokyo. As a result of a territorial dispute w ~ t hRussia, the lap; Nature 372,3 (03 Nov 1994)News (from the Nature Arch~ve:January 198 PDF 3.Satellite-linked tsunami warning to avoid Pacific disasters David Lindley, David Swinbanks Description: Washington 8: Tokyo ALTHOUGH tsunami is a Japanese word, 330,305 (26 Nov 1987) News (from the Nature Archive: January 1 En? Figure 2. T h e hit r a t i o between the native search engine and Google Scholar is 8:l . (NPG) yields a single record, while the native search engine finds eight items, all of them relevant. Searching for the words "tsunami warning" in the full text using the two alternative tools shows an even bigger discrepancy; 17 items retrieved by the native search engine and only one by Google Scholar (see Figure 2). For verification, the 16 additional hits were searched using the "allintitle" option in Google Scholar without any site limitations to see if the records may be available through other sources. One was found in ADS, two in the archive of the National Institute of Health 0. For one article there was one minimalist record found labeled as [CITATION].This follow-up known-item search still resulted in an abysmal hit rate in Google Scholar (see Figure 3). Documents 1t 7 matchiri g t h e query. Search query :Full Text : tsunami warning Phrasing : All Refine m l s auerv I ~ e auery w h NEXT-> - 1 Satellite-linked tsunami warning to avoid Pacific disasters David Lindley, David Swinbanks Description: Washington &Tokyo ALTHOUGH tsunami is a Japanese word, and the phenomenon ,,...., -e l s e ~'11 pa' Nature 331 January 19t PDF a Goog Scholar 3.Japanese earthquake tests disaster warning networks David Swinbanks Description: Tokyo, Last week's major earthquake off the northern tip of Japan exposed CONTEXT: exposed both the strengths and weaknesses of Japan's system for preventing disasters In the aftermath of such events. Tsunami warnings were broad-cast on television with lmpresslve speed, and some accurate predlctlons were made. But...,.. Nature 371, 549 (13 Oct 1994) News (from the Nature Archive: January 1987 - December 1996) P D ,J ... Figure 3. The ratio is far worse for full-text searching ...... 4 . Wave devastated Seattle area Tom Clarke SUMMARY: Ancient tsunami provides clues t o future threat, ... rt3hlTFYTmAmr~nd-1 -1 nIl-t.r,o.arcl;mn- t c ~ ~ n a rrnl m i ua+t;rtdkhm-. Google Scholar: the pros and the cons Apparently Google did not fully index the current eight-year collection, let alone the archive of Nature (which includes all the issues between 1987-1996)and the other 64 journals of NPG, all of which are hosted on the nature.com site. The native search engine at the NPG site finds nearly 87,000 records for items published in Nature alone between January 1987 and December 2004; Google Scholar finds only 13,700 records from the entire nature.com site. Using Google Scholar to search for the exact phrase "tsunami warning" in the Nature retrieves one hit for a 1993 article from the ADS database of Harvard and none from Nature's archive (see Figure 4). The native search engine finds seven matching records. The ADS record provides a link to the Nature archive, but the otherwise excellent ADS collection is not a substitute for Nature's 18 years of digital coverage at its home site. In addition the test searches revealed that Google Scholar also has a puny coverage of ADS itself. The native search engine of the ADS database finds 32 records with the exact phrase "tsunami warning" in the abstract, while Google Scholar retrieves a mere nine records for the same query from the ADS database. It is quite telling about the shallowness of coverage that Google finds only 268,600 of the more than 4.1 million indexinglabstracting records in ADS. A similar pattern is found when searching the ten-year archive of Science magazine (October 1995-December 2004) by the native search engine (nearly 40,000 records) versus Google Scholar with the site:sciencemag.orgparameter (11,800records). For the exact phrase "tsunami warning" Google Scholar retrieved a single record with this site limiter parameter, while the native search engine found three articles. Although Google Scholar does not always reproduce the same number of hits even when repeated within one hour intervals, these hit figures did show up consistently (see Figure 5). This was not the case when searching the site of the Proceedings of the National Academy of Sciences (PNAS),which retrieved 12,900hits in late November, but 300 fewer records on 1January 2005. I can only speculate that Google dynamically assigns the server to answer the query and the query-servers may not mirror their content exactly. The above three periodicals are among the most cited and most respected scholarly journals in their respective fields. If Google Scholar finds only 10-30 per cent of the records Scholar 213 - Results I Iof Ifor "tsunami warning" (0.01 seconds) T I ~Try removlng quotes from your search to get more results r~e"'i'~yNi"ca'r'a"""u'a""e'a'*~u'a'k'~a"'si.o\Iv"t's'u'n'a'ml"'e'a"rt'h""'u"a'Ke'..a's's'o"ci"at'e"d !J :. ..................................................................................... q bvith subducted sedimenti H Kanamuri, M Kikuchi - Cited by 23 a ...American plates. Tsunami warning systems based on long-period waves are essential for hazard reduction from such quakes. Reference ... Nature, 1993 - adsabs.hatvard.edu Figure 4. A record for a Nature article retrieved only from the ADS database uutbde hm A&II Ti 1 1 through its native search engine and Google Scholar g 1 WO L-E %N ~ t ~ T s = H = - d M d t + n ~ N r 2W4AGUFMT2lCOS44M r Xa4AGZTPMS21A0268W WdancS A.hdprtP Figure 5. Search results from ADS h 120004 & S C ~ O ~Results U 1 - 3 of3 tan adsabs.hamrd.~dufor 'Isut T n Try remowng quotes from your search to gel more results Mechanism of tsunaml earthquakes MG .! W D*k r ZDOIUGUFMOS22B 018 Mdmrmto.H , M U H 3 / 1 1 - Car H Kanarnon Cned b J I lhe delematton $tho 8UecIrm moment al various oenod, ' ... made by a simple procedure, thts method could be ~ncorporatedIn the tsunami system Phrs Earth Planet Inter 1972 adsabs heward edu ... 1 I I - 1W llD3M & S t e . k o w l r t u m s Dbsrmddths 20 IJ3N lumm4 & Fmds PaclH M d h n g m Nsar-Real T m f aTsmtarm Warmng Appbcatmm lmo 05~g F A n a a ~ l m d ~ o f t ~ ~ n m t h s o p n m m n loo0 O ~ O M H ~ I WTW ~ A_ I! GPS ard- ~ o m b~oslt-ard~bgg which are available through using the sophisticated, still intuitive, native search engines, users would remain unaware of many potentially important articles. These days, when scientists, administrators, politicians and financial experts need to find comprehensive and high quality scholarly information about the state of the art in tsunami warning systems to implement a feasible solution for the devastated Indian Ocean region, many will turn to Google Scholar to discover only a fragment of the scholarly literature. They also miss out on many scholarly papers which are open access (sometimes after an embargo of 3-12 months), as is the case with the poorly-covered papers published in the top-cited PNAS. Google has kept the beta label as a "shield" for some of its services for years The fact that Google Scholar is in beta version is not a good excuse for the massive content omissions. Google has used the beta "shield" for some of its services for over two years after their launch. Hopefully, Google Scholar will come out from its beta in a much shorter time, disclose the sources covered and fill the gaps to provide an excellent free tool for scholarly information discovery and retrieval. References J a d , P. (2004), "Peter's digital ready reference shelf - Google Scholar", (web-only document), available at: http://GoogleScholar.notlong.com Price, G. (2004), "Google Scholar documentation and large PDF files", (web-only document), available at: http://blog.searchenginewatch.com/blog/041201-105511 Sullivan, D. (2004), "Search engine size war erupts", (web-only document), available at: http:// blog.searchenginewatch.com/blog/041111-084221