22 Aug 2012

Web-Scraper for Google Scholar Updated!

I have updated the Google Scholar Web-Scraper Function GScholarScaper_2 to GScholarScraper_3 (and GScholarScaper_3.1) as it was outdated due to changes in the Google Scholar html-code. The new script is more slender and faster. It returns a dataframe or optionally a CSV-file with the titles, authors, publications & links. Feel free to report bugs, etc.



Update 11-07-2013: bug fixes due to google scholar code changes - https://github.com/gimoya/theBioBucket-Archives/blob/master/R/Functions/GScholarScraper_3.2.R. Note that since lately your IP will be blocked by Google at about the 1000th search result (cumulated) - so there's not much fun when you want to do some extensive bibliometrics..

12 comments :

  1. Got a error message:
    Erro em htmlParse(url) :
    error in creating parser for http://scholar.google.com/scholar?q=allintitle:pantanal&num=1&as_sdt=1&as_vis=1

    I could not solve the problem.

    Anyway, its an interesting function :)

    Ah, i use Tinn-R, windows 7 and R 2.15.1 if you could figure out the problem ^^.

    ReplyDelete
    Replies
    1. Sry, I can't reproduce the error.. As you only search for one word in the titles you could use "intitle:pantanal" - however, it works for me also with "allintitle:pantanal"..

      Delete
    2. Well, i was trying to do something like this. to produce a figure to show how some theory for example got more citations.

      input<-paste("metapopulation&as_ylo=",1980:2012,"as_yhi=",1980:2012,sep="")

      anos<-1980:2012

      resultados<-rep(NA,length(anos))

      for(i in 1:length(anos)) {
      resultados[i]<-length(GScholar_Scraper(input[i],write=F)$PUBLICATION )
      }

      Make many searchs for year, it work sometimes then stop working and start giveing the error i said before

      Delete
    3. Please see the follow-up posting http://thebiobucket.blogspot.co.at/2012/08/toy-example-with-gscholarscraper31.html - maybe this will help!

      However, there is an issue with Google blocking automated searches which arises for search strings giving more than 1000 results. And, occasionally you're IP seems to be blocked generally.. I'm afraid there is no quick solution for this (changing your IP / resetting modem, etc. fixes the problem, however, not very elegantly..).

      Delete
  2. Thank you for this. I previously spent many hours working out how to scrape data from Google Scholar. Sadly, once I got a working program, I found Google Scholar locked me out after I had retrieved around 100 records. Correspondence with them got me nowhere: they basically accuse you of unethical behavior if you try to automate searches. I can't understand their logic and they don't explain it.
    It's very disappointing for those of us who want to do serious research using bibliometrics.
    I haven't tried your program but assume it would hit the same snag?

    ReplyDelete
    Replies
    1. ..with my function which utilizes htmlParse(url) from the XML-package it works for search strings that give less than 1000 hits. Then it seems to be blocked.

      Delete
  3. Really cool application. Could you please provide a brief example of how to produce a wordcloud with the dataframe returned by GScholar_Scraper_3.1

    I attempted following the example shown in GScholar_Scraper_2 but keep getting wordclouds of publication years and removing numeric's leaves an empty dataframe. I'm missing something simple in corpus <- Corpus(DataframeSource(df[, 1:2])) but cant see what



    Thanks again

    ReplyDelete
    Replies
    1. Check the follow-up (http://thebiobucket.blogspot.com/2012/08/follow-up-making-word-cloud-for-search.html)..

      Delete
  4. Hi, is this still live? I read elsewhere on your site that Google had changed their code. Thanks so much

    ReplyDelete
    Replies
    1. Pleaser try version 3.1. and report if there are any issues!

      Delete
  5. I really appreciate your efforts here. However, by 100 hits you mean 1000 returned results? If that is the case, this code has a very very limited usage. I searched for "authenticity" as my keyword and 280,000+ results returned. Obviously, the code didn't work. At least, you can add an argument enabling the user to limit the fetched results to the first 1000 results.

    ReplyDelete