Scraping Web Data with Excel [Macros]

Posted in software by Christopher R. Wirz on Wed Feb 03 2016

Not all web-scraping has to be challenging (such as a massive set of python scripts). Some of it can be done just by going to a URL with a GET request. (Some of the best APIs respond to the GET HTTP verb)

The challenge is comparing the data on all the pages at once. It would be easier if all the information was in something like Excel - and it can be! Most pages use HTML, which is easily understood using the Document Object Model (DOM). For these cases that don't need security, it is very easy to load data using excel macros.

The first step is to save your file as an XLSM (macro enabled). Then open the visual basic editor (alt+f11) and create a new module. Inside the module, you can use the following code:


Option Explicit ' forces you to declare all the variables

Function GetValueFromURL(url As String, id As String) As Variant
    ' to refer to the running copy of Internet Explorer
    Dim ie As InternetExplorer

    ' to refer to the HTML document returned
    Dim html As HTMLDocument

    ' open Internet Explorer in memory, and go to website
    Set ie = New InternetExplorer

    ' turn off visibility so the IE window does not open
    ie.Visible = False
    ie.navigate url

    ' Wait until IE is done loading page
    Do While ie.READYSTATE <> READYSTATE_COMPLETE
        Application.StatusBar = "Trying to go to " & url & " and get the value of " & id & " ..."
        DoEvents
    Loop

    ' get the HTML document returned
    Set html = ie.document

    ' close down IE
    Set ie = Nothing

    ' and reset status bar
    Application.StatusBar = ""

    ' In VBA, set the result equal to the method name to return the value
    GetValueFromURL = html.getElementById(id).innerText

End Function

Nice! Now I can do something like get the date of every blog post, and make a histogram to see when I was busiest. Well, that's just one example.

So, the above example will work with HTML, but what about loading things like JSON that are NOT HTML documents. We don't want to parse anything, just get a string


Function GetWebText(url As String) As String
    Dim objHttp As Object
    Set objHttp = CreateObject("MSXML2.ServerXMLHTTP")

    ' use the HTTP GET verb (POST is recommended to send data)
    ' when calling the url
    ' asynchronously = False
    objHttp.Open "GET", url, False

    ' since this is GET, we don't send data to the server
    objHttp.send ""

    GetWebText = objHttp.responseText

End Function

This method is a little simplified, but you don't have access to the DOM. However, changing the GET verb to POST, one can send data to a server using this macro.

Looking for a job?