So you want to start web scraping, eh? You have come to the perfect place, where I will try to start from the very easiest of basics. The language of choice for this blog, at least for now, will be Java. Java was chosen as it is a highly versatile language and present in so many environments. Some may argue that Python or R is the best for data analysis, and they might be right in some aspects, but this is going to stick to the basics by using a very well documented and popular programming language.
Step One:
First you are going to need to download the Java JDK (Java Development Kit). This places java on your computer in a way so that the IDE (Integrated Development Environment) has a cookbook to work with.
You may download the JDK from the following link:
http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
Download and install the program, but don’t worry about running anything associated with it.
Step Two:
Now we are going to download the IDE so that we have a program that actually will allow us to type up the code and run it. The IDE that we are going to be using for java is the IntelliJ IDEA Community as it is very well put together and also free. You can download IntelliJ from the following link:
https://www.jetbrains.com/idea/download/
Download and install the program as well, run the program after installing it.
Step Three:
Now we actually have all of the tools available to start programming Java. Open running this programming you are going to want to select New Project. Then select the Java tab, and in the top right should be a button labelled “new…”. Select that and then select JDK and select the folder from where you installed the Java JDK from step 1. Once that’s done, you screen should look like this:
Select next in the bottom right, click next again, then give a name to your project. I am calling mine “FirstWebScrap”, then click finish. Congratulations! You have now most likely created you very first programming project.
Step Four:
Now we are going to start programming, but you need to create a file that we can type into. For that select the “FirstWebScrap” project folder in the top left, and then select the “src” or source file. The file tree should look like this:
Then you will want to right click on the source file and select new > Java Class, then you are able to name your new class:
Click “Ok” to create, it should look something like this:
You have now made you first java class, this is where we will be typing the code.
Step Five:
This is where a lot of the coding comes in and I will try to keep this focused on the web scraping portion. If you have no experience with Java or none with programming, I will just simply say this first code is what allows the program to run and it necessary to start.
To start, press enter, to create a new line between the curly braces of our WebScrap class. The type the letter “p” and press “ctrl+j”, this will bring up an auto complete method. We will want to select the bottom option “psvm”. This will create the opening code to which we are able to type the code to web scrap within. Your screen should now look like this:
Step Six:
We are now able to start typing out our code that is able to scrap data from the web, however, there is still one slight thing left to do to make our lives easier. There is a library called jsoup, which has already programed a huge list of methods(essentially a mini-program that does a task) that we can use to make our coding easier. To add this library to our program we select the keys “crtl+shift+alt+s” at the same time. Then navigate to “libraries” and select the green plus sign and select “from maven”. It should pull up this window where you are able to type in the maven code to get the library. The code for the most current jsoup version as of this writing is: “org.jsoup:jsoup:1.8.2”.
Download the library and select “ok”, select “ok” again to add it to our project. You may now select “ok” in the bottom right to exit out of the window.
Step Seven:
Now with our library from jsoup we are able to start coding our web scraping program. First we are going to type “Document doc;” which will create a document (webpage) with the variable name of “doc”. However, the word document will be in red as Java does not recognize that variable. Place you curser over the word to see the following box:
Press “alt+enter” and select the following option from the jsoup library:
It will then add a reference to the jsoup library will “Document” variable is defined. Next, we will want to define a String variable to store the data we receive from web scraping. For this demo, we are going to be scraping the most current ask price for an oz of gold. Thus, we will type “String askPrice;” on the next line.
Step Eight:
This next step is a big one. Here we are importing two more library references as noted in the top of the picture below, and we and adding a “try/catch” block which will attempt the code within the block and upon failure will not break our program. Withing the try block is our first code with relevence to web scrapping, this line of code tells Jsoup to retrieve the website and store it into our Document labled “doc”. From this “doc” where are then able to manipulate it. Follow the code shown below:
Step Nine:
Now its time to actually tell the program what we want off of the website. For this we need to use a browser such as Mozilla or Chrome where we are able to inspect an html element. So what we wish to do (I am using mozilla firefox, but chrome will be very similar), is to go to the website we will be scrapping from which is the following:
http://www.kitco.com/market/
And upon reaching this website, we are going to wish to right click on the current gold asking price (select the number) to see the following box.
Select “inspect element” and your screen should appear like this with the element you selected highlighted. I clicked on the element to expand the box to show that within the element is the number we want.
Now right click on the <td> brackets that hold our number and select “Copy Unique Selector”. This copies a unique identifier that we can tell our program to grab.
Step Ten:
Now we will type up the code that will grab the data from the unique selector. The code shown below will take the variable stored under that selector and store it as a string under the variable name of askPrice.
Then withing the quotation marks of the select commad, paste the unique selector from step nine. It should look like this now:
The unique identifier is quite long..
Step Eleven:
Now that we have the value stored within a string, we want to be able to see what that value is. For that we call the system to print out was the value is. That is done with code by doing the following:
Notice how we changed the “String askPrice;” to “String askPrice = “”;”. That is done as a safety measure so that we don’t accidentally access a null variable if our web scrap does not work.
Step Twelve:
You have completed the program! Time to run it and see if it works. To run it, right click in the editing window anywhere there isnt words and select “run WebScrap.main()”.
It will then compile and show you the results of the program in the bottom portion of the IDE. Success! It worked.
And that concludes are very short and very simple web scraping tutorial. It obviously can get much more complex than this, but this is a very easy and short program to be able to run to obtain a fast result and to kick start your path towards web scraping. Thank you for following along and I hope it all worked!