As a wise man said; now for something completely different. This post is going to be just that.

I received a Amazon Kindle on christmas and absolutely love it. It has replaced hauling around a bunch of physical books and made my life a lot easier when it comes to managing highlights and notes about the topics in the books (I read a lot of educational text).

Previously I had a physical notebook with me that I translated into electronic notes by hand. I use Evernote for all my note taking purposes. With the Kindle, all your highlights and notes are stored on the device itself and synchronised to Amazon. There’s a single page there which shows all your notes. You can then copy & paste from that page to whatever note taking program you’re using.

So getting the Kindle was a massive improvement in my workflow for my notes about books. But, it was still a excruciating task to manually copy & paste the notes from the Amazon page to Evernote.

Not Automagically Enough

I started digging into that web page to see if I can somehow automate it. As it turned out, they are using a dynamic scroller which loads content when you reach the bottom of the page (see Infinite Scrolling). Just retrieving the HTML contents of the page with a server-side script wasn’t going to work. Also, Amazon deploys a lot of counter measures on their website to stop bots from reading content.

Enter PhantomJS

There are a few javascript libraries running on nodeJS that have capabilities to run browser simulations. This basically means that you can run a browser session to a website with a server-side script (i.e. in the background, automatically). PhantomJS is one of those libraries and I found a proper example on how to use it, so there we go.

Browser Workflow

To get the highlights and notes from the Amazon website, this workflow needs to be followed:

  1. Go to https://kindle.amazon.com/
  2. Click the Sign In button
  3. Login with Amazon username & password
  4. Browse to https://kindle.amazon.com/your_highlights
  5. Scroll all the way down until the infinite scroller stops
  6. Save the entire HTML output and parse it to single out individual highlights and notes

After getting this into the PhantomJS script, I had the output of all the highlights & notes, ready to be parsed.

Parsing the HTML

Now that I have entire HTML from the highlights page, I could parse that into individual records and insert those into a database. Considering I’m lazy and all kinds of good people have put out libraries for such things, I used PHP Simple HTML DOM Parser. After that it was a cakewalk to get the individual records and synchronise them to a database.

Sounds Useful?

As with most things that might be useful for someone else, I put this script on GitHub. You can find the direct link below, have fun!

gitHub-download-button NSX



Share the wealth!

Leave a Reply

Your email address will not be published. Required fields are marked *