As a wise man said; now for something completely different. This post is going to be just that.
I received a Amazon Kindle on christmas and absolutely love it. It has replaced hauling around a bunch of physical books and made my life a lot easier when it comes to managing highlights and notes about the topics in the books (I read a lot of educational text).
Previously I had a physical notebook with me that I translated into electronic notes by hand. I use Evernote for all my note taking purposes. With the Kindle, all your highlights and notes are stored on the device itself and synchronised to Amazon. There’s a single page there which shows all your notes. You can then copy & paste from that page to whatever note taking program you’re using.
So getting the Kindle was a massive improvement in my workflow for my notes about books. But, it was still a excruciating task to manually copy & paste the notes from the Amazon page to Evernote.
Not Automagically Enough
I started digging into that web page to see if I can somehow automate it. As it turned out, they are using a dynamic scroller which loads content when you reach the bottom of the page (see Infinite Scrolling). Just retrieving the HTML contents of the page with a server-side script wasn’t going to work. Also, Amazon deploys a lot of counter measures on their website to stop bots from reading content.
Enter PhantomJS
There are a few javascript libraries running on nodeJS that have capabilities to run browser simulations. This basically means that you can run a browser session to a website with a server-side script (i.e. in the background, automatically). PhantomJS is one of those libraries and I found a proper example on how to use it, so there we go.
Browser Workflow
To get the highlights and notes from the Amazon website, this workflow needs to be followed:
- Go to https://kindle.amazon.com/
- Click the Sign In button
- Login with Amazon username & password
- Browse to https://kindle.amazon.com/your_highlights
- Scroll all the way down until the infinite scroller stops
- Save the entire HTML output and parse it to single out individual highlights and notes
After getting this into the PhantomJS script, I had the output of all the highlights & notes, ready to be parsed.
Parsing the HTML
Now that I have entire HTML from the highlights page, I could parse that into individual records and insert those into a database. Considering I’m lazy and all kinds of good people have put out libraries for such things, I used PHP Simple HTML DOM Parser. After that it was a cakewalk to get the individual records and synchronise them to a database.
Sounds Useful?
As with most things that might be useful for someone else, I put this script on GitHub. You can find the direct link below, have fun!
Leave a Reply