Monday, September 19, 2016

Starting a Database

The first step I saw fit to take for my project was making a database of topics and review videos for those topics. I started with one of the best homework review sites out there, so you are probably not surprised that I chose Khan Academy.

There is a big gap between the content on Khan Academy and a JavaScript program inside a Chrome extension, so I need to bridge that gap with a database that a JavaScript program can easily parse, AKA read and extract usable data from.

Luckily Khan Academy already has an organized database on their website (khanacademy.org/library), depicted below. Each subject brings you to a page with different topics, and within those topics are subtopics, and then finally are individual videos.

The next step is to create a list of every video in the Khan Academy library, organized into topics. Easy! Just start with individually copying the link to each of the 4,293 math videos...

Just kidding! That would be tedious beyond imagination. Instead I wrote a computer program to put all of the information into a file. I had many different options to choose for accomplishing this, but I decided to use the programming language I am most familiar with, Java (not the same as JavaScript), with a Java editor called Eclipse. Now skip this next sentence if you don't want to get any more technical: I used the Apache HTTP Client to connect to the webpages in the Khan Academy library, along with an HTML parser to extract the data from the website, such as the URL to a video or the title of a topic.

That process takes a while, simply because the program has to connect to the webpage for all 384 topics in the Khan Academy library, not to mention I have to figure out how to solve all the problems with the program before it works correctly (nothing ever goes right the first time with a computer program).

The end result is four XML files, one for each major subject on Khan Academy (math, science, humanities, and economics). XML is just a fancy, but also universally acknowledged data format that can be easily read in most programming languages. Here's what that looks like:
(this is a tiny fraction of the math topics XML file)

One unfortunate factor in the way I chose to work on this part of the project is the fact that the KhanAcademy website is constantly changing. I made this list initially a few months ago, but recently, Khan Academy added more content. That should have been easy to add, but they also changed the naming scheme of the HTML elements that define the topics and their respective URLs. Long story short: I had to rewrite half of the program. It would be crazy to think that won't happen again in the near future, but there's nothing I can do about that. However, there are many different directions this project can take me in, and this is only one small part of it.

No comments:

Post a Comment