How To: Converting PDF to Word and HTML

I started this little project because I have a client whom needs to get his 24 page PDF online. The problem is that a 24 page PDF with all the bells and whistles ends up being over 5mb in size. This causes issues for people running sub-cable internet connections, as the loading time becomes horrendous. So to solve the problem, I am going to run the PDF as a download by choice and have all the links point to the HTML converted page when they click on what page they want to see. This does cause problems if something is updated on the PDF, the HTML is not dynamic or binded to the PDF so an update will have to occur in both places. The only way around that is to have the HTML being the origionating source and have the ‘download as pdf’ link be a call to a server side script that packages the HTML as a PDF. That however is too much for what this client needs and the issues with the updating will have to be taken in stride.

Sites need to be able to interact in one single, universal space.

-Tim Berners-Lee

Tools Needed

  1. RTF or DOC reader (I prefer LibreOffice) that can convert to HTML
  2. A Program designed to convert PDF to DOC format (I used Able2dDoc, licensed)

Unfortunately, In my case, the PDF contained a large amount of tables that were made up by images after conversion. Because of this, I had to handle things a little bit different, in which I will explain later.

First things first, let’s convert to HTML

Using the software I used, Able2Doc, if you load up the PDF you can simply convert the file to a DOC format. Notice, not many converters will go straight from PDF to DOC or RTF formats. Once you are able to convert the PDF to DOC or RTF, you can then open up that file into Microsoft Office or Open Office. Both have the ability to Open up these files and then Export them as HTML.

Microsoft Office way of doing things

Office is really simple. Take the document you are in and go to File->Save As->Other

PDF to HTML Save as

Then after that you can go click and change the type to an HTML Document… put in the name and your done!

PDF to HTML Save as HTML

Open Offices’ way of doing things

In Open Office, it is actually easier! Just have your document open and then go to File->Save As and you can then select the HTML from the drop down list. No extra step as there is in Word.

PDF to HTML OO

When things get messy…

You have to start to get creative. I know, it stinks, when things just don’t go your way. I mentioned earlier that I my specific issue just could not be settled by this process only because the images in the PDF were making up the tables and the text did not stick inside the image/tables when changed to HTML. I ended having to go with a slightly altered reality, but the end result to the user is near the same.

The idea that I had was to split the PDF into images. This was actually really easy to do. I swapped over to linux for this part (Ubuntu Gutsy). The PDF Reader program has the ability to output to JPG for your PDF’s. This came in very handy, I simply outputted the PDF I was using, 28 pages of it, as JPG’s and then used to Javascript to make a nice little setup for Checking out the picutres.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
function PageQuery(q) {
if(q.length > 1) {
this.q = q.substring(1, q.length);
} else {
this.q = null;
}
this.keyValuePairs = new Array();
if(q) {
for(var i=0; i < this.q.split("&").length; i++) {
this.keyValuePairs[i] = this.q.split("&")[i];
}
}
this.getKeyValuePairs = function() { return this.keyValuePairs; }
this.getValue = function(s) {
for(var j=0; j < this.keyValuePairs.length; j++) {
if(this.keyValuePairs[j].split("=")[0] == s) {
return this.keyValuePairs[j].split("=")[1];
}
}
return false;
}
this.getParameters = function() {
var a = new Array(this.getLength());
for(var j=0; j < this.keyValuePairs.length; j++) {
a[j] = this.keyValuePairs[j].split("=")[0];
}
return a;
}
this.getLength = function() { return this.keyValuePairs.length; }
}
function queryString(key){
var page = new PageQuery(window.location.search);
return unescape(page.getValue(key));
}
function displayItem(key){
if(queryString(key)=='false') {
return '1';
} else {
return queryString(key);
}
}

What else?

Once this code is in place, it is possible to verify what it is trying to do. It will parse a query address URL and looking for the specific information showing on whatever variable is pass in. This is giving JavaScript the ability to know what a variable is from a JavaScript or PHP equivalent to the Get variables.

Now it is needed to give the code that will change values of a select box, so the complete picture will come into view.

The idea that I had was to split the PDF into images. This was actually really easy to do. I swapped over to Linux for this part (Ubuntu). The PDF Reader program has the ability to output to JPG for your PDF’s. This came in very handy,I simply outputted the PDF I was using, 28 pages of it, as JPG’s and then used to JavaScript to make a nice little setup for Checking out the pictures.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
function loadPage(value) {
if(value == "") {
document.getElementById('mainimage').src="img/ProductCatalog/Page1.jpg";
} else {
  document.getElementById('mainimage').src="img/ProductCatalog/Page" + displayItem('p') +".jpg";
}
}
function changeImage() {
var $list = document.getElementById('list');
document.getElementById('mainimage').src = $list.options[$list.selectedIndex].value;
}
function prevImage() {
var $list = document.getElementById('list');
if($list.selectedIndex == 0) {
$list.selectedIndex = $list.options.length-1;
} else {
  $list.selectedIndex--;
  }
  changeImage();
}
function nextImage() {
var $list = document.getElementById('list');
if($list.selectedIndex == $list.options.length-1) {
  $list.selectedIndex = 0;
  } else {
  $list.selectedIndex++;
  }
 changeImage();
}

That was pretty much what the JavaScript needed, and then I could handle everything else from HTML which made things much easier.