How To: Converting PDF to Word and HTML

No Gravatar

Sites need to be able to interact in one single, universal space.

-Tim Berners-Lee

I started this little project because I have a client whom needs to get his 24 page PDF online. The problem is that a 24 page PDF with all the bells and whistles

ends up being over 5mb in size. This causes issues for people running sub-cable internet connections, as the loading time becomes horrendous. So to solve the problem, I am going to run the PDF as a download by choice and have all the links point to the HTML(Hyper-Text Markup Language: what webpages are written in) converted page when they click on what page they want to see. This does however cause problems if something is updated on the PDF, the HTML is not dynamic or binded to the PDF so and update will p align=”left”> have to occur in both places. The only way around that is to have the HTML being the origionating source and have the ‘download as pdf’ link be a call to a server side script that packages the HTML as a PDF. That however is too much for what this client needs and the issues with the updating will have to be taken in stride.

Tools Needed:
RTF or DOC reader (I prefer OpenOffice2.2) that can convert to HTML
A Program designed to convert PDF to DOC format (I used Able2dDoc, licensed)

Unfortunately, In my case, the PDF contained a large amount of tables that were made up by images after conversion. Because of this, I had to handle things a little bit different, in which I will explain later.


First things first, lets convert to HTML

Using the software I used, Able2Doc, if you load up the PDF you can simply convert the file to a DOC format. Notice, not many converters will go straight from PDF to DOC or RTF formats. Once you are able to convert the PDF to DOC or RTF, you can then open up that file into Microsoft Office or Open Office. Both have the ability to Open up these files and then Export them as HTML.

Microsoft Offices’ way of doing things

Office is really simple. Take the document you are in and go to File->Save As->Other

PDF to HTML Save as

Then after that you can go click and change the type to an HTML Document… put in the name and your done!

PDF to HTML Save as HTML

Open Offices’ way of doing things
In Open Office, it is actually easier! Just have your document open and then go to File ->Save As and you can then select the HTML from the drop down list. No extra step as there is in Word.

 

 

PDF to HTML OO


When things get messy…
You have to start to get creative. I know, it stinks, when things just don’t go your way. I mentioned earlier that I my specific issue just could not be settled by this process only because the images in the PDF were making up the tables and the text did not stick inside the image/tables when changed to HTML. I ended having to go with a slightly altered reality, but the end result to the user is near the same.

The idea that I had was to split the PDF into images. This was actually really easy to do. I swapped over to linux for this part (Ubuntu Gutsy). The PDF Reader program has the ability to output to JPG for your PDF’s. This came in very handy, I simply outputted the PDF I was using, 28 pages of it, as JPG’s and then used to Javascript to make a nice little setup for Checking out the picutres.

Javascript Code In the <body> tags :
<script> function PageQuery(q) {if(q.length > 1) this.q = q.substring(1, q.length);else this.q = null;
this.keyValuePairs = new Array();
if(q) {
for(var i=0; i < this.q.split(“&”).length; i++) {
this.keyValuePairs[i] = this.q.split(“&”)[i];
}
}
this.getKeyValuePairs = function() { return this.keyValuePairs; }
this.getValue = function(s) {
for(var j=0; j < this.keyValuePairs.length; j++) {
if(this.keyValuePairs[j].split(“=”)[0] == s)
return this.keyValuePairs[j].split(“=”)[1];
}
return false;
}
this.getParameters = function() {
var a = new Array(this.getLength());
for(var j=0; j < this.keyValuePairs.length; j++) {
a[j] = this.keyValuePairs[j].split(“=”)[0];
}
return a;
}
this.getLength = function() { return this.keyValuePairs.length; }
}
function queryString(key){
var page = new PageQuery(window.location.search);
return unescape(page.getValue(key));
}
function displayItem(key){
if(queryString(key)==’false’)
{
return ’1′;
}else{
return queryString(key);
}
}
</script>

What else?

Once this code is in place, you can see what it is trying to do. You are basically parsing a query address URL and looking for the specific information showing on whatever variable you pass in. This is giving your JavaScript the ability to know what a variable is from a JavaScript /PHP equivalent to the Get variables.

Now you need the code that will change values of a select box, so the complete picture will come into view.

The idea that I had was to split the PDF into images. This was actually really easy to do. I swapped over to Linux for this part (Ubuntu Gutsy). The PDF Reader program has the ability to output to JPG for your PDF’s. This came in very handy, I simply outputted the PDF I was using, 28 pages of it, as JPG’s and then used to JavaScript to make a nice little setup for Checking out the pictures.

<SCRIPT LANGUAGE=”JavaScript”>
 function loadPage(value) {
if(value == “”) {
 document.getElementById(‘mainimage’).src=”img/ProductCatalog/Page1.jpg”;
 } else {
 document.getElementById(‘mainimage’).src=”img/ProductCatalog/Page” + displayItem(‘p’) +”.jpg”;
 }
 }
 function changeImage()
{
 document.getElementById(‘mainimage’).src = document.getElementById(‘list’).options[document.getElementById('list').selectedIndex].value;
 }
function prevImage()
 {
 if(document.getElementById(‘list’).selectedIndex == 0)
 {
 document.getElementById(‘list’).selectedIndex = document.getElementById(‘list’).options.length-1;
 }
 else
 {
 document.getElementById(‘list’).selectedIndex–;
 }
 changeImage();
 }
function nextImage()
 {
 if(document.getElementById(‘list’).selectedIndex == document.getElementById(‘list’).options.length-1)
 {
 document.getElementById(‘list’).selectedIndex = 0;
 }
 else
 {
 document.getElementById(‘list’).selectedIndex++;
 }
 changeImage();
}
</script>

That was pretty much what the JavaScript needed, and then I could handle everything else from HTML which made things much easier. All that is said and done, any Q’s just ask.

Related Posts

This entry was posted in Computers, How-To, Misc, Programming and tagged , , , , , . Bookmark the permalink. Both comments and trackbacks are currently closed.