Extract text from PDF, DOC, HTML, CHM, and RTF files

Posted on January 19, 2008 at 7:31 am

Have a document in PDF format that you would like to convert to a text document? Or maybe an HTML or CHM (Windows Help File) that you need to convert into simply plain text? Why might this be useful you ask? Most PDF documents are not editable and selecting the text manually can be a tedious process.

You can use Text-Mining-Tool to automatically extract text from a PDF file so that you can use it in any program freely. Or if you cannot open a PDF file because you do not have a PDF viewer installed, you can use this tool to extract the text and read the document.

Text Mining Tool is completely free and does not even require an installation, simply unzip it and run the program to use it.

text mining tool

Click the Open button and choose your file that you want to convert to text. Click ok and the large window below the buttons will eventually fill with all of the text extracted from the document.

extract text

Click Save to save the extracted text to your computer. You can also click Clipboard to copy the mined text to the Windows clipboard.

For convenience, the following hotkeys can be used to perform the operations:

  • Open - F3 or O.
  • Save - F2 or S.
  • Clipboard - F5 or C.
  • Exit - F10 or Escape.

You can also use the minetext console tool to create a batch script for extracting text from multiple files. This can be useful if you have a directory with a large number of files that need to have text extracted.

The included console tool minetext has the following syntax:

minetext <input file>

minetext <input file> <output file>

where:

  <input file>  - any file with one of the following extensions:
                  pdf, doc, rtf, chm, htm, html
  <output file> - file you want to write text mined from input file

If you’re a web designer, this program can be very useful to grab the text from a Word document without getting all of the extra Microsoft Office styling code included with the text.

This is a very simple program that is very simple to use! It has one basic purpose and it does a good job! Enjoy!

Technorati Tags: , , , , ,

If you enjoyed this post, make sure you subscribe to my RSS feed!

» Filed Under Free Software Downloads

Related Posts

8 Responses to “Extract text from PDF, DOC, HTML, CHM, and RTF files”

  1. Gregg Decker said on :

    I am very impressed with you software suggestions. I find most of them useful. I look forward daily to my emails from you.
    I have one suggestion and that is to make it easier for users to download the software via one easy to find link. There have been a few times when I gave up looking for the link and then forgot all about the software that I could of found useful.

    Keep up the good work.

    Gregg Decker


  2. Noemi B said on :

    Totally agree with Gregg D. I love your finds and find alot of them very useful, but there have been a few where you just cannot find the download link and I’ve had to google the app to find it.

    Otherwise, keep up the good work!!


  3. loftninja said on :

    great info….by chance does anyone know how to then translate text info another language, then press it back into origianl pdf layout?


  4. akishore said on :

    Hi loftninja,

    Once you have extracted the text from the PDF document, you can translate it using Google Translate. Note that the translations are not going to be very good. Then you can convert it back to PDF by printing to PDF using a program like CutePDF.


  5. rtt_hu said on :

    Maybe you can give a web-link clearly,so the software can be shared more convenient.Thank you!


  6. Ahamad said on :

    In case you want to programmatically extract text from a PDF document, here is how to acheive that(using coldfusion):
    http://cf-examples.net/index.c.....t-From-PDF


    Pingbacks
  1. Recap of last month’s best posts on Online Tech Tips Says:

    [...] Extract text from PDF, DOC, HTML, CHM, and RTF filesHave a document in PDF format that you would like to convert to a text document? Or maybe an HTML or CHM (Windows Help File) that you need to convert into simply plain text? [...]

  2. Pingbacks
  3. How to convert a PDF file to Word, Excel or JPG format Says:

    [...] it! By the way, if youare interested in how to extract the text from a PDF document or how to convert Word files to PDF, etc, check out the [...]

Please post your comments/suggestions!