Extract text from PDF, DOC, HTML, CHM, and RTF files
Posted on January 19, 2008 at 7:31 am
Have a document in PDF format that you would like to convert to a text document? Or maybe an HTML or CHM (Windows Help File) that you need to convert into simply plain text? Why might this be useful you ask? Most PDF documents are not editable and selecting the text manually can be a tedious process.
You can use Text-Mining-Tool to automatically extract text from a PDF file so that you can use it in any program freely. Or if you cannot open a PDF file because you do not have a PDF viewer installed, you can use this tool to extract the text and read the document.
Text Mining Tool is completely free and does not even require an installation, simply unzip it and run the program to use it.
Click the Open button and choose your file that you want to convert to text. Click ok and the large window below the buttons will eventually fill with all of the text extracted from the document.
Click Save to save the extracted text to your computer. You can also click Clipboard to copy the mined text to the Windows clipboard.
For convenience, the following hotkeys can be used to perform the operations:
- Open - F3 or O.
- Save - F2 or S.
- Clipboard - F5 or C.
- Exit - F10 or Escape.
You can also use the minetext console tool to create a batch script for extracting text from multiple files. This can be useful if you have a directory with a large number of files that need to have text extracted.
The included console tool minetext has the following syntax:
minetext <input file>
minetext <input file> <output file>
where:
<input file> - any file with one of the following extensions:
pdf, doc, rtf, chm, htm, html
<output file> - file you want to write text mined from input file
If you’re a web designer, this program can be very useful to grab the text from a Word document without getting all of the extra Microsoft Office styling code included with the text.
This is a very simple program that is very simple to use! It has one basic purpose and it does a good job! Enjoy!
Technorati Tags: text extract, extract text from pdf, extract text pdf, extract text from html, extract text from file, text miningIf you enjoyed this post, make sure you subscribe to my RSS feed!
» Filed Under Free Software Downloads
Related Posts
- How to open files in Windows with different file extensions
- How to convert a PDF file to Word, Excel or JPG format
- How to batch change file extensions for Windows files
- TugZip, an alternative to WinZip
- Open and view PDF documents online with PDFescape
8 Responses to “Extract text from PDF, DOC, HTML, CHM, and RTF files”
Pingbacks
-
Recap of last month’s best posts on Online Tech Tips Says:
[...] Extract text from PDF, DOC, HTML, CHM, and RTF filesHave a document in PDF format that you would like to convert to a text document? Or maybe an HTML or CHM (Windows Help File) that you need to convert into simply plain text? [...]
February 1st, 2008 at 10:06 am
Pingbacks
-
How to convert a PDF file to Word, Excel or JPG format Says:
[...] it! By the way, if youare interested in how to extract the text from a PDF document or how to convert Word files to PDF, etc, check out the [...]
March 6th, 2008 at 12:17 pm
Please post your comments/suggestions!
[...] Extract text from PDF, DOC, HTML, CHM, and RTF filesHave a document in PDF format that you would like to convert to a text document? Or maybe an HTML or CHM (Windows Help File) that you need to convert into simply plain text? [...]
February 1st, 2008 at 10:06 am[...] it! By the way, if youare interested in how to extract the text from a PDF document or how to convert Word files to PDF, etc, check out the [...]
March 6th, 2008 at 12:17 pm
























I am very impressed with you software suggestions. I find most of them useful. I look forward daily to my emails from you.
I have one suggestion and that is to make it easier for users to download the software via one easy to find link. There have been a few times when I gave up looking for the link and then forgot all about the software that I could of found useful.
Keep up the good work.
Gregg Decker
Totally agree with Gregg D. I love your finds and find alot of them very useful, but there have been a few where you just cannot find the download link and I’ve had to google the app to find it.
Otherwise, keep up the good work!!
great info….by chance does anyone know how to then translate text info another language, then press it back into origianl pdf layout?
Hi loftninja,
Once you have extracted the text from the PDF document, you can translate it using Google Translate. Note that the translations are not going to be very good. Then you can convert it back to PDF by printing to PDF using a program like CutePDF.
Maybe you can give a web-link clearly,so the software can be shared more convenient.Thank you!
In case you want to programmatically extract text from a PDF document, here is how to acheive that(using coldfusion):
http://cf-examples.net/index.c.....t-From-PDF