How to extract text from DOCX or ODT files using PHP
Are you searching for a method to extract text from DOCX or ODT files using PHP? Well in this article I will show you how to do so. This technique can be used to create a web crawler and index document files based upon their content i.e. this can be used to create a document repository. The technique here doesn't involve any third party plugins or softwares. It will work in PHP 5.2+ and the only requirement is php_zip.dll for Windows or --enable-zip parameter for Linux. Actually the DOCX and ODT files are archive files whose extension has been changed from .zip to .docx or .odt. Hence we need a ZIP library for PHP in order to extract the data from them.
You can verify this fact yourself. Just try to open any docx or odt file with a ZIP utility. Check out the screenshot below -
The text data is present in word/document.xml for DOCX and in Content.xml for ODT file. In order to extract the text all we need to do is that get the contents of word/document.xml (for docx file) or content.xml (for odt file) and then display its content after filtering out XML tags present in it.
Create a new PHP file and name it as extract.php and add the following code it -
<?php
/*Name of the document file*/
$document = 'attractive_prices.docx';
/**Function to extract text*/
function extracttext($filename) {
//Check for extension
$ext = end(explode('.', $filename));
//if its docx file
if($ext == 'docx')
$dataFile = "word/document.xml";
//else it must be odt file
else
$dataFile = "content.xml";
//Create a new ZIP archive object
$zip = new ZipArchive;
// Open the archive file
if (true === $zip->open($filename)) {
// If successful, search for the data file in the archive
if (($index = $zip->locateName($dataFile)) !== false) {
// Index found! Now read it to a string
$text = $zip->getFromIndex($index);
// Load XML from a string
// Ignore errors and warnings
$xml = DOMDocument::loadXML($text, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
// Remove XML formatting tags and return the text
return strip_tags($xml->saveXML());
}
//Close the archive file
$zip->close();
}
// In case of failure return a message
return "File not found";
}
echo extracttext($document);
?>Comments in the code snippet should easily help you to understand it.



Recent comments
13 weeks 1 day ago
14 weeks 2 days ago
22 weeks 5 days ago
46 weeks 22 hours ago
1 year 11 weeks ago
2 years 9 weeks ago
2 years 11 weeks ago
2 years 11 weeks ago
2 years 12 weeks ago
2 years 14 weeks ago