How to extract text from DOCX or ODT files using PHP

Are you searching for a method to extract text from DOCX or ODT files using PHP? Well in this article I will show you how to do so. This technique can be used to create a web crawler and index document files based upon their content i.e. this can be used to create a document repository. The technique here doesn't involve any third party plugins or softwares. It will work in PHP 5.2+ and the only requirement is php_zip.dll for Windows or --enable-zip parameter for Linux. Actually the DOCX and ODT files are archive files whose extension has been changed from .zip to .docx or .odt. Hence we need a ZIP library for PHP in order to extract the data from them.

You can verify this fact yourself. Just try to open any docx or odt file with a ZIP utility. Check out the screenshot below -

DOCX file structure

The text data is present in word/document.xml for DOCX and in Content.xml for ODT file. In order to extract the text all we need to do is that get the contents of word/document.xml (for docx file) or content.xml (for odt file) and then display its content after filtering out XML tags present in it.

Create a new PHP file and name it as extract.php and add the following code it -

<?php

/*Name of the document file*/
$document = 'attractive_prices.docx';

/**Function to extract text*/
function extracttext($filename) {
    //Check for extension
    $ext = end(explode('.', $filename));

    //if its docx file
    if($ext == 'docx')
	$dataFile = "word/document.xml";
    //else it must be odt file
    else
	$dataFile = "content.xml";     
      
    //Create a new ZIP archive object
    $zip = new ZipArchive;

    // Open the archive file
    if (true === $zip->open($filename)) {
        // If successful, search for the data file in the archive
        if (($index = $zip->locateName($dataFile)) !== false) {
            // Index found! Now read it to a string
            $text = $zip->getFromIndex($index);
            // Load XML from a string
            // Ignore errors and warnings
            $xml = DOMDocument::loadXML($text, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
            // Remove XML formatting tags and return the text
            return strip_tags($xml->saveXML());
        }
        //Close the archive file
        $zip->close();
    }

    // In case of failure return a message
    return "File not found";
}

echo extracttext($document);
?>

Comments in the code snippet should easily help you to understand it.

Tags: