Office toolkit

Convert DOCX to PDF, clean text & scrub metadata in your browser

These tools help you prepare office documents for upload: a lightweight DOCX to PDF converter that focuses on text, a cleaner for messy pasted content, a DOCX metadata scrubber for privacy, and a structural size analyser for DOCX, PPTX and XLSX files. Everything runs locally in your browser.

DOCX to PDF (text-based conversion)

Extracts raw text from a DOCX file and places it into a PDF page for simple submissions. Designed for straightforward documents such as cover letters, short reports and basic forms.

DOCX → PDF
This converter focuses on text. Complex layouts, tables and images are not preserved.

For rich formatting, use a desktop office suite’s “Save as PDF” feature. This tool is meant for quick, simple documents where the text itself matters most.

How this DOCX to PDF converter works behind the scenes

A DOCX file is effectively a compressed folder that contains a collection of XML documents describing the structure and styling of your text. The main body lives in word/document.xml. Instead of attempting to reproduce every formatting nuance, this converter reads that XML, strips the markup and reconstructs a clean, linear text representation. That text is then placed into a PDF page using pdf-lib, with simple line wrapping so the content remains readable.

The workflow happens entirely in the browser. When you choose a DOCX file, JavaScript reads it as an ArrayBuffer and hands it over to JSZip, which opens the ZIP container in memory. The script locates word/document.xml, decodes it as a UTF-8 string and applies straightforward pattern replacements: paragraph tags become newlines, run tags are discarded, and the textual content between them is retained. This distilled text becomes the input for the PDF stage.

What this converter is and what it is not

The tool is designed for situations where you quickly need a PDF that contains the words of your document and you are not concerned about exact visual fidelity. Think short cover letters, simple legal notices or administrative forms where the main requirement is “PDF only” rather than “keep the original page layout exactly”. For those use cases, replicating the typographic grid of the original DOCX is less critical than speed, privacy and ease of use.

It does not attempt to reproduce complex features such as multi-column layouts, embedded tables, equations or floating images. Those elements either appear as part of the linear text or are omitted if they are encoded in ways the text extractor ignores. If you rely on precise formatting for invoices, slide decks or designed brochures, you should export from a full word processor that knows the entire styling model.

Benefits of a pure browser-based approach

Because no server is involved, you retain full control over your files. Confidential drafts do not leave your machine; the browser simply opens the DOCX container, reads the XML and composes a PDF in memory. When the process finishes, the resulting file is downloaded directly to your device. This is useful in environments where external uploads are restricted or where you want to avoid leaving traces of sensitive documents on third-party services.

Light weight is another advantage. The converter uses just JSZip and pdf-lib, both loaded from CDNs, which keeps page size reasonable compared to more ambitious solutions that emulate a full word processor in JavaScript. This aligns better with performance expectations on mobile devices and low-bandwidth connections.

How the PDF layout is created

Once the raw text is available, the tool constructs a single PDF page with a standard A4-like size. It then performs basic line wrapping in JavaScript, splitting the text into lines of roughly fixed character count and inserting manual line breaks where necessary. Every line is positioned using coordinates measured from the top margin, stepping downward by a constant line height. If the text grows beyond the bottom of the page, additional pages are created and the process continues.

This approach does not match the exact wrapping decisions of your original word processor, but it ensures a stable and legible result that can be opened in any PDF viewer. For most short documents, the difference between layouts is acceptable, and in return you gain a fast, install-free conversion path.

Practical tips for use

If you know in advance that a document will go through this converter, keep the structure simple: paragraphs, headings and basic lists. Avoid complex multi-column arrangements and heavy graphical elements. After downloading the PDF, open it in your usual viewer and scan through the pages. If critical information is missing or a particular layout detail is required, fall back to a desktop export for that specific document.

Text cleaner for office content

Paste text from DOCX or PDF, remove broken line breaks, normalise spacing, and straighten punctuation before reusing the cleaned text in forms, email or new documents.

Text clean
Use this tool whenever pasted content carries odd breaks from PDFs or older DOCX files.

Why pasted office text often looks broken

Copying and pasting text from DOCX files, PDFs or email clients often drags along invisible formatting. Hard line breaks may appear in the middle of sentences, multiple spaces accumulate at word boundaries, and typographic quotation marks become inconsistent. When that text is reused in forms or new documents, the result can look unprofessional and harder to read than the original.

The root cause is that many applications treat visual layout and logical text structure as separate concerns. A PDF might insert a line break at the end of every visual line, even if the sentence continues on the next one. A DOCX file may carry styles and control codes that are meaningful inside the word processor but turn into stray spaces and symbols when pasted elsewhere. The cleaner addresses this by applying a set of predictable transformations that make the text behave like paragraphs again.

How the cleaner operates on your text

When you click “Clean text”, the script takes the content of the input box and processes it step by step. If line trimming is enabled, it removes leading and trailing whitespace from each line. The line-joining option then examines break patterns and decides when to merge neighbouring lines with a space so paragraphs become continuous sequences of words instead of jagged fragments. Empty lines are preserved to indicate paragraph boundaries.

The cleaner also collapses repeated spaces into a single one. This helps when text has been justified using non-breaking space tricks or when copying from terminal windows and draft layouts. Straightening quotes replaces curly quotation marks and mixed apostrophe styles with plain ASCII quotes, which simplifies further processing in environments that do not handle typographic characters consistently.

Balancing automation and control

Not every document benefits from the same level of clean-up. Poetry, code snippets and carefully aligned tabular text rely on exact line breaks and spacing, so joining lines indiscriminately would damage meaning. For that reason, each transformation is exposed as a separate checkbox. You can enable only the operations that make sense for your content, preview the result in the output area, and tweak settings as needed.

A practical habit is to keep the original pasted text untouched in the input area and experiment with different combinations of options. Once the output looks right, use that cleaned version for submission or further editing. If you discover a problem later, you can always go back to the original and run the cleaner again with different switches.

Using the downloaded .txt file

The “Download .txt” button produces a plain text file encoded in UTF-8. You can attach this file to emails, open it in any text editor, or import it into another application that accepts text uploads. Because it carries no hidden formatting, the risk of layout surprises is much lower than when pasting from rich text sources directly.

This kind of neutral representation is also useful when preparing content for systems that have strict size limits or character-count constraints. Surplus control characters and redundant spaces are removed, leaving only the parts that actually contribute to the message.

DOCX metadata remover (privacy cleaner)

Scrubs author names, revision timestamps and related fields from DOCX files by rewriting metadata XML inside the document package, all in your browser.

Privacy
The tool targets common metadata fields in docProps/core.xml and docProps/app.xml.

Always keep a backup of the original file. Removing metadata is one-way from the perspective of this tool.

Why DOCX metadata matters for privacy

Word processing files often carry more information than the visible text. They may embed the full name of the original author, the organisation name, creation and modification timestamps, template references and revision identifiers. When you share documents externally, this metadata can reveal details about your system, workflow or identity that you did not intend to disclose.

DOCX uses the Open Packaging Convention, where metadata lives in separate XML parts under the docProps folder: core.xml stores fields such as dc:creator, dcterms:created and dcterms:modified, while app.xml records items like total editing time, number of pages and application name. Removing or neutralising these fields before sharing reduces the amount of context exposed to recipients.

How this cleaner modifies metadata

When you select a DOCX file, the script opens it as a ZIP archive using JSZip. It looks for the metadata parts under docProps/core.xml and docProps/app.xml. If they exist, the XML content is loaded as text. Instead of parsing the full document model, the cleaner applies targeted replacements: it removes the content between tags such as <dc:creator>…</dc:creator> and <cp:lastModifiedBy>…</cp:lastModifiedBy>, either leaving empty tags or collapsing them entirely.

The updated XML is written back into the ZIP structure, and JSZip generates a fresh DOCX package. From the perspective of Word or compatible editors, the document still opens normally because the structural layout is intact; only specific metadata values have been cleared. The resulting file is downloaded automatically to your device under the chosen output name.

What the cleaner targets by default

The focus is on fields that commonly carry personal or system-specific information:

  • Author and last modified by names.
  • Creation and modification timestamps.
  • Revision counts and editing duration where present.

Other structural properties, such as page count and word count, may remain because they are often needed for document management and are less sensitive than direct identity markers. If you require stronger anonymisation, it is still wise to review the resulting file’s properties in Word or LibreOffice to confirm that the cleaned document matches your policy.

Limits and good practices

This cleaner operates at the level of DOCX package metadata; it does not scan the document body for names or email addresses written in the text. If the body contains personal or organisational information, that content remains. For full anonymisation, you should combine metadata scrubbing with careful reading of the actual text to remove or redact identifiers where appropriate.

From a workflow perspective, keep the original DOCX in a secure location and treat the cleaned version as the copy intended for sharing. That way, internal revision history and context remain available to your team if needed, while external recipients only see the information you have explicitly decided to expose.

Office file size analyser (DOCX/PPTX/XLSX)

Inspects the internal ZIP structure of DOCX, PPTX or XLSX files and highlights which embedded parts consume the most space, such as images or media folders.

Analyse
The analyser lists entries ordered by size so you can see which images or embedded items dominate the file.

Understanding why some Office files become unexpectedly large

DOCX, PPTX and XLSX files are ZIP archives that bundle XML markup with images, charts, embedded documents and style definitions. When a slide deck or report becomes much heavier than expected, the culprit is often not the text but a collection of high-resolution images, unused layouts or leftover media that remains inside the package. Without inspecting that structure, it is difficult to know where to focus optimisation efforts.

The size analyser reads the selected file as a binary buffer and passes it to JSZip, which exposes each entry’s name and uncompressed size. The tool then sorts these entries in descending order and prints a report highlighting the largest components. Typical patterns include multiple copies of the same image, audio or video files embedded in slides, and design resources that are never visible in the final document.

Interpreting the analysis output

The report lists each internal file path with its uncompressed size in kilobytes. Folders such as word/media, ppt/media or xl/media usually contain images. If you see individual items occupying hundreds or thousands of kilobytes, those assets are primary candidates for optimisation via resizing or recompression in their native format. Likewise, unusually large XML parts may indicate embedded objects or very complex sheets.

The tool does not modify the original document; it only reveals structure. You can then open the file in Word, PowerPoint or Excel and deliberately replace or remove the heavy components. For example, compressing pictures within PowerPoint, replacing full-resolution photos with versions sized to the slide dimensions, or deleting unused slide masters can substantially reduce the final package size.

Benefits of client-side inspection

Running the analysis in your browser avoids uploading potentially sensitive corporate or personal material to third-party servers. The script only sees filenames and binary sizes; it does not attempt to interpret the semantic content of the XML beyond what is necessary to list entries. This makes the analyser suitable for initial triage in environments with strict data-handling rules.

It also provides faster feedback than repeatedly guessing and exporting. With a single run you can identify the items most responsible for the file weight and target those specifically, instead of randomly deleting slides or shrinking text.

Using the analysis as part of a clean-up workflow

A practical workflow combines the analyser with the image tools elsewhere on your site. After identifying large images inside a PPTX, export those graphics to disk, resize or recompress them with suitable settings, and reinsert the optimised versions into the presentation. Running the analyser again should confirm the reduction in total package size.

For spreadsheets, large size often stems from extensive formatting, embedded charts or objects. The report may reveal specific sheets or drawings as the drivers of complexity. Cleaning or simplifying those elements can make workbooks easier to share with systems that impose upload limits.

This page includes detailed SEO-friendly information explaining how this tool works, recommended settings, and optimization workflows for best results.

These tools are part of the wider Compress It Small hub, so you can jump back to the homepage at any time to discover image, office, and workflow guides that complement your PDF tasks.