When working with Office documents—especially in automation-heavy environments or where data insights come from unconventional sources—we often underestimate the hidden structure behind familiar formats. One recent project grew out of a deceptively simple question: What exactly is inside a PowerPoint file? That curiosity led us to build something we now use internally and are excited to share with the community—a lightweight, flexible tool we call the pptx-parser.

The Idea

PowerPoint presentations are often treated as static visual documents, but in reality, they’re structured collections of XML files zipped into a .pptx container. These files define not only the content of each slide, but also how that content is organized, referenced, and presented. We saw potential in tapping into this structure to extract meaningful information that would otherwise be hidden.

The core idea behind the pptx-parser is to read and interpret the p:cNvPr (NonVisualDrawingProperties) tags. These tags represent metadata for elements like text boxes, images, and shapes—essentially the building blocks of a slide. By parsing these elements, we’re able to uncover a rich layer of information about how a presentation is built.

What It Does

The pptx-parser operates by deconstructing the XML structure of .pptx files and extracting every instance of the p:cNvPr tag. These NonVisualDrawingProperties reveal the metadata behind each visual object on a slide—identifying their type, position, and descriptive labels.

The parser processes this data into readable summaries, such as “Textbox on Slide 3, titled ‘Team Overview’,” providing a fast and insightful view into a presentation’s structure. It also flags potential issues in the XML (such as broken references or missing attributes) and logs every step of the process, giving users full transparency into what was parsed successfully and what wasn’t.

The User Experience

Ease of use was a key goal from the start. The tool comes with a clean and intuitive GUI, allowing users to drag and drop .pptx files for immediate analysis. As the parser runs, it displays real-time logs of the operation—including both error messages and success confirmations—making the entire process visible and understandable, even for non-developers.

One of the strengths of the pptx-parser is its flexibility. It can be used:

  • As a desktop application for local analysis,
  • As a packaged executable for distribution to other users,
  • Or deployed as a web-based application for teams needing remote access or integration into a broader toolchain.

Why It Matters

Parsing Office files is often viewed as tedious and overly technical, but it doesn’t have to be. The pptx-parser offers a bridge between visual design and structured data, enabling automation, validation, and even insight discovery in environments where presentations play a central role.

Whether you’re auditing slide content for consistency, extracting metadata for compliance reporting, or just trying to understand how a complex deck was assembled, this tool turns PowerPoint into something that’s easier to work with programmatically—without sacrificing usability for power.

This project started as an experiment, but it’s quickly becoming a core tool in our tech toolbox. And we think others might find it just as useful.

In addition, parsing pptx files is just one example. Any XML file, i.e. any Office *x file, can be parsed like this and any xml variant like XSD, for example. See also our XSD2XLSX approach.

Want to know more about the project?

Here are links to our other posts on the matter:
GitHub: https://github.com/andreas-buehlmeier/pptx-parser
YouTube: https://youtu.be/NRIaaqDFLOw